Skip to content

GSoC 2018 Project Ideas

Ethan White edited this page Mar 2, 2018 · 3 revisions

Expansion of Spatial Data Support to the Data Retriever

Please ask questions here. Tag @ethanwhite and @henrykironde.

Rationale

The Data Retriever is a package manager for your data. The data retriever automatically finds, downloads and pre-processes publicly available datasets and it stores these datasets in a ready-to-analyse state. The data retriever stores the clean datasets in any of the supported data storage systems (CSV, XMl,JSON, MySQL, Postgres, SQLite and Raw files).

The Data Retriever platform enables scientists to spend less time on data pre-processing, hence reducing the time taken to go from an idea to results.

Approach

The goal of this project aims at scaling up the support for spatial data on the Data Retriever platform. Currently, Data Retriever is going through tests on PostGIS support. PostGIS is an extension of Postgres that enables spatial data manipulation.

This project would extend the spatial processing to the other RDBMS engines like SQLite and MySQL and probably GeoJSON. These RDBMS engines do have extensions that support spatial data storage.

Specifically this would involve:

  • Object oriented programming in Python
  • Working with spatial data using python and database management systems

Involved toolkits or projects

  • The Data Retriever
  • Python
  • MySQL
  • SQLite (extensions SpatiaLite, Rasterlite)
  • GeoJSON
  • HDF

Degree of difficulty and needed skills

  • Moderate Difficulty
  • Knowledge of Python
  • Knowledge of spatial data and spatial database add-ons

Involved developer communities

The Data Retriever primarily interacts via issues and pull requests on GitHub.

Mentors

  • @henrysenyondo
  • @ethanwhite

Status Server and Dashboard

Please ask questions here. Tag @ethanwhite and @henrysenyondo.

Rationale

The Data Retriever is a platform that enables scientists to automate data pre-processing. The Data Retriever specifically automates the tasks of finding, downloading, and cleaning up publicly available data, and then stores them in a local database management system. It's a package manager for data. This lets data analysts spend less time cleaning up and managing data, and more time analyzing it.

As we aim to support more datasets and more data formats, there is a need to keep these datasets validated for the right functionality, since the data producers or assemblers may change data sources anytime.

Approach

The goal of the project is to create a platform that would enable users and maintainers keep track of the datasets that are working and those that are failing. The design of this project will include a dashboard that will show the status of each dataset. The retriever is periodically run against each dataset, logging the status of working, non working datasets and to detect changed datasets

Specifically this would involve:

  • Front end (Dashbord) that displays the status of a dataset
  • Back end, Periodically run the retriever on all datasets to validate their functionality. (Python)
  • Download each dataset and tests for any changes

Involved toolkits or projects

Degree of difficulty and needed skills

  • Moderate Difficulty
  • Knowledge of Python
  • Knowledge of full stack development

Involved developer communities

The Data Retriever primarily interacts via issues and pull requests on GitHub.

Mentors

  • @henrysenyondo
  • @ethanwhite

Implement a provenance feature for the Data Retriever

Please ask questions here.

Rationale

The Data Retriever automates the tasks of finding, downloading, and cleaning up publicly available data, and then stores them in a local database. It's a package manager for data. This lets data analysts spend less time cleaning up and managing data, and more time analyzing it.

One of the challenges in data science is to reproduce an analysis after a change on the data source. This is because most of the steps related to data: downloading it, cleaning it up, and restructuring it, are either done manually or using one-off scripts.

These changes in the data sources can break the scripts, requiring redoing manual cleaning, and make it difficult to understand the relationship between analyses on different versions of the data.

The Data Retriever currently solves many of the problems related to getting the data ready for analysis, but it doesn't currently support re-running analyses using older versions of data and the package management scripts associated with this older versions. This prevents older analyses from being fully reproduced and compared to those based on updated data.

Approach

This project would extend the Data Retriever with a provenance feature using Python. This will enable users to automatically store the downloaded versions of a data package(dataset) and running the Data Retriever on those archived data using the same version of both the Data Retriever and the data package script with which it was originally created.

Specifically this would involve:

  • Object oriented programming in Python
  • Working with Docker to automate running different versions of the Data Retriever
  • Familiar with Git's provenance feature

Involved toolkits or projects

  • The Data Retriever
  • Python
  • Relational database management systems (RDBMS) including MySQL, PostgreSQL, SQLite

Degree of difficulty and needed skills

  • Moderate Difficulty
  • Knowledge of Python
  • Some experience with Docker

Involved developer communities

The Data Retriever primarily interacts via issues and pull requests on GitHub.

Mentors

  • @henrysenyondo
  • @ethanwhite

Improve Data Retriever efficiency for out-of-memory scale datasets

Rationale

The Data Retriever automates the tasks of finding, downloading, and cleaning up publicly available data, and then stores them in a local database or CSV files. It's a package manager for data. This lets data analysts spend less time cleaning up and managing data, and more time analysing it.

The Data Retriever is designed to work with out-of-memory scale data, but is still slower than desirable when doing do. This project would involve both making the Data Retriever for efficient on large datasets and making querying them from the resulting databases more efficient.

Approach

This project would extend the Data Retriever using Python to increase the speed while maintaining a low memory footprint, and to allow indexes to be added to the databases for efficient querying.

Specifically this would involve:

  • Object oriented programming in Python
  • Using profilers to determine slow or/and memory intensive areas of the codebase
  • Working with relational database management systems

Involved toolkits or projects

Degree of difficulty and needed skills

  • Moderate Difficulty
  • Knowledge of Python

Involved developer communities

The Data Retriever primarily interacts via issues and pull requests on GitHub.

Mentors

  • @henrysenyondo
  • @ethanwhite

Synchronize R API with Python API and improve usability of R package

Rationale

The Data Retriever automates the tasks of finding, downloading, and cleaning up publicly available data, and then stores them in a local database or CSV files. It's a package manager for data. This lets data analysts spend less time cleaning up and managing data, and more time analysing it.

The Data Retriever is written in Python and has an R package that wraps the comman-line interface to allow the retriever to be used directly from R. The R package needs significant improvements to be maximally useful to users and allow users to move as seamlessly as possible between the Python interface and the R interface.

Approach

This project would improve several aspects of the user interface and user experience with the R package including:

  • Match the R install API to the Python install API while maintaining backward compatibility. Currently the R API uses a single install function instead of the suite of different install function in the Python interface.
  • Expand the R API to include all features of the Python API.
  • Improve detection of the Python package by R and RStudio.
  • Improve passing of Python errors to user.
  • Improve testing.

Specifically this would involve:

  • Programming in R
  • Familiarity with basic programming in Python (to allow engaging with the core Python code base)
  • Working with relational database management systems

Involved toolkits or projects

Degree of difficulty and needed skills

  • Moderate Difficulty
  • Knowledge of R

Involved developer communities

The Data Retriever primarily interacts via issues and pull requests on GitHub and through a gitter chat channel.

Mentors

  • @henrysenyondo
  • @ethanwhite