dvb.datascience

A python data science pipeline package.

At de Volksbank, our data scientists used to write a lot of overhead code for every experiment from scratch. To help them focus on the more exciting and value added parts of their jobs, we created this package. Using this package you can easily create and reuse your pipeline code (consisting of often used data transformations and modeling steps) in experiments.

This package has (among others) the following features:

Make easy-to-follow model pipelines of fits and transforms (what exactly is a pipeline?)
Make a graph of the pipeline
Output graphics, data, metadata, etc from the pipeline steps
Data preprocessing such as filtering feature and observation outliers
Adding and merging intermediate dataframes
Every pipe stores all intermediate output, so the output can be inspected later on
Transforms can store the outputs of previous runs, so the data from different transforms can be compared into one graph
Data is in Pandas DataFrame format
Parameters for every pipe can be given with the pipeline fit_transform() and transform() methods

Scope

This package was developed specifically for fast prototyping with relatively small datasets on a single machine. By allowing the intermediate output of each pipeline step to be stored, this package might underperform for bigger datasets (100,000 rows or more).

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. For a more extensive overview of all the features, see the docs directory.

Prerequisites

This package requires Python3 and has been tested/developed using python 3.6

Installing

The easiest way to install the library (for using it), is using:

pip install dvb.datascience

Development

(in the checkout directory): For installing the checkouts repo for developing of dvb.datascience:

pipenv install --dev

For using dvb.datascience in your project:

pipenv install dvb.datascience

Development - Anaconda

(in the checkout directory): Create and activate an environment + install the package:

conda create --name dvb.datascience
conda activate dvb.datascience
pip install -e .

or use it via:

pip install dvb.datascience

Jupyter table-of-contents

When working with longer pipelines, the output when using a jupyter notebook can become quite long. It is advisable to install the nbextensions for the toc2 extension:

pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install

Next, start a jupyter notebook and navigate to edit > nbextensions config and enable the toc2 extension. And optionally set other properties. After that, navigate back to your notebook (refresh) and click the icon in the menu for loading the toc in the side panel.

Examples

This example loads the data and makes some plots of the Iris dataset

import dvb.datascience as ds


p = ds.Pipeline()
p.addPipe('read', ds.data.SampleData('iris'))
p.addPipe('split', ds.transform.TrainTestSplit(test_size=0.3), [("read", "df", "df")])
p.addPipe('boxplot', ds.eda.BoxPlot(), [("split", "df", "df")])
p.fit_transform(transform_params={'split': {'train': True}})

This example shows a number of features of the package and its usage:

Adding 3 steps to the pipeline using addPipe().
Linking the 3 steps using [("read", "df", "df")]: the 'df' output (2nd parameter) of the "read" method (1st method) to the "df" input (3rd parameter) of the split method.
The usage of 3 subpackages: ds.data, ds.transform and ds.eda. The other 2 packages are: ds.predictor and ds.score.
The last method p.fit_transform() has as a parameter additional input for running the defined pipeline, which can be different for each call to the p.fit_transform() or p.transform() method.

This example applies the KNeighborsClassifier from sklearn to the Iris dataset

import dvb.datascience as ds

from sklearn.neighbors import KNeighborsClassifier
p = ds.Pipeline()
p.addPipe('read', ds.data.SampleData('iris'))
p.addPipe('clf', ds.predictor.SklearnClassifier(KNeighborsClassifier, n_neighbors=3), [("read", "df", "df"), ("read", "df_metadata", "df_metadata")])
p.addPipe('score', ds.score.ClassificationScore(), [("clf", "predict", "predict"), ("clf", "predict_metadata", "predict_metadata")])
p.fit_transform()

This example shows:

The use of the KNeighborsClassifier from sklearn
The usage of coupling of multiple parameters as input: [("read", "df", "df"), ("read", "df_metadata", "df_metadata")]

For a more extensive overview of all the features, see the docs directory.

Unittesting

The unittests for the project can be run using pytest:

pytest

Code coverage

Pytest will also output the coverage tot the console.

To generate an html report, you can use:

py.test --cov-report html

Code styling

Code styling is done using Black

Built With

For an extensive list, see setup.py

scipy / numpy / pandas / matplotlib - For calculations and visualizations
sklearn - Machine learning algorithms
statsmodels - Statistics
mlxtend - Feature selection
tabulate - Printing tabular data
imblearn - SMOTE

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

Marc Rijken - Initial work - mrijken
Wouter Poncin - Maintenance - wpbs
Daan Knoope - Contributor - daanknoope
Christopher Huijting - Contributor - chuijting

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE file for details

Contact

For any questions please don't hesitate to contact us at [email protected]

Work in progress

Adding support for multiclass classification problems
Adding support for regression problems
Adding support for Apache Spark ML

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
docs		docs
dvb		dvb
test		test
.coveragerc		.coveragerc
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
.travis.yml		.travis.yml
CHANGES.md		CHANGES.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS		CONTRIBUTORS
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dvb.datascience

Scope

Getting Started

Prerequisites

Installing

Development

Development - Anaconda

Jupyter table-of-contents

Examples

Unittesting

Code coverage

Code styling

Built With

Contributing

Versioning

Authors

License

Contact

Work in progress

About

Releases

Packages

Contributors 3

Languages

License

mabvanaartrijk/dvb.datascience

Folders and files

Latest commit

History

Repository files navigation

dvb.datascience

Scope

Getting Started

Prerequisites

Installing

Development

Development - Anaconda

Jupyter table-of-contents

Examples

Unittesting

Code coverage

Code styling

Built With

Contributing

Versioning

Authors

License

Contact

Work in progress

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages