A python data science pipeline package.
At de Volksbank, our data scientists used to write a lot of overhead code for every experiment from scratch. To help them focus on the more exciting and value added parts of their jobs, we created this package. Using this package you can easily create and reuse your pipeline code (consisting of often used data transformations and modeling steps) in experiments.
This package has (among others) the following features:
- Make easy-to-follow model pipelines of fits and transforms (what exactly is a pipeline?)
- Make a graph of the pipeline
- Output graphics, data, metadata, etc from the pipeline steps
- Data preprocessing such as filtering feature and observation outliers
- Adding and merging intermediate dataframes
- Every pipe stores all intermediate output, so the output can be inspected later on
- Transforms can store the outputs of previous runs, so the data from different transforms can be compared into one graph
- Data is in Pandas DataFrame format
- Parameters for every pipe can be given with the pipeline fit_transform() and transform() methods
This package was developed specifically for fast prototyping with relatively small datasets on a single machine. By allowing the intermediate output of each pipeline step to be stored, this package might underperform for bigger datasets (100,000 rows or more).
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. For a more extensive overview of all the features, see the docs directory.
This package requires Python3 and has been tested/developed using python 3.6
The easiest way to install the library (for using it), is using:
pip install dvb.datascience
(in the checkout directory): For installing the checkouts repo for developing of dvb.datascience:
pipenv install --dev
For using dvb.datascience in your project:
pipenv install dvb.datascience
(in the checkout directory): Create and activate an environment + install the package:
conda create --name dvb.datascience
conda activate dvb.datascience
pip install -e .
or use it via:
pip install dvb.datascience
When working with longer pipelines, the output when using a jupyter notebook can become quite long. It is advisable to install the nbextensions for the toc2 extension:
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install
Next, start a jupyter notebook and navigate to edit > nbextensions config and enable the toc2 extension. And optionally set other properties. After that, navigate back to your notebook (refresh) and click the icon in the menu for loading the toc in the side panel.
This example loads the data and makes some plots of the Iris dataset
import dvb.datascience as ds
p = ds.Pipeline()
p.addPipe('read', ds.data.SampleData('iris'))
p.addPipe('split', ds.transform.TrainTestSplit(test_size=0.3), [("read", "df", "df")])
p.addPipe('boxplot', ds.eda.BoxPlot(), [("split", "df", "df")])
p.fit_transform(transform_params={'split': {'train': True}})
This example shows a number of features of the package and its usage:
- Adding 3 steps to the pipeline using
addPipe()
. - Linking the 3 steps using
[("read", "df", "df")]
: the'df'
output (2nd parameter) of the"read"
method (1st method) to the"df"
input (3rd parameter) of the split method. - The usage of 3 subpackages:
ds.data
,ds.transform
andds.eda
. The other 2 packages are:ds.predictor
andds.score
. - The last method
p.fit_transform()
has as a parameter additional input for running the defined pipeline, which can be different for each call to thep.fit_transform()
orp.transform()
method.
This example applies the KNeighborsClassifier from sklearn to the Iris dataset
import dvb.datascience as ds
from sklearn.neighbors import KNeighborsClassifier
p = ds.Pipeline()
p.addPipe('read', ds.data.SampleData('iris'))
p.addPipe('clf', ds.predictor.SklearnClassifier(KNeighborsClassifier, n_neighbors=3), [("read", "df", "df"), ("read", "df_metadata", "df_metadata")])
p.addPipe('score', ds.score.ClassificationScore(), [("clf", "predict", "predict"), ("clf", "predict_metadata", "predict_metadata")])
p.fit_transform()
This example shows:
- The use of the
KNeighborsClassifier
fromsklearn
- The usage of coupling of multiple parameters as input:
[("read", "df", "df"), ("read", "df_metadata", "df_metadata")]
For a more extensive overview of all the features, see the docs directory.
The unittests for the project can be run using pytest:
pytest
Pytest will also output the coverage tot the console.
To generate an html report, you can use:
py.test --cov-report html
Code styling is done using Black
For an extensive list, see setup.py
- scipy / numpy / pandas / matplotlib - For calculations and visualizations
- sklearn - Machine learning algorithms
- statsmodels - Statistics
- mlxtend - Feature selection
- tabulate - Printing tabular data
- imblearn - SMOTE
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
We use SemVer for versioning. For the versions available, see the tags on this repository.
- Marc Rijken - Initial work - mrijken
- Wouter Poncin - Maintenance - wpbs
- Daan Knoope - Contributor - daanknoope
- Christopher Huijting - Contributor - chuijting
See also the list of contributors who participated in this project.
This project is licensed under the MIT License - see the LICENSE file for details
For any questions please don't hesitate to contact us at [email protected]
- Adding support for multiclass classification problems
- Adding support for regression problems
- Adding support for Apache Spark ML