Merge pull request #953 from EpistasisLab/development

Version 0.11.0 release
EpistasisLab · Nov 5, 2019 · e473d73 · e473d73
2 parents 815b0e2 + 8b71687
commit e473d73
Show file tree

Hide file tree

Showing 63 changed files with 2,033 additions and 3,947 deletions.
diff --git a/.appveyor.yml b/.appveyor.yml
@@ -4,10 +4,7 @@ environment:
   matrix:
     - PYTHON_VERSION: 3.7
       MINICONDA: C:/Miniconda36-x64
-      DASK_ML_VERSION: 0.13.0
-    - PYTHON_VERSION: 2.7
-      MINICONDA: C:/Miniconda-x64
-      DASK_ML_VERSION: 0.12.0
+      DASK_ML_VERSION: 1.0.0
 
 platform:
   - x64
@@ -21,10 +18,9 @@ install:
   - conda config --set always_yes yes --set changeps1 no
   - conda update -q conda
   - conda info -a
-  - conda create -q -n test-environment python=%PYTHON_VERSION% numpy scipy scikit-learn nose cython pandas pywin32 joblib
+  - conda create -q -n test-environment python=%PYTHON_VERSION% numpy scipy scikit-learn nose cython pandas joblib
   - activate test-environment
-  - pip install deap tqdm update_checker stopit dask[delayed] cloudpickle==0.5.6
-  - pip install dask_ml==%DASK_ML_VERSION%
+  - pip install deap tqdm update_checker stopit xgboost dask[delayed] dask[dataframe] cloudpickle==0.5.6 fsspec>=0.3.3 dask_ml==%DASK_ML_VERSION%
 
 
 test_script:

diff --git a/.travis.yml b/.travis.yml
@@ -4,24 +4,19 @@ matrix:
   include:
   - name: "Python 3.7 on Xenial Linux"
     dist: xenial        # required for Python >= 3.7
-    env: PYTHON_VERSION="3.7"  DASK_ML_VERSION="0.13.0"
+    env: PYTHON_VERSION="3.7"  DASK_ML_VERSION="1.0.0"
     before_install:
       - wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
   - name: "Python 3.7 on Xenial Linux with coverage"
     dist: xenial        # required for Python >= 3.7
-    env: PYTHON_VERSION="3.7"  COVERAGE="true"  DASK_ML_VERSION="0.13.0"
-    before_install:
-      - wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
-  - name: "Python 2.7 on Xenial Linux"
-    dist: xenial
-    env: PYTHON_VERSION="2.7"  DASK_ML_VERSION="0.12.0"
+    env: PYTHON_VERSION="3.7"  COVERAGE="true"  DASK_ML_VERSION="1.0.0"
     before_install:
       - wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
   - name: "Python 3.7 on macOS"
     os: osx
     osx_image: xcode10.2  # Python 3.7.2 running on macOS 10.14.3
     language: shell       # 'language: python' is an error on Travis CI macOS
-    env: PYTHON_VERSION="3.7"  DASK_ML_VERSION="0.13.0"
+    env: PYTHON_VERSION="3.7"  DASK_ML_VERSION="1.0.0"
     before_install:
       - wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh
 install: source ./ci/.travis_install.sh

diff --git a/README.md b/README.md
@@ -6,16 +6,15 @@ Development status: [![Development Build Status - Mac/Linux](https://travis-ci.o
 [![Development Build Status - Windows](https://ci.appveyor.com/api/projects/status/b7bmpwpkjhifrm7v/branch/development?svg=true)](https://ci.appveyor.com/project/weixuanfu/tpot?branch=development)
 [![Development Coverage Status](https://coveralls.io/repos/github/EpistasisLab/tpot/badge.svg?branch=development)](https://coveralls.io/github/EpistasisLab/tpot?branch=development)
 
-Package information: [![Python 2.7](https://img.shields.io/badge/python-2.7-blue.svg)](https://www.python.org/download/releases/2.7/)
-[![Python 3.7](https://img.shields.io/badge/python-3.7-blue.svg)](https://www.python.org/downloads/release/python-370/)
+Package information: [![Python 3.7](https://img.shields.io/badge/python-3.7-blue.svg)](https://www.python.org/downloads/release/python-370/)
 [![License: LGPL v3](https://img.shields.io/badge/license-LGPL%20v3-blue.svg)](http://www.gnu.org/licenses/lgpl-3.0)
 [![PyPI version](https://badge.fury.io/py/TPOT.svg)](https://badge.fury.io/py/TPOT)
 
 <p align="center">
 <img src="https://raw.githubusercontent.com/EpistasisLab/tpot/master/images/tpot-logo.jpg" width=300 />
 </p>
 
-Consider TPOT your **Data Science Assistant**. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
+**TPOT** stands for **T**ree-based **P**ipeline **O**ptimization **T**ool. Consider TPOT your **Data Science Assistant**. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
 
 ![TPOT Demo](https://github.com/EpistasisLab/tpot/blob/master/images/tpot-demo.gif "TPOT Demo")
 
@@ -55,7 +54,7 @@ Click on the corresponding links to find more information on TPOT usage in the d
 
 ### Classification
 
-Below is a minimal working example with the practice MNIST data set.
+Below is a minimal working example with the the optical recognition of handwritten digits dataset.
 
 ```python
 from tpot import TPOTClassifier
@@ -64,32 +63,43 @@ from sklearn.model_selection import train_test_split
 
 digits = load_digits()
 X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
-                                                    train_size=0.75, test_size=0.25)
+                                                    train_size=0.75, test_size=0.25, random_state=42)
 
-tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
+tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
 tpot.fit(X_train, y_train)
 print(tpot.score(X_test, y_test))
-tpot.export('tpot_mnist_pipeline.py')
+tpot.export('tpot_digits_pipeline.py')
 ```
 
-Running this code should discover a pipeline that achieves about 98% testing accuracy, and the corresponding Python code should be exported to the `tpot_mnist_pipeline.py` file and look similar to the following:
+Running this code should discover a pipeline that achieves about 98% testing accuracy, and the corresponding Python code should be exported to the `tpot_digits_pipeline.py` file and look similar to the following:
 
 ```python
 import numpy as np
 import pandas as pd
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.linear_model import LogisticRegression
 from sklearn.model_selection import train_test_split
-from sklearn.neighbors import KNeighborsClassifier
+from sklearn.pipeline import make_pipeline, make_union
+from sklearn.preprocessing import PolynomialFeatures
+from tpot.builtins import StackingEstimator
+from tpot.export_utils import set_param_recursive
 
-# NOTE: Make sure that the class is labeled 'target' in the data file
+# NOTE: Make sure that the outcome column is labeled 'target' in the data file
 tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
-features = tpot_data.drop('target', axis=1).values
+features = tpot_data.drop('target', axis=1)
 training_features, testing_features, training_target, testing_target = \
-            train_test_split(features, tpot_data['target'].values, random_state=None)
-
-
-exported_pipeline = KNeighborsClassifier(n_neighbors=6, weights="distance")
-
-exported_pipeline.fit(training_features, training_classes)
+            train_test_split(features, tpot_data['target'], random_state=42)
+
+# Average CV score on the training set was: 0.9799428471757372
+exported_pipeline = make_pipeline(
+    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
+    StackingEstimator(estimator=LogisticRegression(C=0.1, dual=False, penalty="l1")),
+    RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.35000000000000003, min_samples_leaf=20, min_samples_split=19, n_estimators=100)
+)
+# Fix random state for all the steps in exported pipeline
+set_param_recursive(exported_pipeline.steps, 'random_state', 42)
+
+exported_pipeline.fit(training_features, training_target)
 results = exported_pipeline.predict(testing_features)
 ```
 
@@ -104,9 +114,9 @@ from sklearn.model_selection import train_test_split
 
 housing = load_boston()
 X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
-                                                    train_size=0.75, test_size=0.25)
+                                                    train_size=0.75, test_size=0.25, random_state=42)
 
-tpot = TPOTRegressor(generations=5, population_size=20, verbosity=2)
+tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, random_state=42)
 tpot.fit(X_train, y_train)
 print(tpot.score(X_test, y_test))
 tpot.export('tpot_boston_pipeline.py')
@@ -117,20 +127,27 @@ which should result in a pipeline that achieves about 12.77 mean squared error (
 ```python
 import numpy as np
 import pandas as pd
-from sklearn.ensemble import GradientBoostingRegressor
+from sklearn.ensemble import ExtraTreesRegressor
 from sklearn.model_selection import train_test_split
+from sklearn.pipeline import make_pipeline
+from sklearn.preprocessing import PolynomialFeatures
+from tpot.export_utils import set_param_recursive
 
-# NOTE: Make sure that the class is labeled 'target' in the data file
+# NOTE: Make sure that the outcome column is labeled 'target' in the data file
 tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
-features = tpot_data.drop('target', axis=1).values
+features = tpot_data.drop('target', axis=1)
 training_features, testing_features, training_target, testing_target = \
-            train_test_split(features, tpot_data['target'].values, random_state=None)
+            train_test_split(features, tpot_data['target'], random_state=42)
 
-exported_pipeline = GradientBoostingRegressor(alpha=0.85, learning_rate=0.1, loss="ls",
-                                              max_features=0.9, min_samples_leaf=5,
-                                              min_samples_split=6)
+# Average CV score on the training set was: -10.812040755234403
+exported_pipeline = make_pipeline(
+    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
+    ExtraTreesRegressor(bootstrap=False, max_features=0.5, min_samples_leaf=2, min_samples_split=3, n_estimators=100)
+)
+# Fix random state for all the steps in exported pipeline
+set_param_recursive(exported_pipeline.steps, 'random_state', 42)
 
-exported_pipeline.fit(training_features, training_classes)
+exported_pipeline.fit(training_features, training_target)
 results = exported_pipeline.predict(testing_features)
 ```
 
@@ -150,6 +167,20 @@ Please [check the existing open and closed issues](https://github.com/EpistasisL
 
 If you use TPOT in a scientific publication, please consider citing at least one of the following papers:
 
+Trang T. Le, Weixuan Fu and Jason H. Moore (2019). [Scaling tree-based automated machine learning to biomedical big data with a feature set selector](https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz470/5511404). *Bioinformatics*. 2019 Jun 4.
+
+BibTeX entry:
+
+```bibtex
+@article{le2019scaling,
+  title={Scaling tree-based automated machine learning to biomedical big data with a feature set selector.},
+  author={Le, TT and Fu, W and Moore, JH},
+  journal={Bioinformatics (Oxford, England)},
+  year={2019}
+}
+```
+
+
 Randal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews, Nicole A. Lavender, La Creis Kidd, and Jason H. Moore (2016). [Automating biomedical data science through tree-based pipeline optimization](http://link.springer.com/chapter/10.1007/978-3-319-31204-0_9). *Applications of Evolutionary Computation*, pages 123-137.
 
 BibTeX entry:

diff --git a/ci/.travis_install.sh b/ci/.travis_install.sh
@@ -38,8 +38,7 @@ conda create -n testenv --yes python=$PYTHON_VERSION pip nose \
 source activate testenv
 
 pip install deap tqdm update_checker stopit \
-    dask[delayed] xgboost cloudpickle==0.5.6
-pip install dask_ml==$DASK_ML_VERSION
+    dask[delayed] dask[dataframe] xgboost cloudpickle==0.5.6 dask_ml==$DASK_ML_VERSION fsspec>=0.3.3
 
 if [[ "$COVERAGE" == "true" ]]; then
     pip install coverage coveralls

diff --git a/docs/404.html b/docs/404.html
@@ -13,12 +13,11 @@
 
   <link rel="stylesheet" href="/tpot/css/theme.css" type="text/css" />
   <link rel="stylesheet" href="/tpot/css/theme_extra.css" type="text/css" />
-  <link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/github.min.css">
+  <link rel="stylesheet" href="/tpot/css/highlight.css">
 
-  <script src="/tpot/js/jquery-2.1.1.min.js" defer></script>
-  <script src="/tpot/js/modernizr-2.8.3.min.js" defer></script>
-  <script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
-  <script>hljs.initHighlightingOnLoad();</script> 
+  <script src="/tpot/js/jquery-2.1.1.min.js"></script>
+  <script src="/tpot/js/modernizr-2.8.3.min.js"></script>
+  <script type="text/javascript" src="/tpot/js/highlight.pack.js"></script> 
 
 </head>
 
@@ -29,10 +28,10 @@
 
     <nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav">
       <div class="wy-side-nav-search">
-        <a href="/tpot/." class="icon icon-home"> TPOT</a>
+        <a href="/tpot/" class="icon icon-home"> TPOT</a>
         <div role="search">
   <form id ="rtd-search-form" class="wy-form" action="/tpot/search.html" method="get">
-    <input type="text" name="q" placeholder="Search docs" title="Type search term here" />
+    <input type="text" name="q" placeholder="Search docs" />
   </form>
 </div>
       </div>
@@ -43,7 +42,7 @@
 
             <li class="toctree-l1">
 
-    <a class="" href="/tpot/.">Home</a>
+    <a class="" href="/tpot/">Home</a>
 	    </li>
 
             <li class="toctree-l1">
@@ -101,15 +100,15 @@
 
       <nav class="wy-nav-top" role="navigation" aria-label="top navigation">
         <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
-        <a href="/tpot/.">TPOT</a>
+        <a href="/tpot/">TPOT</a>
       </nav>
 
 
       <div class="wy-nav-content">
         <div class="rst-content">
           <div role="navigation" aria-label="breadcrumbs navigation">
   <ul class="wy-breadcrumbs">
-    <li><a href="/tpot/.">Docs</a> &raquo;</li>
+    <li><a href="/tpot/">Docs</a> &raquo;</li>
 
 
     <li class="wy-breadcrumbs-aside">
@@ -161,8 +160,9 @@ <h1 id="404-page-not-found">404</h1>
     </span>
 </div>
     <script>var base_url = '/tpot';</script>
-    <script src="/tpot/js/theme.js" defer></script>
-      <script src="/tpot/search/main.js" defer></script>
+    <script src="/tpot/js/theme.js"></script>
+      <script src="/tpot/search/require.js"></script>
+      <script src="/tpot/search/search.js"></script>
 
 </body>
 </html>