Update docs for release

EpistasisLab · Sep 27, 2017 · c100f2a · c100f2a
1 parent c431a9d
commit c100f2a
Show file tree

Hide file tree

Showing 10 changed files with 159 additions and 82 deletions.
diff --git a/docs/api/index.html b/docs/api/index.html
@@ -273,6 +273,7 @@ <h1 id="classification">Classification</h1>
 <li>Python dictionary, TPOT will use your custom configuration,</li>
 <li>string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or</li>
 <li>string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or</li>
+<li>string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices, or</li>
 <li>None, TPOT will use the default TPOTClassifier configuration.</li>
 </ul>
 See the <a href="../using/#built-in-tpot-configurations">built-in configurations</a> section for the list of configurations included with TPOT, and the <a href="../using/#customizing-tpots-operators-and-parameters">custom configuration</a> section for more information and examples of how to create your own TPOT configurations.
@@ -287,16 +288,23 @@ <h1 id="classification">Classification</h1>
 
 <strong>periodic_checkpoint_folder</strong>: path string, optional (default: None)
 <blockquote>
-If supplied, a folder in which tpot will periodically save the best pipeline so far while optimizing.<br /><br />
+If supplied, a folder in which TPOT will periodically save the best pipeline so far while optimizing.<br /><br />
 Currently once per generation but not more often than once per 30 seconds.<br /><br />
 Useful in multiple cases:
 <ul>
-<li>Sudden death before tpot could save optimized pipeline</li>
+<li>Sudden death before TPOT could save optimized pipeline</li>
 <li>Track its progress</li>
 <li>Grab pipelines while it's still optimizing</li>
 </ul>
 </blockquote>
 
+<strong>early_stop</strong>: integer, optional (default: None)
+<blockquote>
+How many generations TPOT checks whether there is no improvement in optimization process.
+<br /><br />
+Ends the optimization process if there is no improvement in the given number of generations.
+</blockquote>
+
 <strong>verbosity</strong>: integer, optional (default=0)
 <blockquote>
 How much information TPOT communicates while it's running.
@@ -700,6 +708,7 @@ <h1 id="regression">Regression</h1>
 <li>Python dictionary, TPOT will use your custom configuration,</li>
 <li>string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or</li>
 <li>string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or</li>
+<li>string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices, or</li>
 <li>None, TPOT will use the default TPOTRegressor configuration.</li>
 </ul>
 See the <a href="../using/#built-in-tpot-configurations">built-in configurations</a> section for the list of configurations included with TPOT, and the <a href="../using/#customizing-tpots-operators-and-parameters">custom configuration</a> section for more information and examples of how to create your own TPOT configurations.
@@ -714,16 +723,23 @@ <h1 id="regression">Regression</h1>
 
 <strong>periodic_checkpoint_folder</strong>: path string, optional (default: None)
 <blockquote>
-If supplied, a folder in which tpot will periodically save the best pipeline so far while optimizing.<br /><br />
+If supplied, a folder in which TPOT will periodically save the best pipeline so far while optimizing.<br /><br />
 Currently once per generation but not more often than once per 30 seconds.<br /><br />
 Useful in multiple cases:
 <ul>
-<li>Sudden death before tpot could save optimized pipeline</li>
+<li>Sudden death before TPOT could save optimized pipeline</li>
 <li>Track its progress</li>
 <li>Grab pipelines while it's still optimizing</li>
 </ul>
 </blockquote>
 
+<strong>early_stop</strong>: integer, optional (default: None)
+<blockquote>
+How many generations TPOT checks whether there is no improvement in optimization process.
+<br /><br />
+Ends the optimization process if there is no improvement in the given number of generations.
+</blockquote>
+
 <strong>verbosity</strong>: integer, optional (default=0)
 <blockquote>
 How much information TPOT communicates while it's running.

diff --git a/docs/index.html b/docs/index.html
@@ -210,5 +210,5 @@
 
 <!--
 MkDocs version : 0.16.3
-Build Date UTC : 2017-07-18 21:14:49
+Build Date UTC : 2017-09-27 17:00:29
 -->
diff --git a/docs/mkdocs/search_index.json b/docs/mkdocs/search_index.json
diff --git a/docs/related/index.html b/docs/related/index.html
@@ -135,7 +135,7 @@
           <div role="main">
             <div class="section">
 
-                <p>Other automated machine-learning tools:</p>
+                <p>Other Automated Machine Learning (AutoML) tools and related projects:</p>
 <table>
 <tr>
 <th width="20%">Name</th>

diff --git a/docs/releases/index.html b/docs/releases/index.html
@@ -82,6 +82,9 @@
     <a class="current" href="./">Release Notes</a>
     <ul class="subnav">
 
+    <li class="toctree-l2"><a href="#version-09">Version 0.9</a></li>
+
+
     <li class="toctree-l2"><a href="#version-08">Version 0.8</a></li>
 
 
@@ -159,7 +162,43 @@
           <div role="main">
             <div class="section">
 
-                <h1 id="version-08">Version 0.8</h1>
+                <h1 id="version-09">Version 0.9</h1>
+<ul>
+<li>
+<p><strong>TPOT now supports sparse matrices</strong> with a new built-in TPOT configurations, "TPOT sparse". We are using a custom OneHotEncoder implementation that supports missing values and continuous features.</p>
+</li>
+<li>
+<p>We have added an "early stopping" option for stopping the optimization process if no improvement is made within a set number of generations. Look up the <code>early_stop</code> parameter to access this functionality.</p>
+</li>
+<li>
+<p>TPOT now reduces the number of duplicated pipelines between generations, which saves you time during the optimization process.</p>
+</li>
+<li>
+<p>TPOT now supports custom scoring functions via the command-line mode.</p>
+</li>
+<li>
+<p>We have added a new optional argument, <code>periodic_checkpoint_folder</code>, that allows TPOT to periodically save the best pipeline so far to a local folder during optimization process.</p>
+</li>
+<li>
+<p>TPOT no longer uses <code>sklearn.externals.joblib</code> when <code>n_jobs=1</code> to avoid the potential freezing issue <a href="http://scikit-learn.org/stable/faq.html#why-do-i-sometime-get-a-crash-freeze-with-n-jobs-1-under-osx-or-linux">that scikit-learn suffers from</a>.</p>
+</li>
+<li>
+<p>We have added <code>pandas</code> as a dependency to read input datasets instead of <code>numpy.recfromcsv</code>. NumPy's <code>recfromcsv</code> function is unable to parse datasets with complex data types.</p>
+</li>
+<li>
+<p>Fixed a bug that <code>DEFAULT</code> in the parameter(s) of nested estimator raises <code>KeyError</code> when exporting pipelines.</p>
+</li>
+<li>
+<p>Fixed a bug related to setting <code>random_state</code> in nested estimators. The issue would happen with pipeline with <code>SelectFromModel</code> (<code>ExtraTreesClassifier</code> as nested estimator) or <code>StackingEstimator</code> if nested estimator has <code>random_state</code> parameter.</p>
+</li>
+<li>
+<p>Fixed a bug in the missing value imputation function in TPOT to impute along columns instead rows.</p>
+</li>
+<li>
+<p>Refined input checking for sparse matrices in TPOT.</p>
+</li>
+</ul>
+<h1 id="version-08">Version 0.8</h1>
 <ul>
 <li>
 <p><strong>TPOT now detects whether there are missing values in your dataset</strong> and replaces them with the median value of the column.</p>

diff --git a/docs/sitemap.xml b/docs/sitemap.xml
@@ -4,79 +4,79 @@
 
     <url>
      <loc>http://rhiever.github.io/tpot/</loc>
-     <lastmod>2017-07-18</lastmod>
+     <lastmod>2017-09-27</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://rhiever.github.io/tpot/installing/</loc>
-     <lastmod>2017-07-18</lastmod>
+     <lastmod>2017-09-27</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://rhiever.github.io/tpot/using/</loc>
-     <lastmod>2017-07-18</lastmod>
+     <lastmod>2017-09-27</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://rhiever.github.io/tpot/api/</loc>
-     <lastmod>2017-07-18</lastmod>
+     <lastmod>2017-09-27</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://rhiever.github.io/tpot/examples/</loc>
-     <lastmod>2017-07-18</lastmod>
+     <lastmod>2017-09-27</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://rhiever.github.io/tpot/contributing/</loc>
-     <lastmod>2017-07-18</lastmod>
+     <lastmod>2017-09-27</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://rhiever.github.io/tpot/releases/</loc>
-     <lastmod>2017-07-18</lastmod>
+     <lastmod>2017-09-27</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://rhiever.github.io/tpot/citing/</loc>
-     <lastmod>2017-07-18</lastmod>
+     <lastmod>2017-09-27</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://rhiever.github.io/tpot/support/</loc>
-     <lastmod>2017-07-18</lastmod>
+     <lastmod>2017-09-27</lastmod>
      <changefreq>daily</changefreq>
     </url>
 
 
 
     <url>
      <loc>http://rhiever.github.io/tpot/related/</loc>
-     <lastmod>2017-07-18</lastmod>
+     <lastmod>2017-09-27</lastmod>
      <changefreq>daily</changefreq>
     </url>
 

diff --git a/docs/using/index.html b/docs/using/index.html
@@ -80,7 +80,7 @@
     <li class="toctree-l2"><a href="#customizing-tpots-operators-and-parameters">Customizing TPOT's operators and parameters</a></li>
 
 
-    <li class="toctree-l2"><a href="#customizing-tpots-starting-population">Customizing TPOT's starting population</a></li>
+    <li class="toctree-l2"><a href="#crashfreeze-issue-with-n_jobs-1-under-osx-or-linux">Crash/freeze issue with n_jobs &gt; 1 under OSX or Linux</a></li>
 
 
     </ul>
@@ -345,7 +345,7 @@ <h1 id="tpot-on-the-command-line">TPOT on the command line</h1>
 <tr>
 <td>-cv</td>
 <td>CV</td>
-<td>Any integer >1</td>
+<td>Any integer > 1</td>
 <td>Number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT optimization process.</td>
 </tr>
 <td>-sub</td>
@@ -386,6 +386,21 @@ <h1 id="tpot-on-the-command-line">TPOT on the command line</h1>
 Set this seed if you want your TPOT run to be reproducible with the same seed and data set in the future.</td>
 </tr>
 <tr>
+<td>-config</td>
+<td>CONFIG_FILE</td>
+<td>String or file path</td>
+<td>Operators and parameter configurations in TPOT:
+<br /><br />
+<ul>
+<li>Path for configuration file: TPOT will use the path to a configuration file for customizing the operators and parameters that TPOT uses in the optimization process</li>
+<li>string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors</li>
+<li>string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies</li>
+<li>string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices.</li>
+</ul>
+See the <a href="../using/#built-in-tpot-configurations">built-in configurations</a> section for the list of configurations included with TPOT, and the <a href="../using/#customizing-tpots-operators-and-parameters">custom configuration</a> section for more information and examples of how to create your own TPOT configurations.
+</td>
+</tr>
+<tr>
 <td>-cf</td>
 <td>CHECKPOINT_FOLDER</td>
 <td>Folder path</td>
@@ -406,6 +421,15 @@ <h1 id="tpot-on-the-command-line">TPOT on the command line</h1>
 -cf ./my_checkpoints
 </tr>
 <tr>
+<td>-es</td>
+<td>EARLY_STOP</td>
+<td>Any positive integer</td>
+<td>
+How many generations TPOT checks whether there is no improvement in optimization process.
+<br /><br />
+End optimization process if there is no improvement in the set number of generations.
+</tr>
+<tr>
 <td>-v</td>
 <td>VERBOSITY</td>
 <td>{0, 1, 2, 3}</td>
@@ -499,6 +523,17 @@ <h1 id="built-in-tpot-configurations">Built-in TPOT configurations</h1>
 <br /><br />
 <a href="https://github.com/rhiever/tpot/blob/master/tpot/config/regressor_mdr.py">Regression</a></td>
 </tr>
+
+<tr>
+<td>TPOT sparse</td>
+<td>TPOT uses a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices.
+<br /><br />
+This configuration works for both the TPOTClassifier and TPOTRegressor.</td>
+<td align="center"><a href="https://github.com/rhiever/tpot/blob/master/tpot/config/classifier_sparse.py">Classification</a>
+<br /><br />
+<a href="https://github.com/rhiever/tpot/blob/master/tpot/config/regressor_sparse.py">Regression</a></td>
+</tr>
+
 </table>
 
 <p>To use any of these configurations, simply pass the string name of the configuration to the <code>config_dict</code> parameter (or <code>-config</code> on the command line). For example, to use the "TPOT light" configuration:</p>
@@ -576,42 +611,20 @@ <h1 id="customizing-tpots-operators-and-parameters">Customizing TPOT's operators
 <p>When using the command-line interface, the configuration file specified in the <code>-config</code> parameter <em>must</em> name its custom TPOT configuration <code>tpot_config</code>. Otherwise, TPOT will not be able to locate the configuration dictionary.</p>
 <p>For more detailed examples of how to customize TPOT's operator configuration, see the default configurations for <a href="https://github.com/rhiever/tpot/blob/master/tpot/config/classifier.py">classification</a> and <a href="https://github.com/rhiever/tpot/blob/master/tpot/config/regressor.py">regression</a> in TPOT's source code.</p>
 <p>Note that you must have all of the corresponding packages for the operators installed on your computer, otherwise TPOT will not be able to use them. For example, if XGBoost is not installed on your computer, then TPOT will simply not import nor use XGBoost in the pipelines it considers.</p>
-<h1 id="customizing-tpots-starting-population">Customizing TPOT's starting population</h1>
-<p>TPOT allows for the initial population of pipelines to be seeded. This can be done either through the <code>population_seeds</code> parameter in the TPOT constructor, or through a <code>population_seeds</code> attribute in a custom config file.</p>
-<pre><code class="Python">population_seeds = [
-    'BernoulliNB(GaussianNB(input_matrix), BernoulliNB__alpha=0.1, BernoulliNB__fit_prior=False)',
-    'BernoulliNB(input_matrix, BernoulliNB__alpha=0.01, BernoulliNB__fit_prior=True)'
-]
+<h1 id="crashfreeze-issue-with-n_jobs-1-under-osx-or-linux">Crash/freeze issue with n_jobs &gt; 1 under OSX or Linux</h1>
+<p>TPOT supports parallel computing for speeding up the optimization process, but it may crash/freeze with n_jobs &gt; 1 under OSX or Linux <a href="http://scikit-learn.org/stable/faq.html#why-do-i-sometime-get-a-crash-freeze-with-n-jobs-1-under-osx-or-linux">as scikit-learn does</a>, especially with large datasets.</p>
+<p>One solution is to configure Python's <code>multiprocessing</code> module to use the <code>forkserver</code> start method (instead of the default <code>fork</code>) to manage the process pools. You can enable the <code>forkserver</code> mode globally for your program by putting the following codes into your main script:</p>
+<pre><code class="Python">import multiprocessing
 
-tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,
-                      config_dict=tpot_config, population_seeds=population_seeds)
-</code></pre>
-
-<p>If specified through a config file, your config file would look like this:</p>
-<pre><code class="Python">tpot_config = {
-    'sklearn.naive_bayes.GaussianNB': {
-    },
-
-    'sklearn.naive_bayes.BernoulliNB': {
-        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
-        'fit_prior': [True, False]
-    },
+# other imports, custom code, load data, define model...
 
-    'sklearn.naive_bayes.MultinomialNB': {
-        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
-        'fit_prior': [True, False]
-    }
-}
+if __name__ == '__main__':
+    multiprocessing.set_start_method('forkserver')
 
-population_seeds = [
-    'BernoulliNB(GaussianNB(input_matrix), BernoulliNB__alpha=0.1, BernoulliNB__fit_prior=False)',
-    'BernoulliNB(input_matrix, BernoulliNB__alpha=0.01, BernoulliNB__fit_prior=True)'
-]
+    # call scikit-learn utils or tpot utils with n_jobs &gt; 1 here
 </code></pre>
 
-<p>As with <code>tpot_config</code>, when using a custom config file the seeds <em>must</em> have the standardized name of "population_seeds". It should only ever be a list of strings.</p>
-<p>If less seeds are provided than there are to be individuals in the entire population, then the remainder will be filled with random individuals.</p>
-<p>If the <code>population_seeds</code> parameter is provided along with seeds from a configuration file, the configuration file's seeds will take precedence.</p>
+<p>More information about these start methods can be found in the <a href="https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods">multiprocessing documentation</a>.</p>
 
             </div>
           </div>