Python dictionary, TPOT will use your custom configuration,
string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or
string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or
+
string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices, or
None, TPOT will use the default TPOTClassifier configuration.
See the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations.
@@ -278,6 +286,25 @@
Classification
Setting warm_start=True can be useful for running TPOT for a short time on a dataset, checking the results, then resuming the TPOT run from where it left off.
+periodic_checkpoint_folder: path string, optional (default: None)
+
+If supplied, a folder in which TPOT will periodically save the best pipeline so far while optimizing.
+Currently once per generation but not more often than once per 30 seconds.
+Useful in multiple cases:
+
+
Sudden death before TPOT could save optimized pipeline
Python dictionary, TPOT will use your custom configuration,
string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or
string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or
+
string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices, or
None, TPOT will use the default TPOTRegressor configuration.
See the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations.
@@ -691,6 +721,25 @@
Regression
Setting warm_start=True can be useful for running TPOT for a short time on a dataset, checking the results, then resuming the TPOT run from where it left off.
+How many generations TPOT checks whether there is no improvement in optimization process.
+
+Ends the optimization process if there is no improvement in the given number of generations.
+
+
verbosity: integer, optional (default=0)
How much information TPOT communicates while it's running.
diff --git a/docs/citing/index.html b/docs/citing/index.html
index aa02c20e..bf25e50f 100644
--- a/docs/citing/index.html
+++ b/docs/citing/index.html
@@ -95,6 +95,11 @@
Support
+
Most of the necessary Python packages can be installed via the Anaconda Python distribution, which we strongly recommend that you use. We also strongly recommend that you use of Python 3 over Python 2 if you're given the choice.
-
NumPy, SciPy, and scikit-learn can be installed in Anaconda via the command:
-
conda install numpy scipy scikit-learn
+
NumPy, SciPy, scikit-learn and pandas can be installed in Anaconda via the command:
+
conda install numpy scipy scikit-learn pandas
-
DEAP, update_checker, and tqdm can be installed with pip via the command:
-
pip install deap update_checker tqdm
+
DEAP, update_checker, tqdm and stopit can be installed with pip via the command:
+
pip install deap update_checker tqdm stopit
For the Windows users, the pywin32 module is required if Python is NOT installed via the Anaconda Python distribution and can be installed with pip:
diff --git a/docs/mkdocs/search_index.json b/docs/mkdocs/search_index.json
index 5d3fbf27..a32e704b 100644
--- a/docs/mkdocs/search_index.json
+++ b/docs/mkdocs/search_index.json
@@ -7,12 +7,12 @@
},
{
"location": "/installing/",
- "text": "TPOT is built on top of several existing Python libraries, including:\n\n\n\n\n\n\nNumPy\n\n\n\n\n\n\nSciPy\n\n\n\n\n\n\nscikit-learn\n\n\n\n\n\n\nDEAP\n\n\n\n\n\n\nupdate_checker\n\n\n\n\n\n\ntqdm\n\n\n\n\n\n\nMost of the necessary Python packages can be installed via the \nAnaconda Python distribution\n, which we strongly recommend that you use. We also strongly recommend that you use of Python 3 over Python 2 if you're given the choice.\n\n\nNumPy, SciPy, and scikit-learn can be installed in Anaconda via the command:\n\n\nconda install numpy scipy scikit-learn\n\n\n\n\nDEAP, update_checker, and tqdm can be installed with \npip\n via the command:\n\n\npip install deap update_checker tqdm\n\n\n\n\nFor the Windows users\n, the pywin32 module is required if Python is NOT installed via the \nAnaconda Python distribution\n and can be installed with \npip\n:\n\n\npip install pywin32\n\n\n\n\nOptionally\n, you can install \nXGBoost\n if you would like TPOT to use the eXtreme Gradient Boosting models. XGBoost is entirely optional, and TPOT will still function normally without XGBoost if you do not have it installed.\n\n\nconda install py-xgboost\n\n\n\n\nIf you have issues installing XGBoost, check the \nXGBoost installation documentation\n.\n\n\nIf you plan to use the \nTPOT-MDR configuration\n, make sure to install \nscikit-mdr\n and \nscikit-rebate\n:\n\n\npip install scikit-mdr skrebate\n\n\n\n\nFinally to install TPOT itself, run the following command:\n\n\npip install tpot\n\n\n\n\nPlease \nfile a new issue\n if you run into installation problems.",
+ "text": "TPOT is built on top of several existing Python libraries, including:\n\n\n\n\n\n\nNumPy\n\n\n\n\n\n\nSciPy\n\n\n\n\n\n\nscikit-learn\n\n\n\n\n\n\nDEAP\n\n\n\n\n\n\nupdate_checker\n\n\n\n\n\n\ntqdm\n\n\n\n\n\n\nstopit\n\n\n\n\n\n\npandas\n\n\n\n\n\n\nMost of the necessary Python packages can be installed via the \nAnaconda Python distribution\n, which we strongly recommend that you use. We also strongly recommend that you use of Python 3 over Python 2 if you're given the choice.\n\n\nNumPy, SciPy, scikit-learn and pandas can be installed in Anaconda via the command:\n\n\nconda install numpy scipy scikit-learn pandas\n\n\n\n\nDEAP, update_checker, tqdm and stopit can be installed with \npip\n via the command:\n\n\npip install deap update_checker tqdm stopit\n\n\n\n\nFor the Windows users\n, the pywin32 module is required if Python is NOT installed via the \nAnaconda Python distribution\n and can be installed with \npip\n:\n\n\npip install pywin32\n\n\n\n\nOptionally\n, you can install \nXGBoost\n if you would like TPOT to use the eXtreme Gradient Boosting models. XGBoost is entirely optional, and TPOT will still function normally without XGBoost if you do not have it installed.\n\n\nconda install py-xgboost\n\n\n\n\nIf you have issues installing XGBoost, check the \nXGBoost installation documentation\n.\n\n\nIf you plan to use the \nTPOT-MDR configuration\n, make sure to install \nscikit-mdr\n and \nscikit-rebate\n:\n\n\npip install scikit-mdr skrebate\n\n\n\n\nFinally to install TPOT itself, run the following command:\n\n\npip install tpot\n\n\n\n\nPlease \nfile a new issue\n if you run into installation problems.",
"title": "Installation"
},
{
"location": "/using/",
- "text": "What to expect from AutoML software\n\n\nAutomated machine learning (AutoML) takes a higher-level approach to machine learning than most practitioners are used to,\nso we've gathered a handful of guidelines on what to expect when running AutoML software such as TPOT.\n\n\nAutoML algorithms aren't intended to run for only a few minutes\n\n\n\nOf course, you \ncan\n run TPOT for only a few minutes and it will find a reasonably good pipeline for your dataset.\nHowever, if you don't run TPOT for very long, it may not find the best pipeline possible for your dataset.\nOften it is worthwhile to run multiple instances of TPOT in parallel for a long time (hours to days) to allow TPOT to thoroughly search\nthe pipeline space for your dataset.\n\n\nAutoML algorithms can take a long time to finish their search\n\n\n\nAutoML algorithms aren't as simple as fitting one model on the dataset; they are considering multiple machine learning algorithms\n(random forests, linear models, SVMs, etc.) in a pipeline with multiple preprocessing steps (missing value imputation, scaling,\nPCA, feature selection, etc.), the hyperparameters for all of the models and preprocessing steps, as well as multiple ways\nto ensemble or stack the algorithms within the pipeline.\n\n\nAs such, TPOT will take a while to run on larger datasets, but it's important to realize why. With the default TPOT settings\n(100 generations with 100 population size), TPOT will evaluate 10,000 pipeline configurations before finishing.\nTo put this number into context, think about a grid search of 10,000 hyperparameter combinations for a machine learning algorithm\nand how long that grid search will take. That is 10,000 model configurations to evaluate with 10-fold cross-validation,\nwhich means that roughly 100,000 models are fit and evaluated on the training data in one grid search.\nThat's a time-consuming procedure, even for simpler models like decision trees.\n\n\nTypical TPOT runs will take hours to days to finish (unless it's a small dataset), but you can always interrupt\nthe run partway through and see the best results so far. TPOT also \nprovides\n a \nwarm_start\n parameter that\nlets you restart a TPOT run from where it left off.\n\n\nAutoML algorithms can recommend different solutions for the same dataset\n\n\n\nIf you're working with a reasonably complex dataset or run TPOT for a short amount of time, different TPOT runs\nmay result in different pipeline recommendations. TPOT's optimization algorithm is stochastic in nature, which means\nthat it uses randomness (in part) to search the possible pipeline space. When two TPOT runs recommend different\npipelines, this means that the TPOT runs didn't converge due to lack of time \nor\n that multiple pipelines\nperform more-or-less the same on your dataset.\n\n\nThis is actually an advantage over fixed grid search techniques: TPOT is meant to be an assistant that gives\nyou ideas on how to solve a particular machine learning problem by exploring pipeline configurations that you\nmight have never considered, then leaves the fine-tuning to more constrained parameter tuning techniques such\nas grid search.\n\n\nTPOT with code\n\n\nWe've taken care to design the TPOT interface to be as similar as possible to scikit-learn.\n\n\nTPOT can be imported just like any regular Python module. To import TPOT, type:\n\n\nfrom tpot import TPOTClassifier\n\n\n\n\nthen create an instance of TPOT as follows:\n\n\npipeline_optimizer = TPOTClassifier()\n\n\n\n\nIt's also possible to use TPOT for regression problems with the \nTPOTRegressor\n class. Other than the class name,\na \nTPOTRegressor\n is used the same way as a \nTPOTClassifier\n. You can read more about the \nTPOTClassifier\n and \nTPOTRegressor\n classes in the \nAPI documentation\n.\n\n\nSome example code with custom TPOT parameters might look like:\n\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2)\n\n\n\n\nNow TPOT is ready to optimize a pipeline for you. You can tell TPOT to optimize a pipeline based on a data set with the \nfit\n function:\n\n\npipeline_optimizer.fit(X_train, y_train)\n\n\n\n\nThe \nfit\n function takes in a training data set and uses k-fold cross-validation when evaluating pipelines. It then\ninitializes the genetic programming algoritm to find the best pipeline based on average k-fold score.\n\n\nYou can then proceed to evaluate the final pipeline on the testing set with the \nscore\n function:\n\n\nprint(pipeline_optimizer.score(X_test, y_test))\n\n\n\n\nFinally, you can tell TPOT to export the corresponding Python code for the optimized pipeline to a text file with the \nexport\n function:\n\n\npipeline_optimizer.export('tpot_exported_pipeline.py')\n\n\n\n\nOnce this code finishes running, \ntpot_exported_pipeline.py\n will contain the Python code for the optimized pipeline.\n\n\nBelow is a full example script using TPOT to optimize a pipeline, score it, and export the best pipeline to a file.\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2)\npipeline_optimizer.fit(X_train, y_train)\nprint(pipeline_optimizer.score(X_test, y_test))\npipeline_optimizer.export('tpot_exported_pipeline.py')\n\n\n\n\nCheck our \nexamples\n to see TPOT applied to some specific data sets.\n\n\nTPOT on the command line\n\n\nTo use TPOT via the command line, enter the following command with a path to the data file:\n\n\ntpot /path_to/data_file.csv\n\n\n\n\nAn example command-line call to TPOT may look like:\n\n\ntpot data/mnist.csv -is , -target class -o tpot_exported_pipeline.py -g 5 -p 20 -cv 5 -s 42 -v 2\n\n\n\n\nTPOT offers several arguments that can be provided at the command line. To see brief descriptions of these arguments,\nenter the following command:\n\n\ntpot --help\n\n\n\n\nDetailed descriptions of the command-line arguments are below.\n\n\n\n\n\n\nArgument\n\n\nParameter\n\n\nValid values\n\n\nEffect\n\n\n\n\n\n\n-is\n\n\nINPUT_SEPARATOR\n\n\nAny string\n\n\nCharacter used to separate columns in the input file.\n\n\n\n\n\n\n-target\n\n\nTARGET_NAME\n\n\nAny string\n\n\nName of the target column in the input file.\n\n\n\n\n\n\n-mode\n\n\nTPOT_MODE\n\n\n['classification', 'regression']\n\n\nWhether TPOT is being used for a supervised classification or regression problem.\n\n\n\n\n\n\n-o\n\n\nOUTPUT_FILE\n\n\nString path to a file\n\n\nFile to export the code for the final optimized pipeline.\n\n\n\n\n\n\n-g\n\n\nGENERATIONS\n\n\nAny positive integer\n\n\nNumber of iterations to run the pipeline optimization process. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.\n\n\n\n\n\n\n-p\n\n\nPOPULATION_SIZE\n\n\nAny positive integer\n\n\nNumber of individuals to retain in the GP population every generation. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.\n\n\n\n\n\n\n-os\n\n\nOFFSPRING_SIZE\n\n\nAny positive integer\n\n\nNumber of offspring to produce in each GP generation.\n\n\nBy default, OFFSPRING_SIZE = POPULATION_SIZE.\n\n\n\n\n\n\n-mr\n\n\nMUTATION_RATE\n\n\n[0.0, 1.0]\n\n\nGP mutation rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to apply random changes to every generation.\n\n\nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.\n\n\n\n\n\n\n-xr\n\n\nCROSSOVER_RATE\n\n\n[0.0, 1.0]\n\n\nGP crossover rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to \"breed\" every generation.\n\n\nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms.\n\n\n\n\n\n\n-scoring\n\n\nSCORING_FN\n\n\n'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy',\n'f1',\n'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss', 'neg_mean_absolute_error',\n'neg_mean_squared_error', 'neg_median_absolute_error', 'precision', 'precision_macro', 'precision_micro',\n'precision_samples', 'precision_weighted',\n'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples',\n'recall_weighted', 'roc_auc'\n\n\nFunction used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression.\n\n\nTPOT assumes that any function with \"error\" or \"loss\" in the name is meant to be minimized, whereas any other functions will be maximized.\n\n\nSee the section on \nscoring functions\n for more details.\n\n\n\n\n\n\n-cv\n\n\nCV\n\n\nAny integer >1\n\n\nNumber of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT optimization process.\n\n\n\n\n-sub\n\n\nSUBSAMPLE\n\n\n(0.0, 1.0]\n\n\nSubsample ratio of the training instance. Setting it to 0.5 means that TPOT randomly collects half of training samples for pipeline optimization process.\n\n\n\n\n\n\n-njobs\n\n\nNUM_JOBS\n\n\nAny positive integer or -1\n\n\nNumber of CPUs for evaluating pipelines in parallel during the TPOT optimization process.\n\n\nAssigning this to -1 will use as many cores as available on the computer.\n\n\n\n\n\n\n-maxtime\n\n\nMAX_TIME_MINS\n\n\nAny positive integer\n\n\nHow many minutes TPOT has to optimize the pipeline.\n\n\nIf provided, this setting will override the \"generations\" parameter and allow TPOT to run until it runs out of time.\n\n\n\n\n\n\n-maxeval\n\n\nMAX_EVAL_MINS\n\n\nAny positive integer\n\n\nHow many minutes TPOT has to evaluate a single pipeline.\n\n\nSetting this parameter to higher values will allow TPOT to consider more complex pipelines but will also allow TPOT to run longer.\n\n\n\n\n\n\n-s\n\n\nRANDOM_STATE\n\n\nAny positive integer\n\n\nRandom number generator seed for reproducibility.\n\n\nSet this seed if you want your TPOT run to be reproducible with the same seed and data set in the future.\n\n\n\n\n\n\n-config\n\n\nCONFIG_FILE\n\n\nFile path or string\n\n\nA path to a configuration file for customizing the operators and parameters that TPOT uses in the optimization process.\n\n\nSee the \nbuilt-in configurations\n section for the list of configurations included with TPOT, and the \ncustom configuration\n section for more information and examples of how to create your own TPOT configurations.\n\n\n\n\n\n\n-v\n\n\nVERBOSITY\n\n\n{0, 1, 2, 3}\n\n\nHow much information TPOT communicates while it is running.\n\n\n0 = none, 1 = minimal, 2 = high, 3 = all.\n\n\nA setting of 2 or higher will add a progress bar during the optimization procedure.\n\n\n\n\n\n\n--no-update-check\n\n\nFlag indicating whether the TPOT version checker should be disabled.\n\n\n\n\n\n\n--version\n\n\nShow TPOT's version number and exit.\n\n\n\n\n\n\n--help\n\n\nShow TPOT's help documentation and exit.\n\n\n\n\n\n\n\nScoring functions\n\n\nTPOT makes use of \nsklearn.model_selection.cross_val_score\n for evaluating pipelines, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT:\n\n\n\n\n\n\nYou can pass in a string to the \nscoring\n parameter from the list above. Any other strings will cause TPOT to throw an exception.\n\n\n\n\n\n\nYou can pass a function with the signature \nscorer(y_true, y_pred)\n, where \ny_true\n are the true target values and \ny_pred\n are the predicted target values from an estimator. To do this, you should implement your own function. See the example below for further explanation.\n\n\n\n\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ndef my_custom_accuracy(y_true, y_pred):\n return float(sum(y_pred == y_true)) / len(y_true)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n scoring=my_custom_accuracy)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')\n\n\n\n\nBuilt-in TPOT configurations\n\n\nTPOT comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. Below is a list of the current built-in configurations that come with TPOT.\n\n\n\n\n\n\nConfiguration Name\n\n\nDescription\n\n\nOperators\n\n\n\n\n\n\n\nDefault TPOT\n\n\nTPOT will search over a broad range of preprocessors, feature constructors, feature selectors, models, and parameters to find a series of operators that minimize the error of the model predictions. Some of these operators are complex and may take a long time to run, especially on larger datasets.\n\n\n\nNote: This is the default configuration for TPOT.\n To use this configuration, use the default value (None) for the config_dict parameter.\n\n\nClassification\n\n\n\n\nRegression\n\n\n\n\n\n\n\nTPOT light\n\n\nTPOT will search over a restricted range of preprocessors, feature constructors, feature selectors, models, and parameters to find a series of operators that minimize the error of the model predictions. Only simpler and fast-running operators will be used in these pipelines, so TPOT light is useful for finding quick and simple pipelines for a classification or regression problem.\n\n\nThis configuration works for both the TPOTClassifier and TPOTRegressor.\n\n\nClassification\n\n\n\n\nRegression\n\n\n\n\n\n\n\nTPOT MDR\n\n\nTPOT will search over a series of feature selectors and \nMultifactor Dimensionality Reduction\n models to find a series of operators that maximize prediction accuracy. The TPOT MDR configuration is specialized for \ngenome-wide association studies (GWAS)\n, and is described in detail online \nhere\n.\n\n\nNote that TPOT MDR may be slow to run because the feature selection routines are computationally expensive, especially on large datasets.\n\n\nClassification\n\n\n\n\nRegression\n\n\n\n\n\n\n\nTo use any of these configurations, simply pass the string name of the configuration to the \nconfig_dict\n parameter (or \n-config\n on the command line). For example, to use the \"TPOT light\" configuration:\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n config_dict='TPOT light')\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')\n\n\n\n\n\nCustomizing TPOT's operators and parameters\n\n\nBeyond the default configurations that come with TPOT, in some cases it is useful to limit the algorithms and parameters that TPOT considers. For that reason, we allow users to provide TPOT with a custom configuration for its operators and parameters.\n\n\nThe custom TPOT configuration must be in nested dictionary format, where the first level key is the path and name of the operator (e.g., \nsklearn.naive_bayes.MultinomialNB\n) and the second level key is the corresponding parameter name for that operator (e.g., \nfit_prior\n). The second level key should point to a list of parameter values for that parameter, e.g., \n'fit_prior': [True, False]\n.\n\n\nFor a simple example, the configuration could be:\n\n\ntpot_config = {\n 'sklearn.naive_bayes.GaussianNB': {\n },\n\n 'sklearn.naive_bayes.BernoulliNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n },\n\n 'sklearn.naive_bayes.MultinomialNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n }\n}\n\n\n\n\nin which case TPOT would only consider pipelines containing \nGaussianNB\n, \nBernoulliNB\n, \nMultinomialNB\n, and tune those algorithm's parameters in the ranges provided. This dictionary can be passed directly within the code to the \nTPOTClassifier\n/\nTPOTRegressor\n \nconfig_dict\n parameter, described above. For example:\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot_config = {\n 'sklearn.naive_bayes.GaussianNB': {\n },\n\n 'sklearn.naive_bayes.BernoulliNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n },\n\n 'sklearn.naive_bayes.MultinomialNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n }\n}\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n config_dict=tpot_config)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')\n\n\n\n\nCommand-line users must create a separate \n.py\n file with the custom configuration and provide the path to the file to the \ntpot\n call. For example, if the simple example configuration above is saved in \ntpot_classifier_config.py\n, that configuration could be used on the command line with the command:\n\n\ntpot data/mnist.csv -is , -target class -config tpot_classifier_config.py -g 5 -p 20 -v 2 -o tpot_exported_pipeline.py\n\n\n\n\nWhen using the command-line interface, the configuration file specified in the \n-config\n parameter \nmust\n name its custom TPOT configuration \ntpot_config\n. Otherwise, TPOT will not be able to locate the configuration dictionary.\n\n\nFor more detailed examples of how to customize TPOT's operator configuration, see the default configurations for \nclassification\n and \nregression\n in TPOT's source code.\n\n\nNote that you must have all of the corresponding packages for the operators installed on your computer, otherwise TPOT will not be able to use them. For example, if XGBoost is not installed on your computer, then TPOT will simply not import nor use XGBoost in the pipelines it considers.",
+ "text": "What to expect from AutoML software\n\n\nAutomated machine learning (AutoML) takes a higher-level approach to machine learning than most practitioners are used to,\nso we've gathered a handful of guidelines on what to expect when running AutoML software such as TPOT.\n\n\nAutoML algorithms aren't intended to run for only a few minutes\n\n\n\nOf course, you \ncan\n run TPOT for only a few minutes and it will find a reasonably good pipeline for your dataset.\nHowever, if you don't run TPOT for very long, it may not find the best pipeline possible for your dataset.\nOften it is worthwhile to run multiple instances of TPOT in parallel for a long time (hours to days) to allow TPOT to thoroughly search\nthe pipeline space for your dataset.\n\n\nAutoML algorithms can take a long time to finish their search\n\n\n\nAutoML algorithms aren't as simple as fitting one model on the dataset; they are considering multiple machine learning algorithms\n(random forests, linear models, SVMs, etc.) in a pipeline with multiple preprocessing steps (missing value imputation, scaling,\nPCA, feature selection, etc.), the hyperparameters for all of the models and preprocessing steps, as well as multiple ways\nto ensemble or stack the algorithms within the pipeline.\n\n\nAs such, TPOT will take a while to run on larger datasets, but it's important to realize why. With the default TPOT settings\n(100 generations with 100 population size), TPOT will evaluate 10,000 pipeline configurations before finishing.\nTo put this number into context, think about a grid search of 10,000 hyperparameter combinations for a machine learning algorithm\nand how long that grid search will take. That is 10,000 model configurations to evaluate with 10-fold cross-validation,\nwhich means that roughly 100,000 models are fit and evaluated on the training data in one grid search.\nThat's a time-consuming procedure, even for simpler models like decision trees.\n\n\nTypical TPOT runs will take hours to days to finish (unless it's a small dataset), but you can always interrupt\nthe run partway through and see the best results so far. TPOT also \nprovides\n a \nwarm_start\n parameter that\nlets you restart a TPOT run from where it left off.\n\n\nAutoML algorithms can recommend different solutions for the same dataset\n\n\n\nIf you're working with a reasonably complex dataset or run TPOT for a short amount of time, different TPOT runs\nmay result in different pipeline recommendations. TPOT's optimization algorithm is stochastic in nature, which means\nthat it uses randomness (in part) to search the possible pipeline space. When two TPOT runs recommend different\npipelines, this means that the TPOT runs didn't converge due to lack of time \nor\n that multiple pipelines\nperform more-or-less the same on your dataset.\n\n\nThis is actually an advantage over fixed grid search techniques: TPOT is meant to be an assistant that gives\nyou ideas on how to solve a particular machine learning problem by exploring pipeline configurations that you\nmight have never considered, then leaves the fine-tuning to more constrained parameter tuning techniques such\nas grid search.\n\n\nTPOT with code\n\n\nWe've taken care to design the TPOT interface to be as similar as possible to scikit-learn.\n\n\nTPOT can be imported just like any regular Python module. To import TPOT, type:\n\n\nfrom tpot import TPOTClassifier\n\n\n\n\nthen create an instance of TPOT as follows:\n\n\npipeline_optimizer = TPOTClassifier()\n\n\n\n\nIt's also possible to use TPOT for regression problems with the \nTPOTRegressor\n class. Other than the class name,\na \nTPOTRegressor\n is used the same way as a \nTPOTClassifier\n. You can read more about the \nTPOTClassifier\n and \nTPOTRegressor\n classes in the \nAPI documentation\n.\n\n\nSome example code with custom TPOT parameters might look like:\n\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2)\n\n\n\n\nNow TPOT is ready to optimize a pipeline for you. You can tell TPOT to optimize a pipeline based on a data set with the \nfit\n function:\n\n\npipeline_optimizer.fit(X_train, y_train)\n\n\n\n\nThe \nfit\n function takes in a training data set and uses k-fold cross-validation when evaluating pipelines. It then\ninitializes the genetic programming algoritm to find the best pipeline based on average k-fold score.\n\n\nYou can then proceed to evaluate the final pipeline on the testing set with the \nscore\n function:\n\n\nprint(pipeline_optimizer.score(X_test, y_test))\n\n\n\n\nFinally, you can tell TPOT to export the corresponding Python code for the optimized pipeline to a text file with the \nexport\n function:\n\n\npipeline_optimizer.export('tpot_exported_pipeline.py')\n\n\n\n\nOnce this code finishes running, \ntpot_exported_pipeline.py\n will contain the Python code for the optimized pipeline.\n\n\nBelow is a full example script using TPOT to optimize a pipeline, score it, and export the best pipeline to a file.\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\npipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,\n random_state=42, verbosity=2)\npipeline_optimizer.fit(X_train, y_train)\nprint(pipeline_optimizer.score(X_test, y_test))\npipeline_optimizer.export('tpot_exported_pipeline.py')\n\n\n\n\nCheck our \nexamples\n to see TPOT applied to some specific data sets.\n\n\nTPOT on the command line\n\n\nTo use TPOT via the command line, enter the following command with a path to the data file:\n\n\ntpot /path_to/data_file.csv\n\n\n\n\nAn example command-line call to TPOT may look like:\n\n\ntpot data/mnist.csv -is , -target class -o tpot_exported_pipeline.py -g 5 -p 20 -cv 5 -s 42 -v 2\n\n\n\n\nTPOT offers several arguments that can be provided at the command line. To see brief descriptions of these arguments,\nenter the following command:\n\n\ntpot --help\n\n\n\n\nDetailed descriptions of the command-line arguments are below.\n\n\n\n\n\n\nArgument\n\n\nParameter\n\n\nValid values\n\n\nEffect\n\n\n\n\n\n\n-is\n\n\nINPUT_SEPARATOR\n\n\nAny string\n\n\nCharacter used to separate columns in the input file.\n\n\n\n\n\n\n-target\n\n\nTARGET_NAME\n\n\nAny string\n\n\nName of the target column in the input file.\n\n\n\n\n\n\n-mode\n\n\nTPOT_MODE\n\n\n['classification', 'regression']\n\n\nWhether TPOT is being used for a supervised classification or regression problem.\n\n\n\n\n\n\n-o\n\n\nOUTPUT_FILE\n\n\nString path to a file\n\n\nFile to export the code for the final optimized pipeline.\n\n\n\n\n\n\n-g\n\n\nGENERATIONS\n\n\nAny positive integer\n\n\nNumber of iterations to run the pipeline optimization process. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.\n\n\n\n\n\n\n-p\n\n\nPOPULATION_SIZE\n\n\nAny positive integer\n\n\nNumber of individuals to retain in the GP population every generation. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.\n\n\n\n\n\n\n-os\n\n\nOFFSPRING_SIZE\n\n\nAny positive integer\n\n\nNumber of offspring to produce in each GP generation.\n\n\nBy default, OFFSPRING_SIZE = POPULATION_SIZE.\n\n\n\n\n\n\n-mr\n\n\nMUTATION_RATE\n\n\n[0.0, 1.0]\n\n\nGP mutation rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to apply random changes to every generation.\n\n\nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.\n\n\n\n\n\n\n-xr\n\n\nCROSSOVER_RATE\n\n\n[0.0, 1.0]\n\n\nGP crossover rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to \"breed\" every generation.\n\n\nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms.\n\n\n\n\n\n\n-scoring\n\n\nSCORING_FN\n\n\n'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy',\n'f1',\n'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss', 'neg_mean_absolute_error',\n'neg_mean_squared_error', 'neg_median_absolute_error', 'precision', 'precision_macro', 'precision_micro',\n'precision_samples', 'precision_weighted',\n'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples',\n'recall_weighted', 'roc_auc', 'my_module.scorer_name*'\n\n\nFunction used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression.\n\n\nTPOT assumes that any function with \"error\" or \"loss\" in the name is meant to be minimized, whereas any other functions will be maximized.\n\n\nmy_module.scorer_name: You can also specify your own function or a full python path to an existing one.\n\n\nSee the section on \nscoring functions\n for more details.\n\n\n\n\n\n\n-cv\n\n\nCV\n\n\nAny integer > 1\n\n\nNumber of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT optimization process.\n\n\n\n\n-sub\n\n\nSUBSAMPLE\n\n\n(0.0, 1.0]\n\n\nSubsample ratio of the training instance. Setting it to 0.5 means that TPOT randomly collects half of training samples for pipeline optimization process.\n\n\n\n\n\n\n-njobs\n\n\nNUM_JOBS\n\n\nAny positive integer or -1\n\n\nNumber of CPUs for evaluating pipelines in parallel during the TPOT optimization process.\n\n\nAssigning this to -1 will use as many cores as available on the computer.\n\n\n\n\n\n\n-maxtime\n\n\nMAX_TIME_MINS\n\n\nAny positive integer\n\n\nHow many minutes TPOT has to optimize the pipeline.\n\n\nIf provided, this setting will override the \"generations\" parameter and allow TPOT to run until it runs out of time.\n\n\n\n\n\n\n-maxeval\n\n\nMAX_EVAL_MINS\n\n\nAny positive integer\n\n\nHow many minutes TPOT has to evaluate a single pipeline.\n\n\nSetting this parameter to higher values will allow TPOT to consider more complex pipelines but will also allow TPOT to run longer.\n\n\n\n\n\n\n-s\n\n\nRANDOM_STATE\n\n\nAny positive integer\n\n\nRandom number generator seed for reproducibility.\n\n\nSet this seed if you want your TPOT run to be reproducible with the same seed and data set in the future.\n\n\n\n\n\n\n-config\n\n\nCONFIG_FILE\n\n\nString or file path\n\n\nOperators and parameter configurations in TPOT:\n\n\n\n\n\nPath for configuration file: TPOT will use the path to a configuration file for customizing the operators and parameters that TPOT uses in the optimization process\n\n\nstring 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors\n\n\nstring 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies\n\n\nstring 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices.\n\n\n\nSee the \nbuilt-in configurations\n section for the list of configurations included with TPOT, and the \ncustom configuration\n section for more information and examples of how to create your own TPOT configurations.\n\n\n\n\n\n\n\n-cf\n\n\nCHECKPOINT_FOLDER\n\n\nFolder path\n\n\n\nIf supplied, a folder you created, in which tpot will periodically save the best pipeline so far while optimizing.\n\n\nThis is useful in multiple cases:\n\n\n\nsudden death before tpot could save an optimized pipeline\n\n\nprogress tracking\n\n\ngrabbing a pipeline while tpot is working\n\n\n\n\n\nExample:\n\n\nmkdir my_checkpoints\n\n\n-cf ./my_checkpoints\n\n\n\n\n\n-es\n\n\nEARLY_STOP\n\n\nAny positive integer\n\n\n\nHow many generations TPOT checks whether there is no improvement in optimization process.\n\n\nEnd optimization process if there is no improvement in the set number of generations.\n\n\n\n\n\n-v\n\n\nVERBOSITY\n\n\n{0, 1, 2, 3}\n\n\nHow much information TPOT communicates while it is running.\n\n\n0 = none, 1 = minimal, 2 = high, 3 = all.\n\n\nA setting of 2 or higher will add a progress bar during the optimization procedure.\n\n\n\n\n\n\n--no-update-check\n\n\nFlag indicating whether the TPOT version checker should be disabled.\n\n\n\n\n\n\n--version\n\n\nShow TPOT's version number and exit.\n\n\n\n\n\n\n--help\n\n\nShow TPOT's help documentation and exit.\n\n\n\n\n\n\n\nScoring functions\n\n\nTPOT makes use of \nsklearn.model_selection.cross_val_score\n for evaluating pipelines, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT:\n\n\n\n\n\n\nYou can pass in a string to the \nscoring\n parameter from the list above. Any other strings will cause TPOT to throw an exception.\n\n\n\n\n\n\nYou can pass a function with the signature \nscorer(y_true, y_pred)\n, where \ny_true\n are the true target values and \ny_pred\n are the predicted target values from an estimator. To do this, you should implement your own function. See the example below for further explanation.\n\n\n\n\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ndef my_custom_accuracy(y_true, y_pred):\n return float(sum(y_pred == y_true)) / len(y_true)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n scoring=my_custom_accuracy)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')\n\n\n\n\n\n\nmy_module.scorer_name\n: you can also use your manual \nscorer(y_true, y_pred)\n function through the command line, just add an argument \n-scoring my_module.scorer\n and TPOT will import your module and take the function from there. TPOT will also include current workdir when importing the module, so you can just put it in the same folder where you are going to run.\nExample: \n-scoring sklearn.metrics.auc\n will use the function auc from sklearn.metrics module.\n\n\n\n\nBuilt-in TPOT configurations\n\n\nTPOT comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. Below is a list of the current built-in configurations that come with TPOT.\n\n\n\n\n\n\nConfiguration Name\n\n\nDescription\n\n\nOperators\n\n\n\n\n\n\n\nDefault TPOT\n\n\nTPOT will search over a broad range of preprocessors, feature constructors, feature selectors, models, and parameters to find a series of operators that minimize the error of the model predictions. Some of these operators are complex and may take a long time to run, especially on larger datasets.\n\n\n\nNote: This is the default configuration for TPOT.\n To use this configuration, use the default value (None) for the config_dict parameter.\n\n\nClassification\n\n\n\n\nRegression\n\n\n\n\n\n\n\nTPOT light\n\n\nTPOT will search over a restricted range of preprocessors, feature constructors, feature selectors, models, and parameters to find a series of operators that minimize the error of the model predictions. Only simpler and fast-running operators will be used in these pipelines, so TPOT light is useful for finding quick and simple pipelines for a classification or regression problem.\n\n\nThis configuration works for both the TPOTClassifier and TPOTRegressor.\n\n\nClassification\n\n\n\n\nRegression\n\n\n\n\n\n\n\nTPOT MDR\n\n\nTPOT will search over a series of feature selectors and \nMultifactor Dimensionality Reduction\n models to find a series of operators that maximize prediction accuracy. The TPOT MDR configuration is specialized for \ngenome-wide association studies (GWAS)\n, and is described in detail online \nhere\n.\n\n\nNote that TPOT MDR may be slow to run because the feature selection routines are computationally expensive, especially on large datasets.\n\n\nClassification\n\n\n\n\nRegression\n\n\n\n\n\n\n\nTPOT sparse\n\n\nTPOT uses a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices.\n\n\nThis configuration works for both the TPOTClassifier and TPOTRegressor.\n\n\nClassification\n\n\n\n\nRegression\n\n\n\n\n\n\n\n\nTo use any of these configurations, simply pass the string name of the configuration to the \nconfig_dict\n parameter (or \n-config\n on the command line). For example, to use the \"TPOT light\" configuration:\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n config_dict='TPOT light')\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')\n\n\n\n\n\nCustomizing TPOT's operators and parameters\n\n\nBeyond the default configurations that come with TPOT, in some cases it is useful to limit the algorithms and parameters that TPOT considers. For that reason, we allow users to provide TPOT with a custom configuration for its operators and parameters.\n\n\nThe custom TPOT configuration must be in nested dictionary format, where the first level key is the path and name of the operator (e.g., \nsklearn.naive_bayes.MultinomialNB\n) and the second level key is the corresponding parameter name for that operator (e.g., \nfit_prior\n). The second level key should point to a list of parameter values for that parameter, e.g., \n'fit_prior': [True, False]\n.\n\n\nFor a simple example, the configuration could be:\n\n\ntpot_config = {\n 'sklearn.naive_bayes.GaussianNB': {\n },\n\n 'sklearn.naive_bayes.BernoulliNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n },\n\n 'sklearn.naive_bayes.MultinomialNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n }\n}\n\n\n\n\nin which case TPOT would only consider pipelines containing \nGaussianNB\n, \nBernoulliNB\n, \nMultinomialNB\n, and tune those algorithm's parameters in the ranges provided. This dictionary can be passed directly within the code to the \nTPOTClassifier\n/\nTPOTRegressor\n \nconfig_dict\n parameter, described above. For example:\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot_config = {\n 'sklearn.naive_bayes.GaussianNB': {\n },\n\n 'sklearn.naive_bayes.BernoulliNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n },\n\n 'sklearn.naive_bayes.MultinomialNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n }\n}\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n config_dict=tpot_config)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')\n\n\n\n\nCommand-line users must create a separate \n.py\n file with the custom configuration and provide the path to the file to the \ntpot\n call. For example, if the simple example configuration above is saved in \ntpot_classifier_config.py\n, that configuration could be used on the command line with the command:\n\n\ntpot data/mnist.csv -is , -target class -config tpot_classifier_config.py -g 5 -p 20 -v 2 -o tpot_exported_pipeline.py\n\n\n\n\nWhen using the command-line interface, the configuration file specified in the \n-config\n parameter \nmust\n name its custom TPOT configuration \ntpot_config\n. Otherwise, TPOT will not be able to locate the configuration dictionary.\n\n\nFor more detailed examples of how to customize TPOT's operator configuration, see the default configurations for \nclassification\n and \nregression\n in TPOT's source code.\n\n\nNote that you must have all of the corresponding packages for the operators installed on your computer, otherwise TPOT will not be able to use them. For example, if XGBoost is not installed on your computer, then TPOT will simply not import nor use XGBoost in the pipelines it considers.\n\n\nCrash/freeze issue with n_jobs > 1 under OSX or Linux\n\n\nTPOT supports parallel computing for speeding up the optimization process, but it may crash/freeze with n_jobs > 1 under OSX or Linux \nas scikit-learn does\n, especially with large datasets.\n\n\nOne solution is to configure Python's \nmultiprocessing\n module to use the \nforkserver\n start method (instead of the default \nfork\n) to manage the process pools. You can enable the \nforkserver\n mode globally for your program by putting the following codes into your main script:\n\n\nimport multiprocessing\n\n# other imports, custom code, load data, define model...\n\nif __name__ == '__main__':\n multiprocessing.set_start_method('forkserver')\n\n # call scikit-learn utils or tpot utils with n_jobs > 1 here\n\n\n\n\nMore information about these start methods can be found in the \nmultiprocessing documentation\n.",
"title": "Using TPOT"
},
{
@@ -27,17 +27,17 @@
},
{
"location": "/using/#tpot-on-the-command-line",
- "text": "To use TPOT via the command line, enter the following command with a path to the data file: tpot /path_to/data_file.csv An example command-line call to TPOT may look like: tpot data/mnist.csv -is , -target class -o tpot_exported_pipeline.py -g 5 -p 20 -cv 5 -s 42 -v 2 TPOT offers several arguments that can be provided at the command line. To see brief descriptions of these arguments,\nenter the following command: tpot --help Detailed descriptions of the command-line arguments are below. Argument Parameter Valid values Effect -is INPUT_SEPARATOR Any string Character used to separate columns in the input file. -target TARGET_NAME Any string Name of the target column in the input file. -mode TPOT_MODE ['classification', 'regression'] Whether TPOT is being used for a supervised classification or regression problem. -o OUTPUT_FILE String path to a file File to export the code for the final optimized pipeline. -g GENERATIONS Any positive integer Number of iterations to run the pipeline optimization process. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline. \nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total. -p POPULATION_SIZE Any positive integer Number of individuals to retain in the GP population every generation. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline. \nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total. -os OFFSPRING_SIZE Any positive integer Number of offspring to produce in each GP generation. \nBy default, OFFSPRING_SIZE = POPULATION_SIZE. -mr MUTATION_RATE [0.0, 1.0] GP mutation rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to apply random changes to every generation. \nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms. -xr CROSSOVER_RATE [0.0, 1.0] GP crossover rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to \"breed\" every generation. \nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms. -scoring SCORING_FN 'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'f1',\n'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss', 'neg_mean_absolute_error',\n'neg_mean_squared_error', 'neg_median_absolute_error', 'precision', 'precision_macro', 'precision_micro',\n'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples',\n'recall_weighted', 'roc_auc' Function used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression. \nTPOT assumes that any function with \"error\" or \"loss\" in the name is meant to be minimized, whereas any other functions will be maximized. \nSee the section on scoring functions for more details. -cv CV Any integer >1 Number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT optimization process. -sub SUBSAMPLE (0.0, 1.0] Subsample ratio of the training instance. Setting it to 0.5 means that TPOT randomly collects half of training samples for pipeline optimization process. -njobs NUM_JOBS Any positive integer or -1 Number of CPUs for evaluating pipelines in parallel during the TPOT optimization process. \nAssigning this to -1 will use as many cores as available on the computer. -maxtime MAX_TIME_MINS Any positive integer How many minutes TPOT has to optimize the pipeline. \nIf provided, this setting will override the \"generations\" parameter and allow TPOT to run until it runs out of time. -maxeval MAX_EVAL_MINS Any positive integer How many minutes TPOT has to evaluate a single pipeline. \nSetting this parameter to higher values will allow TPOT to consider more complex pipelines but will also allow TPOT to run longer. -s RANDOM_STATE Any positive integer Random number generator seed for reproducibility. \nSet this seed if you want your TPOT run to be reproducible with the same seed and data set in the future. -config CONFIG_FILE File path or string A path to a configuration file for customizing the operators and parameters that TPOT uses in the optimization process. \nSee the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations. -v VERBOSITY {0, 1, 2, 3} How much information TPOT communicates while it is running. \n0 = none, 1 = minimal, 2 = high, 3 = all. \nA setting of 2 or higher will add a progress bar during the optimization procedure. --no-update-check Flag indicating whether the TPOT version checker should be disabled. --version Show TPOT's version number and exit. --help Show TPOT's help documentation and exit.",
+ "text": "To use TPOT via the command line, enter the following command with a path to the data file: tpot /path_to/data_file.csv An example command-line call to TPOT may look like: tpot data/mnist.csv -is , -target class -o tpot_exported_pipeline.py -g 5 -p 20 -cv 5 -s 42 -v 2 TPOT offers several arguments that can be provided at the command line. To see brief descriptions of these arguments,\nenter the following command: tpot --help Detailed descriptions of the command-line arguments are below. Argument Parameter Valid values Effect -is INPUT_SEPARATOR Any string Character used to separate columns in the input file. -target TARGET_NAME Any string Name of the target column in the input file. -mode TPOT_MODE ['classification', 'regression'] Whether TPOT is being used for a supervised classification or regression problem. -o OUTPUT_FILE String path to a file File to export the code for the final optimized pipeline. -g GENERATIONS Any positive integer Number of iterations to run the pipeline optimization process. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline. \nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total. -p POPULATION_SIZE Any positive integer Number of individuals to retain in the GP population every generation. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline. \nTPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total. -os OFFSPRING_SIZE Any positive integer Number of offspring to produce in each GP generation. \nBy default, OFFSPRING_SIZE = POPULATION_SIZE. -mr MUTATION_RATE [0.0, 1.0] GP mutation rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to apply random changes to every generation. \nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms. -xr CROSSOVER_RATE [0.0, 1.0] GP crossover rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to \"breed\" every generation. \nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms. -scoring SCORING_FN 'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'f1',\n'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss', 'neg_mean_absolute_error',\n'neg_mean_squared_error', 'neg_median_absolute_error', 'precision', 'precision_macro', 'precision_micro',\n'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples',\n'recall_weighted', 'roc_auc', 'my_module.scorer_name*' Function used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression. \nTPOT assumes that any function with \"error\" or \"loss\" in the name is meant to be minimized, whereas any other functions will be maximized. \nmy_module.scorer_name: You can also specify your own function or a full python path to an existing one. \nSee the section on scoring functions for more details. -cv CV Any integer > 1 Number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT optimization process. -sub SUBSAMPLE (0.0, 1.0] Subsample ratio of the training instance. Setting it to 0.5 means that TPOT randomly collects half of training samples for pipeline optimization process. -njobs NUM_JOBS Any positive integer or -1 Number of CPUs for evaluating pipelines in parallel during the TPOT optimization process. \nAssigning this to -1 will use as many cores as available on the computer. -maxtime MAX_TIME_MINS Any positive integer How many minutes TPOT has to optimize the pipeline. \nIf provided, this setting will override the \"generations\" parameter and allow TPOT to run until it runs out of time. -maxeval MAX_EVAL_MINS Any positive integer How many minutes TPOT has to evaluate a single pipeline. \nSetting this parameter to higher values will allow TPOT to consider more complex pipelines but will also allow TPOT to run longer. -s RANDOM_STATE Any positive integer Random number generator seed for reproducibility. \nSet this seed if you want your TPOT run to be reproducible with the same seed and data set in the future. -config CONFIG_FILE String or file path Operators and parameter configurations in TPOT: Path for configuration file: TPOT will use the path to a configuration file for customizing the operators and parameters that TPOT uses in the optimization process string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices. \nSee the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations. -cf CHECKPOINT_FOLDER Folder path \nIf supplied, a folder you created, in which tpot will periodically save the best pipeline so far while optimizing. \nThis is useful in multiple cases: sudden death before tpot could save an optimized pipeline progress tracking grabbing a pipeline while tpot is working \nExample: \nmkdir my_checkpoints \n-cf ./my_checkpoints -es EARLY_STOP Any positive integer \nHow many generations TPOT checks whether there is no improvement in optimization process. \nEnd optimization process if there is no improvement in the set number of generations. -v VERBOSITY {0, 1, 2, 3} How much information TPOT communicates while it is running. \n0 = none, 1 = minimal, 2 = high, 3 = all. \nA setting of 2 or higher will add a progress bar during the optimization procedure. --no-update-check Flag indicating whether the TPOT version checker should be disabled. --version Show TPOT's version number and exit. --help Show TPOT's help documentation and exit.",
"title": "TPOT on the command line"
},
{
"location": "/using/#scoring-functions",
- "text": "TPOT makes use of sklearn.model_selection.cross_val_score for evaluating pipelines, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT: You can pass in a string to the scoring parameter from the list above. Any other strings will cause TPOT to throw an exception. You can pass a function with the signature scorer(y_true, y_pred) , where y_true are the true target values and y_pred are the predicted target values from an estimator. To do this, you should implement your own function. See the example below for further explanation. from tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ndef my_custom_accuracy(y_true, y_pred):\n return float(sum(y_pred == y_true)) / len(y_true)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n scoring=my_custom_accuracy)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')",
+ "text": "TPOT makes use of sklearn.model_selection.cross_val_score for evaluating pipelines, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT: You can pass in a string to the scoring parameter from the list above. Any other strings will cause TPOT to throw an exception. You can pass a function with the signature scorer(y_true, y_pred) , where y_true are the true target values and y_pred are the predicted target values from an estimator. To do this, you should implement your own function. See the example below for further explanation. from tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ndef my_custom_accuracy(y_true, y_pred):\n return float(sum(y_pred == y_true)) / len(y_true)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n scoring=my_custom_accuracy)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py') my_module.scorer_name : you can also use your manual scorer(y_true, y_pred) function through the command line, just add an argument -scoring my_module.scorer and TPOT will import your module and take the function from there. TPOT will also include current workdir when importing the module, so you can just put it in the same folder where you are going to run.\nExample: -scoring sklearn.metrics.auc will use the function auc from sklearn.metrics module.",
"title": "Scoring functions"
},
{
"location": "/using/#built-in-tpot-configurations",
- "text": "TPOT comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. Below is a list of the current built-in configurations that come with TPOT. Configuration Name Description Operators Default TPOT TPOT will search over a broad range of preprocessors, feature constructors, feature selectors, models, and parameters to find a series of operators that minimize the error of the model predictions. Some of these operators are complex and may take a long time to run, especially on larger datasets. Note: This is the default configuration for TPOT. To use this configuration, use the default value (None) for the config_dict parameter. Classification Regression TPOT light TPOT will search over a restricted range of preprocessors, feature constructors, feature selectors, models, and parameters to find a series of operators that minimize the error of the model predictions. Only simpler and fast-running operators will be used in these pipelines, so TPOT light is useful for finding quick and simple pipelines for a classification or regression problem. \nThis configuration works for both the TPOTClassifier and TPOTRegressor. Classification Regression TPOT MDR TPOT will search over a series of feature selectors and Multifactor Dimensionality Reduction models to find a series of operators that maximize prediction accuracy. The TPOT MDR configuration is specialized for genome-wide association studies (GWAS) , and is described in detail online here . \nNote that TPOT MDR may be slow to run because the feature selection routines are computationally expensive, especially on large datasets. Classification Regression To use any of these configurations, simply pass the string name of the configuration to the config_dict parameter (or -config on the command line). For example, to use the \"TPOT light\" configuration: from tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n config_dict='TPOT light')\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')",
+ "text": "TPOT comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. Below is a list of the current built-in configurations that come with TPOT. Configuration Name Description Operators Default TPOT TPOT will search over a broad range of preprocessors, feature constructors, feature selectors, models, and parameters to find a series of operators that minimize the error of the model predictions. Some of these operators are complex and may take a long time to run, especially on larger datasets. Note: This is the default configuration for TPOT. To use this configuration, use the default value (None) for the config_dict parameter. Classification Regression TPOT light TPOT will search over a restricted range of preprocessors, feature constructors, feature selectors, models, and parameters to find a series of operators that minimize the error of the model predictions. Only simpler and fast-running operators will be used in these pipelines, so TPOT light is useful for finding quick and simple pipelines for a classification or regression problem. \nThis configuration works for both the TPOTClassifier and TPOTRegressor. Classification Regression TPOT MDR TPOT will search over a series of feature selectors and Multifactor Dimensionality Reduction models to find a series of operators that maximize prediction accuracy. The TPOT MDR configuration is specialized for genome-wide association studies (GWAS) , and is described in detail online here . \nNote that TPOT MDR may be slow to run because the feature selection routines are computationally expensive, especially on large datasets. Classification Regression TPOT sparse TPOT uses a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices. \nThis configuration works for both the TPOTClassifier and TPOTRegressor. Classification Regression To use any of these configurations, simply pass the string name of the configuration to the config_dict parameter (or -config on the command line). For example, to use the \"TPOT light\" configuration: from tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n config_dict='TPOT light')\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')",
"title": "Built-in TPOT configurations"
},
{
@@ -45,19 +45,24 @@
"text": "Beyond the default configurations that come with TPOT, in some cases it is useful to limit the algorithms and parameters that TPOT considers. For that reason, we allow users to provide TPOT with a custom configuration for its operators and parameters. The custom TPOT configuration must be in nested dictionary format, where the first level key is the path and name of the operator (e.g., sklearn.naive_bayes.MultinomialNB ) and the second level key is the corresponding parameter name for that operator (e.g., fit_prior ). The second level key should point to a list of parameter values for that parameter, e.g., 'fit_prior': [True, False] . For a simple example, the configuration could be: tpot_config = {\n 'sklearn.naive_bayes.GaussianNB': {\n },\n\n 'sklearn.naive_bayes.BernoulliNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n },\n\n 'sklearn.naive_bayes.MultinomialNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n }\n} in which case TPOT would only consider pipelines containing GaussianNB , BernoulliNB , MultinomialNB , and tune those algorithm's parameters in the ranges provided. This dictionary can be passed directly within the code to the TPOTClassifier / TPOTRegressor config_dict parameter, described above. For example: from tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot_config = {\n 'sklearn.naive_bayes.GaussianNB': {\n },\n\n 'sklearn.naive_bayes.BernoulliNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n },\n\n 'sklearn.naive_bayes.MultinomialNB': {\n 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],\n 'fit_prior': [True, False]\n }\n}\n\ntpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,\n config_dict=tpot_config)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py') Command-line users must create a separate .py file with the custom configuration and provide the path to the file to the tpot call. For example, if the simple example configuration above is saved in tpot_classifier_config.py , that configuration could be used on the command line with the command: tpot data/mnist.csv -is , -target class -config tpot_classifier_config.py -g 5 -p 20 -v 2 -o tpot_exported_pipeline.py When using the command-line interface, the configuration file specified in the -config parameter must name its custom TPOT configuration tpot_config . Otherwise, TPOT will not be able to locate the configuration dictionary. For more detailed examples of how to customize TPOT's operator configuration, see the default configurations for classification and regression in TPOT's source code. Note that you must have all of the corresponding packages for the operators installed on your computer, otherwise TPOT will not be able to use them. For example, if XGBoost is not installed on your computer, then TPOT will simply not import nor use XGBoost in the pipelines it considers.",
"title": "Customizing TPOT's operators and parameters"
},
+ {
+ "location": "/using/#crashfreeze-issue-with-n_jobs-1-under-osx-or-linux",
+ "text": "TPOT supports parallel computing for speeding up the optimization process, but it may crash/freeze with n_jobs > 1 under OSX or Linux as scikit-learn does , especially with large datasets. One solution is to configure Python's multiprocessing module to use the forkserver start method (instead of the default fork ) to manage the process pools. You can enable the forkserver mode globally for your program by putting the following codes into your main script: import multiprocessing\n\n# other imports, custom code, load data, define model...\n\nif __name__ == '__main__':\n multiprocessing.set_start_method('forkserver')\n\n # call scikit-learn utils or tpot utils with n_jobs > 1 here More information about these start methods can be found in the multiprocessing documentation .",
+ "title": "Crash/freeze issue with n_jobs > 1 under OSX or Linux"
+ },
{
"location": "/api/",
- "text": "Classification\n\n\nclass\n tpot.\nTPOTClassifier\n(\ngenerations\n=100, \npopulation_size\n=100,\n \noffspring_size\n=None, \nmutation_rate\n=0.9,\n \ncrossover_rate\n=0.1,\n \nscoring\n='accuracy', \ncv\n=5,\n \nsubsample\n=1.0, \nn_jobs\n=1,\n \nmax_time_mins\n=None, \nmax_eval_time_mins\n=5,\n \nrandom_state\n=None, \nconfig_dict\n=None,\n \nwarm_start\n=False, \nverbosity\n=0,\n \ndisable_update_check\n=False\n)\n\n\n\nsource\n\n\n\nAutomated machine learning for supervised classification tasks.\n\n\nThe TPOTClassifier performs an intelligent search over machine learning pipelines that can contain supervised classification models,\npreprocessors, feature selection techniques, and any other estimator or transformer that follows the \nscikit-learn API\n.\nThe TPOTClassifier will also search over the hyperparameters of all objects in the pipeline.\n\n\nBy default, TPOTClassifier will search over a broad range of supervised classification algorithms, transformers, and their parameters.\nHowever, the algorithms, transformers, and hyperparameters that the TPOTClassifier searches over can be fully customized using the \nconfig_dict\n parameter.\n\n\nRead more in the \nUser Guide\n.\n\n\n\n\n\n\nParameters:\n\n\n\n\ngenerations\n: int, optional (default=100)\n\n\nNumber of iterations to the run pipeline optimization process. Must be a positive number.\n\n\nGenerally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate \npopulation_size\n + \ngenerations\n \u00d7 \noffspring_size\n pipelines in total.\n\n\n\n\npopulation_size\n: int, optional (default=100)\n\n\nNumber of individuals to retain in the genetic programming population every generation. Must be a positive number.\n\n\nGenerally, TPOT will work better when you give it more individuals with which to optimize the pipeline.\n\n\n\n\noffspring_size\n: int, optional (default=100)\n\n\nNumber of offspring to produce in each genetic programming generation. Must be a positive number.\n\n\n\n\nmutation_rate\n: float, optional (default=0.9)\n\n\nMutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation.\n\n\n\nmutation_rate\n + \ncrossover_rate\n cannot exceed 1.0.\n\n\nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.\n\n\n\n\ncrossover_rate\n: float, optional (default=0.1)\n\n\nCrossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to \"breed\" every generation.\n\n\n\nmutation_rate\n + \ncrossover_rate\n cannot exceed 1.0.\n\n\nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms.\n\n\n\n\nscoring\n: string or callable, optional (default='accuracy')\n\n\nFunction used to evaluate the quality of a given pipeline for the classification problem. The following built-in scoring functions can be used:\n\n\n'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss','precision',\n'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc'\n\n\nIf you would like to use a custom scoring function, you can pass a callable function to this parameter with the signature \nscorer(y_true, y_pred)\n. See the section on \nscoring functions\n for more details.\n\n\nTPOT assumes that any function with \"error\" or \"loss\" in the function name is meant to be minimized, whereas any other functions will be maximized.\n\n\n\n\ncv\n: int, cross-validation generator, or an iterable, optional (default=5)\n\n\nCross-validation strategy used when evaluating pipelines.\n\n\nPossible inputs:\n\n\n\ninteger, to specify the number of folds in a StratifiedKFold,\n\n\nAn object to be used as a cross-validation generator, or\n\n\nAn iterable yielding train/test splits.\n\n\n\n\n\nsubsample\n: float, optional (default=1.0)\n\n\nFraction of training samples that are used during the TPOT optimization process. Must be in the range (0.0, 1.0].\n\n\nSetting \nsubsample\n=0.5 tells TPOT to use a random subsample of half of the training data. This subsample will remain the same during the entire pipeline optimization process.\n\n\n\n\nn_jobs\n: integer, optional (default=1)\n\n\nNumber of processes to use in parallel for evaluating pipelines during the TPOT optimization process.\n\n\nSetting \nn_jobs\n=-1 will use as many cores as available on the computer. Beware that using multiple processes on the same machine may cause memory issues for large datasets\n\n\n\n\nmax_time_mins\n: integer or None, optional (default=None)\n\n\nHow many minutes TPOT has to optimize the pipeline.\n\n\nIf not None, this setting will override the \ngenerations\n parameter and allow TPOT to run until \nmax_time_mins\n minutes elapse.\n\n\n\n\nmax_eval_time_mins\n: integer, optional (default=5)\n\n\nHow many minutes TPOT has to evaluate a single pipeline.\n\n\nSetting this parameter to higher values will allow TPOT to evaluate more complex pipelines, but will also allow TPOT to run longer. Use this parameter to help prevent TPOT from wasting time on evaluating time-consuming pipelines.\n\n\n\n\nrandom_state\n: integer or None, optional (default=None)\n\n\nThe seed of the pseudo random number generator used in TPOT.\n\n\nUse this parameter to make sure that TPOT will give you the same results each time you run it against the same data set with that seed.\n\n\n\n\nconfig_dict\n: Python dictionary, string, or None, optional (default=None)\n\n\nA configuration dictionary for customizing the operators and parameters that TPOT searches in the optimization process.\n\n\nPossible inputs are:\n\n\n\nPython dictionary, TPOT will use your custom configuration,\n\n\nstring 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or\n\n\nstring 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or\n\n\nNone, TPOT will use the default TPOTClassifier configuration.\n\n\n\nSee the \nbuilt-in configurations\n section for the list of configurations included with TPOT, and the \ncustom configuration\n section for more information and examples of how to create your own TPOT configurations.\n\n\n\n\nwarm_start\n: boolean, optional (default=False)\n\n\nFlag indicating whether the TPOT instance will reuse the population from previous calls to \nfit()\n.\n\n\nSetting \nwarm_start\n=True can be useful for running TPOT for a short time on a dataset, checking the results, then resuming the TPOT run from where it left off.\n\n\n\n\nverbosity\n: integer, optional (default=0)\n\n\nHow much information TPOT communicates while it's running.\n\n\nPossible inputs are:\n\n\n\n0, TPOT will print nothing,\n\n\n1, TPOT will print minimal information,\n\n\n2, TPOT will print more information and provide a progress bar, or\n\n\n3, TPOT will print everything and provide a progress bar.\n\n\n\n\n\n\n\ndisable_update_check\n: boolean, optional (default=False)\n\n\nFlag indicating whether the TPOT version checker should be disabled.\n\n\nThe update checker will tell you when a new version of TPOT has been released.\n\n\n\n\n\n\n\n\n\n\nAttributes:\n\n\n\n\nfitted_pipeline_\n: scikit-learn Pipeline object\n\n\nThe best pipeline that TPOT discovered during the pipeline optimization process, fitted on the entire training dataset.\n\n\n\n\npareto_front_fitted_pipelines_\n: Python dictionary\n\n\nDictionary containing the all pipelines on the TPOT Pareto front, where the key is the string representation of the pipeline and the value is the corresponding pipeline fitted on the entire training dataset.\n\n\nThe TPOT Pareto front provides a trade-off between pipeline complexity (i.e., the number of steps in the pipeline) and the predictive performance of the pipeline.\n\n\nNote: \npareto_front_fitted_pipelines_\n is only available when \nverbosity\n=3.\n\n\n\n\nevaluated_individuals_\n: Python dictionary\n\n\nDictionary containing all pipelines that were evaluated during the pipeline optimization process, where the key is the string representation of the pipeline and the value is a tuple containing (# of steps in pipeline, accuracy metric for the pipeline).\n\n\nThis attribute is primarily for internal use, but may be useful for looking at the other pipelines that TPOT evaluated.\n\n\n\n\n\n\n\n\n\n\nExample\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')\n\n\n\n\nFunctions\n\n\n\n\n\n\nfit\n(features, classes[, sample_weight, groups])\n\n\nRun the TPOT optimization process on the given training data.\n\n\n\n\n\n\n\npredict\n(features)\n\n\nUse the optimized pipeline to predict the classes for a feature set.\n\n\n\n\n\n\n\npredict_proba\n(features)\n\n\nUse the optimized pipeline to estimate the class probabilities for a feature set.\n\n\n\n\n\n\n\nscore\n(testing_features, testing_classes)\n\n\nReturns the optimized pipeline's score on the given testing data using the user-specified scoring function.\n\n\n\n\n\n\n\nexport\n(output_file_name)\n\n\nExport the optimized pipeline as Python code.\n\n\n\n\n\n\n\n\n\nfit(features, classes, sample_weight=None, groups=None)\n\n\n\n\n\nRun the TPOT optimization process on the given training data.\n\n\nUses genetic programming to optimize a machine learning pipeline that maximizes the score on the provided features and classes. Performs internal stratified k-fold cross-validaton to avoid overfitting on the provided data.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\nfeatures\n: array-like {n_samples, n_features}\n\n\nFeature matrix\n\n\nTPOT and all scikit-learn algorithms assume that the features will be numerical and there will be no missing values.\nAs such, when a feature matrix is provided to TPOT, all missing values will automatically be replaced (i.e., imputed)\nusing \nmedian value imputation\n.\n\n\nIf you wish to use a different imputation strategy than median imputation, please make sure to apply imputation to your feature set prior to passing it to TPOT.\n\n\n\n\nclasses\n: array-like {n_samples}\n\n\nList of class labels for prediction\n\n\n\n\nsample_weight\n: array-like {n_samples}, optional\n\n\nPer-sample weights. Higher weights force TPOT to put more emphasis on those points.\n\n\n\n\ngroups\n: array-like, with shape {n_samples, }, optional\n\n\nGroup labels for the samples used when performing cross-validation.\n\n\nThis parameter should only be used in conjunction with sklearn's Group cross-validation functions, such as \nsklearn.model_selection.GroupKFold\n.\n\n\n\n\n\n\n\n\n\nReturns:\n\n\n\n\nself\n: object\n\n\nReturns a copy of the fitted TPOT object\n\n\n\n\n\n\n\n\n\n\n\n\n\n\npredict(features)\n\n\n\n\n\nUse the optimized pipeline to predict the classes for a feature set.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\nfeatures\n: array-like {n_samples, n_features}\n\n\nFeature matrix\n\n\n\n\n\n\n\n\n\nReturns:\n\n\n\n\npredictions\n: array-like {n_samples}\n\n\nPredicted classes for the samples in the feature matrix\n\n\n\n\n\n\n\n\n\n\n\n\n\n\npredict_proba(features)\n\n\n\n\n\nUse the optimized pipeline to estimate the class probabilities for a feature set.\n\n\nNote: This function will only work for pipelines whose final classifier supports the \npredict_proba\n function. TPOT will raise an error otherwise.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\nfeatures\n: array-like {n_samples, n_features}\n\n\nFeature matrix\n\n\n\n\n\n\n\n\n\nReturns:\n\n\n\n\npredictions\n: array-like {n_samples, n_classes}\n\n\nThe class probabilities of the input samples\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nscore(testing_features, testing_classes)\n\n\n\n\n\nReturns the optimized pipeline's score on the given testing data using the user-specified scoring function.\n\n\nThe default scoring function for TPOTClassifier is 'accuracy'.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\ntesting_features\n: array-like {n_samples, n_features}\n\n\nFeature matrix of the testing set\n\n\n\n\ntesting_classes\n: array-like {n_samples}\n\n\nList of class labels for prediction in the testing set\n\n\n\n\n\n\n\n\n\nReturns:\n\n\n\n\naccuracy_score\n: float\n\n\nThe estimated test set accuracy according to the user-specified scoring function.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nexport(output_file_name)\n\n\n\n\n\nExport the optimized pipeline as Python code.\n\n\nSee the \nusage documentation\n for example usage of the export function.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\noutput_file_name\n: string\n\n\nString containing the path and file name of the desired output file\n\n\n\n\n\n\n\nReturns:\n\n\n\nDoes not return anything\n\n\n\n\n\n\n\n\n\n\nRegression\n\n\nclass\n tpot.\nTPOTRegressor\n(\ngenerations\n=100, \npopulation_size\n=100,\n \noffspring_size\n=None, \nmutation_rate\n=0.9,\n \ncrossover_rate\n=0.1,\n \nscoring\n='neg_mean_squared_error', \ncv\n=5,\n \nsubsample\n=1.0, \nn_jobs\n=1,\n \nmax_time_mins\n=None, \nmax_eval_time_mins\n=5,\n \nrandom_state\n=None, \nconfig_dict\n=None,\n \nwarm_start\n=False, \nverbosity\n=0,\n \ndisable_update_check\n=False\n)\n\n\n\nsource\n\n\n\nAutomated machine learning for supervised regression tasks.\n\n\nThe TPOTRegressor performs an intelligent search over machine learning pipelines that can contain supervised regression models,\npreprocessors, feature selection techniques, and any other estimator or transformer that follows the \nscikit-learn API\n.\nThe TPOTRegressor will also search over the hyperparameters of all objects in the pipeline.\n\n\nBy default, TPOTRegressor will search over a broad range of supervised regression models, transformers, and their hyperparameters.\nHowever, the models, transformers, and parameters that the TPOTRegressor searches over can be fully customized using the \nconfig_dict\n parameter.\n\n\nRead more in the \nUser Guide\n.\n\n\n\n\n\n\nParameters:\n\n\n\n\ngenerations\n: int, optional (default=100)\n\n\nNumber of iterations to the run pipeline optimization process. Must be a positive number.\n\n\nGenerally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate \npopulation_size\n + \ngenerations\n \u00d7 \noffspring_size\n pipelines in total.\n\n\n\n\npopulation_size\n: int, optional (default=100)\n\n\nNumber of individuals to retain in the genetic programming population every generation. Must be a positive number.\n\n\nGenerally, TPOT will work better when you give it more individuals with which to optimize the pipeline.\n\n\n\n\noffspring_size\n: int, optional (default=100)\n\n\nNumber of offspring to produce in each genetic programming generation. Must be a positive number.\n\n\n\n\nmutation_rate\n: float, optional (default=0.9)\n\n\nMutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation.\n\n\n\nmutation_rate\n + \ncrossover_rate\n cannot exceed 1.0.\n\n\nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.\n\n\n\n\ncrossover_rate\n: float, optional (default=0.1)\n\n\nCrossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to \"breed\" every generation.\n\n\n\nmutation_rate\n + \ncrossover_rate\n cannot exceed 1.0.\n\n\nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms.\n\n\n\n\nscoring\n: string or callable, optional (default='neg_mean_squared_error')\n\n\nFunction used to evaluate the quality of a given pipeline for the regression problem. The following built-in scoring functions can be used:\n\n\n'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'r2'\n\n\nNote that we recommend using the \nneg\n version of mean squared error and related metrics so TPOT will minimize (instead of maximize) the metric.\n\n\nIf you would like to use a custom scoring function, you can pass a callable function to this parameter with the signature \nscorer(y_true, y_pred)\n. See the section on \nscoring functions\n for more details.\n\n\nTPOT assumes that any custom scoring function with \"error\" or \"loss\" in the function name is meant to be minimized, whereas any other functions will be maximized.\n\n\n\n\ncv\n: int, cross-validation generator, or an iterable, optional (default=5)\n\n\nCross-validation strategy used when evaluating pipelines.\n\n\nPossible inputs:\n\n\n\ninteger, to specify the number of folds in a KFold,\n\n\nAn object to be used as a cross-validation generator, or\n\n\nAn iterable yielding train/test splits.\n\n\n\n\n\n\n\nsubsample\n: float, optional (default=1.0)\n\n\nFraction of training samples that are used during the TPOT optimization process. Must be in the range (0.0, 1.0].\n\n\nSetting \nsubsample\n=0.5 tells TPOT to use a random subsample of half of the training data. This subsample will remain the same during the entire pipeline optimization process.\n\n\n\n\nn_jobs\n: integer, optional (default=1)\n\n\nNumber of processes to use in parallel for evaluating pipelines during the TPOT optimization process.\n\n\nSetting \nn_jobs\n=-1 will use as many cores as available on the computer. Beware that using multiple processes on the same machine may cause memory issues for large datasets\n\n\n\n\nmax_time_mins\n: integer or None, optional (default=None)\n\n\nHow many minutes TPOT has to optimize the pipeline.\n\n\nIf not None, this setting will override the \ngenerations\n parameter and allow TPOT to run until \nmax_time_mins\n minutes elapse.\n\n\n\n\nmax_eval_time_mins\n: integer, optional (default=5)\n\n\nHow many minutes TPOT has to evaluate a single pipeline.\n\n\nSetting this parameter to higher values will allow TPOT to evaluate more complex pipelines, but will also allow TPOT to run longer. Use this parameter to help prevent TPOT from wasting time on evaluating time-consuming pipelines.\n\n\n\n\nrandom_state\n: integer or None, optional (default=None)\n\n\nThe seed of the pseudo random number generator used in TPOT.\n\n\nUse this parameter to make sure that TPOT will give you the same results each time you run it against the same data set with that seed.\n\n\n\n\nconfig_dict\n: Python dictionary, string, or None, optional (default=None)\n\n\nA configuration dictionary for customizing the operators and parameters that TPOT searches in the optimization process.\n\n\nPossible inputs are:\n\n\n\nPython dictionary, TPOT will use your custom configuration,\n\n\nstring 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or\n\n\nstring 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or\n\n\nNone, TPOT will use the default TPOTRegressor configuration.\n\n\n\nSee the \nbuilt-in configurations\n section for the list of configurations included with TPOT, and the \ncustom configuration\n section for more information and examples of how to create your own TPOT configurations.\n\n\n\n\nwarm_start\n: boolean, optional (default=False)\n\n\nFlag indicating whether the TPOT instance will reuse the population from previous calls to \nfit()\n.\n\n\nSetting \nwarm_start\n=True can be useful for running TPOT for a short time on a dataset, checking the results, then resuming the TPOT run from where it left off.\n\n\n\n\nverbosity\n: integer, optional (default=0)\n\n\nHow much information TPOT communicates while it's running.\n\n\nPossible inputs are:\n\n\n\n0, TPOT will print nothing,\n\n\n1, TPOT will print minimal information,\n\n\n2, TPOT will print more information and provide a progress bar, or\n\n\n3, TPOT will print everything and provide a progress bar.\n\n\n\n\n\n\n\ndisable_update_check\n: boolean, optional (default=False)\n\n\nFlag indicating whether the TPOT version checker should be disabled.\n\n\nThe update checker will tell you when a new version of TPOT has been released.\n\n\n\n\n\n\n\n\n\n\nAttributes:\n\n\n\n\nfitted_pipeline_\n: scikit-learn Pipeline object\n\n\nThe best pipeline that TPOT discovered during the pipeline optimization process, fitted on the entire training dataset.\n\n\n\n\npareto_front_fitted_pipelines_\n: Python dictionary\n\n\nDictionary containing the all pipelines on the TPOT Pareto front, where the key is the string representation of the pipeline and the value is the corresponding pipeline fitted on the entire training dataset.\n\n\nThe TPOT Pareto front provides a trade-off between pipeline complexity (i.e., the number of steps in the pipeline) and the predictive performance of the pipeline.\n\n\nNote: \n_pareto_front_fitted_pipelines\n is only available when \nverbosity\n=3.\n\n\n\n\nevaluated_individuals_\n: Python dictionary\n\n\nDictionary containing all pipelines that were evaluated during the pipeline optimization process, where the key is the string representation of the pipeline and the value is a tuple containing (# of steps in pipeline, accuracy metric for the pipeline).\n\n\nThis attribute is primarily for internal use, but may be useful for looking at the other pipelines that TPOT evaluated.\n\n\n\n\n\n\n\n\n\n\nExample\n\n\nfrom tpot import TPOTRegressor\nfrom sklearn.datasets import load_boston\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_boston()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTRegressor(generations=5, population_size=50, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_boston_pipeline.py')\n\n\n\n\nFunctions\n\n\n\n\n\n\nfit\n(features, target[, sample_weight, groups])\n\n\nRun the TPOT optimization process on the given training data.\n\n\n\n\n\n\n\npredict\n(features)\n\n\nUse the optimized pipeline to predict the target values for a feature set.\n\n\n\n\n\n\n\nscore\n(testing_features, testing_target)\n\n\nReturns the optimized pipeline's score on the given testing data using the user-specified scoring function.\n\n\n\n\n\n\n\nexport\n(output_file_name)\n\n\nExport the optimized pipeline as Python code.\n\n\n\n\n\n\n\n\n\nfit(features, target, sample_weight=None, groups=None)\n\n\n\n\n\nRun the TPOT optimization process on the given training data.\n\n\nUses genetic programming to optimize a machine learning pipeline that maximizes the score on the provided features and target. Performs internal k-fold cross-validaton to avoid overfitting on the provided data.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\nfeatures\n: array-like {n_samples, n_features}\n\n\nFeature matrix\n\n\nTPOT and all scikit-learn algorithms assume that the features will be numerical and there will be no missing values.\nAs such, when a feature matrix is provided to TPOT, all missing values will automatically be replaced (i.e., imputed)\nusing \nmedian value imputation\n.\n\n\nIf you wish to use a different imputation strategy than median imputation, please make sure to apply imputation to your feature set prior to passing it to TPOT.\n\n\n\n\ntarget\n: array-like {n_samples}\n\n\nList of target labels for prediction\n\n\n\n\nsample_weight\n: array-like {n_samples}, optional\n\n\nPer-sample weights. Higher weights force TPOT to put more emphasis on those points.\n\n\n\n\ngroups\n: array-like, with shape {n_samples, }, optional\n\n\nGroup labels for the samples used when performing cross-validation.\n\n\nThis parameter should only be used in conjunction with sklearn's Group cross-validation functions, such as \nsklearn.model_selection.GroupKFold\n.\n\n\n\n\n\n\n\n\n\nReturns:\n\n\n\n\nself\n: object\n\n\nReturns a copy of the fitted TPOT object\n\n\n\n\n\n\n\n\n\n\n\n\n\n\npredict(features)\n\n\n\n\n\nUse the optimized pipeline to predict the target values for a feature set.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\nfeatures\n: array-like {n_samples, n_features}\n\n\nFeature matrix\n\n\n\n\n\n\n\n\n\nReturns:\n\n\n\n\npredictions\n: array-like {n_samples}\n\n\nPredicted target values for the samples in the feature matrix\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nscore(testing_features, testing_target)\n\n\n\n\n\nReturns the optimized pipeline's score on the given testing data using the user-specified scoring function.\n\n\nThe default scoring function for TPOTClassifier is 'mean_squared_error'.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\ntesting_features\n: array-like {n_samples, n_features}\n\n\nFeature matrix of the testing set\n\n\n\n\ntesting_target\n: array-like {n_samples}\n\n\nList of target labels for prediction in the testing set\n\n\n\n\n\n\n\n\n\nReturns:\n\n\n\n\naccuracy_score\n: float\n\n\nThe estimated test set accuracy according to the user-specified scoring function.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nexport(output_file_name)\n\n\n\n\n\nExport the optimized pipeline as Python code.\n\n\nSee the \nusage documentation\n for example usage of the export function.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\noutput_file_name\n: string\n\n\nString containing the path and file name of the desired output file\n\n\n\n\n\n\n\nReturns:\n\n\n\nDoes not return anything",
+ "text": "Classification\n\n\nclass\n tpot.\nTPOTClassifier\n(\ngenerations\n=100, \npopulation_size\n=100,\n \noffspring_size\n=None, \nmutation_rate\n=0.9,\n \ncrossover_rate\n=0.1,\n \nscoring\n='accuracy', \ncv\n=5,\n \nsubsample\n=1.0, \nn_jobs\n=1,\n \nmax_time_mins\n=None, \nmax_eval_time_mins\n=5,\n \nrandom_state\n=None, \nconfig_dict\n=None,\n \nwarm_start\n=False,\n \nperiodic_checkpoint_folder\n=None,\n \nverbosity\n=0,\n \ndisable_update_check\n=False\n)\n\n\n\nsource\n\n\n\nAutomated machine learning for supervised classification tasks.\n\n\nThe TPOTClassifier performs an intelligent search over machine learning pipelines that can contain supervised classification models,\npreprocessors, feature selection techniques, and any other estimator or transformer that follows the \nscikit-learn API\n.\nThe TPOTClassifier will also search over the hyperparameters of all objects in the pipeline.\n\n\nBy default, TPOTClassifier will search over a broad range of supervised classification algorithms, transformers, and their parameters.\nHowever, the algorithms, transformers, and hyperparameters that the TPOTClassifier searches over can be fully customized using the \nconfig_dict\n parameter.\n\n\nRead more in the \nUser Guide\n.\n\n\n\n\n\n\nParameters:\n\n\n\n\ngenerations\n: int, optional (default=100)\n\n\nNumber of iterations to the run pipeline optimization process. Must be a positive number.\n\n\nGenerally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate \npopulation_size\n + \ngenerations\n \u00d7 \noffspring_size\n pipelines in total.\n\n\n\n\npopulation_size\n: int, optional (default=100)\n\n\nNumber of individuals to retain in the genetic programming population every generation. Must be a positive number.\n\n\nGenerally, TPOT will work better when you give it more individuals with which to optimize the pipeline.\n\n\n\n\noffspring_size\n: int, optional (default=100)\n\n\nNumber of offspring to produce in each genetic programming generation. Must be a positive number.\n\n\n\n\nmutation_rate\n: float, optional (default=0.9)\n\n\nMutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation.\n\n\n\nmutation_rate\n + \ncrossover_rate\n cannot exceed 1.0.\n\n\nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.\n\n\n\n\ncrossover_rate\n: float, optional (default=0.1)\n\n\nCrossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to \"breed\" every generation.\n\n\n\nmutation_rate\n + \ncrossover_rate\n cannot exceed 1.0.\n\n\nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms.\n\n\n\n\nscoring\n: string or callable, optional (default='accuracy')\n\n\nFunction used to evaluate the quality of a given pipeline for the classification problem. The following built-in scoring functions can be used:\n\n\n'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss','precision',\n'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc'\n\n\nIf you would like to use a custom scoring function, you can pass a callable function to this parameter with the signature \nscorer(y_true, y_pred)\n. See the section on \nscoring functions\n for more details.\n\n\nTPOT assumes that any function with \"error\" or \"loss\" in the function name is meant to be minimized, whereas any other functions will be maximized.\n\n\n\n\ncv\n: int, cross-validation generator, or an iterable, optional (default=5)\n\n\nCross-validation strategy used when evaluating pipelines.\n\n\nPossible inputs:\n\n\n\ninteger, to specify the number of folds in a StratifiedKFold,\n\n\nAn object to be used as a cross-validation generator, or\n\n\nAn iterable yielding train/test splits.\n\n\n\n\n\nsubsample\n: float, optional (default=1.0)\n\n\nFraction of training samples that are used during the TPOT optimization process. Must be in the range (0.0, 1.0].\n\n\nSetting \nsubsample\n=0.5 tells TPOT to use a random subsample of half of the training data. This subsample will remain the same during the entire pipeline optimization process.\n\n\n\n\nn_jobs\n: integer, optional (default=1)\n\n\nNumber of processes to use in parallel for evaluating pipelines during the TPOT optimization process.\n\n\nSetting \nn_jobs\n=-1 will use as many cores as available on the computer. Beware that using multiple processes on the same machine may cause memory issues for large datasets\n\n\n\n\nmax_time_mins\n: integer or None, optional (default=None)\n\n\nHow many minutes TPOT has to optimize the pipeline.\n\n\nIf not None, this setting will override the \ngenerations\n parameter and allow TPOT to run until \nmax_time_mins\n minutes elapse.\n\n\n\n\nmax_eval_time_mins\n: integer, optional (default=5)\n\n\nHow many minutes TPOT has to evaluate a single pipeline.\n\n\nSetting this parameter to higher values will allow TPOT to evaluate more complex pipelines, but will also allow TPOT to run longer. Use this parameter to help prevent TPOT from wasting time on evaluating time-consuming pipelines.\n\n\n\n\nrandom_state\n: integer or None, optional (default=None)\n\n\nThe seed of the pseudo random number generator used in TPOT.\n\n\nUse this parameter to make sure that TPOT will give you the same results each time you run it against the same data set with that seed.\n\n\n\n\nconfig_dict\n: Python dictionary, string, or None, optional (default=None)\n\n\nA configuration dictionary for customizing the operators and parameters that TPOT searches in the optimization process.\n\n\nPossible inputs are:\n\n\n\nPython dictionary, TPOT will use your custom configuration,\n\n\nstring 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or\n\n\nstring 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or\n\n\nstring 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices, or\n\n\nNone, TPOT will use the default TPOTClassifier configuration.\n\n\n\nSee the \nbuilt-in configurations\n section for the list of configurations included with TPOT, and the \ncustom configuration\n section for more information and examples of how to create your own TPOT configurations.\n\n\n\n\nwarm_start\n: boolean, optional (default=False)\n\n\nFlag indicating whether the TPOT instance will reuse the population from previous calls to \nfit()\n.\n\n\nSetting \nwarm_start\n=True can be useful for running TPOT for a short time on a dataset, checking the results, then resuming the TPOT run from where it left off.\n\n\n\n\nperiodic_checkpoint_folder\n: path string, optional (default: None)\n\n\nIf supplied, a folder in which TPOT will periodically save the best pipeline so far while optimizing.\n\nCurrently once per generation but not more often than once per 30 seconds.\n\nUseful in multiple cases:\n\n\n\nSudden death before TPOT could save optimized pipeline\n\n\nTrack its progress\n\n\nGrab pipelines while it's still optimizing\n\n\n\n\n\n\n\nearly_stop\n: integer, optional (default: None)\n\n\nHow many generations TPOT checks whether there is no improvement in optimization process.\n\n\nEnds the optimization process if there is no improvement in the given number of generations.\n\n\n\n\nverbosity\n: integer, optional (default=0)\n\n\nHow much information TPOT communicates while it's running.\n\n\nPossible inputs are:\n\n\n\n0, TPOT will print nothing,\n\n\n1, TPOT will print minimal information,\n\n\n2, TPOT will print more information and provide a progress bar, or\n\n\n3, TPOT will print everything and provide a progress bar.\n\n\n\n\n\n\n\ndisable_update_check\n: boolean, optional (default=False)\n\n\nFlag indicating whether the TPOT version checker should be disabled.\n\n\nThe update checker will tell you when a new version of TPOT has been released.\n\n\n\n\n\n\n\n\n\n\nAttributes:\n\n\n\n\nfitted_pipeline_\n: scikit-learn Pipeline object\n\n\nThe best pipeline that TPOT discovered during the pipeline optimization process, fitted on the entire training dataset.\n\n\n\n\npareto_front_fitted_pipelines_\n: Python dictionary\n\n\nDictionary containing the all pipelines on the TPOT Pareto front, where the key is the string representation of the pipeline and the value is the corresponding pipeline fitted on the entire training dataset.\n\n\nThe TPOT Pareto front provides a trade-off between pipeline complexity (i.e., the number of steps in the pipeline) and the predictive performance of the pipeline.\n\n\nNote: \npareto_front_fitted_pipelines_\n is only available when \nverbosity\n=3.\n\n\n\n\nevaluated_individuals_\n: Python dictionary\n\n\nDictionary containing all pipelines that were evaluated during the pipeline optimization process, where the key is the string representation of the pipeline and the value is a tuple containing (# of steps in pipeline, accuracy metric for the pipeline).\n\n\nThis attribute is primarily for internal use, but may be useful for looking at the other pipelines that TPOT evaluated.\n\n\n\n\n\n\n\n\n\n\nExample\n\n\nfrom tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py')\n\n\n\n\nFunctions\n\n\n\n\n\n\nfit\n(features, classes[, sample_weight, groups])\n\n\nRun the TPOT optimization process on the given training data.\n\n\n\n\n\n\n\npredict\n(features)\n\n\nUse the optimized pipeline to predict the classes for a feature set.\n\n\n\n\n\n\n\npredict_proba\n(features)\n\n\nUse the optimized pipeline to estimate the class probabilities for a feature set.\n\n\n\n\n\n\n\nscore\n(testing_features, testing_classes)\n\n\nReturns the optimized pipeline's score on the given testing data using the user-specified scoring function.\n\n\n\n\n\n\n\nexport\n(output_file_name)\n\n\nExport the optimized pipeline as Python code.\n\n\n\n\n\n\n\n\n\nfit(features, classes, sample_weight=None, groups=None)\n\n\n\n\n\nRun the TPOT optimization process on the given training data.\n\n\nUses genetic programming to optimize a machine learning pipeline that maximizes the score on the provided features and classes. Performs internal stratified k-fold cross-validaton to avoid overfitting on the provided data.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\nfeatures\n: array-like {n_samples, n_features}\n\n\nFeature matrix\n\n\nTPOT and all scikit-learn algorithms assume that the features will be numerical and there will be no missing values.\nAs such, when a feature matrix is provided to TPOT, all missing values will automatically be replaced (i.e., imputed)\nusing \nmedian value imputation\n.\n\n\nIf you wish to use a different imputation strategy than median imputation, please make sure to apply imputation to your feature set prior to passing it to TPOT.\n\n\n\n\nclasses\n: array-like {n_samples}\n\n\nList of class labels for prediction\n\n\n\n\nsample_weight\n: array-like {n_samples}, optional\n\n\nPer-sample weights. Higher weights force TPOT to put more emphasis on those points.\n\n\n\n\ngroups\n: array-like, with shape {n_samples, }, optional\n\n\nGroup labels for the samples used when performing cross-validation.\n\n\nThis parameter should only be used in conjunction with sklearn's Group cross-validation functions, such as \nsklearn.model_selection.GroupKFold\n.\n\n\n\n\n\n\n\n\n\nReturns:\n\n\n\n\nself\n: object\n\n\nReturns a copy of the fitted TPOT object\n\n\n\n\n\n\n\n\n\n\n\n\n\n\npredict(features)\n\n\n\n\n\nUse the optimized pipeline to predict the classes for a feature set.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\nfeatures\n: array-like {n_samples, n_features}\n\n\nFeature matrix\n\n\n\n\n\n\n\n\n\nReturns:\n\n\n\n\npredictions\n: array-like {n_samples}\n\n\nPredicted classes for the samples in the feature matrix\n\n\n\n\n\n\n\n\n\n\n\n\n\n\npredict_proba(features)\n\n\n\n\n\nUse the optimized pipeline to estimate the class probabilities for a feature set.\n\n\nNote: This function will only work for pipelines whose final classifier supports the \npredict_proba\n function. TPOT will raise an error otherwise.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\nfeatures\n: array-like {n_samples, n_features}\n\n\nFeature matrix\n\n\n\n\n\n\n\n\n\nReturns:\n\n\n\n\npredictions\n: array-like {n_samples, n_classes}\n\n\nThe class probabilities of the input samples\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nscore(testing_features, testing_classes)\n\n\n\n\n\nReturns the optimized pipeline's score on the given testing data using the user-specified scoring function.\n\n\nThe default scoring function for TPOTClassifier is 'accuracy'.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\ntesting_features\n: array-like {n_samples, n_features}\n\n\nFeature matrix of the testing set\n\n\n\n\ntesting_classes\n: array-like {n_samples}\n\n\nList of class labels for prediction in the testing set\n\n\n\n\n\n\n\n\n\nReturns:\n\n\n\n\naccuracy_score\n: float\n\n\nThe estimated test set accuracy according to the user-specified scoring function.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nexport(output_file_name)\n\n\n\n\n\nExport the optimized pipeline as Python code.\n\n\nSee the \nusage documentation\n for example usage of the export function.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\noutput_file_name\n: string\n\n\nString containing the path and file name of the desired output file\n\n\n\n\n\n\n\nReturns:\n\n\n\nDoes not return anything\n\n\n\n\n\n\n\n\n\n\nRegression\n\n\nclass\n tpot.\nTPOTRegressor\n(\ngenerations\n=100, \npopulation_size\n=100,\n \noffspring_size\n=None, \nmutation_rate\n=0.9,\n \ncrossover_rate\n=0.1,\n \nscoring\n='neg_mean_squared_error', \ncv\n=5,\n \nsubsample\n=1.0, \nn_jobs\n=1,\n \nmax_time_mins\n=None, \nmax_eval_time_mins\n=5,\n \nrandom_state\n=None, \nconfig_dict\n=None,\n \nwarm_start\n=False,\n \nperiodic_checkpoint_folder\n=None,\n \nverbosity\n=0,\n \ndisable_update_check\n=False\n)\n\n\n\nsource\n\n\n\nAutomated machine learning for supervised regression tasks.\n\n\nThe TPOTRegressor performs an intelligent search over machine learning pipelines that can contain supervised regression models,\npreprocessors, feature selection techniques, and any other estimator or transformer that follows the \nscikit-learn API\n.\nThe TPOTRegressor will also search over the hyperparameters of all objects in the pipeline.\n\n\nBy default, TPOTRegressor will search over a broad range of supervised regression models, transformers, and their hyperparameters.\nHowever, the models, transformers, and parameters that the TPOTRegressor searches over can be fully customized using the \nconfig_dict\n parameter.\n\n\nRead more in the \nUser Guide\n.\n\n\n\n\n\n\nParameters:\n\n\n\n\ngenerations\n: int, optional (default=100)\n\n\nNumber of iterations to the run pipeline optimization process. Must be a positive number.\n\n\nGenerally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline.\n\n\nTPOT will evaluate \npopulation_size\n + \ngenerations\n \u00d7 \noffspring_size\n pipelines in total.\n\n\n\n\npopulation_size\n: int, optional (default=100)\n\n\nNumber of individuals to retain in the genetic programming population every generation. Must be a positive number.\n\n\nGenerally, TPOT will work better when you give it more individuals with which to optimize the pipeline.\n\n\n\n\noffspring_size\n: int, optional (default=100)\n\n\nNumber of offspring to produce in each genetic programming generation. Must be a positive number.\n\n\n\n\nmutation_rate\n: float, optional (default=0.9)\n\n\nMutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation.\n\n\n\nmutation_rate\n + \ncrossover_rate\n cannot exceed 1.0.\n\n\nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.\n\n\n\n\ncrossover_rate\n: float, optional (default=0.1)\n\n\nCrossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to \"breed\" every generation.\n\n\n\nmutation_rate\n + \ncrossover_rate\n cannot exceed 1.0.\n\n\nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms.\n\n\n\n\nscoring\n: string or callable, optional (default='neg_mean_squared_error')\n\n\nFunction used to evaluate the quality of a given pipeline for the regression problem. The following built-in scoring functions can be used:\n\n\n'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'r2'\n\n\nNote that we recommend using the \nneg\n version of mean squared error and related metrics so TPOT will minimize (instead of maximize) the metric.\n\n\nIf you would like to use a custom scoring function, you can pass a callable function to this parameter with the signature \nscorer(y_true, y_pred)\n. See the section on \nscoring functions\n for more details.\n\n\nTPOT assumes that any custom scoring function with \"error\" or \"loss\" in the function name is meant to be minimized, whereas any other functions will be maximized.\n\n\n\n\ncv\n: int, cross-validation generator, or an iterable, optional (default=5)\n\n\nCross-validation strategy used when evaluating pipelines.\n\n\nPossible inputs:\n\n\n\ninteger, to specify the number of folds in a KFold,\n\n\nAn object to be used as a cross-validation generator, or\n\n\nAn iterable yielding train/test splits.\n\n\n\n\n\n\n\nsubsample\n: float, optional (default=1.0)\n\n\nFraction of training samples that are used during the TPOT optimization process. Must be in the range (0.0, 1.0].\n\n\nSetting \nsubsample\n=0.5 tells TPOT to use a random subsample of half of the training data. This subsample will remain the same during the entire pipeline optimization process.\n\n\n\n\nn_jobs\n: integer, optional (default=1)\n\n\nNumber of processes to use in parallel for evaluating pipelines during the TPOT optimization process.\n\n\nSetting \nn_jobs\n=-1 will use as many cores as available on the computer. Beware that using multiple processes on the same machine may cause memory issues for large datasets\n\n\n\n\nmax_time_mins\n: integer or None, optional (default=None)\n\n\nHow many minutes TPOT has to optimize the pipeline.\n\n\nIf not None, this setting will override the \ngenerations\n parameter and allow TPOT to run until \nmax_time_mins\n minutes elapse.\n\n\n\n\nmax_eval_time_mins\n: integer, optional (default=5)\n\n\nHow many minutes TPOT has to evaluate a single pipeline.\n\n\nSetting this parameter to higher values will allow TPOT to evaluate more complex pipelines, but will also allow TPOT to run longer. Use this parameter to help prevent TPOT from wasting time on evaluating time-consuming pipelines.\n\n\n\n\nrandom_state\n: integer or None, optional (default=None)\n\n\nThe seed of the pseudo random number generator used in TPOT.\n\n\nUse this parameter to make sure that TPOT will give you the same results each time you run it against the same data set with that seed.\n\n\n\n\nconfig_dict\n: Python dictionary, string, or None, optional (default=None)\n\n\nA configuration dictionary for customizing the operators and parameters that TPOT searches in the optimization process.\n\n\nPossible inputs are:\n\n\n\nPython dictionary, TPOT will use your custom configuration,\n\n\nstring 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or\n\n\nstring 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or\n\n\nstring 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices, or\n\n\nNone, TPOT will use the default TPOTRegressor configuration.\n\n\n\nSee the \nbuilt-in configurations\n section for the list of configurations included with TPOT, and the \ncustom configuration\n section for more information and examples of how to create your own TPOT configurations.\n\n\n\n\nwarm_start\n: boolean, optional (default=False)\n\n\nFlag indicating whether the TPOT instance will reuse the population from previous calls to \nfit()\n.\n\n\nSetting \nwarm_start\n=True can be useful for running TPOT for a short time on a dataset, checking the results, then resuming the TPOT run from where it left off.\n\n\n\n\nperiodic_checkpoint_folder\n: path string, optional (default: None)\n\n\nIf supplied, a folder in which TPOT will periodically save the best pipeline so far while optimizing.\n\nCurrently once per generation but not more often than once per 30 seconds.\n\nUseful in multiple cases:\n\n\n\nSudden death before TPOT could save optimized pipeline\n\n\nTrack its progress\n\n\nGrab pipelines while it's still optimizing\n\n\n\n\n\n\n\nearly_stop\n: integer, optional (default: None)\n\n\nHow many generations TPOT checks whether there is no improvement in optimization process.\n\n\nEnds the optimization process if there is no improvement in the given number of generations.\n\n\n\n\nverbosity\n: integer, optional (default=0)\n\n\nHow much information TPOT communicates while it's running.\n\n\nPossible inputs are:\n\n\n\n0, TPOT will print nothing,\n\n\n1, TPOT will print minimal information,\n\n\n2, TPOT will print more information and provide a progress bar, or\n\n\n3, TPOT will print everything and provide a progress bar.\n\n\n\n\n\n\n\ndisable_update_check\n: boolean, optional (default=False)\n\n\nFlag indicating whether the TPOT version checker should be disabled.\n\n\nThe update checker will tell you when a new version of TPOT has been released.\n\n\n\n\n\n\n\n\n\n\nAttributes:\n\n\n\n\nfitted_pipeline_\n: scikit-learn Pipeline object\n\n\nThe best pipeline that TPOT discovered during the pipeline optimization process, fitted on the entire training dataset.\n\n\n\n\npareto_front_fitted_pipelines_\n: Python dictionary\n\n\nDictionary containing the all pipelines on the TPOT Pareto front, where the key is the string representation of the pipeline and the value is the corresponding pipeline fitted on the entire training dataset.\n\n\nThe TPOT Pareto front provides a trade-off between pipeline complexity (i.e., the number of steps in the pipeline) and the predictive performance of the pipeline.\n\n\nNote: \n_pareto_front_fitted_pipelines\n is only available when \nverbosity\n=3.\n\n\n\n\nevaluated_individuals_\n: Python dictionary\n\n\nDictionary containing all pipelines that were evaluated during the pipeline optimization process, where the key is the string representation of the pipeline and the value is a tuple containing (# of steps in pipeline, accuracy metric for the pipeline).\n\n\nThis attribute is primarily for internal use, but may be useful for looking at the other pipelines that TPOT evaluated.\n\n\n\n\n\n\n\n\n\n\nExample\n\n\nfrom tpot import TPOTRegressor\nfrom sklearn.datasets import load_boston\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_boston()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTRegressor(generations=5, population_size=50, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_boston_pipeline.py')\n\n\n\n\nFunctions\n\n\n\n\n\n\nfit\n(features, target[, sample_weight, groups])\n\n\nRun the TPOT optimization process on the given training data.\n\n\n\n\n\n\n\npredict\n(features)\n\n\nUse the optimized pipeline to predict the target values for a feature set.\n\n\n\n\n\n\n\nscore\n(testing_features, testing_target)\n\n\nReturns the optimized pipeline's score on the given testing data using the user-specified scoring function.\n\n\n\n\n\n\n\nexport\n(output_file_name)\n\n\nExport the optimized pipeline as Python code.\n\n\n\n\n\n\n\n\n\nfit(features, target, sample_weight=None, groups=None)\n\n\n\n\n\nRun the TPOT optimization process on the given training data.\n\n\nUses genetic programming to optimize a machine learning pipeline that maximizes the score on the provided features and target. Performs internal k-fold cross-validaton to avoid overfitting on the provided data.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\nfeatures\n: array-like {n_samples, n_features}\n\n\nFeature matrix\n\n\nTPOT and all scikit-learn algorithms assume that the features will be numerical and there will be no missing values.\nAs such, when a feature matrix is provided to TPOT, all missing values will automatically be replaced (i.e., imputed)\nusing \nmedian value imputation\n.\n\n\nIf you wish to use a different imputation strategy than median imputation, please make sure to apply imputation to your feature set prior to passing it to TPOT.\n\n\n\n\ntarget\n: array-like {n_samples}\n\n\nList of target labels for prediction\n\n\n\n\nsample_weight\n: array-like {n_samples}, optional\n\n\nPer-sample weights. Higher weights force TPOT to put more emphasis on those points.\n\n\n\n\ngroups\n: array-like, with shape {n_samples, }, optional\n\n\nGroup labels for the samples used when performing cross-validation.\n\n\nThis parameter should only be used in conjunction with sklearn's Group cross-validation functions, such as \nsklearn.model_selection.GroupKFold\n.\n\n\n\n\n\n\n\n\n\nReturns:\n\n\n\n\nself\n: object\n\n\nReturns a copy of the fitted TPOT object\n\n\n\n\n\n\n\n\n\n\n\n\n\n\npredict(features)\n\n\n\n\n\nUse the optimized pipeline to predict the target values for a feature set.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\nfeatures\n: array-like {n_samples, n_features}\n\n\nFeature matrix\n\n\n\n\n\n\n\n\n\nReturns:\n\n\n\n\npredictions\n: array-like {n_samples}\n\n\nPredicted target values for the samples in the feature matrix\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nscore(testing_features, testing_target)\n\n\n\n\n\nReturns the optimized pipeline's score on the given testing data using the user-specified scoring function.\n\n\nThe default scoring function for TPOTClassifier is 'mean_squared_error'.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\ntesting_features\n: array-like {n_samples, n_features}\n\n\nFeature matrix of the testing set\n\n\n\n\ntesting_target\n: array-like {n_samples}\n\n\nList of target labels for prediction in the testing set\n\n\n\n\n\n\n\n\n\nReturns:\n\n\n\n\naccuracy_score\n: float\n\n\nThe estimated test set accuracy according to the user-specified scoring function.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nexport(output_file_name)\n\n\n\n\n\nExport the optimized pipeline as Python code.\n\n\nSee the \nusage documentation\n for example usage of the export function.\n\n\n\n\n\n\n\nParameters:\n\n\n\n\noutput_file_name\n: string\n\n\nString containing the path and file name of the desired output file\n\n\n\n\n\n\n\nReturns:\n\n\n\nDoes not return anything",
"title": "TPOT API"
},
{
"location": "/api/#classification",
- "text": "class tpot. TPOTClassifier ( generations =100, population_size =100,\n offspring_size =None, mutation_rate =0.9,\n crossover_rate =0.1,\n scoring ='accuracy', cv =5,\n subsample =1.0, n_jobs =1,\n max_time_mins =None, max_eval_time_mins =5,\n random_state =None, config_dict =None,\n warm_start =False, verbosity =0,\n disable_update_check =False ) source Automated machine learning for supervised classification tasks. The TPOTClassifier performs an intelligent search over machine learning pipelines that can contain supervised classification models,\npreprocessors, feature selection techniques, and any other estimator or transformer that follows the scikit-learn API .\nThe TPOTClassifier will also search over the hyperparameters of all objects in the pipeline. By default, TPOTClassifier will search over a broad range of supervised classification algorithms, transformers, and their parameters.\nHowever, the algorithms, transformers, and hyperparameters that the TPOTClassifier searches over can be fully customized using the config_dict parameter. Read more in the User Guide . Parameters: generations : int, optional (default=100) \nNumber of iterations to the run pipeline optimization process. Must be a positive number. \nGenerally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline. \nTPOT will evaluate population_size + generations \u00d7 offspring_size pipelines in total. population_size : int, optional (default=100) \nNumber of individuals to retain in the genetic programming population every generation. Must be a positive number. \nGenerally, TPOT will work better when you give it more individuals with which to optimize the pipeline. offspring_size : int, optional (default=100) \nNumber of offspring to produce in each genetic programming generation. Must be a positive number. mutation_rate : float, optional (default=0.9) \nMutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation. mutation_rate + crossover_rate cannot exceed 1.0. \nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms. crossover_rate : float, optional (default=0.1) \nCrossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to \"breed\" every generation. mutation_rate + crossover_rate cannot exceed 1.0. \nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms. scoring : string or callable, optional (default='accuracy') \nFunction used to evaluate the quality of a given pipeline for the classification problem. The following built-in scoring functions can be used: \n'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss','precision',\n'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc' \nIf you would like to use a custom scoring function, you can pass a callable function to this parameter with the signature scorer(y_true, y_pred) . See the section on scoring functions for more details. \nTPOT assumes that any function with \"error\" or \"loss\" in the function name is meant to be minimized, whereas any other functions will be maximized. cv : int, cross-validation generator, or an iterable, optional (default=5) \nCross-validation strategy used when evaluating pipelines. \nPossible inputs: integer, to specify the number of folds in a StratifiedKFold, An object to be used as a cross-validation generator, or An iterable yielding train/test splits. subsample : float, optional (default=1.0) \nFraction of training samples that are used during the TPOT optimization process. Must be in the range (0.0, 1.0]. \nSetting subsample =0.5 tells TPOT to use a random subsample of half of the training data. This subsample will remain the same during the entire pipeline optimization process. n_jobs : integer, optional (default=1) \nNumber of processes to use in parallel for evaluating pipelines during the TPOT optimization process. \nSetting n_jobs =-1 will use as many cores as available on the computer. Beware that using multiple processes on the same machine may cause memory issues for large datasets max_time_mins : integer or None, optional (default=None) \nHow many minutes TPOT has to optimize the pipeline. \nIf not None, this setting will override the generations parameter and allow TPOT to run until max_time_mins minutes elapse. max_eval_time_mins : integer, optional (default=5) \nHow many minutes TPOT has to evaluate a single pipeline. \nSetting this parameter to higher values will allow TPOT to evaluate more complex pipelines, but will also allow TPOT to run longer. Use this parameter to help prevent TPOT from wasting time on evaluating time-consuming pipelines. random_state : integer or None, optional (default=None) \nThe seed of the pseudo random number generator used in TPOT. \nUse this parameter to make sure that TPOT will give you the same results each time you run it against the same data set with that seed. config_dict : Python dictionary, string, or None, optional (default=None) \nA configuration dictionary for customizing the operators and parameters that TPOT searches in the optimization process. \nPossible inputs are: Python dictionary, TPOT will use your custom configuration, string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or None, TPOT will use the default TPOTClassifier configuration. \nSee the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations. warm_start : boolean, optional (default=False) \nFlag indicating whether the TPOT instance will reuse the population from previous calls to fit() . \nSetting warm_start =True can be useful for running TPOT for a short time on a dataset, checking the results, then resuming the TPOT run from where it left off. verbosity : integer, optional (default=0) \nHow much information TPOT communicates while it's running. \nPossible inputs are: 0, TPOT will print nothing, 1, TPOT will print minimal information, 2, TPOT will print more information and provide a progress bar, or 3, TPOT will print everything and provide a progress bar. disable_update_check : boolean, optional (default=False) \nFlag indicating whether the TPOT version checker should be disabled. \nThe update checker will tell you when a new version of TPOT has been released. Attributes: fitted_pipeline_ : scikit-learn Pipeline object \nThe best pipeline that TPOT discovered during the pipeline optimization process, fitted on the entire training dataset. pareto_front_fitted_pipelines_ : Python dictionary \nDictionary containing the all pipelines on the TPOT Pareto front, where the key is the string representation of the pipeline and the value is the corresponding pipeline fitted on the entire training dataset. \nThe TPOT Pareto front provides a trade-off between pipeline complexity (i.e., the number of steps in the pipeline) and the predictive performance of the pipeline. \nNote: pareto_front_fitted_pipelines_ is only available when verbosity =3. evaluated_individuals_ : Python dictionary \nDictionary containing all pipelines that were evaluated during the pipeline optimization process, where the key is the string representation of the pipeline and the value is a tuple containing (# of steps in pipeline, accuracy metric for the pipeline). \nThis attribute is primarily for internal use, but may be useful for looking at the other pipelines that TPOT evaluated. Example from tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py') Functions fit (features, classes[, sample_weight, groups]) Run the TPOT optimization process on the given training data. predict (features) Use the optimized pipeline to predict the classes for a feature set. predict_proba (features) Use the optimized pipeline to estimate the class probabilities for a feature set. score (testing_features, testing_classes) Returns the optimized pipeline's score on the given testing data using the user-specified scoring function. export (output_file_name) Export the optimized pipeline as Python code. fit(features, classes, sample_weight=None, groups=None) \nRun the TPOT optimization process on the given training data. \nUses genetic programming to optimize a machine learning pipeline that maximizes the score on the provided features and classes. Performs internal stratified k-fold cross-validaton to avoid overfitting on the provided data. Parameters: features : array-like {n_samples, n_features} \nFeature matrix \nTPOT and all scikit-learn algorithms assume that the features will be numerical and there will be no missing values.\nAs such, when a feature matrix is provided to TPOT, all missing values will automatically be replaced (i.e., imputed)\nusing median value imputation . \nIf you wish to use a different imputation strategy than median imputation, please make sure to apply imputation to your feature set prior to passing it to TPOT. classes : array-like {n_samples} \nList of class labels for prediction sample_weight : array-like {n_samples}, optional \nPer-sample weights. Higher weights force TPOT to put more emphasis on those points. groups : array-like, with shape {n_samples, }, optional \nGroup labels for the samples used when performing cross-validation. \nThis parameter should only be used in conjunction with sklearn's Group cross-validation functions, such as sklearn.model_selection.GroupKFold . Returns: self : object \nReturns a copy of the fitted TPOT object predict(features) \nUse the optimized pipeline to predict the classes for a feature set. Parameters: features : array-like {n_samples, n_features} \nFeature matrix Returns: predictions : array-like {n_samples} \nPredicted classes for the samples in the feature matrix predict_proba(features) \nUse the optimized pipeline to estimate the class probabilities for a feature set. \nNote: This function will only work for pipelines whose final classifier supports the predict_proba function. TPOT will raise an error otherwise. Parameters: features : array-like {n_samples, n_features} \nFeature matrix Returns: predictions : array-like {n_samples, n_classes} \nThe class probabilities of the input samples score(testing_features, testing_classes) \nReturns the optimized pipeline's score on the given testing data using the user-specified scoring function. \nThe default scoring function for TPOTClassifier is 'accuracy'. Parameters: testing_features : array-like {n_samples, n_features} \nFeature matrix of the testing set testing_classes : array-like {n_samples} \nList of class labels for prediction in the testing set Returns: accuracy_score : float \nThe estimated test set accuracy according to the user-specified scoring function. export(output_file_name) \nExport the optimized pipeline as Python code. \nSee the usage documentation for example usage of the export function. Parameters: output_file_name : string \nString containing the path and file name of the desired output file Returns: \nDoes not return anything",
+ "text": "class tpot. TPOTClassifier ( generations =100, population_size =100,\n offspring_size =None, mutation_rate =0.9,\n crossover_rate =0.1,\n scoring ='accuracy', cv =5,\n subsample =1.0, n_jobs =1,\n max_time_mins =None, max_eval_time_mins =5,\n random_state =None, config_dict =None,\n warm_start =False,\n periodic_checkpoint_folder =None,\n verbosity =0,\n disable_update_check =False ) source Automated machine learning for supervised classification tasks. The TPOTClassifier performs an intelligent search over machine learning pipelines that can contain supervised classification models,\npreprocessors, feature selection techniques, and any other estimator or transformer that follows the scikit-learn API .\nThe TPOTClassifier will also search over the hyperparameters of all objects in the pipeline. By default, TPOTClassifier will search over a broad range of supervised classification algorithms, transformers, and their parameters.\nHowever, the algorithms, transformers, and hyperparameters that the TPOTClassifier searches over can be fully customized using the config_dict parameter. Read more in the User Guide . Parameters: generations : int, optional (default=100) \nNumber of iterations to the run pipeline optimization process. Must be a positive number. \nGenerally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline. \nTPOT will evaluate population_size + generations \u00d7 offspring_size pipelines in total. population_size : int, optional (default=100) \nNumber of individuals to retain in the genetic programming population every generation. Must be a positive number. \nGenerally, TPOT will work better when you give it more individuals with which to optimize the pipeline. offspring_size : int, optional (default=100) \nNumber of offspring to produce in each genetic programming generation. Must be a positive number. mutation_rate : float, optional (default=0.9) \nMutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation. mutation_rate + crossover_rate cannot exceed 1.0. \nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms. crossover_rate : float, optional (default=0.1) \nCrossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to \"breed\" every generation. mutation_rate + crossover_rate cannot exceed 1.0. \nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms. scoring : string or callable, optional (default='accuracy') \nFunction used to evaluate the quality of a given pipeline for the classification problem. The following built-in scoring functions can be used: \n'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss','precision',\n'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc' \nIf you would like to use a custom scoring function, you can pass a callable function to this parameter with the signature scorer(y_true, y_pred) . See the section on scoring functions for more details. \nTPOT assumes that any function with \"error\" or \"loss\" in the function name is meant to be minimized, whereas any other functions will be maximized. cv : int, cross-validation generator, or an iterable, optional (default=5) \nCross-validation strategy used when evaluating pipelines. \nPossible inputs: integer, to specify the number of folds in a StratifiedKFold, An object to be used as a cross-validation generator, or An iterable yielding train/test splits. subsample : float, optional (default=1.0) \nFraction of training samples that are used during the TPOT optimization process. Must be in the range (0.0, 1.0]. \nSetting subsample =0.5 tells TPOT to use a random subsample of half of the training data. This subsample will remain the same during the entire pipeline optimization process. n_jobs : integer, optional (default=1) \nNumber of processes to use in parallel for evaluating pipelines during the TPOT optimization process. \nSetting n_jobs =-1 will use as many cores as available on the computer. Beware that using multiple processes on the same machine may cause memory issues for large datasets max_time_mins : integer or None, optional (default=None) \nHow many minutes TPOT has to optimize the pipeline. \nIf not None, this setting will override the generations parameter and allow TPOT to run until max_time_mins minutes elapse. max_eval_time_mins : integer, optional (default=5) \nHow many minutes TPOT has to evaluate a single pipeline. \nSetting this parameter to higher values will allow TPOT to evaluate more complex pipelines, but will also allow TPOT to run longer. Use this parameter to help prevent TPOT from wasting time on evaluating time-consuming pipelines. random_state : integer or None, optional (default=None) \nThe seed of the pseudo random number generator used in TPOT. \nUse this parameter to make sure that TPOT will give you the same results each time you run it against the same data set with that seed. config_dict : Python dictionary, string, or None, optional (default=None) \nA configuration dictionary for customizing the operators and parameters that TPOT searches in the optimization process. \nPossible inputs are: Python dictionary, TPOT will use your custom configuration, string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices, or None, TPOT will use the default TPOTClassifier configuration. \nSee the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations. warm_start : boolean, optional (default=False) \nFlag indicating whether the TPOT instance will reuse the population from previous calls to fit() . \nSetting warm_start =True can be useful for running TPOT for a short time on a dataset, checking the results, then resuming the TPOT run from where it left off. periodic_checkpoint_folder : path string, optional (default: None) \nIf supplied, a folder in which TPOT will periodically save the best pipeline so far while optimizing. \nCurrently once per generation but not more often than once per 30 seconds. \nUseful in multiple cases: Sudden death before TPOT could save optimized pipeline Track its progress Grab pipelines while it's still optimizing early_stop : integer, optional (default: None) \nHow many generations TPOT checks whether there is no improvement in optimization process. \nEnds the optimization process if there is no improvement in the given number of generations. verbosity : integer, optional (default=0) \nHow much information TPOT communicates while it's running. \nPossible inputs are: 0, TPOT will print nothing, 1, TPOT will print minimal information, 2, TPOT will print more information and provide a progress bar, or 3, TPOT will print everything and provide a progress bar. disable_update_check : boolean, optional (default=False) \nFlag indicating whether the TPOT version checker should be disabled. \nThe update checker will tell you when a new version of TPOT has been released. Attributes: fitted_pipeline_ : scikit-learn Pipeline object \nThe best pipeline that TPOT discovered during the pipeline optimization process, fitted on the entire training dataset. pareto_front_fitted_pipelines_ : Python dictionary \nDictionary containing the all pipelines on the TPOT Pareto front, where the key is the string representation of the pipeline and the value is the corresponding pipeline fitted on the entire training dataset. \nThe TPOT Pareto front provides a trade-off between pipeline complexity (i.e., the number of steps in the pipeline) and the predictive performance of the pipeline. \nNote: pareto_front_fitted_pipelines_ is only available when verbosity =3. evaluated_individuals_ : Python dictionary \nDictionary containing all pipelines that were evaluated during the pipeline optimization process, where the key is the string representation of the pipeline and the value is a tuple containing (# of steps in pipeline, accuracy metric for the pipeline). \nThis attribute is primarily for internal use, but may be useful for looking at the other pipelines that TPOT evaluated. Example from tpot import TPOTClassifier\nfrom sklearn.datasets import load_digits\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_digits()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTClassifier(generations=5, population_size=50, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_mnist_pipeline.py') Functions fit (features, classes[, sample_weight, groups]) Run the TPOT optimization process on the given training data. predict (features) Use the optimized pipeline to predict the classes for a feature set. predict_proba (features) Use the optimized pipeline to estimate the class probabilities for a feature set. score (testing_features, testing_classes) Returns the optimized pipeline's score on the given testing data using the user-specified scoring function. export (output_file_name) Export the optimized pipeline as Python code. fit(features, classes, sample_weight=None, groups=None) \nRun the TPOT optimization process on the given training data. \nUses genetic programming to optimize a machine learning pipeline that maximizes the score on the provided features and classes. Performs internal stratified k-fold cross-validaton to avoid overfitting on the provided data. Parameters: features : array-like {n_samples, n_features} \nFeature matrix \nTPOT and all scikit-learn algorithms assume that the features will be numerical and there will be no missing values.\nAs such, when a feature matrix is provided to TPOT, all missing values will automatically be replaced (i.e., imputed)\nusing median value imputation . \nIf you wish to use a different imputation strategy than median imputation, please make sure to apply imputation to your feature set prior to passing it to TPOT. classes : array-like {n_samples} \nList of class labels for prediction sample_weight : array-like {n_samples}, optional \nPer-sample weights. Higher weights force TPOT to put more emphasis on those points. groups : array-like, with shape {n_samples, }, optional \nGroup labels for the samples used when performing cross-validation. \nThis parameter should only be used in conjunction with sklearn's Group cross-validation functions, such as sklearn.model_selection.GroupKFold . Returns: self : object \nReturns a copy of the fitted TPOT object predict(features) \nUse the optimized pipeline to predict the classes for a feature set. Parameters: features : array-like {n_samples, n_features} \nFeature matrix Returns: predictions : array-like {n_samples} \nPredicted classes for the samples in the feature matrix predict_proba(features) \nUse the optimized pipeline to estimate the class probabilities for a feature set. \nNote: This function will only work for pipelines whose final classifier supports the predict_proba function. TPOT will raise an error otherwise. Parameters: features : array-like {n_samples, n_features} \nFeature matrix Returns: predictions : array-like {n_samples, n_classes} \nThe class probabilities of the input samples score(testing_features, testing_classes) \nReturns the optimized pipeline's score on the given testing data using the user-specified scoring function. \nThe default scoring function for TPOTClassifier is 'accuracy'. Parameters: testing_features : array-like {n_samples, n_features} \nFeature matrix of the testing set testing_classes : array-like {n_samples} \nList of class labels for prediction in the testing set Returns: accuracy_score : float \nThe estimated test set accuracy according to the user-specified scoring function. export(output_file_name) \nExport the optimized pipeline as Python code. \nSee the usage documentation for example usage of the export function. Parameters: output_file_name : string \nString containing the path and file name of the desired output file Returns: \nDoes not return anything",
"title": "Classification"
},
{
"location": "/api/#regression",
- "text": "class tpot. TPOTRegressor ( generations =100, population_size =100,\n offspring_size =None, mutation_rate =0.9,\n crossover_rate =0.1,\n scoring ='neg_mean_squared_error', cv =5,\n subsample =1.0, n_jobs =1,\n max_time_mins =None, max_eval_time_mins =5,\n random_state =None, config_dict =None,\n warm_start =False, verbosity =0,\n disable_update_check =False ) source Automated machine learning for supervised regression tasks. The TPOTRegressor performs an intelligent search over machine learning pipelines that can contain supervised regression models,\npreprocessors, feature selection techniques, and any other estimator or transformer that follows the scikit-learn API .\nThe TPOTRegressor will also search over the hyperparameters of all objects in the pipeline. By default, TPOTRegressor will search over a broad range of supervised regression models, transformers, and their hyperparameters.\nHowever, the models, transformers, and parameters that the TPOTRegressor searches over can be fully customized using the config_dict parameter. Read more in the User Guide . Parameters: generations : int, optional (default=100) \nNumber of iterations to the run pipeline optimization process. Must be a positive number. \nGenerally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline. \nTPOT will evaluate population_size + generations \u00d7 offspring_size pipelines in total. population_size : int, optional (default=100) \nNumber of individuals to retain in the genetic programming population every generation. Must be a positive number. \nGenerally, TPOT will work better when you give it more individuals with which to optimize the pipeline. offspring_size : int, optional (default=100) \nNumber of offspring to produce in each genetic programming generation. Must be a positive number. mutation_rate : float, optional (default=0.9) \nMutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation. mutation_rate + crossover_rate cannot exceed 1.0. \nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms. crossover_rate : float, optional (default=0.1) \nCrossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to \"breed\" every generation. mutation_rate + crossover_rate cannot exceed 1.0. \nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms. scoring : string or callable, optional (default='neg_mean_squared_error') \nFunction used to evaluate the quality of a given pipeline for the regression problem. The following built-in scoring functions can be used: \n'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'r2' \nNote that we recommend using the neg version of mean squared error and related metrics so TPOT will minimize (instead of maximize) the metric. \nIf you would like to use a custom scoring function, you can pass a callable function to this parameter with the signature scorer(y_true, y_pred) . See the section on scoring functions for more details. \nTPOT assumes that any custom scoring function with \"error\" or \"loss\" in the function name is meant to be minimized, whereas any other functions will be maximized. cv : int, cross-validation generator, or an iterable, optional (default=5) \nCross-validation strategy used when evaluating pipelines. \nPossible inputs: integer, to specify the number of folds in a KFold, An object to be used as a cross-validation generator, or An iterable yielding train/test splits. subsample : float, optional (default=1.0) \nFraction of training samples that are used during the TPOT optimization process. Must be in the range (0.0, 1.0]. \nSetting subsample =0.5 tells TPOT to use a random subsample of half of the training data. This subsample will remain the same during the entire pipeline optimization process. n_jobs : integer, optional (default=1) \nNumber of processes to use in parallel for evaluating pipelines during the TPOT optimization process. \nSetting n_jobs =-1 will use as many cores as available on the computer. Beware that using multiple processes on the same machine may cause memory issues for large datasets max_time_mins : integer or None, optional (default=None) \nHow many minutes TPOT has to optimize the pipeline. \nIf not None, this setting will override the generations parameter and allow TPOT to run until max_time_mins minutes elapse. max_eval_time_mins : integer, optional (default=5) \nHow many minutes TPOT has to evaluate a single pipeline. \nSetting this parameter to higher values will allow TPOT to evaluate more complex pipelines, but will also allow TPOT to run longer. Use this parameter to help prevent TPOT from wasting time on evaluating time-consuming pipelines. random_state : integer or None, optional (default=None) \nThe seed of the pseudo random number generator used in TPOT. \nUse this parameter to make sure that TPOT will give you the same results each time you run it against the same data set with that seed. config_dict : Python dictionary, string, or None, optional (default=None) \nA configuration dictionary for customizing the operators and parameters that TPOT searches in the optimization process. \nPossible inputs are: Python dictionary, TPOT will use your custom configuration, string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or None, TPOT will use the default TPOTRegressor configuration. \nSee the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations. warm_start : boolean, optional (default=False) \nFlag indicating whether the TPOT instance will reuse the population from previous calls to fit() . \nSetting warm_start =True can be useful for running TPOT for a short time on a dataset, checking the results, then resuming the TPOT run from where it left off. verbosity : integer, optional (default=0) \nHow much information TPOT communicates while it's running. \nPossible inputs are: 0, TPOT will print nothing, 1, TPOT will print minimal information, 2, TPOT will print more information and provide a progress bar, or 3, TPOT will print everything and provide a progress bar. disable_update_check : boolean, optional (default=False) \nFlag indicating whether the TPOT version checker should be disabled. \nThe update checker will tell you when a new version of TPOT has been released. Attributes: fitted_pipeline_ : scikit-learn Pipeline object \nThe best pipeline that TPOT discovered during the pipeline optimization process, fitted on the entire training dataset. pareto_front_fitted_pipelines_ : Python dictionary \nDictionary containing the all pipelines on the TPOT Pareto front, where the key is the string representation of the pipeline and the value is the corresponding pipeline fitted on the entire training dataset. \nThe TPOT Pareto front provides a trade-off between pipeline complexity (i.e., the number of steps in the pipeline) and the predictive performance of the pipeline. \nNote: _pareto_front_fitted_pipelines is only available when verbosity =3. evaluated_individuals_ : Python dictionary \nDictionary containing all pipelines that were evaluated during the pipeline optimization process, where the key is the string representation of the pipeline and the value is a tuple containing (# of steps in pipeline, accuracy metric for the pipeline). \nThis attribute is primarily for internal use, but may be useful for looking at the other pipelines that TPOT evaluated. Example from tpot import TPOTRegressor\nfrom sklearn.datasets import load_boston\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_boston()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTRegressor(generations=5, population_size=50, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_boston_pipeline.py') Functions fit (features, target[, sample_weight, groups]) Run the TPOT optimization process on the given training data. predict (features) Use the optimized pipeline to predict the target values for a feature set. score (testing_features, testing_target) Returns the optimized pipeline's score on the given testing data using the user-specified scoring function. export (output_file_name) Export the optimized pipeline as Python code. fit(features, target, sample_weight=None, groups=None) \nRun the TPOT optimization process on the given training data. \nUses genetic programming to optimize a machine learning pipeline that maximizes the score on the provided features and target. Performs internal k-fold cross-validaton to avoid overfitting on the provided data. Parameters: features : array-like {n_samples, n_features} \nFeature matrix \nTPOT and all scikit-learn algorithms assume that the features will be numerical and there will be no missing values.\nAs such, when a feature matrix is provided to TPOT, all missing values will automatically be replaced (i.e., imputed)\nusing median value imputation . \nIf you wish to use a different imputation strategy than median imputation, please make sure to apply imputation to your feature set prior to passing it to TPOT. target : array-like {n_samples} \nList of target labels for prediction sample_weight : array-like {n_samples}, optional \nPer-sample weights. Higher weights force TPOT to put more emphasis on those points. groups : array-like, with shape {n_samples, }, optional \nGroup labels for the samples used when performing cross-validation. \nThis parameter should only be used in conjunction with sklearn's Group cross-validation functions, such as sklearn.model_selection.GroupKFold . Returns: self : object \nReturns a copy of the fitted TPOT object predict(features) \nUse the optimized pipeline to predict the target values for a feature set. Parameters: features : array-like {n_samples, n_features} \nFeature matrix Returns: predictions : array-like {n_samples} \nPredicted target values for the samples in the feature matrix score(testing_features, testing_target) \nReturns the optimized pipeline's score on the given testing data using the user-specified scoring function. \nThe default scoring function for TPOTClassifier is 'mean_squared_error'. Parameters: testing_features : array-like {n_samples, n_features} \nFeature matrix of the testing set testing_target : array-like {n_samples} \nList of target labels for prediction in the testing set Returns: accuracy_score : float \nThe estimated test set accuracy according to the user-specified scoring function. export(output_file_name) \nExport the optimized pipeline as Python code. \nSee the usage documentation for example usage of the export function. Parameters: output_file_name : string \nString containing the path and file name of the desired output file Returns: \nDoes not return anything",
+ "text": "class tpot. TPOTRegressor ( generations =100, population_size =100,\n offspring_size =None, mutation_rate =0.9,\n crossover_rate =0.1,\n scoring ='neg_mean_squared_error', cv =5,\n subsample =1.0, n_jobs =1,\n max_time_mins =None, max_eval_time_mins =5,\n random_state =None, config_dict =None,\n warm_start =False,\n periodic_checkpoint_folder =None,\n verbosity =0,\n disable_update_check =False ) source Automated machine learning for supervised regression tasks. The TPOTRegressor performs an intelligent search over machine learning pipelines that can contain supervised regression models,\npreprocessors, feature selection techniques, and any other estimator or transformer that follows the scikit-learn API .\nThe TPOTRegressor will also search over the hyperparameters of all objects in the pipeline. By default, TPOTRegressor will search over a broad range of supervised regression models, transformers, and their hyperparameters.\nHowever, the models, transformers, and parameters that the TPOTRegressor searches over can be fully customized using the config_dict parameter. Read more in the User Guide . Parameters: generations : int, optional (default=100) \nNumber of iterations to the run pipeline optimization process. Must be a positive number. \nGenerally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline. \nTPOT will evaluate population_size + generations \u00d7 offspring_size pipelines in total. population_size : int, optional (default=100) \nNumber of individuals to retain in the genetic programming population every generation. Must be a positive number. \nGenerally, TPOT will work better when you give it more individuals with which to optimize the pipeline. offspring_size : int, optional (default=100) \nNumber of offspring to produce in each genetic programming generation. Must be a positive number. mutation_rate : float, optional (default=0.9) \nMutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation. mutation_rate + crossover_rate cannot exceed 1.0. \nWe recommend using the default parameter unless you understand how the mutation rate affects GP algorithms. crossover_rate : float, optional (default=0.1) \nCrossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to \"breed\" every generation. mutation_rate + crossover_rate cannot exceed 1.0. \nWe recommend using the default parameter unless you understand how the crossover rate affects GP algorithms. scoring : string or callable, optional (default='neg_mean_squared_error') \nFunction used to evaluate the quality of a given pipeline for the regression problem. The following built-in scoring functions can be used: \n'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'r2' \nNote that we recommend using the neg version of mean squared error and related metrics so TPOT will minimize (instead of maximize) the metric. \nIf you would like to use a custom scoring function, you can pass a callable function to this parameter with the signature scorer(y_true, y_pred) . See the section on scoring functions for more details. \nTPOT assumes that any custom scoring function with \"error\" or \"loss\" in the function name is meant to be minimized, whereas any other functions will be maximized. cv : int, cross-validation generator, or an iterable, optional (default=5) \nCross-validation strategy used when evaluating pipelines. \nPossible inputs: integer, to specify the number of folds in a KFold, An object to be used as a cross-validation generator, or An iterable yielding train/test splits. subsample : float, optional (default=1.0) \nFraction of training samples that are used during the TPOT optimization process. Must be in the range (0.0, 1.0]. \nSetting subsample =0.5 tells TPOT to use a random subsample of half of the training data. This subsample will remain the same during the entire pipeline optimization process. n_jobs : integer, optional (default=1) \nNumber of processes to use in parallel for evaluating pipelines during the TPOT optimization process. \nSetting n_jobs =-1 will use as many cores as available on the computer. Beware that using multiple processes on the same machine may cause memory issues for large datasets max_time_mins : integer or None, optional (default=None) \nHow many minutes TPOT has to optimize the pipeline. \nIf not None, this setting will override the generations parameter and allow TPOT to run until max_time_mins minutes elapse. max_eval_time_mins : integer, optional (default=5) \nHow many minutes TPOT has to evaluate a single pipeline. \nSetting this parameter to higher values will allow TPOT to evaluate more complex pipelines, but will also allow TPOT to run longer. Use this parameter to help prevent TPOT from wasting time on evaluating time-consuming pipelines. random_state : integer or None, optional (default=None) \nThe seed of the pseudo random number generator used in TPOT. \nUse this parameter to make sure that TPOT will give you the same results each time you run it against the same data set with that seed. config_dict : Python dictionary, string, or None, optional (default=None) \nA configuration dictionary for customizing the operators and parameters that TPOT searches in the optimization process. \nPossible inputs are: Python dictionary, TPOT will use your custom configuration, string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices, or None, TPOT will use the default TPOTRegressor configuration. \nSee the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations. warm_start : boolean, optional (default=False) \nFlag indicating whether the TPOT instance will reuse the population from previous calls to fit() . \nSetting warm_start =True can be useful for running TPOT for a short time on a dataset, checking the results, then resuming the TPOT run from where it left off. periodic_checkpoint_folder : path string, optional (default: None) \nIf supplied, a folder in which TPOT will periodically save the best pipeline so far while optimizing. \nCurrently once per generation but not more often than once per 30 seconds. \nUseful in multiple cases: Sudden death before TPOT could save optimized pipeline Track its progress Grab pipelines while it's still optimizing early_stop : integer, optional (default: None) \nHow many generations TPOT checks whether there is no improvement in optimization process. \nEnds the optimization process if there is no improvement in the given number of generations. verbosity : integer, optional (default=0) \nHow much information TPOT communicates while it's running. \nPossible inputs are: 0, TPOT will print nothing, 1, TPOT will print minimal information, 2, TPOT will print more information and provide a progress bar, or 3, TPOT will print everything and provide a progress bar. disable_update_check : boolean, optional (default=False) \nFlag indicating whether the TPOT version checker should be disabled. \nThe update checker will tell you when a new version of TPOT has been released. Attributes: fitted_pipeline_ : scikit-learn Pipeline object \nThe best pipeline that TPOT discovered during the pipeline optimization process, fitted on the entire training dataset. pareto_front_fitted_pipelines_ : Python dictionary \nDictionary containing the all pipelines on the TPOT Pareto front, where the key is the string representation of the pipeline and the value is the corresponding pipeline fitted on the entire training dataset. \nThe TPOT Pareto front provides a trade-off between pipeline complexity (i.e., the number of steps in the pipeline) and the predictive performance of the pipeline. \nNote: _pareto_front_fitted_pipelines is only available when verbosity =3. evaluated_individuals_ : Python dictionary \nDictionary containing all pipelines that were evaluated during the pipeline optimization process, where the key is the string representation of the pipeline and the value is a tuple containing (# of steps in pipeline, accuracy metric for the pipeline). \nThis attribute is primarily for internal use, but may be useful for looking at the other pipelines that TPOT evaluated. Example from tpot import TPOTRegressor\nfrom sklearn.datasets import load_boston\nfrom sklearn.model_selection import train_test_split\n\ndigits = load_boston()\nX_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,\n train_size=0.75, test_size=0.25)\n\ntpot = TPOTRegressor(generations=5, population_size=50, verbosity=2)\ntpot.fit(X_train, y_train)\nprint(tpot.score(X_test, y_test))\ntpot.export('tpot_boston_pipeline.py') Functions fit (features, target[, sample_weight, groups]) Run the TPOT optimization process on the given training data. predict (features) Use the optimized pipeline to predict the target values for a feature set. score (testing_features, testing_target) Returns the optimized pipeline's score on the given testing data using the user-specified scoring function. export (output_file_name) Export the optimized pipeline as Python code. fit(features, target, sample_weight=None, groups=None) \nRun the TPOT optimization process on the given training data. \nUses genetic programming to optimize a machine learning pipeline that maximizes the score on the provided features and target. Performs internal k-fold cross-validaton to avoid overfitting on the provided data. Parameters: features : array-like {n_samples, n_features} \nFeature matrix \nTPOT and all scikit-learn algorithms assume that the features will be numerical and there will be no missing values.\nAs such, when a feature matrix is provided to TPOT, all missing values will automatically be replaced (i.e., imputed)\nusing median value imputation . \nIf you wish to use a different imputation strategy than median imputation, please make sure to apply imputation to your feature set prior to passing it to TPOT. target : array-like {n_samples} \nList of target labels for prediction sample_weight : array-like {n_samples}, optional \nPer-sample weights. Higher weights force TPOT to put more emphasis on those points. groups : array-like, with shape {n_samples, }, optional \nGroup labels for the samples used when performing cross-validation. \nThis parameter should only be used in conjunction with sklearn's Group cross-validation functions, such as sklearn.model_selection.GroupKFold . Returns: self : object \nReturns a copy of the fitted TPOT object predict(features) \nUse the optimized pipeline to predict the target values for a feature set. Parameters: features : array-like {n_samples, n_features} \nFeature matrix Returns: predictions : array-like {n_samples} \nPredicted target values for the samples in the feature matrix score(testing_features, testing_target) \nReturns the optimized pipeline's score on the given testing data using the user-specified scoring function. \nThe default scoring function for TPOTClassifier is 'mean_squared_error'. Parameters: testing_features : array-like {n_samples, n_features} \nFeature matrix of the testing set testing_target : array-like {n_samples} \nList of target labels for prediction in the testing set Returns: accuracy_score : float \nThe estimated test set accuracy according to the user-specified scoring function. export(output_file_name) \nExport the optimized pipeline as Python code. \nSee the usage documentation for example usage of the export function. Parameters: output_file_name : string \nString containing the path and file name of the desired output file Returns: \nDoes not return anything",
"title": "Regression"
},
{
@@ -117,9 +122,14 @@
},
{
"location": "/releases/",
- "text": "Version 0.8\n\n\n\n\n\n\nTPOT now detects whether there are missing values in your dataset\n and replaces them with the median value of the column.\n\n\n\n\n\n\nTPOT now allows you to set a \ngroup\n parameter in the \nfit\n function so you can use the \nGroupKFold\n cross-validation strategy.\n\n\n\n\n\n\nTPOT now allows you to set a subsample ratio of the training instance with the \nsubsample\n parameter. For example, setting \nsubsample\n=0.5 tells TPOT to create a fixed subsample of half of the training data for the pipeline optimization process. This parameter can be useful for speeding up the pipeline optimization process, but may give less accurate performance estimates from cross-validation.\n\n\n\n\n\n\nTPOT now has more \nbuilt-in configurations\n, including TPOT MDR and TPOT light, for both classification and regression problems.\n\n\n\n\n\n\nTPOTClassifier\n and \nTPOTRegressor\n now expose three useful internal attributes, \nfitted_pipeline_\n, \npareto_front_fitted_pipelines_\n, and \nevaluated_individuals_\n. These attributes are described in the \nAPI documentation\n.\n\n\n\n\n\n\nOh, \nTPOT now has \nthorough API documentation\n. Check it out!\n\n\n\n\n\n\nFixed a reproducibility issue where setting \nrandom_seed\n didn't necessarily result in the same results every time. This bug was present since TPOT v0.7.\n\n\n\n\n\n\nRefined input checking in TPOT.\n\n\n\n\n\n\nRemoved Python 2 uncompliant code.\n\n\n\n\n\n\nVersion 0.7\n\n\n\n\n\n\nTPOT now has multiprocessing support.\n TPOT allows you to use multiple processes in parallel to accelerate the pipeline optimization process in TPOT with the \nn_jobs\n parameter.\n\n\n\n\n\n\nTPOT now allows you to \ncustomize the operators and parameters considered during the optimization process\n, which can be accomplished with the new \nconfig_dict\n parameter. The format of this customized dictionary can be found in the \nonline documentation\n, along with a list of \nbuilt-in configurations\n.\n\n\n\n\n\n\nTPOT now allows you to \nspecify a time limit for evaluating a single pipeline\n (default limit is 5 minutes) in optimization process with the \nmax_eval_time_mins\n parameter, so TPOT won't spend hours evaluating overly-complex pipelines.\n\n\n\n\n\n\nWe tweaked TPOT's underlying evolutionary optimization algorithm to work even better, including using the \nmu+lambda algorithm\n. This algorithm gives you more control of how many pipelines are generated every iteration with the \noffspring_size\n parameter.\n\n\n\n\n\n\nRefined the default operators and parameters in TPOT, so TPOT 0.7 should work even better than 0.6.\n\n\n\n\n\n\nTPOT now supports sample weights in the fitness function if some if your samples are more important to classify correctly than others. The sample weights option works the same as in scikit-learn, e.g., \ntpot.fit(x_train, y_train, sample_weights=sample_weights)\n.\n\n\n\n\n\n\nThe default scoring metric in TPOT has been changed from balanced accuracy to accuracy, the same default metric for classification algorithms in scikit-learn. Balanced accuracy can still be used by setting \nscoring='balanced_accuracy'\n when creating a TPOT instance.\n\n\n\n\n\n\nVersion 0.6\n\n\n\n\n\n\nTPOT now supports regression problems!\n We have created two separate \nTPOTClassifier\n and \nTPOTRegressor\n classes to support classification and regression problems, respectively. The \ncommand-line interface\n also supports this feature through the \n-mode\n parameter.\n\n\n\n\n\n\nTPOT now allows you to \nspecify a time limit\n for the optimization process with the \nmax_time_mins\n parameter, so you don't need to guess how long TPOT will take any more to recommend a pipeline to you.\n\n\n\n\n\n\nAdded a new operator that performs feature selection using \nExtraTrees\n feature importance scores.\n\n\n\n\n\n\nXGBoost\n has been added as an optional dependency to TPOT.\n If you have XGBoost installed, TPOT will automatically detect your installation and use the \nXGBoostClassifier\n and \nXGBoostRegressor\n in its pipelines.\n\n\n\n\n\n\nTPOT now offers a verbosity level of 3 (\"science mode\"), which outputs the entire Pareto front instead of only the current best score. This feature may be useful for users looking to make a trade-off between pipeline complexity and score.\n\n\n\n\n\n\nVersion 0.5\n\n\n\n\nMajor refactor: Each operator is defined in a separate class file. Hooray for easier-to-maintain code!\n\n\nTPOT now \nexports directly to scikit-learn Pipelines\n instead of hacky code.\n\n\nInternal representation of individuals now uses scikit-learn pipelines.\n\n\nParameters for each operator have been optimized so TPOT spends less time exploring useless parameters.\n\n\nWe have removed pandas as a dependency and instead use numpy matrices to store the data.\n\n\nTPOT now uses \nk-fold cross-validation\n when evaluating pipelines, with a default k = 3. This k parameter can be tuned when creating a new TPOT instance.\n\n\nImproved \nscoring function support\n: Even though TPOT uses balanced accuracy by default, you can now have TPOT use \nany of the scoring functions\n that \ncross_val_score\n supports.\n\n\nAdded the scikit-learn \nNormalizer\n preprocessor.\n\n\nMinor text fixes.\n\n\n\n\nVersion 0.4\n\n\nIn TPOT 0.4, we've made some major changes to the internals of TPOT and added some convenience functions. We've summarized the changes below.\n\n\n\n\nAdded new sklearn models and preprocessors\n\n\n\n\nAdaBoostClassifier\n\n\nBernoulliNB\n\n\nExtraTreesClassifier\n\n\nGaussianNB\n\n\nMultinomialNB\n\n\nLinearSVC\n\n\nPassiveAggressiveClassifier\n\n\nGradientBoostingClassifier\n\n\nRBFSampler\n\n\nFastICA\n\n\nFeatureAgglomeration\n\n\nNystroem\n\n\n\n\nAdded operator that inserts virtual features for the count of features with values of zero\n\n\nReworked parameterization of TPOT operators\n\n\n\nReduced parameter search space with information from a scikit-learn benchmark\n\n\nTPOT no longer generates arbitrary parameter values, but uses a fixed parameter set instead\n\n\n\n\nRemoved XGBoost as a dependency\n\n\n\nToo many users were having install issues with XGBoost\n\n\nReplaced with scikit-learn's GradientBoostingClassifier\n\n\n\n\nImproved descriptiveness of TPOT command line parameter documentation\n\n\nRemoved min/max/avg details during fit() when verbosity > 1\n\n\n\n\nReplaced with tqdm progress bar\n\n\nAdded tqdm as a dependency\n\n\n\n\nAdded \nfit_predict()\n convenience function\n\n\nAdded \nget_params()\n function so TPOT can operate in scikit-learn's \ncross_val_score\n & related functions\n\n\n\n\n\nVersion 0.3\n\n\n\n\nWe revised the internal optimization process of TPOT to make it more efficient, in particular in regards to the model parameters that TPOT optimizes over.\n\n\n\n\nVersion 0.2\n\n\n\n\n\n\nTPOT now has the ability to export the optimized pipelines to sklearn code.\n\n\n\n\n\n\nLogistic regression, SVM, and k-nearest neighbors classifiers were added as pipeline operators. Previously, TPOT only included decision tree and random forest classifiers.\n\n\n\n\n\n\nTPOT can now use arbitrary scoring functions for the optimization process.\n\n\n\n\n\n\nTPOT now performs multi-objective Pareto optimization to balance model complexity (i.e., # of pipeline operators) and the score of the pipeline.\n\n\n\n\n\n\nVersion 0.1\n\n\n\n\n\n\nFirst public release of TPOT.\n\n\n\n\n\n\nOptimizes pipelines with decision trees and random forest classifiers as the model, and uses a handful of feature preprocessors.",
+ "text": "Version 0.9\n\n\n\n\n\n\nTPOT now supports sparse matrices\n with a new built-in TPOT configurations, \"TPOT sparse\". We are using a custom OneHotEncoder implementation that supports missing values and continuous features.\n\n\n\n\n\n\nWe have added an \"early stopping\" option for stopping the optimization process if no improvement is made within a set number of generations. Look up the \nearly_stop\n parameter to access this functionality.\n\n\n\n\n\n\nTPOT now reduces the number of duplicated pipelines between generations, which saves you time during the optimization process.\n\n\n\n\n\n\nTPOT now supports custom scoring functions via the command-line mode.\n\n\n\n\n\n\nWe have added a new optional argument, \nperiodic_checkpoint_folder\n, that allows TPOT to periodically save the best pipeline so far to a local folder during optimization process.\n\n\n\n\n\n\nTPOT no longer uses \nsklearn.externals.joblib\n when \nn_jobs=1\n to avoid the potential freezing issue \nthat scikit-learn suffers from\n.\n\n\n\n\n\n\nWe have added \npandas\n as a dependency to read input datasets instead of \nnumpy.recfromcsv\n. NumPy's \nrecfromcsv\n function is unable to parse datasets with complex data types.\n\n\n\n\n\n\nFixed a bug that \nDEFAULT\n in the parameter(s) of nested estimator raises \nKeyError\n when exporting pipelines.\n\n\n\n\n\n\nFixed a bug related to setting \nrandom_state\n in nested estimators. The issue would happen with pipeline with \nSelectFromModel\n (\nExtraTreesClassifier\n as nested estimator) or \nStackingEstimator\n if nested estimator has \nrandom_state\n parameter.\n\n\n\n\n\n\nFixed a bug in the missing value imputation function in TPOT to impute along columns instead rows.\n\n\n\n\n\n\nRefined input checking for sparse matrices in TPOT.\n\n\n\n\n\n\nVersion 0.8\n\n\n\n\n\n\nTPOT now detects whether there are missing values in your dataset\n and replaces them with the median value of the column.\n\n\n\n\n\n\nTPOT now allows you to set a \ngroup\n parameter in the \nfit\n function so you can use the \nGroupKFold\n cross-validation strategy.\n\n\n\n\n\n\nTPOT now allows you to set a subsample ratio of the training instance with the \nsubsample\n parameter. For example, setting \nsubsample\n=0.5 tells TPOT to create a fixed subsample of half of the training data for the pipeline optimization process. This parameter can be useful for speeding up the pipeline optimization process, but may give less accurate performance estimates from cross-validation.\n\n\n\n\n\n\nTPOT now has more \nbuilt-in configurations\n, including TPOT MDR and TPOT light, for both classification and regression problems.\n\n\n\n\n\n\nTPOTClassifier\n and \nTPOTRegressor\n now expose three useful internal attributes, \nfitted_pipeline_\n, \npareto_front_fitted_pipelines_\n, and \nevaluated_individuals_\n. These attributes are described in the \nAPI documentation\n.\n\n\n\n\n\n\nOh, \nTPOT now has \nthorough API documentation\n. Check it out!\n\n\n\n\n\n\nFixed a reproducibility issue where setting \nrandom_seed\n didn't necessarily result in the same results every time. This bug was present since TPOT v0.7.\n\n\n\n\n\n\nRefined input checking in TPOT.\n\n\n\n\n\n\nRemoved Python 2 uncompliant code.\n\n\n\n\n\n\nVersion 0.7\n\n\n\n\n\n\nTPOT now has multiprocessing support.\n TPOT allows you to use multiple processes in parallel to accelerate the pipeline optimization process in TPOT with the \nn_jobs\n parameter.\n\n\n\n\n\n\nTPOT now allows you to \ncustomize the operators and parameters considered during the optimization process\n, which can be accomplished with the new \nconfig_dict\n parameter. The format of this customized dictionary can be found in the \nonline documentation\n, along with a list of \nbuilt-in configurations\n.\n\n\n\n\n\n\nTPOT now allows you to \nspecify a time limit for evaluating a single pipeline\n (default limit is 5 minutes) in optimization process with the \nmax_eval_time_mins\n parameter, so TPOT won't spend hours evaluating overly-complex pipelines.\n\n\n\n\n\n\nWe tweaked TPOT's underlying evolutionary optimization algorithm to work even better, including using the \nmu+lambda algorithm\n. This algorithm gives you more control of how many pipelines are generated every iteration with the \noffspring_size\n parameter.\n\n\n\n\n\n\nRefined the default operators and parameters in TPOT, so TPOT 0.7 should work even better than 0.6.\n\n\n\n\n\n\nTPOT now supports sample weights in the fitness function if some if your samples are more important to classify correctly than others. The sample weights option works the same as in scikit-learn, e.g., \ntpot.fit(x_train, y_train, sample_weights=sample_weights)\n.\n\n\n\n\n\n\nThe default scoring metric in TPOT has been changed from balanced accuracy to accuracy, the same default metric for classification algorithms in scikit-learn. Balanced accuracy can still be used by setting \nscoring='balanced_accuracy'\n when creating a TPOT instance.\n\n\n\n\n\n\nVersion 0.6\n\n\n\n\n\n\nTPOT now supports regression problems!\n We have created two separate \nTPOTClassifier\n and \nTPOTRegressor\n classes to support classification and regression problems, respectively. The \ncommand-line interface\n also supports this feature through the \n-mode\n parameter.\n\n\n\n\n\n\nTPOT now allows you to \nspecify a time limit\n for the optimization process with the \nmax_time_mins\n parameter, so you don't need to guess how long TPOT will take any more to recommend a pipeline to you.\n\n\n\n\n\n\nAdded a new operator that performs feature selection using \nExtraTrees\n feature importance scores.\n\n\n\n\n\n\nXGBoost\n has been added as an optional dependency to TPOT.\n If you have XGBoost installed, TPOT will automatically detect your installation and use the \nXGBoostClassifier\n and \nXGBoostRegressor\n in its pipelines.\n\n\n\n\n\n\nTPOT now offers a verbosity level of 3 (\"science mode\"), which outputs the entire Pareto front instead of only the current best score. This feature may be useful for users looking to make a trade-off between pipeline complexity and score.\n\n\n\n\n\n\nVersion 0.5\n\n\n\n\nMajor refactor: Each operator is defined in a separate class file. Hooray for easier-to-maintain code!\n\n\nTPOT now \nexports directly to scikit-learn Pipelines\n instead of hacky code.\n\n\nInternal representation of individuals now uses scikit-learn pipelines.\n\n\nParameters for each operator have been optimized so TPOT spends less time exploring useless parameters.\n\n\nWe have removed pandas as a dependency and instead use numpy matrices to store the data.\n\n\nTPOT now uses \nk-fold cross-validation\n when evaluating pipelines, with a default k = 3. This k parameter can be tuned when creating a new TPOT instance.\n\n\nImproved \nscoring function support\n: Even though TPOT uses balanced accuracy by default, you can now have TPOT use \nany of the scoring functions\n that \ncross_val_score\n supports.\n\n\nAdded the scikit-learn \nNormalizer\n preprocessor.\n\n\nMinor text fixes.\n\n\n\n\nVersion 0.4\n\n\nIn TPOT 0.4, we've made some major changes to the internals of TPOT and added some convenience functions. We've summarized the changes below.\n\n\n\n\nAdded new sklearn models and preprocessors\n\n\n\n\nAdaBoostClassifier\n\n\nBernoulliNB\n\n\nExtraTreesClassifier\n\n\nGaussianNB\n\n\nMultinomialNB\n\n\nLinearSVC\n\n\nPassiveAggressiveClassifier\n\n\nGradientBoostingClassifier\n\n\nRBFSampler\n\n\nFastICA\n\n\nFeatureAgglomeration\n\n\nNystroem\n\n\n\n\nAdded operator that inserts virtual features for the count of features with values of zero\n\n\nReworked parameterization of TPOT operators\n\n\n\nReduced parameter search space with information from a scikit-learn benchmark\n\n\nTPOT no longer generates arbitrary parameter values, but uses a fixed parameter set instead\n\n\n\n\nRemoved XGBoost as a dependency\n\n\n\nToo many users were having install issues with XGBoost\n\n\nReplaced with scikit-learn's GradientBoostingClassifier\n\n\n\n\nImproved descriptiveness of TPOT command line parameter documentation\n\n\nRemoved min/max/avg details during fit() when verbosity > 1\n\n\n\n\nReplaced with tqdm progress bar\n\n\nAdded tqdm as a dependency\n\n\n\n\nAdded \nfit_predict()\n convenience function\n\n\nAdded \nget_params()\n function so TPOT can operate in scikit-learn's \ncross_val_score\n & related functions\n\n\n\n\n\nVersion 0.3\n\n\n\n\nWe revised the internal optimization process of TPOT to make it more efficient, in particular in regards to the model parameters that TPOT optimizes over.\n\n\n\n\nVersion 0.2\n\n\n\n\n\n\nTPOT now has the ability to export the optimized pipelines to sklearn code.\n\n\n\n\n\n\nLogistic regression, SVM, and k-nearest neighbors classifiers were added as pipeline operators. Previously, TPOT only included decision tree and random forest classifiers.\n\n\n\n\n\n\nTPOT can now use arbitrary scoring functions for the optimization process.\n\n\n\n\n\n\nTPOT now performs multi-objective Pareto optimization to balance model complexity (i.e., # of pipeline operators) and the score of the pipeline.\n\n\n\n\n\n\nVersion 0.1\n\n\n\n\n\n\nFirst public release of TPOT.\n\n\n\n\n\n\nOptimizes pipelines with decision trees and random forest classifiers as the model, and uses a handful of feature preprocessors.",
"title": "Release Notes"
},
+ {
+ "location": "/releases/#version-09",
+ "text": "TPOT now supports sparse matrices with a new built-in TPOT configurations, \"TPOT sparse\". We are using a custom OneHotEncoder implementation that supports missing values and continuous features. We have added an \"early stopping\" option for stopping the optimization process if no improvement is made within a set number of generations. Look up the early_stop parameter to access this functionality. TPOT now reduces the number of duplicated pipelines between generations, which saves you time during the optimization process. TPOT now supports custom scoring functions via the command-line mode. We have added a new optional argument, periodic_checkpoint_folder , that allows TPOT to periodically save the best pipeline so far to a local folder during optimization process. TPOT no longer uses sklearn.externals.joblib when n_jobs=1 to avoid the potential freezing issue that scikit-learn suffers from . We have added pandas as a dependency to read input datasets instead of numpy.recfromcsv . NumPy's recfromcsv function is unable to parse datasets with complex data types. Fixed a bug that DEFAULT in the parameter(s) of nested estimator raises KeyError when exporting pipelines. Fixed a bug related to setting random_state in nested estimators. The issue would happen with pipeline with SelectFromModel ( ExtraTreesClassifier as nested estimator) or StackingEstimator if nested estimator has random_state parameter. Fixed a bug in the missing value imputation function in TPOT to impute along columns instead rows. Refined input checking for sparse matrices in TPOT.",
+ "title": "Version 0.9"
+ },
{
"location": "/releases/#version-08",
"text": "TPOT now detects whether there are missing values in your dataset and replaces them with the median value of the column. TPOT now allows you to set a group parameter in the fit function so you can use the GroupKFold cross-validation strategy. TPOT now allows you to set a subsample ratio of the training instance with the subsample parameter. For example, setting subsample =0.5 tells TPOT to create a fixed subsample of half of the training data for the pipeline optimization process. This parameter can be useful for speeding up the pipeline optimization process, but may give less accurate performance estimates from cross-validation. TPOT now has more built-in configurations , including TPOT MDR and TPOT light, for both classification and regression problems. TPOTClassifier and TPOTRegressor now expose three useful internal attributes, fitted_pipeline_ , pareto_front_fitted_pipelines_ , and evaluated_individuals_ . These attributes are described in the API documentation . Oh, TPOT now has thorough API documentation . Check it out! Fixed a reproducibility issue where setting random_seed didn't necessarily result in the same results every time. This bug was present since TPOT v0.7. Refined input checking in TPOT. Removed Python 2 uncompliant code.",
@@ -169,6 +179,11 @@
"location": "/support/",
"text": "TPOT was developed in the \nComputational Genetics Lab\n with funding from the \nNIH\n under grant R01 AI117694. We're incredibly grateful for their support during the development of this project.\n\n\nThe TPOT logo was designed by Todd Newmuis, who generously donated his time to the project.",
"title": "Support"
+ },
+ {
+ "location": "/related/",
+ "text": "Other Automated Machine Learning (AutoML) tools and related projects:\n\n\n\n\n\n\nName\n\n\nLanguage\n\n\nLicense\n\n\nDescription\n\n\n\n\n\n\nAuto-WEKA\n\n\nJava\n\n\nGPL-v3\n\n\nAutomated hyper-parameter tuning for WEKA models.\n\n\n\n\n\n\nauto-sklearn\n\n\nPython\n\n\nBSD-3-Clause\n\n\nAn automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator.\n\n\n\n\n\n\nauto_ml\n\n\nPython\n\n\nMIT\n\n\nAutomated machine learning for analytics & production. Supports manual feature type declarations.\n\n\n\n\n\n\ndevol\n\n\nPython\n\n\nMIT\n\n\nAutomated deep neural network design via genetic programming.\n\n\n\n\n\n\nMLBox\n\n\nPython\n\n\nBSD-3-Clause\n\n\nAccurate hyper-parameter optimization in high-dimensional space with support for distributed computing.\n\n\n\n\n\n\nRecipe\n\n\nC\n\n\nGPL-v3\n\n\nMachine-learning pipeline optimization through genetic programming. Uses grammars to define pipeline structure.\n\n\n\n\n\n\nXcessiv\n\n\nPython\n\n\nApache-2.0\n\n\nA web-based application for quick, scalable, and automated hyper-parameter tuning and stacked ensembling in Python.",
+ "title": "Related"
}
]
}
\ No newline at end of file
diff --git a/docs/related/index.html b/docs/related/index.html
new file mode 100644
index 00000000..258d62d1
--- /dev/null
+++ b/docs/related/index.html
@@ -0,0 +1,235 @@
+
+
+
+
+
+
+
+
+
+
+ Related - TPOT
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
TPOT now supports sparse matrices with a new built-in TPOT configurations, "TPOT sparse". We are using a custom OneHotEncoder implementation that supports missing values and continuous features.
+
+
+
We have added an "early stopping" option for stopping the optimization process if no improvement is made within a set number of generations. Look up the early_stop parameter to access this functionality.
+
+
+
TPOT now reduces the number of duplicated pipelines between generations, which saves you time during the optimization process.
+
+
+
TPOT now supports custom scoring functions via the command-line mode.
+
+
+
We have added a new optional argument, periodic_checkpoint_folder, that allows TPOT to periodically save the best pipeline so far to a local folder during optimization process.
+
+
+
TPOT no longer uses sklearn.externals.joblib when n_jobs=1 to avoid the potential freezing issue that scikit-learn suffers from.
+
+
+
We have added pandas as a dependency to read input datasets instead of numpy.recfromcsv. NumPy's recfromcsv function is unable to parse datasets with complex data types.
+
+
+
Fixed a bug that DEFAULT in the parameter(s) of nested estimator raises KeyError when exporting pipelines.
+
+
+
Fixed a bug related to setting random_state in nested estimators. The issue would happen with pipeline with SelectFromModel (ExtraTreesClassifier as nested estimator) or StackingEstimator if nested estimator has random_state parameter.
+
+
+
Fixed a bug in the missing value imputation function in TPOT to impute along columns instead rows.
+
+
+
Refined input checking for sparse matrices in TPOT.
+
+
+
Version 0.8
TPOT now detects whether there are missing values in your dataset and replaces them with the median value of the column.
diff --git a/docs/search.html b/docs/search.html
index caea363e..23277c3f 100644
--- a/docs/search.html
+++ b/docs/search.html
@@ -88,6 +88,11 @@
Support
Function used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression.
TPOT assumes that any function with "error" or "loss" in the name is meant to be minimized, whereas any other functions will be maximized.
+my_module.scorer_name: You can also specify your own function or a full python path to an existing one.
+
Number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT optimization process.
-sub
@@ -378,10 +388,46 @@
TPOT on the command line
-config
CONFIG_FILE
-
File path or string
-
A path to a configuration file for customizing the operators and parameters that TPOT uses in the optimization process.
+
String or file path
+
Operators and parameter configurations in TPOT:
-See the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations.
+
+
Path for configuration file: TPOT will use the path to a configuration file for customizing the operators and parameters that TPOT uses in the optimization process
+
string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors
+
string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies
+
string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices.
+
+See the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations.
+
+
+
+
-cf
+
CHECKPOINT_FOLDER
+
Folder path
+
+If supplied, a folder you created, in which tpot will periodically save the best pipeline so far while optimizing.
+
+This is useful in multiple cases:
+
+
sudden death before tpot could save an optimized pipeline
+How many generations TPOT checks whether there is no improvement in optimization process.
+
+End optimization process if there is no improvement in the set number of generations.
-v
@@ -435,6 +481,10 @@
Scoring functions
tpot.export('tpot_mnist_pipeline.py')
+
+
my_module.scorer_name: you can also use your manual scorer(y_true, y_pred) function through the command line, just add an argument -scoring my_module.scorer and TPOT will import your module and take the function from there. TPOT will also include current workdir when importing the module, so you can just put it in the same folder where you are going to run.
+Example: -scoring sklearn.metrics.auc will use the function auc from sklearn.metrics module.
+
Built-in TPOT configurations
TPOT comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. Below is a list of the current built-in configurations that come with TPOT.
To use any of these configurations, simply pass the string name of the configuration to the config_dict parameter (or -config on the command line). For example, to use the "TPOT light" configuration:
@@ -550,6 +611,20 @@
Customizing TPOT's operators
When using the command-line interface, the configuration file specified in the -config parameter must name its custom TPOT configuration tpot_config. Otherwise, TPOT will not be able to locate the configuration dictionary.
For more detailed examples of how to customize TPOT's operator configuration, see the default configurations for classification and regression in TPOT's source code.
Note that you must have all of the corresponding packages for the operators installed on your computer, otherwise TPOT will not be able to use them. For example, if XGBoost is not installed on your computer, then TPOT will simply not import nor use XGBoost in the pipelines it considers.
+
Crash/freeze issue with n_jobs > 1 under OSX or Linux
+
TPOT supports parallel computing for speeding up the optimization process, but it may crash/freeze with n_jobs > 1 under OSX or Linux as scikit-learn does, especially with large datasets.
+
One solution is to configure Python's multiprocessing module to use the forkserver start method (instead of the default fork) to manage the process pools. You can enable the forkserver mode globally for your program by putting the following codes into your main script:
+
import multiprocessing
+
+# other imports, custom code, load data, define model...
+
+if __name__ == '__main__':
+ multiprocessing.set_start_method('forkserver')
+
+ # call scikit-learn utils or tpot utils with n_jobs > 1 here
+
Python dictionary, TPOT will use your custom configuration,
string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or
string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or
+
string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices, or
None, TPOT will use the default TPOTClassifier configuration.
See the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations.
@@ -144,6 +147,25 @@ Flag indicating whether the TPOT instance will reuse the population from previou
Setting warm_start=True can be useful for running TPOT for a short time on a dataset, checking the results, then resuming the TPOT run from where it left off.
+periodic_checkpoint_folder: path string, optional (default: None)
+
+If supplied, a folder in which TPOT will periodically save the best pipeline so far while optimizing.
+Currently once per generation but not more often than once per 30 seconds.
+Useful in multiple cases:
+
+
Sudden death before TPOT could save optimized pipeline
Python dictionary, TPOT will use your custom configuration,
string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors, or
string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies, or
+
string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices, or
None, TPOT will use the default TPOTRegressor configuration.
See the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations.
@@ -577,6 +602,25 @@ Flag indicating whether the TPOT instance will reuse the population from previou
Setting warm_start=True can be useful for running TPOT for a short time on a dataset, checking the results, then resuming the TPOT run from where it left off.
+How many generations TPOT checks whether there is no improvement in optimization process.
+
+Ends the optimization process if there is no improvement in the given number of generations.
+
+
verbosity: integer, optional (default=0)
How much information TPOT communicates while it's running.
diff --git a/docs_sources/installing.md b/docs_sources/installing.md
index 53b50e5e..a971a79b 100644
--- a/docs_sources/installing.md
+++ b/docs_sources/installing.md
@@ -12,19 +12,22 @@ TPOT is built on top of several existing Python libraries, including:
* [tqdm](https://github.com/tqdm/tqdm)
+* [stopit](https://github.com/glenfant/stopit)
+
+* [pandas](http://pandas.pydata.org)
Most of the necessary Python packages can be installed via the [Anaconda Python distribution](https://www.continuum.io/downloads), which we strongly recommend that you use. We also strongly recommend that you use of Python 3 over Python 2 if you're given the choice.
-NumPy, SciPy, and scikit-learn can be installed in Anaconda via the command:
+NumPy, SciPy, scikit-learn and pandas can be installed in Anaconda via the command:
```Shell
-conda install numpy scipy scikit-learn
+conda install numpy scipy scikit-learn pandas
```
-DEAP, update_checker, and tqdm can be installed with `pip` via the command:
+DEAP, update_checker, tqdm and stopit can be installed with `pip` via the command:
```Shell
-pip install deap update_checker tqdm
+pip install deap update_checker tqdm stopit
```
**For the Windows users**, the pywin32 module is required if Python is NOT installed via the [Anaconda Python distribution](https://www.continuum.io/downloads) and can be installed with `pip`:
diff --git a/docs_sources/related.md b/docs_sources/related.md
new file mode 100644
index 00000000..8d257162
--- /dev/null
+++ b/docs_sources/related.md
@@ -0,0 +1,52 @@
+Other Automated Machine Learning (AutoML) tools and related projects:
+
+
A web-based application for quick, scalable, and automated hyper-parameter tuning and stacked ensembling in Python.
+
+
diff --git a/docs_sources/releases.md b/docs_sources/releases.md
index 5a5bf54a..cabe1064 100644
--- a/docs_sources/releases.md
+++ b/docs_sources/releases.md
@@ -1,3 +1,28 @@
+# Version 0.9
+
+* **TPOT now supports sparse matrices** with a new built-in TPOT configurations, "TPOT sparse". We are using a custom OneHotEncoder implementation that supports missing values and continuous features.
+
+* We have added an "early stopping" option for stopping the optimization process if no improvement is made within a set number of generations. Look up the `early_stop` parameter to access this functionality.
+
+* TPOT now reduces the number of duplicated pipelines between generations, which saves you time during the optimization process.
+
+* TPOT now supports custom scoring functions via the command-line mode.
+
+* We have added a new optional argument, `periodic_checkpoint_folder`, that allows TPOT to periodically save the best pipeline so far to a local folder during optimization process.
+
+* TPOT no longer uses `sklearn.externals.joblib` when `n_jobs=1` to avoid the potential freezing issue [that scikit-learn suffers from](http://scikit-learn.org/stable/faq.html#why-do-i-sometime-get-a-crash-freeze-with-n-jobs-1-under-osx-or-linux).
+
+* We have added `pandas` as a dependency to read input datasets instead of `numpy.recfromcsv`. NumPy's `recfromcsv` function is unable to parse datasets with complex data types.
+
+* Fixed a bug that `DEFAULT` in the parameter(s) of nested estimator raises `KeyError` when exporting pipelines.
+
+* Fixed a bug related to setting `random_state` in nested estimators. The issue would happen with pipeline with `SelectFromModel` (`ExtraTreesClassifier` as nested estimator) or `StackingEstimator` if nested estimator has `random_state` parameter.
+
+* Fixed a bug in the missing value imputation function in TPOT to impute along columns instead rows.
+
+* Refined input checking for sparse matrices in TPOT.
+
+
# Version 0.8
* **TPOT now detects whether there are missing values in your dataset** and replaces them with the median value of the column.
diff --git a/docs_sources/using.md b/docs_sources/using.md
index 939aceb7..ef8e633a 100644
--- a/docs_sources/using.md
+++ b/docs_sources/using.md
@@ -212,17 +212,19 @@ We recommend using the default parameter unless you understand how the crossover
'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'neg_log_loss', 'neg_mean_absolute_error',
'neg_mean_squared_error', 'neg_median_absolute_error', 'precision', 'precision_macro', 'precision_micro',
'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples',
-'recall_weighted', 'roc_auc'
+'recall_weighted', 'roc_auc', 'my_module.scorer_name*'
Function used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression.
TPOT assumes that any function with "error" or "loss" in the name is meant to be minimized, whereas any other functions will be maximized.
+my_module.scorer_name: You can also specify your own function or a full python path to an existing one.
+
Number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT optimization process.
-sub
@@ -265,10 +267,46 @@ Set this seed if you want your TPOT run to be reproducible with the same seed an
-config
CONFIG_FILE
-
File path or string
-
A path to a configuration file for customizing the operators and parameters that TPOT uses in the optimization process.
+
String or file path
+
Operators and parameter configurations in TPOT:
+
+
+
Path for configuration file: TPOT will use the path to a configuration file for customizing the operators and parameters that TPOT uses in the optimization process
+
string 'TPOT light', TPOT will use a built-in configuration with only fast models and preprocessors
+
string 'TPOT MDR', TPOT will use a built-in configuration specialized for genomic studies
+
string 'TPOT sparse': TPOT will use a configuration dictionary with a one-hot encoder and the operators normally included in TPOT that also support sparse matrices.
+
+See the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations.
+
+
+
+
-cf
+
CHECKPOINT_FOLDER
+
Folder path
+
+If supplied, a folder you created, in which tpot will periodically save the best pipeline so far while optimizing.
+
+This is useful in multiple cases:
+
+
sudden death before tpot could save an optimized pipeline
+How many generations TPOT checks whether there is no improvement in optimization process.
-See the built-in configurations section for the list of configurations included with TPOT, and the custom configuration section for more information and examples of how to create your own TPOT configurations.
+End optimization process if there is no improvement in the set number of generations.
-v
@@ -321,6 +359,9 @@ print(tpot.score(X_test, y_test))
tpot.export('tpot_mnist_pipeline.py')
```
+* **my_module.scorer_name**: you can also use your manual `scorer(y_true, y_pred)` function through the command line, just add an argument `-scoring my_module.scorer` and TPOT will import your module and take the function from there. TPOT will also include current workdir when importing the module, so you can just put it in the same folder where you are going to run.
+Example: `-scoring sklearn.metrics.auc` will use the function auc from sklearn.metrics module.
+
# Built-in TPOT configurations
TPOT comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. Below is a list of the current built-in configurations that come with TPOT.
@@ -361,6 +402,17 @@ Note that TPOT MDR may be slow to run because the feature selection routines are