update environment and notebook 2 with dask-searchcv #104

rdvelazquez · 2017-06-22T16:56:01Z

Here's the very simple PR to add the dask-searchCV to the environment and dask-searchCV's implementation of GridSearchCV to notebook 2.

A few notes:

I evaluated the speed gains of dask-searchCV in Upload dask-searchCV notebook #101 and dask-searchCV seemed to make GridSearchCV about 3 times faster. The wall times for GridSearchCV in this PR are similar to those from @dhimmel 's notebook 2 times with the SK-learn GridSearchCV not because dask-searchCV doesn't speed things up but because my computer is slow. :(
It's probably worth having some one else try to update their conda environment and run notebook 2 to confirm that everything works.
I didn't rename GridSearchCV... I debated renaming it to include "dask" or "dask-searchCV" (by using from dask_searchcv import GridSearchCV as dask-GridSearchCV) as this would make it easier for someone to see that it wasn't the SK-learn implementation but I thought there was potentials for confusion either way so I would just keep it the same.
Just as an FYI - I tried using dask_searchcv in the marginal gain notebook and got the following error at line 30:

# Train with covariates data
all_best_estimator_aurocs_covariates, all_grid_aurocs_covariates = train_and_evaluate(covariates, pipeline)


AttributeError: 'GridSearchCV' object has no attribute 'grid_scores_'

I'm not sure what caused this. I searched the dask-searchcv github page for 'grid_scores_' and didn't see anything. If this becomes an issue in the future we can look into it further or revert that specific case to SK-learn's GridSearchCV.

Closes #94

Closes cognoma#94

dhimmel · 2017-06-22T19:29:25Z

I agree that from dask_searchcv import GridSearchCV is the most pythonic implementation. This is supposed to be swapped in for GridSearchCV so it makes sense not to change the name.

Regarding the 'GridSearchCV' object has no attribute 'grid_scores_', that's expected. See http://scikit-learn.org/stable/whats_new.html#model-selection-enhancements-and-api-changes:

The new cv_results_ attribute (of model_selection.GridSearchCV and model_selection.RandomizedSearchCV) introduced in lieu of the grid_scores_ attribute is a dict of 1D arrays with elements in each array corresponding to the parameter settings (i.e. search candidates).

dhimmel

Do you have any indication how dask-searchcv changed the runtime on your machine?

Here are the results with dask-searchcv:

Fitting CV for model: full
	runtime: 0:03:09.570861
Fitting CV for model: expressions
	runtime: 0:03:14.775702
Fitting CV for model: covariates
	runtime: 0:00:27.543525

Also github not showing diff for 2.mutation-classifier.ipynb. Assuming an image changed causing large diff. Any idea what changed from a local git diff?

dhimmel · 2017-06-22T19:30:08Z

environment.yml

 - conda-forge::vega=0.4.4
 - pip:
  - neo4j-driver==1.3.1
+


There should be a newline at the end of file, but no spaces on the newline

dhimmel · 2017-06-22T19:34:55Z

environment.yml

@@ -12,6 +12,8 @@ dependencies:
 - anaconda::setuptools=27.2.0
 - anaconda::statsmodels=0.8.0
 - conda-forge::altair=1.2.0
+- conda-forge::dask-searchcv


Let's also specify a dask channel and version

I think a version number for dask-searchcv would be good. Are you also thinking we would explicitly import dask? So it would be:

conda-forge::dask=0.14.3 conda-forge::dask-searchcv=0.0.2

Explicitly add dask-searchcv and dask to environment.yml both with versions. As far as importing dask in the notebook, only if it's necessary. Also consider using anaconda dask unless that doesn't work. Currently, we're using the anaconda channel over conda-forge in terms of priority.

I added version numbers to both dask and dask-searchcv and used anaconda for dask. Everything works fine.

rdvelazquez · 2017-06-22T21:28:59Z

Do you have any indication how dask-searchcv changed the runtime on your machine?

dask-searchcv was about twice as fast. Average of three runs without dask-searchcv = 14.5 minutes; average of three runs with dask-searchcv = 6.5 minutes.

Any idea what changed from a local git diff?

Is this what you're looking for?

+++ b/2.mutation-classifier.ipynb
@@ -23,7 +23,8 @@
     "from sklearn.decomposition import PCA\n",
     "from sklearn.linear_model import SGDClassifier\n",
     "from sklearn.metrics import roc_auc_score, roc_curve\n",
-    "from sklearn.model_selection import train_test_split, GridSearchCV, Strati
fiedKFold\n",
+    "from sklearn.model_selection import train_test_split, StratifiedKFold\n",
+    "from dask_searchcv import GridSearchCV\n",
     "from sklearn.pipeline import Pipeline, FeatureUnion\n",
     "from sklearn.preprocessing import StandardScaler, FunctionTransformer\n",
     "from vega import Vega\n",
@@ -85,14 +86,15 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "metadata": {},
+   "metadata": {
+    "collapsed": false
+   },
    "outputs": [

dhimmel · 2017-06-23T00:54:33Z

Is this what you're looking for?

Yeah. Just make sure all changes in there make sense.

dhimmel · 2017-06-23T00:58:19Z

dask-searchcv was about twice as fast. Average of three runs without dask-searchcv = 14.5 minutes; average of three runs with dask-searchcv = 6.5 minutes.

Nice, hopefully we also reduced memory footprint. Was this with only 1 core? In production, we could parallelize.

rdvelazquez · 2017-06-23T11:05:05Z

Just make sure all changes in there make sense.

Both changes make sense to me: The first change was importing GridSearchCV from dask_searchcv. The second change was not something I edited but rather the output of how long it took to load the data. My notebook only outputted wall time whereas yours outputted CPU times as well. This is likely just a result of running my notebook on Windows (Related StackOverflow Question)

rdvelazquez · 2017-06-23T11:07:48Z

Was this with only 1 core?

Yep, only 1 core. It will be cool to see how fast this runs in production!
The slowest part of the notebook for me is now loading the data. It may soon be time to revisit pickle (#85).

update environment and notebook 2 with dask-searchcv

65ae0b6

Closes cognoma#94

dhimmel reviewed Jun 22, 2017

View reviewed changes

add dask and versions for dask-searchcv and dask

8b6d200

dhimmel approved these changes Jun 23, 2017

View reviewed changes

dhimmel merged commit 57f7bd0 into cognoma:master Jun 23, 2017

rdvelazquez deleted the dask-searchcv-update branch June 24, 2017 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update environment and notebook 2 with dask-searchcv #104

update environment and notebook 2 with dask-searchcv #104

rdvelazquez commented Jun 22, 2017

dhimmel commented Jun 22, 2017

dhimmel left a comment

dhimmel Jun 22, 2017

dhimmel Jun 22, 2017

rdvelazquez Jun 22, 2017

dhimmel Jun 23, 2017

rdvelazquez Jun 23, 2017

rdvelazquez commented Jun 22, 2017

dhimmel commented Jun 23, 2017

dhimmel commented Jun 23, 2017

rdvelazquez commented Jun 23, 2017

rdvelazquez commented Jun 23, 2017

update environment and notebook 2 with dask-searchcv #104

update environment and notebook 2 with dask-searchcv #104

Conversation

rdvelazquez commented Jun 22, 2017

dhimmel commented Jun 22, 2017

dhimmel left a comment

Choose a reason for hiding this comment

dhimmel Jun 22, 2017

Choose a reason for hiding this comment

dhimmel Jun 22, 2017

Choose a reason for hiding this comment

rdvelazquez Jun 22, 2017

Choose a reason for hiding this comment

dhimmel Jun 23, 2017

Choose a reason for hiding this comment

rdvelazquez Jun 23, 2017

Choose a reason for hiding this comment

rdvelazquez commented Jun 22, 2017

dhimmel commented Jun 23, 2017

dhimmel commented Jun 23, 2017

rdvelazquez commented Jun 23, 2017

rdvelazquez commented Jun 23, 2017