Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update environment and notebook 2 with dask-searchcv #104

Merged
merged 2 commits into from
Jun 23, 2017

Conversation

rdvelazquez
Copy link
Member

Here's the very simple PR to add the dask-searchCV to the environment and dask-searchCV's implementation of GridSearchCV to notebook 2.

A few notes:

  1. I evaluated the speed gains of dask-searchCV in Upload dask-searchCV notebook #101 and dask-searchCV seemed to make GridSearchCV about 3 times faster. The wall times for GridSearchCV in this PR are similar to those from @dhimmel 's notebook 2 times with the SK-learn GridSearchCV not because dask-searchCV doesn't speed things up but because my computer is slow. :(
  2. It's probably worth having some one else try to update their conda environment and run notebook 2 to confirm that everything works.
  3. I didn't rename GridSearchCV... I debated renaming it to include "dask" or "dask-searchCV" (by using from dask_searchcv import GridSearchCV as dask-GridSearchCV) as this would make it easier for someone to see that it wasn't the SK-learn implementation but I thought there was potentials for confusion either way so I would just keep it the same.
  4. Just as an FYI - I tried using dask_searchcv in the marginal gain notebook and got the following error at line 30:
# Train with covariates data
all_best_estimator_aurocs_covariates, all_grid_aurocs_covariates = train_and_evaluate(covariates, pipeline)


AttributeError: 'GridSearchCV' object has no attribute 'grid_scores_'

I'm not sure what caused this. I searched the dask-searchcv github page for 'grid_scores_' and didn't see anything. If this becomes an issue in the future we can look into it further or revert that specific case to SK-learn's GridSearchCV.

Closes #94

@dhimmel
Copy link
Member

dhimmel commented Jun 22, 2017

I agree that from dask_searchcv import GridSearchCV is the most pythonic implementation. This is supposed to be swapped in for GridSearchCV so it makes sense not to change the name.

Regarding the 'GridSearchCV' object has no attribute 'grid_scores_', that's expected. See http://scikit-learn.org/stable/whats_new.html#model-selection-enhancements-and-api-changes:

The new cv_results_ attribute (of model_selection.GridSearchCV and model_selection.RandomizedSearchCV) introduced in lieu of the grid_scores_ attribute is a dict of 1D arrays with elements in each array corresponding to the parameter settings (i.e. search candidates).

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any indication how dask-searchcv changed the runtime on your machine?

Here are the results with dask-searchcv:

Fitting CV for model: full
	runtime: 0:03:09.570861
Fitting CV for model: expressions
	runtime: 0:03:14.775702
Fitting CV for model: covariates
	runtime: 0:00:27.543525

Also github not showing diff for 2.mutation-classifier.ipynb. Assuming an image changed causing large diff. Any idea what changed from a local git diff?

environment.yml Outdated
- conda-forge::vega=0.4.4
- pip:
- neo4j-driver==1.3.1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a newline at the end of file, but no spaces on the newline

environment.yml Outdated
@@ -12,6 +12,8 @@ dependencies:
- anaconda::setuptools=27.2.0
- anaconda::statsmodels=0.8.0
- conda-forge::altair=1.2.0
- conda-forge::dask-searchcv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also specify a dask channel and version

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a version number for dask-searchcv would be good. Are you also thinking we would explicitly import dask? So it would be:

conda-forge::dask=0.14.3
conda-forge::dask-searchcv=0.0.2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly add dask-searchcv and dask to environment.yml both with versions. As far as importing dask in the notebook, only if it's necessary. Also consider using anaconda dask unless that doesn't work. Currently, we're using the anaconda channel over conda-forge in terms of priority.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added version numbers to both dask and dask-searchcv and used anaconda for dask. Everything works fine.

@rdvelazquez
Copy link
Member Author

Do you have any indication how dask-searchcv changed the runtime on your machine?

dask-searchcv was about twice as fast. Average of three runs without dask-searchcv = 14.5 minutes; average of three runs with dask-searchcv = 6.5 minutes.

Any idea what changed from a local git diff?

Is this what you're looking for?

+++ b/2.mutation-classifier.ipynb
@@ -23,7 +23,8 @@
     "from sklearn.decomposition import PCA\n",
     "from sklearn.linear_model import SGDClassifier\n",
     "from sklearn.metrics import roc_auc_score, roc_curve\n",
-    "from sklearn.model_selection import train_test_split, GridSearchCV, Strati
fiedKFold\n",
+    "from sklearn.model_selection import train_test_split, StratifiedKFold\n",
+    "from dask_searchcv import GridSearchCV\n",
     "from sklearn.pipeline import Pipeline, FeatureUnion\n",
     "from sklearn.preprocessing import StandardScaler, FunctionTransformer\n",
     "from vega import Vega\n",
@@ -85,14 +86,15 @@
   {
    "cell_type": "code",
    "execution_count": 4,
-   "metadata": {},
+   "metadata": {
+    "collapsed": false
+   },
    "outputs": [

@dhimmel
Copy link
Member

dhimmel commented Jun 23, 2017

Is this what you're looking for?

Yeah. Just make sure all changes in there make sense.

@dhimmel
Copy link
Member

dhimmel commented Jun 23, 2017

dask-searchcv was about twice as fast. Average of three runs without dask-searchcv = 14.5 minutes; average of three runs with dask-searchcv = 6.5 minutes.

Nice, hopefully we also reduced memory footprint. Was this with only 1 core? In production, we could parallelize.

@rdvelazquez
Copy link
Member Author

Just make sure all changes in there make sense.

Both changes make sense to me: The first change was importing GridSearchCV from dask_searchcv. The second change was not something I edited but rather the output of how long it took to load the data. My notebook only outputted wall time whereas yours outputted CPU times as well. This is likely just a result of running my notebook on Windows (Related StackOverflow Question)

@rdvelazquez
Copy link
Member Author

Was this with only 1 core?

Yep, only 1 core. It will be cool to see how fast this runs in production!
The slowest part of the notebook for me is now loading the data. It may soon be time to revisit pickle (#85).

@dhimmel dhimmel merged commit 57f7bd0 into cognoma:master Jun 23, 2017
@rdvelazquez rdvelazquez deleted the dask-searchcv-update branch June 24, 2017 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants