Evaluate dask-searchcv to speed up GridSearchCV #94

dhimmel · 2017-05-26T17:54:32Z

I'm excited about trying out dask-searchcv as a drop-in replacement for GridSearchCV. For info on dask-searchcv see, the blog post, github, docs, and video.

I'm hoping using dask-searchcv for GridSearchCV will help solve the following problems:

High memory usage, e.g. Memory issue #70, caused by joblib overhead.
The slow performance of the pipeline when properly implementing cross-validation. See discussion at Finding which features are passed to the final estimator of an sklearn pipeline scikit-learn/scikit-learn#7536 (comment). The builtin GridSearchCV is repeating the same transform steps making it brutally slow.

I initially mentioned dask-searchcv in #93 (comment), a PR by @patrick-miller. I thought this would be a good issue for @rdvelazquez to work on. @rdvelazquez are you interested?

We'll have to add some additional dependencies to our environment. It may be a good time to also update the package versions of existing packages (especially pandas).

The text was updated successfully, but these errors were encountered:

rdvelazquez · 2017-05-26T18:10:38Z

Yea, I'm definitely interested! I'll have a little time to look into this before next Tuesday's meet-up so I'll come ready with some questions and suggestions to discuss.

@patrick-miller : Did you look into dask-searchcv yet? I don't want to duplicate work.

patrick-miller · 2017-05-26T19:02:57Z

I only read the docs, so go for it!

It isn't directly relevant to dash-searchcv, but one thing we need to keep in mind when implementing PCA on our features is that we want to only run PCA on the expression features and not the covariates. I have created a new issue for just this #96.

rdvelazquez · 2017-06-02T20:44:44Z

I'm still looking into this and hope to post a WIP pull request early next week so you can see what I'm working on. So far dask-searchcv itself is very easy to implement but it does not seem to negate all the issues associated with running PCA in the pipeline... I'll have more soon.

rdvelazquez · 2017-06-05T16:17:04Z

Here's my WIP notebook: dask-searchCV

@dhimmel and @patrick-miller, some questions on where to go from here:

What should I include in the pull request? I'm thinking: Update environment.yml to include dask-searchcv, and replace SciKit-Learn's grid search with dask-searchcv in some of the notebooks that we want to keep up to date moving forward. (which notebooks should this be? and should I wait until Add covariates-only model for comparison in the main notebook #93 is merged?)
How important is speed? As you'll see in the WIP notebook, searching over a range of n_components can take a while and will only take longer if we increase the number of CV splits and/or increase the range of alpha that we search across. If speed is an important issue we could consider pre-processing (scale, pca, scale) the whole training data-set; this would speed things up while still keeping training and testing data isolated... our cross validation just wouldn't be totally accurate.
Any advice on selecting the range of n_components to search across? I'm tagging @htcai because he may have already looked at this. It seems like unbalanced genes/queries (which in our case will mostly be gene's with few mutations) will perform better with fewer components (in the 30 - 100 range) where as balanced genes/queries (equal number of mutated and non-mutated samples) will perform better with more components. This question may be better addressed as a separate issue/PR.

Thanks in advance for the input!

dhimmel · 2017-06-06T14:58:30Z

What should I include in the pull request?

I'd open an exploration PR. In other words, a PR with the notebook you link to above. Then after #93 by @patrick-miller is merged, you can update the main notebooks.

Any advice on selecting the range of n_components to search across?

For efficiency reasons, I'm hoping we can have presets for n_components based on the number of positives of negatives (whichever is smaller). This way we can avoid the computational burden of having to grid search across a range of component numbers. This will only work if the optimal number of n_components is consistent across classifiers with similar number of positives. Let's save this research for another PR. I believe @htcai may also be working on it.

If speed is an important issue we could consider pre-processing (scale, pca, scale) the whole training data-set; this would speed things up while still keeping training and testing data isolated... our cross validation just wouldn't be totally accurate.

Let's keep this in mind and decide later. But your reasoning is correct. In my experience here, the CV performance will not be majorly inflated due to this shortcut.

rdvelazquez · 2017-06-06T15:13:31Z

Thanks @dhimmel I'll open an exploration PR now.

For efficiency reasons, I'm hoping we can have presets for n_components based on the number of positives of negatives (whichever is smaller).

I looked into this a bit and didn't find a clear correlation between number of positives and ideal n_components... this will be easier to evaluate now that we can include PCA in the pipeline. @htcai let me know if this is something you are working on... if not I can look into it.

dhimmel · 2017-06-06T15:26:46Z

This is getting off topic, but on the topic of:

I looked into this a bit and didn't find a clear correlation between number of positives and ideal n_components... this will be easier to evaluate now that we can include PCA in the pipeline.

@rdvelazquez when working with small n_positives, you'll likely need to switch the CV assessment to use repeated cross validation or a large number of StratifiedShuffleSplits. See discussion on #71 by @htcai. We ended up going with:

sss = StratifiedShuffleSplit(n_splits=100, test_size=0.1, random_state=0)

Let's open a new issue if we want to continue this discussion.

Closes cognoma#94

Closes #94 Refs #70

dhimmel mentioned this issue May 26, 2017

Create benchmark data sets #11

Closed

rdvelazquez mentioned this issue Jun 6, 2017

Dask search cv #98

Closed

rdvelazquez mentioned this issue Jun 7, 2017

Upload dask-searchCV notebook #101

Merged

dhimmel mentioned this issue Jun 17, 2017

Add PCA on expressions only to the CV pipeline #100

Merged

rdvelazquez pushed a commit to rdvelazquez/machine-learning that referenced this issue Jun 22, 2017

Update Environment and Notebook 2 with dask-searchCV

c1cac45

Closes cognoma#94

rdvelazquez pushed a commit to rdvelazquez/machine-learning that referenced this issue Jun 22, 2017

Update environment and notebook 2 with dask-searchcv

e7b7820

Closes cognoma#94

rdvelazquez pushed a commit to rdvelazquez/machine-learning that referenced this issue Jun 22, 2017

update environment and notebook 2 with dask-searchcv

65ae0b6

Closes cognoma#94

rdvelazquez mentioned this issue Jun 22, 2017

update environment and notebook 2 with dask-searchcv #104

Merged

dhimmel closed this as completed in #104 Jun 23, 2017

dhimmel pushed a commit that referenced this issue Jun 23, 2017

Use dask-searchcv in 2.mutation-classifier.ipynb (#104)

57f7bd0

Closes #94 Refs #70

rdvelazquez mentioned this issue Jul 11, 2017

Selecting the number of components returned by PCA #106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate dask-searchcv to speed up GridSearchCV #94

Evaluate dask-searchcv to speed up GridSearchCV #94

dhimmel commented May 26, 2017

rdvelazquez commented May 26, 2017

patrick-miller commented May 26, 2017 •

edited

Loading

rdvelazquez commented Jun 2, 2017

rdvelazquez commented Jun 5, 2017

dhimmel commented Jun 6, 2017

rdvelazquez commented Jun 6, 2017

dhimmel commented Jun 6, 2017

Evaluate dask-searchcv to speed up GridSearchCV #94

Evaluate dask-searchcv to speed up GridSearchCV #94

Comments

dhimmel commented May 26, 2017

rdvelazquez commented May 26, 2017

patrick-miller commented May 26, 2017 • edited Loading

rdvelazquez commented Jun 2, 2017

rdvelazquez commented Jun 5, 2017

dhimmel commented Jun 6, 2017

rdvelazquez commented Jun 6, 2017

dhimmel commented Jun 6, 2017

patrick-miller commented May 26, 2017 •

edited

Loading