Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate dask-searchcv to speed up GridSearchCV #94

Closed
dhimmel opened this issue May 26, 2017 · 7 comments
Closed

Evaluate dask-searchcv to speed up GridSearchCV #94

dhimmel opened this issue May 26, 2017 · 7 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented May 26, 2017

I'm excited about trying out dask-searchcv as a drop-in replacement for GridSearchCV. For info on dask-searchcv see, the blog post, github, docs, and video.

I'm hoping using dask-searchcv for GridSearchCV will help solve the following problems:

  1. High memory usage, e.g. Memory issue #70, caused by joblib overhead.
  2. The slow performance of the pipeline when properly implementing cross-validation. See discussion at Finding which features are passed to the final estimator of an sklearn pipeline scikit-learn/scikit-learn#7536 (comment). The builtin GridSearchCV is repeating the same transform steps making it brutally slow.

I initially mentioned dask-searchcv in #93 (comment), a PR by @patrick-miller. I thought this would be a good issue for @rdvelazquez to work on. @rdvelazquez are you interested?

We'll have to add some additional dependencies to our environment. It may be a good time to also update the package versions of existing packages (especially pandas).

@rdvelazquez
Copy link
Member

Yea, I'm definitely interested! I'll have a little time to look into this before next Tuesday's meet-up so I'll come ready with some questions and suggestions to discuss.

@patrick-miller : Did you look into dask-searchcv yet? I don't want to duplicate work.

@patrick-miller
Copy link
Member

patrick-miller commented May 26, 2017

I only read the docs, so go for it!

It isn't directly relevant to dash-searchcv, but one thing we need to keep in mind when implementing PCA on our features is that we want to only run PCA on the expression features and not the covariates. I have created a new issue for just this #96.

@rdvelazquez
Copy link
Member

I'm still looking into this and hope to post a WIP pull request early next week so you can see what I'm working on. So far dask-searchcv itself is very easy to implement but it does not seem to negate all the issues associated with running PCA in the pipeline... I'll have more soon.

@rdvelazquez
Copy link
Member

Here's my WIP notebook: dask-searchCV

@dhimmel and @patrick-miller, some questions on where to go from here:

  1. What should I include in the pull request? I'm thinking: Update environment.yml to include dask-searchcv, and replace SciKit-Learn's grid search with dask-searchcv in some of the notebooks that we want to keep up to date moving forward. (which notebooks should this be? and should I wait until Add covariates-only model for comparison in the main notebook #93 is merged?)
  2. How important is speed? As you'll see in the WIP notebook, searching over a range of n_components can take a while and will only take longer if we increase the number of CV splits and/or increase the range of alpha that we search across. If speed is an important issue we could consider pre-processing (scale, pca, scale) the whole training data-set; this would speed things up while still keeping training and testing data isolated... our cross validation just wouldn't be totally accurate.
  3. Any advice on selecting the range of n_components to search across? I'm tagging @htcai because he may have already looked at this. It seems like unbalanced genes/queries (which in our case will mostly be gene's with few mutations) will perform better with fewer components (in the 30 - 100 range) where as balanced genes/queries (equal number of mutated and non-mutated samples) will perform better with more components. This question may be better addressed as a separate issue/PR.

Thanks in advance for the input!

@dhimmel
Copy link
Member Author

dhimmel commented Jun 6, 2017

What should I include in the pull request?

I'd open an exploration PR. In other words, a PR with the notebook you link to above. Then after #93 by @patrick-miller is merged, you can update the main notebooks.

Any advice on selecting the range of n_components to search across?

For efficiency reasons, I'm hoping we can have presets for n_components based on the number of positives of negatives (whichever is smaller). This way we can avoid the computational burden of having to grid search across a range of component numbers. This will only work if the optimal number of n_components is consistent across classifiers with similar number of positives. Let's save this research for another PR. I believe @htcai may also be working on it.

If speed is an important issue we could consider pre-processing (scale, pca, scale) the whole training data-set; this would speed things up while still keeping training and testing data isolated... our cross validation just wouldn't be totally accurate.

Let's keep this in mind and decide later. But your reasoning is correct. In my experience here, the CV performance will not be majorly inflated due to this shortcut.

@rdvelazquez
Copy link
Member

Thanks @dhimmel I'll open an exploration PR now.

For efficiency reasons, I'm hoping we can have presets for n_components based on the number of positives of negatives (whichever is smaller).

I looked into this a bit and didn't find a clear correlation between number of positives and ideal n_components... this will be easier to evaluate now that we can include PCA in the pipeline. @htcai let me know if this is something you are working on... if not I can look into it.

@dhimmel
Copy link
Member Author

dhimmel commented Jun 6, 2017

This is getting off topic, but on the topic of:

I looked into this a bit and didn't find a clear correlation between number of positives and ideal n_components... this will be easier to evaluate now that we can include PCA in the pipeline.

@rdvelazquez when working with small n_positives, you'll likely need to switch the CV assessment to use repeated cross validation or a large number of StratifiedShuffleSplits. See discussion on #71 by @htcai. We ended up going with:

sss = StratifiedShuffleSplit(n_splits=100, test_size=0.1, random_state=0)

Let's open a new issue if we want to continue this discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants