-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate dask-searchcv to speed up GridSearchCV #94
Comments
Yea, I'm definitely interested! I'll have a little time to look into this before next Tuesday's meet-up so I'll come ready with some questions and suggestions to discuss. @patrick-miller : Did you look into dask-searchcv yet? I don't want to duplicate work. |
I only read the docs, so go for it! It isn't directly relevant to dash-searchcv, but one thing we need to keep in mind when implementing PCA on our features is that we want to only run PCA on the expression features and not the covariates. I have created a new issue for just this #96. |
I'm still looking into this and hope to post a WIP pull request early next week so you can see what I'm working on. So far dask-searchcv itself is very easy to implement but it does not seem to negate all the issues associated with running PCA in the pipeline... I'll have more soon. |
Here's my WIP notebook: dask-searchCV @dhimmel and @patrick-miller, some questions on where to go from here:
Thanks in advance for the input! |
I'd open an exploration PR. In other words, a PR with the notebook you link to above. Then after #93 by @patrick-miller is merged, you can update the main notebooks.
For efficiency reasons, I'm hoping we can have presets for
Let's keep this in mind and decide later. But your reasoning is correct. In my experience here, the CV performance will not be majorly inflated due to this shortcut. |
Thanks @dhimmel I'll open an exploration PR now.
I looked into this a bit and didn't find a clear correlation between number of positives and ideal n_components... this will be easier to evaluate now that we can include PCA in the pipeline. @htcai let me know if this is something you are working on... if not I can look into it. |
This is getting off topic, but on the topic of:
@rdvelazquez when working with small n_positives, you'll likely need to switch the CV assessment to use repeated cross validation or a large number of sss = StratifiedShuffleSplit(n_splits=100, test_size=0.1, random_state=0) Let's open a new issue if we want to continue this discussion. |
I'm excited about trying out dask-searchcv as a drop-in replacement for
GridSearchCV
. For info on dask-searchcv see, the blog post, github, docs, and video.I'm hoping using dask-searchcv for GridSearchCV will help solve the following problems:
I initially mentioned dask-searchcv in #93 (comment), a PR by @patrick-miller. I thought this would be a good issue for @rdvelazquez to work on. @rdvelazquez are you interested?
We'll have to add some additional dependencies to our environment. It may be a good time to also update the package versions of existing packages (especially pandas).
The text was updated successfully, but these errors were encountered: