Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issue #70

Closed
htcai opened this issue Nov 19, 2016 · 9 comments
Closed

Memory issue #70

htcai opened this issue Nov 19, 2016 · 9 comments

Comments

@htcai
Copy link
Member

htcai commented Nov 19, 2016

I am running my notebook obtained by revising the latest 2.TCGA-MLexample in Ubuntu on my laptop (8GB RAM & 8GB swap). I used over-sampling which increased the size of the training data by about 7%. My machine keeps running into memory problem: OSError: [Errno 12] Cannot allocate memory, as well as other exceptions

There is no problem after I discard pipeline. I will use my MacBook (using compressed memory) to run the notebook, but it will be much slower.

@dhimmel
Copy link
Member

dhimmel commented Dec 9, 2016

Okay I think this memory issue probably started after we merged #54. It may be worth considering reverting our pipeline to the old incorrect ordering.

@htcai
Copy link
Member Author

htcai commented Dec 9, 2016

@dhimmel Thanks for your reply! In the older version of pipeline, k is fixed in the pipeline while we feed a singleton list (e.g., [2000]) to the pipeline in the current version. Does this lead to the difference that GridSearchCV will fit SelectKBest only once in the former case while it will be fitted for each training fold in the latter?

@dhimmel
Copy link
Member

dhimmel commented Dec 9, 2016

Does this lead to the difference that GridSearchCV will run SelectKBest only once in the former case while it will be run for each training fold in the latter?

Previously the grid_search only included the SGDClassifier. Now the grid_search includes the entire pipeline. Therefore, cross validation now refits the feature selection (if used) and standardization (if used) on every training-fold rather than once on the entire X_train.

@htcai
Copy link
Member Author

htcai commented Dec 9, 2016

Daniel, thank you for your confirmation! This information is very helpful.

@KT12
Copy link
Contributor

KT12 commented Jan 16, 2017

There is a fix suggested here.

After implementing the fix, I tried using Isomap to do some dimensionality reduction but my Jupyter Notebook still yielded OSError: [Errno 12] Cannot allocate memory.

@dhimmel
Copy link
Member

dhimmel commented Jan 16, 2017

@KT12 So you kept n_jobs=1 in sklearn.manifold.Isomap? It's possible that even with only one job running Isomap, you could run out of memory.

You can also set n_jobs=1 in GridSearchCV, not sure exactly what stage is causing you to run out of memory. See #43 (comment) for more information on memory usage at different stages of our pipeline.

@KT12
Copy link
Contributor

KT12 commented Jan 16, 2017

I kept the default n_jobs=1 in Isomap. I'll try it again using all cores.

The SGDClassifier had n_jobs=-1.

On various attempts, it's been mostly the classifier that ran out of memory. The few times I was able to run the classifier, the Investigate the predictions block is what gave me an issue.

@dhimmel
Copy link
Member

dhimmel commented Jan 16, 2017

I kept the default n_jobs=1 in Isomap. I'll try it again using all cores.

That may speed things up but will only make memory issues worse!

@htcai
Copy link
Member Author

htcai commented Jan 16, 2017

I just found and experimented a solution for the memory issue in Ubuntu 14.04. For 16.04, a similar solution is also available. Mainly, more swap space can be added via swap file. I added a swap file of 16GB and finished running the latest version of the sample notebook 2.TCGA-MLexample.ipynb for the first time. I will restore the usage of pipeline in my own notebook.

However, it takes ~40 min to finish the training and the highest memory usage is beyond 25GB according to activity monitor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants