Getting out of memory when evaluating k-means #501

jasperhyp · 2022-07-18T23:57:40Z

jasperhyp
Jul 18, 2022

Hi Kevin, Thanks for creating this wonderful library! I have been using your library without problem before, but recently I seemed to be running into (GPU) OOMs (although I have a 32GB V100). Basically, the Jupyter notebook would always die at validation (k-means) time of the first epoch here:

INFO:PML:Initializing dataloader
INFO:PML:Initializing dataloader iterator
INFO:PML:Done creating dataloader iterator
INFO:PML:TRAINING EPOCH 1
total_loss=0.00000: 100%|██████████| 307/307 [00:02<00:00, 105.57it/s]
INFO:PML:Evaluating epoch 1
INFO:PML:Getting embeddings for the val split
100%|██████████| 69/69 [00:00<00:00, 155.40it/s]
INFO:PML:Getting embeddings for the train split
100%|██████████| 548/548 [00:01<00:00, 275.84it/s]
INFO:PML:Computing accuracy for the train split w.r.t ['train']
INFO:PML:running k-nn with k=2192
INFO:PML:embedding dimensionality is 16
WARNING:PML:Using CPU for k-nn search because k = 2192 > 2048, which is the maximum allowable on GPU.
INFO:PML:running k-means clustering with k=410
INFO:PML:embedding dimensionality is 16
[And then dies]

I've already set both train and test batch sizes to 4, and output dimension of embedder to 16, which is obviously pretty small. I also tried to use CPU-only by reinstalling FAISS-cpu (naively setting os.environ['CUDA_VISIBLE_DEVICES']='-1' and device=torch.device('cpu') does not work), and it worked smoothly with batch size as large as 64 and output dimension as large as 64 (80GB RAM). I would greatly appreciate any advice/suggestions!

KevinMusgrave · 2022-07-19T02:42:52Z

KevinMusgrave
Jul 19, 2022
Maintainer

I've already set both train and test batch sizes to 4

Batch size shouldn't affect k-means memory consumption. It is only used by the Tester class for computing the embeddings.

naively setting os.environ['CUDA_VISIBLE_DEVICES']='-1' and device=torch.device('cpu') does not work

Does it use the GPU even though you've specified CPU?

it worked smoothly with batch size as large as 64 and output dimension as large as 64 (80GB RAM). I would greatly appreciate any advice/suggestions!

So it takes 80GB of memory using 64-dim embeddings? How big is your dataset? Could you see how much memory it takes with 16-dim embeddings?

2 replies

jasperhyp Jul 20, 2022
Author

Thanks for your reply!

Does it use the GPU even though you've specified CPU?

Yes exactly. I had to uninstall faiss-gpu and install faiss-cpu instead to use CPU.

So it takes 80GB of memory using 64-dim embeddings? How big is your dataset? Could you see how much memory it takes with 16-dim embeddings?

No, it only takes 10GB total amount of virtual memory. The original training set contains 2192 samples and 400+ classes (which is probably too large for metric learning). In CPU it's just several GB. It was weird because when I used nvidia-smi to monitor GPU usage, the GPU memory usage stayed around 2GB, and then suddenly the process disappeared in nvidia-smi and Jupyter failed, prompting that the kernel was dead. Usually, even if the tensor is too large, it would just output an OOM whilst Jupyter kernel should still be alive. I tried to allocate a 50000x50000 tensor to GPU and it worked well (taking 10GB).

KevinMusgrave Jul 21, 2022
Maintainer

Yes exactly. I had to uninstall faiss-gpu and install faiss-cpu instead to use CPU.

Did you pass device=torch.device('cpu') to AccuracyCalculator?

No, it only takes 10GB total amount of virtual memory. The original training set contains 2192 samples and 400+ classes (which is probably too large for metric learning). In CPU it's just several GB. It was weird because when I used nvidia-smi to monitor GPU usage, the GPU memory usage stayed around 2GB, and then suddenly the process disappeared in nvidia-smi and Jupyter failed, prompting that the kernel was dead. Usually, even if the tensor is too large, it would just output an OOM whilst Jupyter kernel should still be alive. I tried to allocate a 50000x50000 tensor to GPU and it worked well (taking 10GB).

That's strange, so it's probably not an out-of-memory error. If you want to isolate the problem, you could try a couple of things:

Pass include=("AMI", "NMI") to AccuracyCalculator, so that knn doesn't run at all. (Maybe running knn before kmeans is causing some problem.)
Try computing kmeans outside of AccuracyCalculator, i.e. gather the embeddings in a for loop and use faiss as shown here.

If you don't care about using faiss specifically, you could use another library that does kmeans on gpu (e.g. kmeans-pytorch). Just create a function that takes in embeddings and the number of clusters, and pass it into AccuracyCalculator as kmeans_func.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting out of memory when evaluating k-means #501

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Getting out of memory when evaluating k-means #501

jasperhyp Jul 18, 2022

Replies: 1 comment · 2 replies

KevinMusgrave Jul 19, 2022 Maintainer

jasperhyp Jul 20, 2022 Author

KevinMusgrave Jul 21, 2022 Maintainer

jasperhyp
Jul 18, 2022

Replies: 1 comment 2 replies

KevinMusgrave
Jul 19, 2022
Maintainer

jasperhyp Jul 20, 2022
Author

KevinMusgrave Jul 21, 2022
Maintainer