Increasingly occupied GPU memory #108

LeePleased · 2019-08-08T01:56:45Z

Hi~, I meet some problem while running your code on GPU. During training, the program will unexpectly consume GPU memory continuously, like from 2000 Mb -> 3000 Mb -> ... -> and finally explode. I use Python 3.6, Pytorch 0.4 and GPU with 12 GB memory.

LeePleased · 2019-08-08T01:57:40Z

I guess it may relate to RNN compact weight, but I don't know how to fix it.

S-Abdelnabi · 2020-02-17T09:59:31Z

Hi,

Have you found a fix to this? I am having a similar issue. It was working on pytorch 0.4.1, the compact weight warning was displayed once at the beginning only, and it continues normally until the end of the training.
However, I updated to pytorch 1.2 and I am facing the same issue as yours. The warning is displayed at each call of the forward, and it stops training with OOM after around 100 epochs. I tried to call flatten_parameters() at the forward function of WeightDrop class. But I still get the warning.

Thanks a lot.

rewicks · 2020-05-14T23:06:07Z

Also having this issue. I don't think it's related to the flatten_parameters() warnings. It seems to be correlated with the optimizer--specifically, the memory usage only starts to increase after it is switched to ASGD.

AndreaLK3 · 2020-06-04T14:53:26Z

@rewicks good call, the memory usage increases only with the ASGD optimizer. I think I have found the problem with it, but I am not sure how to solve it.

I printed the tensors living in memory using the GPU memory profiling code mentioned at https://discuss.pytorch.org/t/how-to-debug-causes-of-gpu-memory-leaks/6741/3 , and used the PyCharm debugger to see the variables during training.

The ASGD optimizer is an object that contains:

defaults: its default settings
param_groups = list containing 1 dictionary, with the hyperparameters ‘lr’, ‘alpha’, ‘lambd’, ‘t0’, ‘weight_decay’ and ’params’=a list of 14 Parameters w/Tensors
state = defaultdict:20 {Parameter containing Tensor, Parameter containing Tensor, etc etc.}

As the epochs go on and on, optimizer.state will contain 20,23,26,29,... (un-named) Tensors.
My hypothesis:

either the ASGD averages over all the previous epochs, and thus eventually breaks memory
or, more likely, the Tensors with the past gradients are never de-allocated from memory, we always allocate into new ones

Should we change the t0 parameter, increasing it by 1 each epoch? Or should we delete manually tensors from optimizer.state?
I would like to hear your opinions, and possibly from the authors as well - although maybe they didn't have this problem because they did not have resource constraints (I break memory on a GPU that has 10GB of memory)

AndreaLK3 · 2020-06-06T09:58:09Z

I have found a solution. If it works for others as well, this issue can be closed.

I have modified the ASGD optimizer using @mourga's port of AWD-LSTM for PyTorch 1.2.0, from: https://github.com/mourga/awd-lstm-lm

In particular, in main.py, you have to replace:

lines 243-245 with :

for prm in model.parameters():
       if prm in optimizer.state.keys():
                    tmp[prm] = prm.data.detach()
                    prm.data = optimizer.state[prm]['ax'].detach()

lines 259-260 with:

for prm in model.parameters():
      if prm in tmp.keys():
             prm.data = tmp[prm].detach()
             prm.requires_grad = True
del tmp

zhilizju · 2020-10-26T09:01:42Z

I have found a solution. If it works for others as well, this issue can be closed.

I have modified the ASGD optimizer using @mourga's port of AWD-LSTM for PyTorch 1.2.0, from: https://github.com/mourga/awd-lstm-lm

In particular, in main.py, you have to replace:

lines 243-245 with :
for prm in model.parameters():
       if prm in optimizer.state.keys():
                    tmp[prm] = prm.data.detach()
                    prm.data = optimizer.state[prm]['ax'].detach()
lines 259-260 with:
for prm in model.parameters():
      if prm in tmp.keys():
             prm.data = tmp[prm].detach()
             prm.requires_grad = True
del tmp

hi, @AndreaLK3. It works for me as well. However, I don‘t achieve the perplexities that this instruction declares.

The instruction below trains a PTB model that without finetuning achieves perplexities of approximately 61.2 / 58.8 (validation / testing) .python main.py --batch_size 20 --data data/penn --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 500 --save PTB.pt

I just achieve perplexities of 64.74/62.23(validation/testing) with the same command.
My torch version is 1.5.0 and cuda version is 10.1.
I'd like to know your experiment result and your advice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increasingly occupied GPU memory #108

Increasingly occupied GPU memory #108

LeePleased commented Aug 8, 2019 •

edited

Loading

LeePleased commented Aug 8, 2019

S-Abdelnabi commented Feb 17, 2020

rewicks commented May 14, 2020

AndreaLK3 commented Jun 4, 2020

AndreaLK3 commented Jun 6, 2020 •

edited

Loading

zhilizju commented Oct 26, 2020 •

edited

Loading

Increasingly occupied GPU memory #108

Increasingly occupied GPU memory #108

Comments

LeePleased commented Aug 8, 2019 • edited Loading

LeePleased commented Aug 8, 2019

S-Abdelnabi commented Feb 17, 2020

rewicks commented May 14, 2020

AndreaLK3 commented Jun 4, 2020

AndreaLK3 commented Jun 6, 2020 • edited Loading

zhilizju commented Oct 26, 2020 • edited Loading

LeePleased commented Aug 8, 2019 •

edited

Loading

AndreaLK3 commented Jun 6, 2020 •

edited

Loading

zhilizju commented Oct 26, 2020 •

edited

Loading