Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

运行bert for pytorch报错Out of memory问题 #118

Open
adloph1234 opened this issue Jan 29, 2021 · 6 comments
Open

运行bert for pytorch报错Out of memory问题 #118

adloph1234 opened this issue Jan 29, 2021 · 6 comments

Comments

@adloph1234
Copy link

adloph1234 commented Jan 29, 2021

使用nvidia提供的pytorch docker运行Bert时,精度为fp32,batch size=32或者以上时会报错out of memory,设置的参数和硬件配置和https://github.com/Oneflow-Inc/DLPerf/tree/master/NVIDIADeepLearningExamples/PyTorch/BERT 相同,请问下这个是什么原因呢?

@Flowingsun007
Copy link
Contributor

你好,首先请确保GPU环境是:GPU:Tesla V100-SXM2-16GB x 8,其次可能的原因有docker运行时未设定足够大小的内存,如:
--shm-size=16g

@adloph1234
Copy link
Author

谢谢。
用df能看到docker的shm-size是16g(由于图片无法上传,就用文本复制)
tmpfs 131862444 0 131862444 0% /sys/fs/cgroup
shm 16777216 0 16777216 0% /dev/shm
/dev/mapper/node105--vg-root 1920488384 1688205160 134705036 93% /etc/hosts
tmpfs 131862444 12 131862432 1% /proc/driver/nvidia

其它训练参数信息:

  • python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes 1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 /workspace/examples/bert/run_pretraining.py --input_dir=/workspace/examples/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/ --output_dir=/workspace/examples/bert/results/checkpoints --config_file=/workspace/examples/bert/bert_config.json --bert_model=bert-base-uncased --train_batch_size=48 --max_seq_length=128 --max_predictions_per_seq=20 --max_steps=120 --warmup_proportion=1 --num_steps_per_checkpoint=1000 --learning_rate=6e-3 --seed=42 --do_train --json-summary /workspace/examples/bert/dllogger.json
    device: cuda:0 n_gpu: 1, distributed training: True, 16-bits training: False
    DLL 2021-01-29 03:51:02.285200 - PARAMETER Config : ["Namespace(allreduce_post_accumulation=False, allreduce_post_accumulation_fp16=False, bert_model='bert-base-uncased', checkpoint_activations=False, config_file='/workspace/examples/bert/bert_config.json', disable_progress_bar=False, do_train=True, fp16=False, gradient_accumulation_steps=1, init_checkpoint=None, input_dir='/workspace/examples/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/', json_summary='/workspace/examples/bert/dllogger.json', learning_rate=0.006, local_rank=0, log_freq=1.0, loss_scale=0.0, max_predictions_per_seq=20, max_seq_length=128, max_steps=120.0, n_gpu=1, num_steps_per_checkpoint=1000, num_train_epochs=3.0, output_dir='/workspace/examples/bert/results/checkpoints', phase1_end_step=7038, phase2=False, resume_from_checkpoint=False, resume_step=-1, seed=42, skip_checkpoint=False, train_batch_size=48, use_env=False, warmup_proportion=1.0)"]

报错信息:
Iteration: 0%| | 0/12776 [00:00<?, ?it/s]Traceback (most recent call last):
File "/workspace/examples/bert/run_pretraining.py", line 654, in
args, final_loss, train_time_raw, global_step = main()
File "/workspace/examples/bert/run_pretraining.py", line 571, in main
prediction_scores, seq_relationship_score = model(input_ids=input_ids, token_type_ids=segment_ids, attention_mask=input_mask)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/apex/parallel/distributed.py", line 560, in forward
result = self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "/workspace/examples/bert/modeling.py", line 889, in forward
encoded_layers, pooled_output = self.bert(input_ids, token_type_ids, attention_mask)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "/workspace/examples/bert/modeling.py", line 824, in forward
encoded_layers = self.encoder(embedding_output, extended_attention_mask)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "/workspace/examples/bert/modeling.py", line 508, in forward
hidden_states = layer_module(hidden_states, attention_mask)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "/workspace/examples/bert/modeling.py", line 470, in forward
intermediate_output = self.intermediate(attention_output)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "/workspace/examples/bert/modeling.py", line 443, in forward
hidden_states = self.dense_act(hidden_states)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "/workspace/examples/bert/modeling.py", line 174, in forward
return self.biased_act_fn(self.bias, F.linear(input, self.weight, None))
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 15.78 GiB total capacity; 14.78 GiB already allocated; 9.44 MiB free; 14.83 GiB reserved in total by PyTorch)

@adloph1234
Copy link
Author

硬件信息:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:2D:00.0 Off | 0 |
| N/A 36C P0 43W / 300W | 12MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:32:00.0 Off | 0 |
| N/A 42C P0 58W / 300W | 4076MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:5B:00.0 Off | 0 |
| N/A 36C P0 44W / 300W | 12MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:5F:00.0 Off | 0 |
| N/A 33C P0 41W / 300W | 12MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... Off | 00000000:B5:00.0 Off | 0 |
| N/A 37C P0 42W / 300W | 12MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... Off | 00000000:BE:00.0 Off | 0 |
| N/A 34C P0 42W / 300W | 12MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... Off | 00000000:DF:00.0 Off | 0 |
| N/A 36C P0 43W / 300W | 12MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... Off | 00000000:E7:00.0 Off | 0 |
| N/A 38C P0 56W / 300W | 4240MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|

@nlqq
Copy link
Contributor

nlqq commented Jan 29, 2021

这种时候还有一种可能是路径错误,NVIDIA 仓库的代码在路径错误时会显示为显存 OOC,请检查一下所填写的所有路径是否存在,数据集所在路径是否有效。
对于一些常见的 Q&A,可在https://zhuanlan.zhihu.com/p/276154597 这篇文章中找到答案。

@adloph1234
Copy link
Author

@nlqq 谢谢
batch size=16时是可以运行的,能排除路径错误的情况。查看了一下nvidia 对pytorch docker的测试情况,https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#pre-training-on-multiple-nvidia-dgx-1-with-16g,能看到nvidia的报告中batch size也是16。所以想请问我们这个测试是做了什么特别的设置吗?

@nlqq
Copy link
Contributor

nlqq commented Feb 24, 2021

还需要修改容器中 /workspace/examples/bert_config.json 文件如下:

{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

为了能够在单机上运行bert,部分参数做了如上修改。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants