运行bert for pytorch报错Out of memory问题 #118

adloph1234 · 2021-01-29T02:42:17Z

使用nvidia提供的pytorch docker运行Bert时，精度为fp32，batch size=32或者以上时会报错out of memory，设置的参数和硬件配置和https://github.com/Oneflow-Inc/DLPerf/tree/master/NVIDIADeepLearningExamples/PyTorch/BERT 相同，请问下这个是什么原因呢？

Flowingsun007 · 2021-01-29T02:47:58Z

你好，首先请确保GPU环境是：GPU：Tesla V100-SXM2-16GB x 8，其次可能的原因有docker运行时未设定足够大小的内存，如：
--shm-size=16g

adloph1234 · 2021-01-29T03:50:32Z

谢谢。
用df能看到docker的shm-size是16g(由于图片无法上传，就用文本复制)
tmpfs 131862444 0 131862444 0% /sys/fs/cgroup
shm 16777216 0 16777216 0% /dev/shm
/dev/mapper/node105--vg-root 1920488384 1688205160 134705036 93% /etc/hosts
tmpfs 131862444 12 131862432 1% /proc/driver/nvidia

其它训练参数信息：

python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes 1 --node_rank=0 --master_addr=127.0.0.1 --master_port=29500 /workspace/examples/bert/run_pretraining.py --input_dir=/workspace/examples/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/ --output_dir=/workspace/examples/bert/results/checkpoints --config_file=/workspace/examples/bert/bert_config.json --bert_model=bert-base-uncased --train_batch_size=48 --max_seq_length=128 --max_predictions_per_seq=20 --max_steps=120 --warmup_proportion=1 --num_steps_per_checkpoint=1000 --learning_rate=6e-3 --seed=42 --do_train --json-summary /workspace/examples/bert/dllogger.json
device: cuda:0 n_gpu: 1, distributed training: True, 16-bits training: False
DLL 2021-01-29 03:51:02.285200 - PARAMETER Config : ["Namespace(allreduce_post_accumulation=False, allreduce_post_accumulation_fp16=False, bert_model='bert-base-uncased', checkpoint_activations=False, config_file='/workspace/examples/bert/bert_config.json', disable_progress_bar=False, do_train=True, fp16=False, gradient_accumulation_steps=1, init_checkpoint=None, input_dir='/workspace/examples/bert/data/hdf5_lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5/wikicorpus_en/', json_summary='/workspace/examples/bert/dllogger.json', learning_rate=0.006, local_rank=0, log_freq=1.0, loss_scale=0.0, max_predictions_per_seq=20, max_seq_length=128, max_steps=120.0, n_gpu=1, num_steps_per_checkpoint=1000, num_train_epochs=3.0, output_dir='/workspace/examples/bert/results/checkpoints', phase1_end_step=7038, phase2=False, resume_from_checkpoint=False, resume_step=-1, seed=42, skip_checkpoint=False, train_batch_size=48, use_env=False, warmup_proportion=1.0)"]

报错信息：
Iteration: 0%| | 0/12776 [00:00<?, ?it/s]Traceback (most recent call last):
File "/workspace/examples/bert/run_pretraining.py", line 654, in
args, final_loss, train_time_raw, global_step = main()
File "/workspace/examples/bert/run_pretraining.py", line 571, in main
prediction_scores, seq_relationship_score = model(input_ids=input_ids, token_type_ids=segment_ids, attention_mask=input_mask)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/apex/parallel/distributed.py", line 560, in forward
result = self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "/workspace/examples/bert/modeling.py", line 889, in forward
encoded_layers, pooled_output = self.bert(input_ids, token_type_ids, attention_mask)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "/workspace/examples/bert/modeling.py", line 824, in forward
encoded_layers = self.encoder(embedding_output, extended_attention_mask)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "/workspace/examples/bert/modeling.py", line 508, in forward
hidden_states = layer_module(hidden_states, attention_mask)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "/workspace/examples/bert/modeling.py", line 470, in forward
intermediate_output = self.intermediate(attention_output)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "/workspace/examples/bert/modeling.py", line 443, in forward
hidden_states = self.dense_act(hidden_states)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in call
result = self.forward(*input, **kwargs)
File "/workspace/examples/bert/modeling.py", line 174, in forward
return self.biased_act_fn(self.bias, F.linear(input, self.weight, None))
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 15.78 GiB total capacity; 14.78 GiB already allocated; 9.44 MiB free; 14.83 GiB reserved in total by PyTorch)

adloph1234 · 2021-01-29T03:51:22Z

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|

nlqq · 2021-01-29T03:59:02Z

这种时候还有一种可能是路径错误，NVIDIA 仓库的代码在路径错误时会显示为显存 OOC，请检查一下所填写的所有路径是否存在，数据集所在路径是否有效。
对于一些常见的 Q&A，可在https://zhuanlan.zhihu.com/p/276154597 这篇文章中找到答案。

adloph1234 · 2021-01-29T09:32:24Z

@nlqq 谢谢
batch size=16时是可以运行的，能排除路径错误的情况。查看了一下nvidia 对pytorch docker的测试情况，https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#pre-training-on-multiple-nvidia-dgx-1-with-16g，能看到nvidia的报告中batch size也是16。所以想请问我们这个测试是做了什么特别的设置吗？

nlqq · 2021-02-24T03:36:46Z

还需要修改容器中 /workspace/examples/bert_config.json 文件如下：

{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

为了能够在单机上运行bert，部分参数做了如上修改。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

运行bert for pytorch报错Out of memory问题 #118

运行bert for pytorch报错Out of memory问题 #118

adloph1234 commented Jan 29, 2021 •

edited

Loading

Flowingsun007 commented Jan 29, 2021

adloph1234 commented Jan 29, 2021

adloph1234 commented Jan 29, 2021

nlqq commented Jan 29, 2021 •

edited

Loading

adloph1234 commented Jan 29, 2021

nlqq commented Feb 24, 2021

运行bert for pytorch报错Out of memory问题 #118

运行bert for pytorch报错Out of memory问题 #118

Comments

adloph1234 commented Jan 29, 2021 • edited Loading

Flowingsun007 commented Jan 29, 2021

adloph1234 commented Jan 29, 2021

adloph1234 commented Jan 29, 2021

nlqq commented Jan 29, 2021 • edited Loading

adloph1234 commented Jan 29, 2021

nlqq commented Feb 24, 2021

adloph1234 commented Jan 29, 2021 •

edited

Loading

nlqq commented Jan 29, 2021 •

edited

Loading