-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
运行bert for pytorch报错Out of memory问题 #118
Comments
你好,首先请确保GPU环境是:GPU:Tesla V100-SXM2-16GB x 8,其次可能的原因有docker运行时未设定足够大小的内存,如: |
谢谢。 其它训练参数信息:
报错信息: |
硬件信息: +-----------------------------------------------------------------------------+ |
这种时候还有一种可能是路径错误,NVIDIA 仓库的代码在路径错误时会显示为显存 OOC,请检查一下所填写的所有路径是否存在,数据集所在路径是否有效。 |
@nlqq 谢谢 |
还需要修改容器中 /workspace/examples/bert_config.json 文件如下:
为了能够在单机上运行bert,部分参数做了如上修改。 |
使用nvidia提供的pytorch docker运行Bert时,精度为fp32,batch size=32或者以上时会报错out of memory,设置的参数和硬件配置和https://github.com/Oneflow-Inc/DLPerf/tree/master/NVIDIADeepLearningExamples/PyTorch/BERT 相同,请问下这个是什么原因呢?
The text was updated successfully, but these errors were encountered: