Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

一台服务器三个GPU跑bert模型时,总会突然崩掉 #13

Open
yzc1103 opened this issue Oct 21, 2019 · 5 comments
Open

一台服务器三个GPU跑bert模型时,总会突然崩掉 #13

yzc1103 opened this issue Oct 21, 2019 · 5 comments

Comments

@yzc1103
Copy link

yzc1103 commented Oct 21, 2019

我觉得可能是batch_size太大,但是调小也不管用。会不会是负载不均衡。有没有一些建议或解决方法。

@yzc1103
Copy link
Author

yzc1103 commented Oct 21, 2019

@yilifzf

@yilifzf
Copy link
Owner

yilifzf commented Oct 21, 2019

gpu显存是多大呢?显存过载的话可以调小max_seq_length和train_batch_size,我的经验是单块的8G gpu可以跑max_legnth=128和batch_size=24的。三块gpu应该是绰绰有余的,建议观察nvidia-smi的gpu使用状态,对应调整就好啦。

@yzc1103
Copy link
Author

yzc1103 commented Oct 21, 2019

我试了调小max_seq_length=32,64和train_batch_size=24,16,8,之类的,但是不管用。
显存是12196MiB,跑的时候我看着显存使用并不多,但是会出现突然有一块GPU利用率达到99%的情况。那我是不是需要规定一块跑呢

@yzc1103
Copy link
Author

yzc1103 commented Oct 21, 2019

我规定了一块GPU跑,现在的看着是成功了。谢谢您啦~

@littlefenliu
Copy link

请问楼上跑了多久啊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants