-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If you have trouble in training in DDP, here is THE solution #91
Comments
wanyunfeiAlex
changed the title
If you have trouble in training in DDP, here are THE solution
If you have trouble in training in DDP, here is THE solution
May 30, 2023
您好,我使用train_dpp.py 进行多GPU训练时出现问题,有好的解决办法吗 |
作者的跑不通的,需要按照我上面的步骤重写 |
好的,谢谢, 我尝试一下吧。 还想请教一下我这边自定义多类别检测训练,一直不收敛,效果很差,您有什么解决方法吗 |
lr调低/ 训练参数先固定一部分/ 先在小批量数据上看能不能过拟合。你结合你自己的网络看下 |
好的 我尝试 ,刚开始学这个, 谢谢您 |
加油 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Noted that I've successfully trained on muti-GPUs of A100*8, I would like to upload the code, but according to my lab's regulations, I cannot do that. So I illustrate the steps you need to do.
Here are the instructions:
I reconstruct the DDP procedure according to this:
https://github.com/rentainhe/pytorch-distributed-training/tree/master
Since this is a Chinese version though, you can choose any other accessible version as well.
As for the dataloader, model construction, and optimizer methods, you can just copy them from train_ddp.py.
In class ModelWithLoss, "from_logits" the parameter of the class "TverskyLoss" needs to be set to "True", or the loss will come out nasty.
Still in classs ModelWithLoss, "mode" the parameter of both class "TverskyLoss" and "FocalLossSeg" needs to be set to "self.model.seg_mode", or the training process will collapse.
During the training iterations, there is a loss check to make sure the loss is not nan/inf. It seems reasonable in a single card, but it causes hang in multi-card scenarios. For example, card A detects a nan and skips this iteration, while card B (WAPP!) doesn't face a nan/inf so it marches into the backward process, during which card B is waiting for responses from card A that will never happen. So the training process is hanged forever. IN ORDER TO fix this, you have to invoke torch.all_gather. Once you detect there is a nan/inf on any of your cards, you have to skip this step on all your cards.
PS: I doubt that much valid data can be skipped under this fix, so I would try to train more epochs to compensate for this drawback.
The text was updated successfully, but these errors were encountered: