-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grad Norm Differences Across Nodes #2240
Comments
I think this may just be a logging issue actually. To get the correct grad_norm we need to call I think right now we're just reporting the (unreduced) total norm from rank 0. If I call |
Thanks @EugenHotaj for the issue and the update. I agree that |
@ebsmothers sorry just saw your comment / pull request. Thanks for the quick fix! |
Closing now that #2248 has landed |
Continuing the discussion from #2172 (thanks @mirceamironenco, @ebsmothers for the fix!).
We have a run on the exact same dataset / hparams except we change the number of nodes from 8->2->1. We noticed that when we reduce the number of nodes the gradient norm goes up:
Here is an 8 node run:
Here is a 2 node run:
Here is a 1 node run:
We can see the grad norm at initialization is ~4x different between 8 node and 1 node run. With the fix in #2172, I would expect the grad norms to be similar regardless of the world size. The only difference between the runs is the global batch size (64 on 1 node, 512 on 8 nodes), but I would not expect this to cause such a big difference.
Is it possible there are still some issues in how we compute / scale the gradients?
The text was updated successfully, but these errors were encountered: