-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] PT parallel training neighbor stat OOM #4594
Comments
It looks like all ranks run on the first GPU. |
Hi @njzjz , |
V100. Indeed, computing only on rank 0 is enough. Other ranks can obtain the results from rank 0. |
Hi @njzjz , I'm not able to reproduce the error. The memory seems to be distributed evenly across GPUs.
|
You may use 16 GB V100 cards to trigger the error. |
@njzjz would you attach a snapshot of |
|
@njzjz 🤔I've met this problem long ago. I remember this is a bug with DDP? Would you try Edit: the problem I've encountered behaves like this |
Again, I don't see the need to calculate it on every rank. This issue has another problem: the automatic batch size module misses this error message I have never seen before. deepmd-kit/deepmd/pt/utils/auto_batch_size.py Lines 52 to 56 in e5eac4a
|
Yes, you are right. For large datasets (e.g. DPA pretraining), this step takes quite long (a few hours). In practice, we do neighbor stat in one CPU process, and then run training with GPUs with the saved info. Possible solutions would be calculating neighbor stats only with rank 0, and throw a warning if there are other ranks? |
Bug summary
Parallel training using the PyTorch backend throws OOM during the neighbor statics step.
DeePMD-kit Version
v3.0.1
Backend and its version
PyTorch v2.4.1.post302
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
Steps to Reproduce
cd examples/water/se_atten torchrun --nproc_per_node=4 --no-python dp --pt train input.json
Further Information, Files, and Links
No response
The text was updated successfully, but these errors were encountered: