Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] PT parallel training neighbor stat OOM #4594

Open
njzjz opened this issue Feb 10, 2025 · 10 comments
Open

[BUG] PT parallel training neighbor stat OOM #4594

njzjz opened this issue Feb 10, 2025 · 10 comments
Assignees
Labels

Comments

@njzjz
Copy link
Member

njzjz commented Feb 10, 2025

Bug summary

Parallel training using the PyTorch backend throws OOM during the neighbor statics step.

DeePMD-kit Version

v3.0.1

Backend and its version

PyTorch v2.4.1.post302

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

py child error file (/tmp/torchelastic_mkfvemdy/none_9h49vpn4/attempt_0/4/error.json)
Traceback (most recent call last):
  File "/root/deepmd-kit/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
dp FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-10_17:56:48
  host      : bohrium-156-1256408
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 1349)
  error_file: /tmp/torchelastic_mkfvemdy/none_9h49vpn4/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
      return f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/entrypoints/main.py", line 527, in main
      train(
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/entrypoints/main.py", line 317, in train
      config["model"], min_nbor_dist = BaseModel.update_sel(
                                       ^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/dpmodel/model/base_model.py", line 192, in update_sel
      return cls.update_sel(train_data, type_map, local_jdata)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/dp_model.py", line 45, in update_sel
      local_jdata_cpy["descriptor"], min_nbor_dist = BaseDescriptor.update_sel(
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/dpmodel/descriptor/make_base_descriptor.py", line 238, in update_sel
      return cls.update_sel(train_data, type_map, local_jdata)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/descriptor/dpa1.py", line 739, in update_sel
      min_nbor_dist, sel = UpdateSel().update_one_sel(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/update_sel.py", line 33, in update_one_sel
      min_nbor_dist, tmp_sel = self.get_nbor_stat(
                               ^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/update_sel.py", line 122, in get_nbor_stat
      min_nbor_dist, max_nbor_size = neistat.get_stat(train_data)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/neighbor_stat.py", line 66, in get_stat
      for mn, dt, jj in self.iterator(data):
                        ^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/utils/neighbor_stat.py", line 159, in iterator
      minrr2, max_nnei = self.auto_batch_size.execute_all(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 197, in execute_all
      n_batch, result = self.execute(execute_with_batch_size, index, natoms)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 111, in execute
      raise e
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 108, in execute
      n_batch, result = callable(max(batch_nframes, 1), start_index)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 174, in execute_with_batch_size
      return (end_index - start_index), callable(
                                        ^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/utils/neighbor_stat.py", line 186, in _execute
      minrr2, max_nnei = self.op(
                         ^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
      return forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  RuntimeError: CUDA error: out of memory
  CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  For debugging consider passing CUDA_LAUNCH_BLOCKING=1
  Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Steps to Reproduce

cd examples/water/se_atten
torchrun --nproc_per_node=4 --no-python dp --pt train input.json

Further Information, Files, and Links

No response

@njzjz njzjz added the bug label Feb 10, 2025
@njzjz
Copy link
Member Author

njzjz commented Feb 10, 2025

It looks like all ranks run on the first GPU.

@caic99
Copy link
Member

caic99 commented Feb 11, 2025

Hi @njzjz ,
Which GPU are you using? The neighbor statistics step should run on CPU entirely. I'll try reproducing this error.

@njzjz
Copy link
Member Author

njzjz commented Feb 11, 2025

Which GPU are you using?

V100.

Indeed, computing only on rank 0 is enough. Other ranks can obtain the results from rank 0.

@njzjz njzjz closed this as completed Feb 11, 2025
@njzjz njzjz reopened this Feb 11, 2025
@caic99
Copy link
Member

caic99 commented Feb 11, 2025

Hi @njzjz ,

I'm not able to reproduce the error. The memory seems to be distributed evenly across GPUs.
Below is a snapshot on GPU memory when neighbor statistics finishes.


Tue Feb 11 06:07:08 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A800-SXM4-80GB          On  | 00000000:0D:00.0 Off |                    0 |
| N/A   33C    P0              71W / 400W |   8993MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A800-SXM4-80GB          On  | 00000000:13:00.0 Off |                    0 |
| N/A   30C    P0              71W / 400W |   7741MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A800-SXM4-80GB          On  | 00000000:29:00.0 Off |                    0 |
| N/A   31C    P0              70W / 400W |   7741MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A800-SXM4-80GB          On  | 00000000:2D:00.0 Off |                    0 |
| N/A   31C    P0              71W / 400W |   7741MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

@njzjz
Copy link
Member Author

njzjz commented Feb 11, 2025

You may use 16 GB V100 cards to trigger the error.

@caic99
Copy link
Member

caic99 commented Feb 11, 2025

@njzjz would you attach a snapshot of nvidia-smi when OOM happens? I was not able to reproduce the uneven memory occupation.

@njzjz
Copy link
Member Author

njzjz commented Feb 11, 2025

Tue Feb 11 14:28:35 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:09.0 Off |                    0 |
| N/A   36C    P0    68W / 300W |  15245MiB / 16384MiB |     44%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:0A.0 Off |                    0 |
| N/A   35C    P0    61W / 300W |   1513MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:0B.0 Off |                    0 |
| N/A   35C    P0    60W / 300W |   1513MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:0C.0 Off |                    0 |
| N/A   34C    P0    57W / 300W |   1513MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:00:0D.0 Off |                    0 |
| N/A   34C    P0    59W / 300W |   1513MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:00:0E.0 Off |                    0 |
| N/A   34C    P0    56W / 300W |   1513MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:00:0F.0 Off |                    0 |
| N/A   34C    P0    55W / 300W |   1513MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:00:10.0 Off |                    0 |
| N/A   35C    P0    60W / 300W |   1513MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

@caic99
Copy link
Member

caic99 commented Feb 11, 2025

@njzjz 🤔I've met this problem long ago. I remember this is a bug with DDP? Would you try torch.cuda.set_device explicitly.
rwth-i6/returnn#1469

Edit: the problem I've encountered behaves like this

@njzjz
Copy link
Member Author

njzjz commented Feb 11, 2025

Again, I don't see the need to calculate it on every rank.

This issue has another problem: the automatic batch size module misses this error message I have never seen before.

if isinstance(e, RuntimeError) and (
"CUDA out of memory." in e.args[0]
or "CUDA driver error: out of memory" in e.args[0]
or "cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR" in e.args[0]
):

@caic99
Copy link
Member

caic99 commented Feb 11, 2025

Again, I don't see the need to calculate it on every rank.

Yes, you are right.

For large datasets (e.g. DPA pretraining), this step takes quite long (a few hours). In practice, we do neighbor stat in one CPU process, and then run training with GPUs with the saved info.

Possible solutions would be calculating neighbor stats only with rank 0, and throw a warning if there are other ranks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants