[BUG] PT parallel training neighbor stat OOM #4594

njzjz · 2025-02-10T10:01:18Z

Bug summary

Parallel training using the PyTorch backend throws OOM during the neighbor statics step.

DeePMD-kit Version

v3.0.1

Backend and its version

PyTorch v2.4.1.post302

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

py child error file (/tmp/torchelastic_mkfvemdy/none_9h49vpn4/attempt_0/4/error.json)
Traceback (most recent call last):
  File "/root/deepmd-kit/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
dp FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-10_17:56:48
  host      : bohrium-156-1256408
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 1349)
  error_file: /tmp/torchelastic_mkfvemdy/none_9h49vpn4/attempt_0/4/error.json
  traceback : Traceback (most recent call last):
    File "/root/deepmd-kit/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
      return f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/entrypoints/main.py", line 527, in main
      train(
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/entrypoints/main.py", line 317, in train
      config["model"], min_nbor_dist = BaseModel.update_sel(
                                       ^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/dpmodel/model/base_model.py", line 192, in update_sel
      return cls.update_sel(train_data, type_map, local_jdata)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/dp_model.py", line 45, in update_sel
      local_jdata_cpy["descriptor"], min_nbor_dist = BaseDescriptor.update_sel(
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/dpmodel/descriptor/make_base_descriptor.py", line 238, in update_sel
      return cls.update_sel(train_data, type_map, local_jdata)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/descriptor/dpa1.py", line 739, in update_sel
      min_nbor_dist, sel = UpdateSel().update_one_sel(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/update_sel.py", line 33, in update_one_sel
      min_nbor_dist, tmp_sel = self.get_nbor_stat(
                               ^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/update_sel.py", line 122, in get_nbor_stat
      min_nbor_dist, max_nbor_size = neistat.get_stat(train_data)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/neighbor_stat.py", line 66, in get_stat
      for mn, dt, jj in self.iterator(data):
                        ^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/utils/neighbor_stat.py", line 159, in iterator
      minrr2, max_nnei = self.auto_batch_size.execute_all(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 197, in execute_all
      n_batch, result = self.execute(execute_with_batch_size, index, natoms)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 111, in execute
      raise e
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 108, in execute
      n_batch, result = callable(max(batch_nframes, 1), start_index)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/utils/batch_size.py", line 174, in execute_with_batch_size
      return (end_index - start_index), callable(
                                        ^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/utils/neighbor_stat.py", line 186, in _execute
      minrr2, max_nnei = self.op(
                         ^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/root/deepmd-kit/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
      return forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  RuntimeError: CUDA error: out of memory
  CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
  For debugging consider passing CUDA_LAUNCH_BLOCKING=1
  Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Steps to Reproduce

cd examples/water/se_atten
torchrun --nproc_per_node=4 --no-python dp --pt train input.json

Further Information, Files, and Links

No response

The text was updated successfully, but these errors were encountered:

njzjz · 2025-02-10T10:14:05Z

It looks like all ranks run on the first GPU.

caic99 · 2025-02-11T03:10:42Z

Hi @njzjz ,
Which GPU are you using? The neighbor statistics step should run on CPU entirely. I'll try reproducing this error.

njzjz · 2025-02-11T05:21:04Z

Which GPU are you using?

V100.

Indeed, computing only on rank 0 is enough. Other ranks can obtain the results from rank 0.

caic99 · 2025-02-11T06:08:38Z

Hi @njzjz ,

I'm not able to reproduce the error. The memory seems to be distributed evenly across GPUs.
Below is a snapshot on GPU memory when neighbor statistics finishes.


Tue Feb 11 06:07:08 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A800-SXM4-80GB          On  | 00000000:0D:00.0 Off |                    0 |
| N/A   33C    P0              71W / 400W |   8993MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A800-SXM4-80GB          On  | 00000000:13:00.0 Off |                    0 |
| N/A   30C    P0              71W / 400W |   7741MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A800-SXM4-80GB          On  | 00000000:29:00.0 Off |                    0 |
| N/A   31C    P0              70W / 400W |   7741MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A800-SXM4-80GB          On  | 00000000:2D:00.0 Off |                    0 |
| N/A   31C    P0              71W / 400W |   7741MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

njzjz · 2025-02-11T06:16:50Z

You may use 16 GB V100 cards to trigger the error.

caic99 · 2025-02-11T06:20:36Z

@njzjz would you attach a snapshot of nvidia-smi when OOM happens? I was not able to reproduce the uneven memory occupation.

njzjz · 2025-02-11T06:29:26Z

Tue Feb 11 14:28:35 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:09.0 Off |                    0 |
| N/A   36C    P0    68W / 300W |  15245MiB / 16384MiB |     44%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:0A.0 Off |                    0 |
| N/A   35C    P0    61W / 300W |   1513MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:0B.0 Off |                    0 |
| N/A   35C    P0    60W / 300W |   1513MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:0C.0 Off |                    0 |
| N/A   34C    P0    57W / 300W |   1513MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:00:0D.0 Off |                    0 |
| N/A   34C    P0    59W / 300W |   1513MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:00:0E.0 Off |                    0 |
| N/A   34C    P0    56W / 300W |   1513MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:00:0F.0 Off |                    0 |
| N/A   34C    P0    55W / 300W |   1513MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:00:10.0 Off |                    0 |
| N/A   35C    P0    60W / 300W |   1513MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

caic99 · 2025-02-11T07:11:59Z

@njzjz 🤔I've met this problem long ago. I remember this is a bug with DDP? Would you try torch.cuda.set_device explicitly.
rwth-i6/returnn#1469

Edit: the problem I've encountered behaves like this

njzjz · 2025-02-11T07:22:54Z

Again, I don't see the need to calculate it on every rank.

This issue has another problem: the automatic batch size module misses this error message I have never seen before.

deepmd-kit/deepmd/pt/utils/auto_batch_size.py

Lines 52 to 56 in e5eac4a

    
           if isinstance(e, RuntimeError) and ( 
        
               "CUDA out of memory." in e.args[0] 
        
               or "CUDA driver error: out of memory" in e.args[0] 
        
               or "cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR" in e.args[0] 
        
           ):

caic99 · 2025-02-11T07:54:43Z

Again, I don't see the need to calculate it on every rank.

Yes, you are right.

For large datasets (e.g. DPA pretraining), this step takes quite long (a few hours). In practice, we do neighbor stat in one CPU process, and then run training with GPUs with the saved info.

Possible solutions would be calculating neighbor stats only with rank 0, and throw a warning if there are other ranks?

njzjz added the bug label Feb 10, 2025

wanghan-iapcm assigned caic99 Feb 11, 2025

njzjz closed this as completed Feb 11, 2025

njzjz reopened this Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] PT parallel training neighbor stat OOM #4594

[BUG] PT parallel training neighbor stat OOM #4594

njzjz commented Feb 10, 2025

njzjz commented Feb 10, 2025

caic99 commented Feb 11, 2025

njzjz commented Feb 11, 2025

caic99 commented Feb 11, 2025

njzjz commented Feb 11, 2025

caic99 commented Feb 11, 2025

njzjz commented Feb 11, 2025

caic99 commented Feb 11, 2025 •

edited

Loading

njzjz commented Feb 11, 2025

caic99 commented Feb 11, 2025

[BUG] PT parallel training neighbor stat OOM #4594

[BUG] PT parallel training neighbor stat OOM #4594

Comments

njzjz commented Feb 10, 2025

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

njzjz commented Feb 10, 2025

caic99 commented Feb 11, 2025

njzjz commented Feb 11, 2025

caic99 commented Feb 11, 2025

njzjz commented Feb 11, 2025

caic99 commented Feb 11, 2025

njzjz commented Feb 11, 2025

caic99 commented Feb 11, 2025 • edited Loading

njzjz commented Feb 11, 2025

caic99 commented Feb 11, 2025

caic99 commented Feb 11, 2025 •

edited

Loading