Fix issues for MCore DDP. #1474

Victarry · 2025-02-11T04:52:05Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

4: [rank4]: Traceback (most recent call last):
4: [rank4]:   File "/workspace/megatron-lm/pretrain_gpt.py", line 245, in <module>
4: [rank4]:     pretrain(
4: [rank4]:   File "/workspace/megatron-lm/megatron/training/training.py", line 313, in pretrain
4: [rank4]:     iteration, num_floating_point_operations_so_far = train(
4: [rank4]:                                                       ^^^^^^
4: [rank4]:   File "/workspace/megatron-lm/megatron/training/training.py", line 1157, in train
4: [rank4]:     train_step(forward_step_func,
4: [rank4]:   File "/workspace/megatron-lm/megatron/training/training.py", line 631, in train_step
4: [rank4]:     losses_reduced = forward_backward_func(
4: [rank4]:                      ^^^^^^^^^^^^^^^^^^^^^^
4: [rank4]:   File "/workspace/megatron-lm/megatron/core/pipeline_parallel/schedules.py", line 456, in forward_backward_no_pipelining
4: [rank4]:     backward_step(input_tensor, output_tensor, output_tensor_grad, model_type, config)
4: [rank4]:   File "/workspace/megatron-lm/megatron/core/pipeline_parallel/schedules.py", line 356, in backward_step
4: [rank4]:     custom_backward(output_tensor[0], output_tensor_grad[0])
4: [rank4]:   File "/workspace/megatron-lm/megatron/core/pipeline_parallel/schedules.py", line 155, in custom_backward
4: [rank4]:     Variable._execution_engine.run_backward(
4: [rank4]:   File "/workspace/megatron-lm/megatron/core/distributed/distributed_data_parallel.py", line 223, in param_hook
4: [rank4]:     param.grad is not None
4: [rank4]: AssertionError: param.grad being None is not safe when overlap_grad_reduce is True

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change prepare_for_saving from tensor_list.append(tensor.data) to tensor_list.append(tensor). Since this will remove params attributes like grad_added_to_main_grad
Add .data to CPU offload hook. (Details of reason on Fix issues for MCore DDP. #1474 (comment))
Revert the return value of wgrad to empty tensor instead of None, since DDP with backward overlap requires tensor value for wgrad.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

ksivaman · 2025-02-11T05:08:01Z

@Victarry Could you sign-off your commits? Here is the guide.

Signed-off-by: Dennis Liu <[email protected]>

Victarry · 2025-02-11T05:25:55Z

@Victarry Could you sign-off your commits? Here is the guide.

Thanks. Done

ksivaman · 2025-02-11T06:20:21Z

/te-ci pytorch

timmoon10 · 2025-02-11T19:29:24Z

@Victarry Just to confirm, Mcore now requires param.grad to be allocated when gradient_accumulation_fusion=True? This will avoid some race conditions with backward hooks (hooks are launched on a different thread if grad is None), but also add unnecessary memory usage. Also, does the distributed optimizer also have this requirement?

Victarry · 2025-02-12T06:55:42Z

Mcore now requires param.grad to be allocated when gradient_accumulation_fusion=True?

MCore always requires param.grad to be allocated when gradient_accumulation_fusion=True. But TE2.0 changed the return value from empty tensor to None.
https://github.com/NVIDIA/TransformerEngine/blame/49a4535d1addd2c5743a7e280e2f4f2640f0bedf/transformer_engine/pytorch/module/linear.py#L609

Also, does the distributed optimizer also have this requirement?

I'm not familiar with the distributed optimizer. Maybe @deepakn94 can provide some comments?

Signed-off-by: Dennis Liu <[email protected]>

yaox12 · 2025-02-13T07:28:38Z

/te-ci pytorch

Victarry · 2025-02-13T07:35:55Z

Change prepare_for_saving from tensor_list.append(tensor.data) to tensor_list.append(tensor). Since this will remove params attributes like grad_added_to_main_grad

I found above change will cause UT failing with CPU offloading, and the reaons are as follows:

TransformerEngine/transformer_engine/pytorch/module/linear.py

Lines 268 to 273 in f0d22ca

    
           tensors_to_save, tensor_objects = prepare_for_saving( 
        
               saved_inputmat, 
        
               weightmat, 
        
               weight, 
        
               bias, 
        
           )

With BF16 training, weightmat and weight point to the same tensor. And the CPU offload hook will be applied twice on them.
During the offloading hook to weightmat, the data of weightmat will be copied to CPU and then its data is set to blank tensor in

TransformerEngine/transformer_engine/pytorch/cpu_offload.py

Line 384 in f0d22ca

tensor_on_device.data = torch.Tensor() # Force to release memory
During the offloading hook to weight, only a blank tensor is saved, which will cause size mismatch after restoring in the backward

In the original version code, tensor.data will create two tensor objects, such that the force release will not influence each other. But the underlying tensor data is actually offloaded twice.

Signed-off-by: Dennis Liu <[email protected]>

for more information, see https://pre-commit.ci

Victarry · 2025-02-13T10:29:07Z

To make the fix MR simple and make MCore work as soon as possible, I added .data to save_for_backward hook in CPU offload handler.

Signed-off-by: Dennis Liu <[email protected]>

deepakn94 · 2025-02-13T19:33:53Z

@Victarry Just to confirm, Mcore now requires param.grad to be allocated when gradient_accumulation_fusion=True? This will avoid some race conditions with backward hooks, but also add unnecessary memory usage. Also, does the distributed optimizer also have this requirement?

Yes, distributed optimizer also has this requirement for the same reason.

deepakn94 · 2025-02-13T19:35:27Z

Mcore now requires param.grad to be allocated when gradient_accumulation_fusion=True?

MCore always requires param.grad to be allocated when gradient_accumulation_fusion=True. But TE2.0 changed the return value from empty tensor to None. https://github.com/NVIDIA/TransformerEngine/blame/49a4535d1addd2c5743a7e280e2f4f2640f0bedf/transformer_engine/pytorch/module/linear.py#L609

Yup, exactly. MCore has had this requirement for the last year plus. Changing the return value to None is a breaking change for us.

timmoon10 · 2025-02-13T23:30:25Z

/te-ci pytorch

Signed-off-by: Dennis Liu <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Dennis Liu <[email protected]>

yaox12 · 2025-02-17T08:17:32Z

/te-ci pytorch

Victarry · 2025-02-17T15:59:48Z

@timmoon10 @ksivaman could I have your approve for this bug fix? It seems the UT fails with unrelated bugs. Thanks!

Signed-off-by: Dennis Liu <[email protected]>

ksivaman · 2025-02-18T16:24:54Z

/te-ci pytorch

timmoon10

LGTM

Victarry force-pushed the denliu/fix_mcore_ddp branch from ec1d3ec to 4997b56 Compare February 11, 2025 04:55

Fix issues for MCore DDP.

3ffd732

Signed-off-by: Dennis Liu <[email protected]>

Victarry force-pushed the denliu/fix_mcore_ddp branch from 4997b56 to 3ffd732 Compare February 11, 2025 05:24

Victarry force-pushed the denliu/fix_mcore_ddp branch from ddd2c1a to 6a2d88a Compare February 13, 2025 06:55

Remove force data release for CPU offloading.

594ea31

Signed-off-by: Dennis Liu <[email protected]>

Victarry force-pushed the denliu/fix_mcore_ddp branch from 6a2d88a to 594ea31 Compare February 13, 2025 06:57

Merge branch 'main' into denliu/fix_mcore_ddp

3a06526

Victarry and others added 2 commits February 13, 2025 02:11

Add preserved attributeds.

18c7307

Signed-off-by: Dennis Liu <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e6f8528

for more information, see https://pre-commit.ci

Add main_grad to prevserved attributes.

8df9212

Signed-off-by: Dennis Liu <[email protected]>

ptrendx added the 2.1.0 label Feb 15, 2025

Victarry and others added 4 commits February 16, 2025 23:53

Change prepare_for_saving to original tensor and add .data to CPU hook.

70fbec0

Signed-off-by: Dennis Liu <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5885c9e

for more information, see https://pre-commit.ci

Update.

b95c392

Signed-off-by: Dennis Liu <[email protected]>

Merge branch 'main' into denliu/fix_mcore_ddp

17beb0e

Fix for LayernormLinear in FP8.

7d77515

Signed-off-by: Dennis Liu <[email protected]>

timmoon10 approved these changes Feb 18, 2025

View reviewed changes

timmoon10 mentioned this pull request Feb 18, 2025

[PyTorch] Fix fuse_wgrad_accumulation for GroupedLinear #1488

Open

13 tasks

timmoon10 merged commit 978f1d7 into NVIDIA:main Feb 19, 2025
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issues for MCore DDP. #1474

Fix issues for MCore DDP. #1474

Victarry commented Feb 11, 2025 •

edited

Loading

ksivaman commented Feb 11, 2025

Victarry commented Feb 11, 2025

ksivaman commented Feb 11, 2025

timmoon10 commented Feb 11, 2025 •

edited

Loading

Victarry commented Feb 12, 2025

yaox12 commented Feb 13, 2025

Victarry commented Feb 13, 2025 •

edited

Loading

Victarry commented Feb 13, 2025 •

edited

Loading

deepakn94 commented Feb 13, 2025

deepakn94 commented Feb 13, 2025

timmoon10 commented Feb 13, 2025

yaox12 commented Feb 17, 2025

Victarry commented Feb 17, 2025

ksivaman commented Feb 18, 2025

timmoon10 left a comment

Fix issues for MCore DDP. #1474

Fix issues for MCore DDP. #1474

Conversation

Victarry commented Feb 11, 2025 • edited Loading

Description

Type of change

Changes

Checklist:

ksivaman commented Feb 11, 2025

Victarry commented Feb 11, 2025

ksivaman commented Feb 11, 2025

timmoon10 commented Feb 11, 2025 • edited Loading

Victarry commented Feb 12, 2025

yaox12 commented Feb 13, 2025

Victarry commented Feb 13, 2025 • edited Loading

Victarry commented Feb 13, 2025 • edited Loading

deepakn94 commented Feb 13, 2025

deepakn94 commented Feb 13, 2025

timmoon10 commented Feb 13, 2025

yaox12 commented Feb 17, 2025

Victarry commented Feb 17, 2025

ksivaman commented Feb 18, 2025

timmoon10 left a comment

Choose a reason for hiding this comment

Victarry commented Feb 11, 2025 •

edited

Loading

timmoon10 commented Feb 11, 2025 •

edited

Loading

Victarry commented Feb 13, 2025 •

edited

Loading

Victarry commented Feb 13, 2025 •

edited

Loading