[PyTorch] Fix fuse_wgrad_accumulation for GroupedLinear #1488

yaox12 · 2025-02-17T08:22:56Z

Description

Due to the wrong indent, the wgrad computation is not called when ctx.fuse_wgrad_accumulation == True.

Also update the test to cover this case.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Xin Yao <[email protected]>

yaox12 · 2025-02-17T08:24:44Z

/te-ci pytorch

Signed-off-by: Xin Yao <[email protected]>

yaox12 · 2025-02-17T09:23:07Z

/te-ci pytorch

timmoon10 · 2025-02-18T23:18:24Z

transformer_engine/pytorch/module/grouped_linear.py

            ctx.weights_shape_1 = weights[0].shape[1]

            tensors_to_save, tensor_objects = prepare_for_saving(*inputmats, *weights_fp8, *biases)
            ctx.save_for_backward(*tensors_to_save)
            ctx.tensor_objects = tensor_objects

            ctx.weights_requires_grad = weights[0].requires_grad
+            if fuse_wgrad_accumulation and ctx.weights_requires_grad:
+                ctx.main_grads = [weights[i].main_grad for i in range(num_gemms)]


It's recommended to use ctx.save_for_backward instead of storing tensors directly in ctx. They warn about messing up the grad graph and memory leaks, although I'm not sure what cases they are specifically worried about.

I agree. We were saving main_grad tensors using ctx.save_for_backward in TE 1.x. But I'm seeing there is comment here.

TransformerEngine/transformer_engine/pytorch/module/linear.py

Line 356 in 6673f16

# Since main_grad can be modified inplace, it should not be a part of saved_tensors

I'm wondering if we have seen issues with ctx.save_for_backward?

Interesting, we should follow the example of Linear then.

@ksivaman This change is from commit 7e58678 in the internal repo. Do you remember why we can't store main_grad in saved_tensors?

Oh I know, previous prepare_for_saving saves tensor.data instead of the tensor itself. So for main_grad that need to be modified inplace, this could be an issue.

Now #1474 changed prepare_for_saving to save the tensor itself, this is no longer a problem.

timmoon10 · 2025-02-18T23:28:00Z

tests/pytorch/test_numerics.py

-            outputs.append(p.grad)
+            if getattr(p, "main_grad", None) is not None:
+                outputs.append(p.main_grad)
+                assert p.grad is None  # grad should be None if fuse_wgrad_accumulation is True


It turns out Mcore expects p.grad to not be None: #1474 (comment)
#1474 sets grad to an uninitialized tensor and assumes Mcore will ignore it.

timmoon10 · 2025-02-19T02:47:19Z

/te-ci pytorch

yaox12 added 2 commits February 17, 2025 00:07

fix fuse_wgrad_accumulation for GroupedLinear

2ad6ad6

Signed-off-by: Xin Yao <[email protected]>

fix fuse_wgrad_accumulation for GroupedLinear

3ea1356

Signed-off-by: Xin Yao <[email protected]>

yaox12 added bug Something isn't working 2.1.0 labels Feb 17, 2025

yaox12 requested a review from timmoon10 February 17, 2025 08:22

update tests

2007bab

Signed-off-by: Xin Yao <[email protected]>

timmoon10 reviewed Feb 18, 2025

View reviewed changes

timmoon10 self-requested a review February 18, 2025 23:33

timmoon10 approved these changes Feb 19, 2025

View reviewed changes

Merge branch 'main' into xiny/fix_grouped_linear

3b16079

timmoon10 merged commit fceff07 into NVIDIA:main Feb 19, 2025
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Fix fuse_wgrad_accumulation for GroupedLinear #1488

[PyTorch] Fix fuse_wgrad_accumulation for GroupedLinear #1488

yaox12 commented Feb 17, 2025

yaox12 commented Feb 17, 2025

yaox12 commented Feb 17, 2025

timmoon10 Feb 18, 2025

yaox12 Feb 19, 2025 •

edited

Loading

timmoon10 Feb 19, 2025

yaox12 Feb 19, 2025

timmoon10 Feb 18, 2025

timmoon10 commented Feb 19, 2025

[PyTorch] Fix fuse_wgrad_accumulation for GroupedLinear #1488

[PyTorch] Fix fuse_wgrad_accumulation for GroupedLinear #1488

Conversation

yaox12 commented Feb 17, 2025

Description

Type of change

Changes

Checklist:

yaox12 commented Feb 17, 2025

yaox12 commented Feb 17, 2025

timmoon10 Feb 18, 2025

Choose a reason for hiding this comment

yaox12 Feb 19, 2025 • edited Loading

Choose a reason for hiding this comment

timmoon10 Feb 19, 2025

Choose a reason for hiding this comment

yaox12 Feb 19, 2025

Choose a reason for hiding this comment

timmoon10 Feb 18, 2025

Choose a reason for hiding this comment

timmoon10 commented Feb 19, 2025

yaox12 Feb 19, 2025 •

edited

Loading