Fix incorrect patch in zero.init #5921

VeryLazyBoy · 2024-08-12T18:15:42Z

The code below has a problem where cls.__init__ in line 525 can be modified before assignment to _old_init. This could lead to an incorrect __init__ being backed up:

DeepSpeed/deepspeed/runtime/zero/partition_parameters.py

Lines 524 to 534 in ffe0af2

    
           def _enable_class(cls): 
        
               cls._old_init = cls.__init__ 
        
               cls.__init__ = partition_after(cls.__init__) 
        
           def _init_subclass(cls, **kwargs): 
        
               cls._old_init = cls.__init__ 
        
               cls.__init__ = partition_after(cls.__init__) 
        
           # Replace .__init__() for all existing subclasses of torch.nn.Module recursively 
        
           for subclass in get_all_subclasses(torch.nn.modules.module.Module): 
        
               _enable_class(subclass)

Test Case

import deepspeed
from torch import nn


class ModelA(nn.Module):
    def __init__(self):
        super().__init__()


class ModelB(ModelA):
    pass


original_init = ModelA.__init__


ds_config = {
    'fp16': {'enabled': False},
    'bf16': {'enabled': True},
    'zero_optimization': {
        'stage': 3,
        'offload_optimizer': {
            'device': 'cpu',
            'pin_memory': True
        },
        'offload_param': {
            'device': 'cpu',
            'pin_memory': True
        },
    },
    'gradient_accumulation_steps': 1,
    'gradient_clipping': 1,
    'train_batch_size': 1,
    'train_micro_batch_size_per_gpu': 1
}


with deepspeed.zero.Init(config_dict_or_path=ds_config, enabled=True, mem_efficient_linear=False, mpu=None):
    model_a = ModelA()
    assert ModelA.__init__ != original_init

assert ModelA.__init__ == original_init
assert ModelB.__init__ == original_init   #  Fails here. If not, please try several times since it depends on the order of modifications

VeryLazyBoy · 2024-08-12T18:17:33Z

@microsoft-github-policy-service agree

VeryLazyBoy · 2024-08-13T03:45:05Z

A better solution is proposed to handle _init_subclass as well

tohtana · 2024-08-21T01:57:02Z

Thank you @VeryLazyBoy for the great catch!

I think the issue is that we patch superclass's cls.__init__ when cls doesn't have its __init__. So I try another approach in this branch. Do you think if this works?
This is less intrusive as we do not set __init__ when cls doesn't have it.

VeryLazyBoy · 2024-08-21T02:52:17Z

@tohtana Yes! Your approach is less intrusive and much better. Let's go ahead with this new method. Should I close this merge request?

tohtana · 2024-08-21T03:20:40Z

@VeryLazyBoy Thank you for your response!
Let me create a PR using my branch to make sure it works. Let's close this PR after all test pass with the PR.

@VeryLazyBoy

This PR fixes an issue addressed in #5921. With this change, we only apply the patch for parameter partitioning to classes that have `__init__` so that we can avoid applying the patch multiple times. The class that does not have `__init__` now uses its superclass's one. So this PR also applies the patch to the root class, `torch.nn.modules.module.Module`. Thanks @VeryLazyBoy for the report and initial solution. --------- Co-authored-by: Logan Adams <[email protected]>

Fix incorrect patch in zero.init

9633363

VeryLazyBoy requested a review from tjruwase as a code owner August 12, 2024 18:15

VeryLazyBoy added 2 commits August 13, 2024 02:22

Fix typo

5f564b6

Better solution to handle _init_subclass as well

348b29d

VeryLazyBoy and others added 3 commits August 13, 2024 11:54

Fix typo

bf84522

Fix pre-commit format errors

a478920

Merge branch 'master' into fix-incorrect-patch-zero-init

f82e96f

tjruwase requested a review from tohtana August 15, 2024 17:53

loadams and others added 3 commits August 15, 2024 10:56

Merge branch 'master' into fix-incorrect-patch-zero-init

f09ae4b

Merge branch 'master' into fix-incorrect-patch-zero-init

4766531

Merge branch 'master' into fix-incorrect-patch-zero-init

037ac7f

tohtana mentioned this pull request Aug 21, 2024

Fix patch for parameter partitioning in zero.Init() #6388

Merged

VeryLazyBoy closed this Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect patch in zero.init #5921

Fix incorrect patch in zero.init #5921

VeryLazyBoy commented Aug 12, 2024

VeryLazyBoy commented Aug 12, 2024

VeryLazyBoy commented Aug 13, 2024

tohtana commented Aug 21, 2024

VeryLazyBoy commented Aug 21, 2024

tohtana commented Aug 21, 2024

	def _enable_class(cls):
	cls._old_init = cls.__init__
	cls.__init__ = partition_after(cls.__init__)

	def _init_subclass(cls, **kwargs):
	cls._old_init = cls.__init__
	cls.__init__ = partition_after(cls.__init__)

	# Replace .__init__() for all existing subclasses of torch.nn.Module recursively
	for subclass in get_all_subclasses(torch.nn.modules.module.Module):
	_enable_class(subclass)

Fix incorrect patch in zero.init #5921

Fix incorrect patch in zero.init #5921

Conversation

VeryLazyBoy commented Aug 12, 2024

Test Case

VeryLazyBoy commented Aug 12, 2024

VeryLazyBoy commented Aug 13, 2024

tohtana commented Aug 21, 2024

VeryLazyBoy commented Aug 21, 2024

tohtana commented Aug 21, 2024