Why ZeRO1 partitioning the model parameters but not the optimizer state? #1

calico-niko · 2024-07-08T07:35:17Z

Hi, awesome project. Learn a lot from this project, thanks your great work!

but the model partition policy confused me. From this blog I read, the ZeRO1 policy only partitions the model optimizer state (self.m and self.v in code), but the code shown below splits the model parameters. Would you mind explaining why?

min-fsdp/journey/understanding_zero/4_zero1.py

Lines 70 to 76 in 750b4f4

    
           for idx, (_, param) in enumerate(self._local_params()): 
        
               si_s, si_e = self.shard_indices[idx] 
        
               self.sharded_fp32_master_param[si_s:si_e] = param.data.view(-1).float() 
        
               # set grad as well. 
        
               param.grad = torch.zeros_like(param.data) 
        
               param.grad.data = self.local_grad_buffer_hp[si_s:si_e].view_as(param.data)

The text was updated successfully, but these errors were encountered:

MostHumble · 2024-09-08T17:31:02Z

@calico-niko Not sure if it's useful to you, but here's what I understood:
The optimizer state consists of a copy of the parameters and momentum parameters:

Along with the parameter sharding you mentioned, sharding momentums is done here (because of _local_params thus getting a local final current_offset):

min-fsdp/journey/understanding_zero/4_zero1.py

Lines 52 to 68 in 750b4f4

    
           current_offset = 0 
        
           # Initialize config per-shard. 
        
           for gidx, param in self._local_params(): 
        
               self.offsets.append(param.data.view(-1).size(0)) 
        
               self.shard_indices.append( 
        
                   (current_offset, current_offset + param.data.view(-1).size(0)) 
        
               ) 
        
               current_offset += param.data.view(-1).size(0) 
        
               self.local_param_indices.add(gidx) 
        
           self.v = torch.zeros(current_offset).to(self.device) 
        
           self.m = torch.zeros(current_offset).to(self.device) 
        
           self.sharded_fp32_master_param = torch.zeros(current_offset).to(self.device) 
        
           self.local_grad_buffer_hp = torch.zeros(current_offset).to( 
        
               self.device, dtype=forward_dtype 
        
           )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why ZeRO1 partitioning the model parameters but not the optimizer state? #1

Why ZeRO1 partitioning the model parameters but not the optimizer state? #1

calico-niko commented Jul 8, 2024

MostHumble commented Sep 8, 2024

Why ZeRO1 partitioning the model parameters but not the optimizer state? #1

Why ZeRO1 partitioning the model parameters but not the optimizer state? #1

Comments

calico-niko commented Jul 8, 2024

MostHumble commented Sep 8, 2024