Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An error occurred while running grpo: assert not param.ds_active_sub_modules, param.ds_summary() #299

Open
Tendo33 opened this issue Feb 13, 2025 · 9 comments

Comments

@Tendo33
Copy link
Contributor

Tendo33 commented Feb 13, 2025

When using open-r1/recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml for Grpo training, an error occurred:
What could be the reason ?

{'loss': 0.0115, 'grad_norm': 0.5289333401881726, 'learning_rate': 3.6124283849
5853e-07, 'rewards/accuracy_reward': 0.2666666775941849, 'rewards/format_reward
': 0.8750000357627868, 'reward': 1.1416666984558106, 'reward_std': 0.3181980699
300766, 'completion_length': 317.7666748046875, 'kl': 0.92265625, 'epoch': 1.0}
 91%|████████████████████████████▏  | 378/415 [10:26:07<51:42, 83.86s/it]Traceb
ack (most recent call last):                                                   
[rank2]: Traceback (most recent call last):                                    
[rank2]:   File "/workspace/sunjinfeng/github_projet/open-r1/src/open_r1/grpo.p
y", line 259, in <module>                                                      
[rank2]:     main(script_args, training_args, model_args)                      
[rank2]:   File "/workspace/sunjinfeng/github_projet/open-r1/src/open_r1/grpo.p
y", line 213, in main                                                          
[rank2]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)   
[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/transformers/trainer.py", line 2184, in train                      
[rank2]:     return inner_training_loop(                                       
[rank2]:            ^^^^^^^^^^^^^^^^^^^^                                       
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/transformers/trainer.py", line 2490, in _inner_training_loop       
[rank2]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batc
h)                                                                             
[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^                                                                             
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/transformers/trainer.py", line 3592, in training_step              
[rank2]:     inputs = self._prepare_inputs(inputs)                             
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^      
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/trl/trainer/grpo_trainer.py", line 477, in _prepare_inputs         
[rank2]:     self._move_model_to_vllm()                                        
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/trl/trainer/grpo_trainer.py", line 448, in _move_model_to_vllm     
[rank2]:     with unwrap_model_for_generation(                                 
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/co
ntextlib.py", line 144, in __exit__    
[rank2]:     next(self.gen)            
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/trl/models/utils.py", line 195, in unwrap_model_for_generation
[rank2]:     with deepspeed.zero.GatheredParameters(model.parameters()):       
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/deepspeed/runtime/zero/partition_parameters.py", line 2241, in __ex
it__               
[rank2]:     self.params[0].partition(param_list=self.params, has_been_updated=
False)             
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/deepspeed/runtime/zero/partition_parameters.py", line 1386, in part
ition              
[rank2]:     self._partition(param_list, has_been_updated=has_been_updated)
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/deepspeed/runtime/zero/partition_parameters.py", line 1535, in _par
tition             
[rank2]:     self._partition_param(param, has_been_updated=has_been_updated)
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn                   
[rank2]:     ret_val = func(*args, **kwargs)                                   
[rank2]:               ^^^^^^^^^^^^^^^^^^^^^ 
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/deepspeed/runtime/zero/partition_parameters.py", line 1568, in _par
tition_param       
[rank2]:     free_param(param)         
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn                   
[rank2]:     ret_val = func(*args, **kwargs)                                   
[rank2]:               ^^^^^^^^^^^^^^^^^^^^^                                   
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/deepspeed/runtime/zero/partition_parameters.py", line 284, in free_
param              
[rank2]:     assert not param.ds_active_sub_modules, param.ds_summary()        
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                            
[rank2]: AssertionError: {'id': 0, 'status': 'AVAILABLE', 'numel': 233373696, '
ds_numel': 233373696, 'shape': (151936, 1536), 'ds_shape': (151936, 1536), 'req
uires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': 
{372}, 'ds_tensor.shape': torch.Size([77791232])}                              
  File "/workspace/sunjinfeng/github_projet/open-r1/src/open_r1/grpo.py", line 
259, in <module>   
    main(script_args, training_args, model_args)  

update: I tried again, but the same error occurred at step 378. Could it be due to the datasets?

@Vaidurya00
Copy link

same problem!!

@wuyifan18
Copy link

Same issue

@Tendo33
Copy link
Contributor Author

Tendo33 commented Feb 18, 2025

@nomadlx
Copy link

nomadlx commented Feb 18, 2025

OpenRLHF/OpenRLHF#630 (评论)

However, the config_demo.yaml configuration does not involve the train_batch_size. Which other parameters need to satisfy the divisibility relationship?

@Tendo33
Copy link
Contributor Author

Tendo33 commented Feb 18, 2025

OpenRLHF/OpenRLHF#630 (评论)

However, the config_demo.yaml configuration does not involve the train_batch_size. Which other parameters need to satisfy the divisibility relationship?但是,config_demo.yaml配置不涉及train_batch_size。哪些其他参数需要满足分裂性关系?

It can be set here:

per_device_train_batch_size: 16


Vllm inference occupies 1 GPU, the equation train_batch_size % actor_size % micro_batch_size should == 0 is not satisfied.

@nomadlx
Copy link

nomadlx commented Feb 18, 2025

OpenRLHF/OpenRLHF#630 (评论)

However, the config_demo.yaml configuration does not involve the train_batch_size. Which other parameters need to satisfy the divisibility relationship?但是,config_demo.yaml配置不涉及train_batch_size。哪些其他参数需要满足分裂性关系?

It can be set here:

open-r1/recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml

Line 41 in d5b67f4

per_device_train_batch_size: 16
Vllm inference occupies 1 GPU, the equation train_batch_size % actor_size % micro_batch_size should == 0 is not satisfied.

The parameter "per_device_train_batch_size" seems more like "micro_batch_size" rather than "train_batch_size" because this represents the number of batches on each GPU. I don't understand why this value needs to be divisible by the number of GPUs.

@luoruikun
Copy link

Though I don't know how vllm work, but I find out that you can change the trainset size to avoid it. It means trainset_size % batch_size (gradient_acc * batch_size_per_gpu * gpu_num) == 0. For examples, I have this setting:

sbatch --job-name=open_r1 --nodes=1 \
    train.slurm
gradient_accumulation_steps: 2
per_device_train_batch_size: 1
num_processes: 7

In this setting, the batch_size is 1 node * 7 gpus * 1 batch_size_per_device * 2 gradient_acc = 14, and my trainset_size need to fit the batch_size, e,g., trainset_size=1,848.

@Tendo33
Copy link
Contributor Author

Tendo33 commented Feb 19, 2025

OpenRLHF/OpenRLHF#630 (评论)

However, the config_demo.yaml configuration does not involve the train_batch_size. Which other parameters need to satisfy the divisibility relationship?但是,config_demo.yaml配置不涉及train_batch_size。哪些其他参数需要满足分裂性关系?

It can be set here:
open-r1/recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml
Line 41 in d5b67f4
per_device_train_batch_size: 16
Vllm inference occupies 1 GPU, the equation train_batch_size % actor_size % micro_batch_size should == 0 is not satisfied.

The parameter "per_device_train_batch_size" seems more like "micro_batch_size" rather than "train_batch_size" because this represents the number of batches on each GPU. I don't understand why this value needs to be divisible by the number of GPUs.

batch_size = gradient_acc * batch_size_per_gpu * gpu_num

@nomadlx
Copy link

nomadlx commented Feb 19, 2025

OpenRLHF/OpenRLHF#630(评论)

但config_demo.yaml配置中不涉及train_batch_size。还有哪些参数需要满足可分性关系?但是,config_demo.yaml配置不涉及train_batch_size。还有哪些参数需要满足可分性关系?

可以在这里设置:
open-r1/recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml d5b67f4
中第 41 行
per_device_train_batch_size: 16
Vllm 推理占用 1 GPU,方程不满足。train_batch_size % actor_size % micro_batch_size should == 0

参数“per_device_train_batch_size”看起来更像是“micro_batch_size”,而不是“train_batch_size”,因为这表示每个 GPU 上的批次数。我不明白为什么这个值需要被 GPU 数量整除。

batch_size = gradient_acc * batch_size_per_gpu * gpu_num

Do you mean that actor_size=num_processes? So, should gradient_acc * batch_size_per_gpu * gpu_num be divisible by num_processes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants