An error occurred while running grpo: assert not param.ds_active_sub_modules, param.ds_summary() #299

Tendo33 · 2025-02-13T02:18:19Z

When using open-r1/recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml for Grpo training, an error occurred:
What could be the reason ?

{'loss': 0.0115, 'grad_norm': 0.5289333401881726, 'learning_rate': 3.6124283849
5853e-07, 'rewards/accuracy_reward': 0.2666666775941849, 'rewards/format_reward
': 0.8750000357627868, 'reward': 1.1416666984558106, 'reward_std': 0.3181980699
300766, 'completion_length': 317.7666748046875, 'kl': 0.92265625, 'epoch': 1.0}
 91%|████████████████████████████▏  | 378/415 [10:26:07<51:42, 83.86s/it]Traceb
ack (most recent call last):                                                   
[rank2]: Traceback (most recent call last):                                    
[rank2]:   File "/workspace/sunjinfeng/github_projet/open-r1/src/open_r1/grpo.p
y", line 259, in <module>                                                      
[rank2]:     main(script_args, training_args, model_args)                      
[rank2]:   File "/workspace/sunjinfeng/github_projet/open-r1/src/open_r1/grpo.p
y", line 213, in main                                                          
[rank2]:     train_result = trainer.train(resume_from_checkpoint=checkpoint)   
[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/transformers/trainer.py", line 2184, in train                      
[rank2]:     return inner_training_loop(                                       
[rank2]:            ^^^^^^^^^^^^^^^^^^^^                                       
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/transformers/trainer.py", line 2490, in _inner_training_loop       
[rank2]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batc
h)                                                                             
[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^                                                                             
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/transformers/trainer.py", line 3592, in training_step              
[rank2]:     inputs = self._prepare_inputs(inputs)                             
[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^      
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/trl/trainer/grpo_trainer.py", line 477, in _prepare_inputs         
[rank2]:     self._move_model_to_vllm()                                        
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/trl/trainer/grpo_trainer.py", line 448, in _move_model_to_vllm     
[rank2]:     with unwrap_model_for_generation(                                 
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/co
ntextlib.py", line 144, in __exit__    
[rank2]:     next(self.gen)            
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/trl/models/utils.py", line 195, in unwrap_model_for_generation
[rank2]:     with deepspeed.zero.GatheredParameters(model.parameters()):       
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/deepspeed/runtime/zero/partition_parameters.py", line 2241, in __ex
it__               
[rank2]:     self.params[0].partition(param_list=self.params, has_been_updated=
False)             
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/deepspeed/runtime/zero/partition_parameters.py", line 1386, in part
ition              
[rank2]:     self._partition(param_list, has_been_updated=has_been_updated)
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/deepspeed/runtime/zero/partition_parameters.py", line 1535, in _par
tition             
[rank2]:     self._partition_param(param, has_been_updated=has_been_updated)
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn                   
[rank2]:     ret_val = func(*args, **kwargs)                                   
[rank2]:               ^^^^^^^^^^^^^^^^^^^^^ 
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/deepspeed/runtime/zero/partition_parameters.py", line 1568, in _par
tition_param       
[rank2]:     free_param(param)         
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn                   
[rank2]:     ret_val = func(*args, **kwargs)                                   
[rank2]:               ^^^^^^^^^^^^^^^^^^^^^                                   
[rank2]:   File "/workspace/sunjinfeng/miniconda3/envs/openr1/lib/python3.11/si
te-packages/deepspeed/runtime/zero/partition_parameters.py", line 284, in free_
param              
[rank2]:     assert not param.ds_active_sub_modules, param.ds_summary()        
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                            
[rank2]: AssertionError: {'id': 0, 'status': 'AVAILABLE', 'numel': 233373696, '
ds_numel': 233373696, 'shape': (151936, 1536), 'ds_shape': (151936, 1536), 'req
uires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': 
{372}, 'ds_tensor.shape': torch.Size([77791232])}                              
  File "/workspace/sunjinfeng/github_projet/open-r1/src/open_r1/grpo.py", line 
259, in <module>   
    main(script_args, training_args, model_args)

update: I tried again, but the same error occurred at step 378. Could it be due to the datasets?

The text was updated successfully, but these errors were encountered:

Vaidurya00 · 2025-02-13T07:03:11Z

same problem！！

wuyifan18 · 2025-02-18T04:20:04Z

Same issue

Tendo33 · 2025-02-18T08:22:54Z

OpenRLHF/OpenRLHF#630 (comment)

nomadlx · 2025-02-18T08:42:54Z

OpenRLHF/OpenRLHF#630 (评论)

However, the config_demo.yaml configuration does not involve the train_batch_size. Which other parameters need to satisfy the divisibility relationship?

Tendo33 · 2025-02-18T11:19:04Z

OpenRLHF/OpenRLHF#630 (评论)

However, the config_demo.yaml configuration does not involve the train_batch_size. Which other parameters need to satisfy the divisibility relationship?但是，config_demo.yaml配置不涉及train_batch_size。哪些其他参数需要满足分裂性关系？

It can be set here:

open-r1/recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml

Line 41 in d5b67f4

per_device_train_batch_size: 16

Vllm inference occupies 1 GPU, the equation train_batch_size % actor_size % micro_batch_size should == 0 is not satisfied.

nomadlx · 2025-02-18T11:37:13Z

OpenRLHF/OpenRLHF#630 (评论)

However, the config_demo.yaml configuration does not involve the train_batch_size. Which other parameters need to satisfy the divisibility relationship?但是，config_demo.yaml配置不涉及train_batch_size。哪些其他参数需要满足分裂性关系？

It can be set here:

open-r1/recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml

Line 41 in d5b67f4

per_device_train_batch_size: 16
Vllm inference occupies 1 GPU, the equation train_batch_size % actor_size % micro_batch_size should == 0 is not satisfied.

The parameter "per_device_train_batch_size" seems more like "micro_batch_size" rather than "train_batch_size" because this represents the number of batches on each GPU. I don't understand why this value needs to be divisible by the number of GPUs.

luoruikun · 2025-02-19T02:41:28Z

Though I don't know how vllm work, but I find out that you can change the trainset size to avoid it. It means trainset_size % batch_size (gradient_acc * batch_size_per_gpu * gpu_num) == 0. For examples, I have this setting:

sbatch --job-name=open_r1 --nodes=1 \
    train.slurm

gradient_accumulation_steps: 2
per_device_train_batch_size: 1
num_processes: 7

In this setting, the batch_size is 1 node * 7 gpus * 1 batch_size_per_device * 2 gradient_acc = 14, and my trainset_size need to fit the batch_size, e,g., trainset_size=1,848.

Tendo33 · 2025-02-19T06:13:17Z

OpenRLHF/OpenRLHF#630 (评论)

However, the config_demo.yaml configuration does not involve the train_batch_size. Which other parameters need to satisfy the divisibility relationship?但是，config_demo.yaml配置不涉及train_batch_size。哪些其他参数需要满足分裂性关系？

It can be set here:
open-r1/recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml
Line 41 in d5b67f4
per_device_train_batch_size: 16
Vllm inference occupies 1 GPU, the equation train_batch_size % actor_size % micro_batch_size should == 0 is not satisfied.

The parameter "per_device_train_batch_size" seems more like "micro_batch_size" rather than "train_batch_size" because this represents the number of batches on each GPU. I don't understand why this value needs to be divisible by the number of GPUs.

batch_size = gradient_acc * batch_size_per_gpu * gpu_num

nomadlx · 2025-02-19T06:22:07Z

OpenRLHF/OpenRLHF#630（评论）

但config_demo.yaml配置中不涉及train_batch_size。还有哪些参数需要满足可分性关系？但是，config_demo.yaml配置不涉及train_batch_size。还有哪些参数需要满足可分性关系？

可以在这里设置：
open-r1/recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml d5b67f4
中第 41 行
per_device_train_batch_size: 16
Vllm 推理占用 1 GPU，方程不满足。train_batch_size % actor_size % micro_batch_size should == 0

参数“per_device_train_batch_size”看起来更像是“micro_batch_size”，而不是“train_batch_size”，因为这表示每个 GPU 上的批次数。我不明白为什么这个值需要被 GPU 数量整除。

batch_size = gradient_acc * batch_size_per_gpu * gpu_num

Do you mean that actor_size=num_processes? So, should gradient_acc * batch_size_per_gpu * gpu_num be divisible by num_processes?

Superskyyy mentioned this issue Feb 14, 2025

GRPOTrainer fails to transfer weights to vLLM with _move_model_to_vllm after 7.5 hours of the job running huggingface/trl#2840

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An error occurred while running grpo: assert not param.ds_active_sub_modules, param.ds_summary() #299

An error occurred while running grpo: assert not param.ds_active_sub_modules, param.ds_summary() #299

Tendo33 commented Feb 13, 2025 •

edited

Loading

Vaidurya00 commented Feb 13, 2025

wuyifan18 commented Feb 18, 2025

Tendo33 commented Feb 18, 2025

nomadlx commented Feb 18, 2025

Tendo33 commented Feb 18, 2025

nomadlx commented Feb 18, 2025

luoruikun commented Feb 19, 2025

Tendo33 commented Feb 19, 2025

nomadlx commented Feb 19, 2025

An error occurred while running grpo: assert not param.ds_active_sub_modules, param.ds_summary() #299

An error occurred while running grpo: assert not param.ds_active_sub_modules, param.ds_summary() #299

Comments

Tendo33 commented Feb 13, 2025 • edited Loading

Vaidurya00 commented Feb 13, 2025

wuyifan18 commented Feb 18, 2025

Tendo33 commented Feb 18, 2025

nomadlx commented Feb 18, 2025

Tendo33 commented Feb 18, 2025

nomadlx commented Feb 18, 2025

luoruikun commented Feb 19, 2025

Tendo33 commented Feb 19, 2025

nomadlx commented Feb 19, 2025

Tendo33 commented Feb 13, 2025 •

edited

Loading