-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An error occurred while running grpo: assert not param.ds_active_sub_modules, param.ds_summary() #299
Comments
same problem!! |
Same issue |
However, the config_demo.yaml configuration does not involve the train_batch_size. Which other parameters need to satisfy the divisibility relationship? |
It can be set here:
Vllm inference occupies 1 GPU, the equation |
The parameter "per_device_train_batch_size" seems more like "micro_batch_size" rather than "train_batch_size" because this represents the number of batches on each GPU. I don't understand why this value needs to be divisible by the number of GPUs. |
Though I don't know how vllm work, but I find out that you can change the trainset size to avoid it. It means trainset_size % batch_size (gradient_acc * batch_size_per_gpu * gpu_num) == 0. For examples, I have this setting: sbatch --job-name=open_r1 --nodes=1 \
train.slurm gradient_accumulation_steps: 2
per_device_train_batch_size: 1
num_processes: 7 In this setting, the batch_size is 1 node * 7 gpus * 1 batch_size_per_device * 2 gradient_acc = 14, and my trainset_size need to fit the batch_size, e,g., trainset_size=1,848. |
|
Do you mean that actor_size=num_processes? So, should gradient_acc * batch_size_per_gpu * gpu_num be divisible by num_processes? |
When using
open-r1/recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml
for Grpo training, an error occurred:What could be the reason ?
update: I tried again, but the same error occurred at step 378. Could it be due to the datasets?
The text was updated successfully, but these errors were encountered: