Fix broken accumulate_grad_batches behavior #287
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix broken
accumulate_grad_batches
argument in v5 trainerWhile trying to finetune some of the RWKV-7-Pile models, I found that the
accumulate_grad_batches
argument sent to the main trainer file had some bugs.These bugs occurred because the training code doesn't take into account gradient accumulation steps when calculating total tokens processed and the number of steps to resume at.
To fix this, I modified the trainer code so that the step logged to W&B is the actual optimization step, not the grad accumulation (micro) step, by dividing
args.epoch_steps
byargs.accumulate_grad_batches
in the calculation ofreal_step
. This should have no effect whenargs.accumulate_grad_batches
is equal to1
, which is the default. Then, I modified the learning rate schedule so thatreal_tokens
andwarmup_tokens
are scaled by the number of gradient accumulation batches, so that the learning rate schedule works properly. All the other code is left the same, and the training progress bar still represents micro-steps.In my testing, this appears to fix the issue of resuming training, and the learning rate scales properly. Below is a comparison for a training run before the change and a training run after from W&B, both using
accumulate_grad_batches=8
(on different data, though).Before (Stopped and resumed around 15k steps):
After (Stopped and resumed at ~200, 400 steps; the LR does scale properly with
my_exit_tokens
, but not visible from this image):NOTE: While in theory this should not be a breaking change, I would still highly recommend testing on a multi-GPU setup for any bugs as I only had access to my local GPU while testing.
How to Test:
Do a training run in which the
--accumulate_grad_batches
argument is set to a number greater than1
; check that the learning rate schedule works properly and that resuming from a checkpoint does not cause step gaps in the loss curve.