Fix FairseqHydraPretrainJob for better start_checkpoint behavior #554

AndreasPlt · 2024-11-14T13:11:23Z

Currently, when fairseq_hydra_config["checkpoint"]["restore_file"] is given as a parameter for the pretraining, this makes resuming jobs much more difficult since it will always restart from the given file - even though the model might have been trained for several more epochs already. Since changing the restore_file parameter to the new checkpoint would change the job parameter again, the job would also not be able continue training again.

I therefore propose to handle the "restore_file" parameter in the job individually and not leaving it to fairseq. The idea is to save the given "restore_file" as a job attribute to later move it to "output/checkpoints/checkpoint_last.pt" if the latter does not exist yet, which is also the default parameter inside fairseq. That way, when training has been already done for several epochs, the job can still continue from checkpoint_last.pt, and use the given checkpoint if the training hasn't already ran earlier.

fairseq/training.py

Co-authored-by: michelwi <[email protected]>

AndreasPlt added 6 commits November 11, 2024 16:27

fix when using restore_file

1d76f46

add some prints

a7817e9

comment clean up

6812f6f

change link to symlink

51882e1

fix when using restore_file

e1191ee

comment clean up

a37e006

AndreasPlt mentioned this pull request Nov 14, 2024

add new fairseq_pretraining function for starting from checkpoint rwth-i6/i6_experiments#255

Open

vieting self-requested a review November 14, 2024 13:30

michelwi reviewed Nov 14, 2024

View reviewed changes

fairseq/training.py Outdated Show resolved Hide resolved

fairseq/training.py Show resolved Hide resolved

fairseq/training.py Show resolved Hide resolved

Update fairseq/training.py

6171182

Co-authored-by: michelwi <[email protected]>

michelwi approved these changes Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FairseqHydraPretrainJob for better start_checkpoint behavior #554

Fix FairseqHydraPretrainJob for better start_checkpoint behavior #554

AndreasPlt commented Nov 14, 2024

Fix FairseqHydraPretrainJob for better start_checkpoint behavior #554

Are you sure you want to change the base?

Fix FairseqHydraPretrainJob for better start_checkpoint behavior #554

Conversation

AndreasPlt commented Nov 14, 2024