Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix FairseqHydraPretrainJob for better start_checkpoint behavior #554

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

AndreasPlt
Copy link
Contributor

Currently, when fairseq_hydra_config["checkpoint"]["restore_file"] is given as a parameter for the pretraining, this makes resuming jobs much more difficult since it will always restart from the given file - even though the model might have been trained for several more epochs already. Since changing the restore_file parameter to the new checkpoint would change the job parameter again, the job would also not be able continue training again.

I therefore propose to handle the "restore_file" parameter in the job individually and not leaving it to fairseq. The idea is to save the given "restore_file" as a job attribute to later move it to "output/checkpoints/checkpoint_last.pt" if the latter does not exist yet, which is also the default parameter inside fairseq. That way, when training has been already done for several epochs, the job can still continue from checkpoint_last.pt, and use the given checkpoint if the training hasn't already ran earlier.

fairseq/training.py Outdated Show resolved Hide resolved
fairseq/training.py Show resolved Hide resolved
fairseq/training.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants