Fix FairseqHydraPretrainJob for better start_checkpoint behavior #554
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, when
fairseq_hydra_config["checkpoint"]["restore_file"]
is given as a parameter for the pretraining, this makes resuming jobs much more difficult since it will always restart from the given file - even though the model might have been trained for several more epochs already. Since changing therestore_file
parameter to the new checkpoint would change the job parameter again, the job would also not be able continue training again.I therefore propose to handle the
"restore_file"
parameter in the job individually and not leaving it to fairseq. The idea is to save the given"restore_file"
as a job attribute to later move it to"output/checkpoints/checkpoint_last.pt"
if the latter does not exist yet, which is also the default parameter inside fairseq. That way, when training has been already done for several epochs, the job can still continue fromcheckpoint_last.pt
, and use the given checkpoint if the training hasn't already ran earlier.