Trainer sets state.best_model_checkpoint
even when it doesn't save there; leads to training crash
#35609
Open
2 of 4 tasks
Labels
System Info
transformers
version: 4.49.0.dev0Who can help?
@muellerz
@SunMarc
@seanswyi
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
pytest tests/test_model_card.py::test_model_card
fromsetfit
(link: https://github.com/huggingface/setfit/blob/main/tests/test_model_card.py#L15)Apologies for not having a convenient ready-to-go
transformers
-only script. I'm afraid I don't have time for that right now.In essence, the flow is as follows:
eval_steps=1
,eval_strategy="steps"
)_determine_best_metric
is called:transformers/src/transformers/trainer.py
Lines 3070 to 3075 in f63829c
args.metric_for_best_model
set, we only set thebest_metric
in the first evaluation:transformers/src/transformers/trainer.py
Line 3182 in f63829c
best_model_checkpoint
:transformers/src/transformers/trainer.py
Lines 3184 to 3192 in f63829c
args.save_strategy != SaveStrategy.BEST:
, then it's very possible that we're not saving.best_model_checkpoint
:transformers/src/transformers/trainer.py
Lines 2680 to 2685 in f63829c
Expected behavior
We should not be setting
best_model_checkpoint
unless we're confident that 1)state.should_save
is True or 2)args.save_strategy == "best"
. Then we'll avoid this crash.The text was updated successfully, but these errors were encountered: