Trainer sets `state.best_model_checkpoint` even when it doesn't save there; leads to training crash #35609

tomaarsen · 2025-01-10T14:15:31Z

System Info

transformers version: 4.49.0.dev0
Platform: Windows-10-10.0.22631-SP0
Python version: 3.9.16
Huggingface_hub version: 0.24.7
Safetensors version: 0.4.5
Accelerate version: 0.34.2
Accelerate config: not found
PyTorch version (GPU?): 2.4.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No
Using GPU in script?: Yes
GPU type: NVIDIA GeForce RTX 3090

Who can help?

@muellerz
@SunMarc
@seanswyi

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

pytest tests/test_model_card.py::test_model_card from setfit (link: https://github.com/huggingface/setfit/blob/main/tests/test_model_card.py#L15)

Apologies for not having a convenient ready-to-go transformers-only script. I'm afraid I don't have time for that right now.
In essence, the flow is as follows:

I start the trainer, with lots of evaluations (e.g. eval_steps=1, eval_strategy="steps")

When evaluating, the new _determine_best_metric is called:

transformers/src/transformers/trainer.py

Lines 3070 to 3075 in f63829c

    
           if self.control.should_evaluate: 
        
               metrics = self._evaluate(trial, ignore_keys_for_eval) 
        
               is_new_best_metric = self._determine_best_metric(metrics=metrics, trial=trial) 
        
               if self.args.save_strategy == SaveStrategy.BEST: 
        
                   self.control.should_save = is_new_best_metric

With args.metric_for_best_model set, we only set the best_metric in the first evaluation:

transformers/src/transformers/trainer.py

Line 3182 in f63829c

self.state.best_metric = float("-inf") if self.args.greater_is_better else float("inf")

On the 2nd eval, we start comparing against the first. If the model is better, we now also set best_model_checkpoint:

transformers/src/transformers/trainer.py

Lines 3184 to 3192 in f63829c

    
           if operator(metric_value, self.state.best_metric): 
        
               run_dir = self._get_output_dir(trial=trial) 
        
               checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}" 
        
               output_dir = os.path.join(run_dir, checkpoint_folder) 
        
               self.state.best_metric = metric_value 
        
               self.state.best_model_checkpoint = output_dir 
        
               is_new_best_metric = True

but we're not sure if we're even going to be saving at this step! If args.save_strategy != SaveStrategy.BEST:, then it's very possible that we're not saving.

The eventual crash occurs when "deleting old checkpoints", because there is no file at best_model_checkpoint:

transformers/src/transformers/trainer.py

Lines 2680 to 2685 in f63829c

    
           # Delete the last checkpoint when save_total_limit=1 if it's different from the best checkpoint and process allowed to save. 
        
           if self.args.should_save and self.state.best_model_checkpoint is not None and self.args.save_total_limit == 1: 
        
               for checkpoint in checkpoints_sorted: 
        
                   if not os.path.samefile(checkpoint, self.state.best_model_checkpoint): 
        
                       logger.info(f"Deleting older checkpoint [{checkpoint}] due to args.save_total_limit") 
        
                       shutil.rmtree(checkpoint, ignore_errors=True)

Expected behavior

We should not be setting best_model_checkpoint unless we're confident that 1) state.should_save is True or 2) args.save_strategy == "best". Then we'll avoid this crash.

Tom Aarsen

The text was updated successfully, but these errors were encountered:

tomaarsen added the bug label Jan 10, 2025

This was referenced Jan 10, 2025

fixed to work transformers after v4.45.2 huggingface/setfit#577

Merged

[tests] Add 'save_strategy="no"' in tests to counteract transformers v4.48.0 bug huggingface/setfit#582

Merged

seanswyi linked a pull request Jan 25, 2025 that will close this issue

Fix/best model checkpoint fix #35885

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer sets `state.best_model_checkpoint` even when it doesn't save there; leads to training crash #35609

Trainer sets `state.best_model_checkpoint` even when it doesn't save there; leads to training crash #35609

tomaarsen commented Jan 10, 2025 •

edited

Loading

Trainer sets state.best_model_checkpoint even when it doesn't save there; leads to training crash #35609

Trainer sets state.best_model_checkpoint even when it doesn't save there; leads to training crash #35609

Comments

tomaarsen commented Jan 10, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Trainer sets `state.best_model_checkpoint` even when it doesn't save there; leads to training crash #35609

Trainer sets `state.best_model_checkpoint` even when it doesn't save there; leads to training crash #35609

tomaarsen commented Jan 10, 2025 •

edited

Loading