Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer sets state.best_model_checkpoint even when it doesn't save there; leads to training crash #35609

Open
2 of 4 tasks
tomaarsen opened this issue Jan 10, 2025 · 0 comments · May be fixed by #35885
Open
2 of 4 tasks
Labels

Comments

@tomaarsen
Copy link
Member

tomaarsen commented Jan 10, 2025

System Info

  • transformers version: 4.49.0.dev0
  • Platform: Windows-10-10.0.22631-SP0
  • Python version: 3.9.16
  • Huggingface_hub version: 0.24.7
  • Safetensors version: 0.4.5
  • Accelerate version: 0.34.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: No
  • Using GPU in script?: Yes
  • GPU type: NVIDIA GeForce RTX 3090

Who can help?

@muellerz
@SunMarc
@seanswyi

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

pytest tests/test_model_card.py::test_model_card from setfit (link: https://github.com/huggingface/setfit/blob/main/tests/test_model_card.py#L15)

Apologies for not having a convenient ready-to-go transformers-only script. I'm afraid I don't have time for that right now.
In essence, the flow is as follows:

  1. I start the trainer, with lots of evaluations (e.g. eval_steps=1, eval_strategy="steps")
  2. When evaluating, the new _determine_best_metric is called:
    if self.control.should_evaluate:
    metrics = self._evaluate(trial, ignore_keys_for_eval)
    is_new_best_metric = self._determine_best_metric(metrics=metrics, trial=trial)
    if self.args.save_strategy == SaveStrategy.BEST:
    self.control.should_save = is_new_best_metric
  3. With args.metric_for_best_model set, we only set the best_metric in the first evaluation:
    self.state.best_metric = float("-inf") if self.args.greater_is_better else float("inf")
  4. On the 2nd eval, we start comparing against the first. If the model is better, we now also set best_model_checkpoint:
    if operator(metric_value, self.state.best_metric):
    run_dir = self._get_output_dir(trial=trial)
    checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
    output_dir = os.path.join(run_dir, checkpoint_folder)
    self.state.best_metric = metric_value
    self.state.best_model_checkpoint = output_dir
    is_new_best_metric = True
    but we're not sure if we're even going to be saving at this step! If args.save_strategy != SaveStrategy.BEST:, then it's very possible that we're not saving.
  5. The eventual crash occurs when "deleting old checkpoints", because there is no file at best_model_checkpoint:
    # Delete the last checkpoint when save_total_limit=1 if it's different from the best checkpoint and process allowed to save.
    if self.args.should_save and self.state.best_model_checkpoint is not None and self.args.save_total_limit == 1:
    for checkpoint in checkpoints_sorted:
    if not os.path.samefile(checkpoint, self.state.best_model_checkpoint):
    logger.info(f"Deleting older checkpoint [{checkpoint}] due to args.save_total_limit")
    shutil.rmtree(checkpoint, ignore_errors=True)

Expected behavior

We should not be setting best_model_checkpoint unless we're confident that 1) state.should_save is True or 2) args.save_strategy == "best". Then we'll avoid this crash.

  • Tom Aarsen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant