Setting Dropout in MPT Prefix-LM after Exporting to HuggingFace Crashes during Fine-tuning #1046

timsteuer · 2024-03-21T10:54:55Z

Context

We have a MPT MoD prefix-lm trained on llm-foundry and then exported to HuggingFace (via your scripts).
For some fine-tuning experiments with the HF model, I tried to set up dropout.

However, as soon as I do, the model crashes during training.
The training works fine without setting the the dropout parameter.

To reproduce

set up dropout in the exported model (is there an easier way?):

   if "attn_dropout" in hyper_conf:
        model_config = AutoConfig.from_pretrained(args.model_name_or_path, trust_remote_code=True)
        if "attn_config" in model_config.to_dict() and "attn_pdrop" in model_config.to_dict()['attn_config']:
            update_dict = model_config.to_dict()['attn_config']
            update_dict['attn_pdrop'] = hyper_conf['attn_dropout']
            model_config.update({'attn_config': update_dict})
            print("### Updated attn_dropout to %s ###" % hyper_conf['attn_dropout'])

            model = AutoModelForCausalLM.from_config(model_config, trust_remote_code=True)

try to fine-tune the model via the HuggingFace trainer.

Exception

 ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
 │ /opt/ml/code/entry_point.py:17 in <module>                                   │
 │                                                                              │
 │   14 │   print("Parent directory: %s" % os.path.abspath(".."))               │
 │   15 │   print(os.listdir(".."))                                             │
 │   16 │                                                                       │
 │ ❱ 17 │   main()                                                              │
 │   18                                                                         │
 │ /opt/ml/code/ds_text2text_sci_bench/train.py:293 in main                     │
 │   290 │   │   │   val_ds = ds['validation']                                  │
 │   291 │   │   else:                                                          │
 │   292 │   │   │   val_ds = None                                              │
 │ ❱ 293 │   │   model = train_causal_model(args, hyper_conf, tokenizer, train_ │
 │   294 │   │                                                                  │
 │   295 │   │   model.eval()                                                   │
 │   296                                                                        │
 │ /opt/ml/code/ds_text2text_sci_bench/train.py:380 in train_causal_model       │
 │   377 │   )                                                                  │
 │   378 │   trainer = Trainer(model=model, tokenizer=tokenizer, args=training_ │
 │   379 │   │   │   │   │     data_collator=data_collator)                     │
 │ ❱ 380 │   trainer.train()                                                    │
 │   381 │                                                                      │
 │   382 │   trainer.save_model(args.output_dir)                                │
 │   383                                                                        │
 │ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1555 in      │
 │ train                                                                        │
 │   1552 │   │   │   finally:                                                  │
 │   1553 │   │   │   │   hf_hub_utils.enable_progress_bars()                   │
 │   1554 │   │   else:                                                         │
 │ ❱ 1555 │   │   │   return inner_training_loop(                               │
 │   1556 │   │   │   │   args=args,                                            │
 │   1557 │   │   │   │   resume_from_checkpoint=resume_from_checkpoint,        │
 │   1558 │   │   │   │   trial=trial,                                          │
 │ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1860 in      │
 │ _inner_training_loop                                                         │
 │   1857 │   │   │   │   │   self.control = self.callback_handler.on_step_begi │
 │   1858 │   │   │   │                                                         │
 │   1859 │   │   │   │   with self.accelerator.accumulate(model):              │
 │ ❱ 1860 │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)  │
 │   1861 │   │   │   │                                                         │
 │   1862 │   │   │   │   if (                                                  │
 │   1863 │   │   │   │   │   args.logging_nan_inf_filter                       │
 │ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2734 in      │
 │ training_step                                                                │
 │   2731 │   │   │   with amp.scale_loss(loss, self.optimizer) as scaled_loss: │
 │   2732 │   │   │   │   scaled_loss.backward()                                │
 │   2733 │   │   else:                                                         │
 │ ❱ 2734 │   │   │   self.accelerator.backward(loss)                           │
 │   2735 │   │                                                                 │
 │   2736 │   │   return loss.detach() / self.args.gradient_accumulation_steps  │
 │   2737                                                                       │
 │ /opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py:1821 in    │
 │ backward                                                                     │
 │   1818 │   │   elif self.scaler is not None:                                 │
 │   1819 │   │   │   self.scaler.scale(loss).backward(**kwargs)                │
 │   1820 │   │   else:                                                         │
 │ ❱ 1821 │   │   │   loss.backward(**kwargs)                                   │
 │   1822 │                                                                     │
 │   1823 │   def unscale_gradients(self, optimizer=None):                      │
 │   1824 │   │   """                                                           │
 │ /opt/conda/lib/python3.10/site-packages/torch/_tensor.py:487 in backward     │
 │    484 │   │   │   │   create_graph=create_graph,                            │
 │    485 │   │   │   │   inputs=inputs,                                        │
 │    486 │   │   │   )                                                         │
 │ ❱  487 │   │   torch.autograd.backward(                                      │
 │    488 │   │   │   self, gradient, retain_graph, create_graph, inputs=inputs │
 │    489 │   │   )                                                             │
 │    490                                                                       │
 │ /opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py:200 in    │
 │   197 │   # The reason we repeat same the comment below is that              │
 │   198 │   # some Python versions print out the first line of a multi-line fu │
 │   199 │   # calls in the traceback and some print out the last line          │
 │ ❱ 200 │   Variable._execution_engine.run_backward(  # Calls into the C++ eng │
 │   201 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,    │
 │   202 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into th │
 │   203                                                                        │
 ╰──────────────────────────────────────────────────────────────────────────────╯
 RuntimeError: one of the variables needed for gradient computation has been
 modified by an inplace operation: [torch.cuda.FloatTensor [8, 16, 512, 512]],
 which is output 0 of SoftmaxBackward0, is at version 1; expected version 0
 instead. Hint: enable anomaly detection to find the operation that failed to
 compute its gradient, with torch.autograd.set_detect_anomaly(True).

To narrow it down: with enabled anomaly detection

/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py:200: UserWarning: Error detected in SoftmaxBackward0. Traceback of forward call that caused the error:
  File "/opt/ml/code/entry_point.py", line 17, in <module>
    main()
  File "/opt/ml/code/ds_text2text_sci_bench/train.py", line 295, in main
    model = train_causal_model(args, hyper_conf, tokenizer, train_ds, val_ds)
  File "/opt/ml/code/ds_text2text_sci_bench/train.py", line 384, in train_causal_model
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2725, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2748, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 553, in forward
    return model_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/operations.py", line 541, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/opt/conda/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/20240201-lr1_0e2-warm-sched-biomed-350M-var_seq1024ctx_15B/modeling_mpt.py", line 317, in forward
    outputs = self.transformer(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id, return_dict=return_dict, output_attentions=output_attentions, output_hidden_states=output_hidden_states, use_cache=use_cache)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/20240201-lr1_0e2-warm-sched-biomed-350M-var_seq1024ctx_15B/modeling_mpt.py", line 236, in forward
    (x, attn_weights, present) = block(x, past_key_value=past_key_value, attn_bias=attn_bias, rotary_emb_w_meta_info=rotary_emb_w_meta_info, attention_mask=attention_mask, is_causal=self.is_causal, output_attentions=bool(output_attentions))
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/20240201-lr1_0e2-warm-sched-biomed-350M-var_seq1024ctx_15B/blocks.py", line 35, in forward
    (b, attn_weights, past_key_value) = self.attn(a, past_key_value=past_key_value, attn_bias=attn_bias, rotary_emb_w_meta_info=rotary_emb_w_meta_info, attention_mask=attention_mask, is_causal=is_causal, needs_weights=output_attentions)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/20240201-lr1_0e2-warm-sched-biomed-350M-var_seq1024ctx_15B/attention.py", line 300, in forward
    (context, attn_weights, past_key_value) = self.attn_fn(query, key, value, self.n_heads, self.kv_n_heads, past_key_value=past_key_value, softmax_scale=self.softmax_scale, attn_bias=attn_bias, key_padding_mask=key_padding_mask, is_causal=is_causal, dropout_p=self.attn_dropout_p, training=self.training, needs_weights=needs_weights)
  File "/root/.cache/huggingface/modules/transformers_modules/20240201-lr1_0e2-warm-sched-biomed-350M-var_seq1024ctx_15B/attention.py", line 96, in scaled_multihead_dot_product_attention
    attn_weight = torch.softmax(attn_weight, dim=-1)
 (Triggered internally at /opt/conda/conda-bld/pytorch_1679586020379/work/torch/csrc/autograd/python_anomaly_mode.cpp:114.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "/opt/ml/code/entry_point.py", line 17, in <module>
    main()
  File "/opt/ml/code/ds_text2text_sci_bench/train.py", line 295, in main
    model = train_causal_model(args, hyper_conf, tokenizer, train_ds, val_ds)
  File "/opt/ml/code/ds_text2text_sci_bench/train.py", line 384, in train_causal_model
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2734, in training_step
    self.accelerator.backward(loss)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1821, in backward
    loss.backward(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [8, 16, 512, 512]], which is output 0 of SoftmaxBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/ml/code/entry_point.py:17 in <module>                                   │
│                                                                              │
│   14 │   print("Parent directory: %s" % os.path.abspath(".."))               │
│   15 │   print(os.listdir(".."))                                             │
│   16 │                                                                       │
│ ❱ 17 │   main()                                                              │
│   18                                                                         │
│                                                                              │
│ /opt/ml/code/ds_text2text_sci_bench/train.py:295 in main                     │
│                                                                              │
│   292 │   │   │   val_ds = ds['validation']                                  │
│   293 │   │   else:                                                          │
│   294 │   │   │   val_ds = None                                              │
│ ❱ 295 │   │   model = train_causal_model(args, hyper_conf, tokenizer, train_ │
│   296 │   │                                                                  │
│   297 │   │   model.eval()                                                   │
│   298                                                                        │
│                                                                              │
│ /opt/ml/code/ds_text2text_sci_bench/train.py:384 in train_causal_model       │
│                                                                              │
│   381 │   )                                                                  │
│   382 │   trainer = Trainer(model=model, tokenizer=tokenizer, args=training_ │
│   383 │   │   │   │   │     data_collator=data_collator)                     │
│ ❱ 384 │   trainer.train()                                                    │
│   385 │                                                                      │
│   386 │   trainer.save_model(args.output_dir)                                │
│   387                                                                        │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1555 in      │
│ train                                                                        │
│                                                                              │
│   1552 │   │   │   finally:                                                  │
│   1553 │   │   │   │   hf_hub_utils.enable_progress_bars()                   │
│   1554 │   │   else:                                                         │
│ ❱ 1555 │   │   │   return inner_training_loop(                               │
│   1556 │   │   │   │   args=args,                                            │
│   1557 │   │   │   │   resume_from_checkpoint=resume_from_checkpoint,        │
│   1558 │   │   │   │   trial=trial,                                          │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1860 in      │
│ _inner_training_loop                                                         │
│                                                                              │
│   1857 │   │   │   │   │   self.control = self.callback_handler.on_step_begi │
│   1858 │   │   │   │                                                         │
│   1859 │   │   │   │   with self.accelerator.accumulate(model):              │
│ ❱ 1860 │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)  │
│   1861 │   │   │   │                                                         │
│   1862 │   │   │   │   if (                                                  │
│   1863 │   │   │   │   │   args.logging_nan_inf_filter                       │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2734 in      │
│ training_step                                                                │
│                                                                              │
│   2731 │   │   │   with amp.scale_loss(loss, self.optimizer) as scaled_loss: │
│   2732 │   │   │   │   scaled_loss.backward()                                │
│   2733 │   │   else:                                                         │
│ ❱ 2734 │   │   │   self.accelerator.backward(loss)                           │
│   2735 │   │                                                                 │
│   2736 │   │   return loss.detach() / self.args.gradient_accumulation_steps  │
│   2737                                                                       │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py:1821 in    │
│ backward                                                                     │
│                                                                              │
│   1818 │   │   elif self.scaler is not None:                                 │
│   1819 │   │   │   self.scaler.scale(loss).backward(**kwargs)                │
│   1820 │   │   else:                                                         │
│ ❱ 1821 │   │   │   loss.backward(**kwargs)                                   │
│   1822 │                                                                     │
│   1823 │   def unscale_gradients(self, optimizer=None):                      │
│   1824 │   │   """                                                           │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/_tensor.py:487 in backward     │
│                                                                              │
│    484 │   │   │   │   create_graph=create_graph,                            │
│    485 │   │   │   │   inputs=inputs,                                        │
│    486 │   │   │   )                                                         │
│ ❱  487 │   │   torch.autograd.backward(                                      │
│    488 │   │   │   self, gradient, retain_graph, create_graph, inputs=inputs │
│    489 │   │   )                                                             │
│    490                                                                       │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py:200 in    │
│ backward                                                                     │
│                                                                              │
│   197 │   # The reason we repeat same the comment below is that              │
│   198 │   # some Python versions print out the first line of a multi-line fu │
│   199 │   # calls in the traceback and some print out the last line          │
│ ❱ 200 │   Variable._execution_engine.run_backward(  # Calls into the C++ eng │
│   201 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,    │
│   202 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into th │
│   203                                                                        │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: one of the variables needed for gradient computation has been 
modified by an inplace operation: [torch.cuda.FloatTensor [8, 16, 512, 512]], 
which is output 0 of SoftmaxBackward0, is at version 1; expected version 0 
instead. Hint: the backtrace further above shows the operation that failed to 
compute its gradient. The variable in question was changed in there or anywhere 
later. Good luck!

Expected behavior

Fine-tuning with dropout works.

Versions

exported likely with llm-foundry v.0.4.0; how do I find that out quickly after exporting?
transformers==4.35.2

The text was updated successfully, but these errors were encountered:

vchiley · 2024-03-21T22:24:44Z

If you are able to modify the code, could you try setting inplace here to False?

timsteuer · 2024-03-22T08:50:24Z

Thank you, that seems to work. The model trains now with the parameter set.

timsteuer added the bug Something isn't working label Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting Dropout in MPT Prefix-LM after Exporting to HuggingFace Crashes during Fine-tuning #1046

Setting Dropout in MPT Prefix-LM after Exporting to HuggingFace Crashes during Fine-tuning #1046

timsteuer commented Mar 21, 2024

vchiley commented Mar 21, 2024

timsteuer commented Mar 22, 2024

Setting Dropout in MPT Prefix-LM after Exporting to HuggingFace Crashes during Fine-tuning #1046

Setting Dropout in MPT Prefix-LM after Exporting to HuggingFace Crashes during Fine-tuning #1046

Comments

timsteuer commented Mar 21, 2024

Context

To reproduce

Expected behavior

Versions

vchiley commented Mar 21, 2024

timsteuer commented Mar 22, 2024