Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finetune a very small 0.5B qwen2.5 model with method of pissa on 2 *A800 (80G each, 120G available ) strangely met with OOM error #2559

Open
8 of 9 tasks
chuangzhidan opened this issue Jan 10, 2025 · 1 comment
Labels
🐛 bug Something isn't working 🏋 SFT Related to SFT

Comments

@chuangzhidan
Copy link

chuangzhidan commented Jan 10, 2025

System Info

Copy-paste the following information when reporting an issue:

  • Platform: Linux-5.15.0-130-generic-x86_64-with-glibc2.35
  • Python version: 3.12.4
  • PyTorch version: 2.4.0
    - CUDA device(s): NVIDIA A800 80GB PCIe, NVIDIA A800 80GB PCIe
  • Transformers version: 4.46.0
  • Accelerate version: 1.1.1
  • Accelerate config: not found
  • Datasets version: 3.2.0
  • HF Hub version: 0.27.1
  • TRL version: 0.13.0
  • bitsandbytes version: not installed
  • DeepSpeed version: 0.16.2
  • Diffusers version: not installed
  • Liger-Kernel version: 0.5.2
  • LLM-Blender version: not installed
  • OpenAI version: 1.43.1
  • PEFT version: 0.13.2

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

import torch
from peft import LoraConfig, get_peft_model
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset, Dataset
import json



model = AutoModelForCausalLM.from_pretrained(
    "/media/data/llm/Qwen2.5-0.5B-Instruct", 
    torch_dtype=torch.bfloat16, 
    # device_map={"":"cuda:1"}
    # device_map={"0": "cuda:1"} 
)#.to(device)

tokenizer = AutoTokenizer.from_pretrained("/media/data/llm/Qwen2.5-0.5B-Instruct")
tokenizer.pad_token_id = tokenizer.eos_token_id

lora_config = LoraConfig(
    # init_lora_weights="pissa",  
    init_lora_weights="pissa_niter_4",  # Initialize the PiSSA with fast SVD, which completes in just a few seconds.
)
peft_model = get_peft_model(model, lora_config)#.to(device)
peft_model.print_trainable_parameters()


# Load your local JSON file
train_file_path = "/media/data/xgp/scripts/output_format2.json"

dataset = load_dataset("json", data_files=train_file_path, split="train")


training_args = SFTConfig(
    # dataset_text_field="input",  # input_format2
    max_seq_length=400,
    output_dir="/media/data/llm/0.5b-ft-bi",
    # use_liger=True,
    packing=False, #True
    # model_init_kwargs={"torch_dtype": "bfloat16",},
    # peft_config=peft_config
)
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=dataset,
    # dataset_text_field="input", #input_format2
    # max_seq_length=8192,
    tokenizer=tokenizer,
    # label_col_name="output" 
    args=training_args,
)

print("-------------start training-------------")
trainer.train()
peft_model.save_pretrained("pissa-Qwen2.5-0.5B-Instruct")

outputs:

(base) ubuntu@localhost:/media/data/xgp/scripts$ python psa_bi.py
trainable params: 540,672 || all params: 494,573,440 || trainable%: 0.1093
/media/data/xgp/scripts/psa_bi.py:69: FutureWarning: `tokenizer` is deprecated and removed starting from version 0.16.0 for `SFTTrainer.__init__`. Use `processing_class` instead.
  trainer = SFTTrainer(
[2025-01-10 17:28:52,265] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-------------start training-------------
  0%|                                                                                                                                                                                              | 0/78 [00:00<?, ?it/s]/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/parallel_apply.py:79: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
 78%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                       | 61/78 [01:50<00:12,  1.37it/s]Traceback (most recent call last):
  File "/media/data/xgp/scripts/psa_bi.py", line 80, in <module>
    trainer.train()
  File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/trainer.py", line 2122, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/trainer.py", line 3572, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/trainer.py", line 3625, in compute_loss
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/data_parallel.py", line 186, in forward
    outputs = self.parallel_apply(replicas, inputs, module_kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/data_parallel.py", line 201, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/parallel_apply.py", line 108, in parallel_apply
    output.reraise()
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise
    raise exception
torch.OutOfMemoryError: Caught OutOfMemoryError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
    output = module(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/peft/peft_model.py", line 812, in forward
    return self.get_base_model()(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1183, in forward
    loss = self.loss_function(logits, labels, self.vocab_size, **loss_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/loss/loss_utils.py", line 46, in ForCausalLMLoss
    loss = fixed_cross_entropy(shift_logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/loss/loss_utils.py", line 26, in fixed_cross_entropy
    loss = nn.functional.cross_entropy(source, target, ignore_index=ignore_index, reduction=reduction)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/functional.py", line 3104, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.31 GiB. GPU 1 has a total capacity of 79.25 GiB of which 1.95 GiB is free. Process 1522057 has 0 bytes memory in use. Process 1522056 has 0 bytes memory in use. Process 1615472 has 3.83 GiB memory in use. Including non-PyTorch memory, this process has 70.24 GiB memory in use. Of the allocated memory 67.70 GiB is allocated by PyTorch, and 1.91 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
    ...

Expected behavior

it should not happen, its a very small model after all ,and i used pissa ,simillar to lora. once max_seq_length is above 400 ,like 512 ,you will get oom, even though i have trained 7b Qwen2.5 model with max_seq_length 8192, full funtuning, deepspeed.
as max_seq_length decreases, you will see train progress completes more ,such as 56% ,62%, 72 % ,78%, 100% before it reports oom error and stops. really do not get why max_seq_length have anything to do with its progress status,it's really a first time i have ever seen.
secondly, i do not know why 403 examples get you 78 steps ? when on 2 GPUs ,you see 153 steps.

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete
@chuangzhidan chuangzhidan changed the title finetune a very small 0.5B qwen2.5 model with method of pissa on 2 *A800 (80G each, available 120G ) strangely met with OOM error finetune a very small 0.5B qwen2.5 model with method of pissa on 2 *A800 (80G each, 120G available ) strangely met with OOM error Jan 11, 2025
@August-murr August-murr added 🐛 bug Something isn't working 🏋 SFT Related to SFT labels Jan 11, 2025
@Allenwutao
Copy link

@chuangzhidan have you solved this issue? I got the same problems cc @August-murr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working 🏋 SFT Related to SFT
Projects
None yet
Development

No branches or pull requests

3 participants