finetune a very small 0.5B qwen2.5 model with method of pissa on 2 *A800 (80G each, 120G available ) strangely met with OOM error #2559

chuangzhidan · 2025-01-10T18:18:20Z

System Info

Copy-paste the following information when reporting an issue:

Platform: Linux-5.15.0-130-generic-x86_64-with-glibc2.35
Python version: 3.12.4
PyTorch version: 2.4.0
- CUDA device(s): NVIDIA A800 80GB PCIe, NVIDIA A800 80GB PCIe
Transformers version: 4.46.0
Accelerate version: 1.1.1
Accelerate config: not found
Datasets version: 3.2.0
HF Hub version: 0.27.1
TRL version: 0.13.0
bitsandbytes version: not installed
DeepSpeed version: 0.16.2
Diffusers version: not installed
Liger-Kernel version: 0.5.2
LLM-Blender version: not installed
OpenAI version: 1.43.1
PEFT version: 0.13.2

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

import torch
from peft import LoraConfig, get_peft_model
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset, Dataset
import json



model = AutoModelForCausalLM.from_pretrained(
    "/media/data/llm/Qwen2.5-0.5B-Instruct", 
    torch_dtype=torch.bfloat16, 
    # device_map={"":"cuda:1"}
    # device_map={"0": "cuda:1"} 
)#.to(device)

tokenizer = AutoTokenizer.from_pretrained("/media/data/llm/Qwen2.5-0.5B-Instruct")
tokenizer.pad_token_id = tokenizer.eos_token_id

lora_config = LoraConfig(
    # init_lora_weights="pissa",  
    init_lora_weights="pissa_niter_4",  # Initialize the PiSSA with fast SVD, which completes in just a few seconds.
)
peft_model = get_peft_model(model, lora_config)#.to(device)
peft_model.print_trainable_parameters()


# Load your local JSON file
train_file_path = "/media/data/xgp/scripts/output_format2.json"

dataset = load_dataset("json", data_files=train_file_path, split="train")


training_args = SFTConfig(
    # dataset_text_field="input",  # input_format2
    max_seq_length=400,
    output_dir="/media/data/llm/0.5b-ft-bi",
    # use_liger=True,
    packing=False, #True
    # model_init_kwargs={"torch_dtype": "bfloat16",},
    # peft_config=peft_config
)
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=dataset,
    # dataset_text_field="input", #input_format2
    # max_seq_length=8192,
    tokenizer=tokenizer,
    # label_col_name="output" 
    args=training_args,
)

print("-------------start training-------------")
trainer.train()
peft_model.save_pretrained("pissa-Qwen2.5-0.5B-Instruct")

outputs:

(base) ubuntu@localhost:/media/data/xgp/scripts$ python psa_bi.py
trainable params: 540,672 || all params: 494,573,440 || trainable%: 0.1093
/media/data/xgp/scripts/psa_bi.py:69: FutureWarning: `tokenizer` is deprecated and removed starting from version 0.16.0 for `SFTTrainer.__init__`. Use `processing_class` instead.
  trainer = SFTTrainer(
[2025-01-10 17:28:52,265] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-------------start training-------------
  0%|                                                                                                                                                                                              | 0/78 [00:00<?, ?it/s]/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/parallel_apply.py:79: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
 78%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                       | 61/78 [01:50<00:12,  1.37it/s]Traceback (most recent call last):
  File "/media/data/xgp/scripts/psa_bi.py", line 80, in <module>
    trainer.train()
  File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/trainer.py", line 2122, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/trainer.py", line 3572, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/trainer.py", line 3625, in compute_loss
    outputs = model(**inputs)
              ^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/data_parallel.py", line 186, in forward
    outputs = self.parallel_apply(replicas, inputs, module_kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/data_parallel.py", line 201, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/parallel_apply.py", line 108, in parallel_apply
    output.reraise()
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise
    raise exception
torch.OutOfMemoryError: Caught OutOfMemoryError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
    output = module(*input, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/peft/peft_model.py", line 812, in forward
    return self.get_base_model()(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1183, in forward
    loss = self.loss_function(logits, labels, self.vocab_size, **loss_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/loss/loss_utils.py", line 46, in ForCausalLMLoss
    loss = fixed_cross_entropy(shift_logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/loss/loss_utils.py", line 26, in fixed_cross_entropy
    loss = nn.functional.cross_entropy(source, target, ignore_index=ignore_index, reduction=reduction)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/functional.py", line 3104, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.31 GiB. GPU 1 has a total capacity of 79.25 GiB of which 1.95 GiB is free. Process 1522057 has 0 bytes memory in use. Process 1522056 has 0 bytes memory in use. Process 1615472 has 3.83 GiB memory in use. Including non-PyTorch memory, this process has 70.24 GiB memory in use. Of the allocated memory 67.70 GiB is allocated by PyTorch, and 1.91 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
    ...

Expected behavior

it should not happen, its a very small model after all ,and i used pissa ,simillar to lora. once max_seq_length is above 400 ,like 512 ,you will get oom, even though i have trained 7b Qwen2.5 model with max_seq_length 8192, full funtuning, deepspeed.
as max_seq_length decreases, you will see train progress completes more ,such as 56% ,62%, 72 % ,78%, 100% before it reports oom error and stops. really do not get why max_seq_length have anything to do with its progress status,it's really a first time i have ever seen.
secondly, i do not know why 403 examples get you 78 steps ? when on 2 GPUs ,you see 153 steps.

Checklist

I have checked that my issue isn't already filed (see open issues)
I have included my system information
Any code provided is minimal, complete, and reproducible (more on MREs)
Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
Any traceback provided is complete

The text was updated successfully, but these errors were encountered:

Allenwutao · 2025-01-23T03:57:47Z

@chuangzhidan have you solved this issue? I got the same problems cc @August-murr

August-murr added 🐛 bug Something isn't working 🏋 SFT Related to SFT labels Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

finetune a very small 0.5B qwen2.5 model with method of pissa on 2 *A800 (80G each, 120G available ) strangely met with OOM error #2559

finetune a very small 0.5B qwen2.5 model with method of pissa on 2 *A800 (80G each, 120G available ) strangely met with OOM error #2559

chuangzhidan commented Jan 10, 2025 •

edited

Loading

Allenwutao commented Jan 23, 2025

finetune a very small 0.5B qwen2.5 model with method of pissa on 2 *A800 (80G each, 120G available ) strangely met with OOM error #2559

finetune a very small 0.5B qwen2.5 model with method of pissa on 2 *A800 (80G each, 120G available ) strangely met with OOM error #2559

Comments

chuangzhidan commented Jan 10, 2025 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

Checklist

Allenwutao commented Jan 23, 2025

chuangzhidan commented Jan 10, 2025 •

edited

Loading