You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder
My own task or dataset (give details below)
Reproduction
importtorchfrompeftimportLoraConfig, get_peft_modelfromtransformersimportAutoTokenizer, AutoModelForCausalLMfromtrlimportSFTTrainer, SFTConfigfromdatasetsimportload_dataset, Datasetimportjsonmodel=AutoModelForCausalLM.from_pretrained(
"/media/data/llm/Qwen2.5-0.5B-Instruct",
torch_dtype=torch.bfloat16,
# device_map={"":"cuda:1"}# device_map={"0": "cuda:1"}
)#.to(device)tokenizer=AutoTokenizer.from_pretrained("/media/data/llm/Qwen2.5-0.5B-Instruct")
tokenizer.pad_token_id=tokenizer.eos_token_idlora_config=LoraConfig(
# init_lora_weights="pissa", init_lora_weights="pissa_niter_4", # Initialize the PiSSA with fast SVD, which completes in just a few seconds.
)
peft_model=get_peft_model(model, lora_config)#.to(device)peft_model.print_trainable_parameters()
# Load your local JSON filetrain_file_path="/media/data/xgp/scripts/output_format2.json"dataset=load_dataset("json", data_files=train_file_path, split="train")
training_args=SFTConfig(
# dataset_text_field="input", # input_format2max_seq_length=400,
output_dir="/media/data/llm/0.5b-ft-bi",
# use_liger=True,packing=False, #True# model_init_kwargs={"torch_dtype": "bfloat16",},# peft_config=peft_config
)
trainer=SFTTrainer(
model=peft_model,
train_dataset=dataset,
# dataset_text_field="input", #input_format2# max_seq_length=8192,tokenizer=tokenizer,
# label_col_name="output" args=training_args,
)
print("-------------start training-------------")
trainer.train()
peft_model.save_pretrained("pissa-Qwen2.5-0.5B-Instruct")
outputs:
(base) ubuntu@localhost:/media/data/xgp/scripts$ python psa_bi.py
trainable params: 540,672 || all params: 494,573,440 || trainable%: 0.1093
/media/data/xgp/scripts/psa_bi.py:69: FutureWarning: `tokenizer` is deprecated and removed starting from version 0.16.0 for `SFTTrainer.__init__`. Use `processing_class` instead.
trainer = SFTTrainer(
[2025-01-10 17:28:52,265] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
-------------start training-------------
0%| | 0/78 [00:00<?, ?it/s]/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/parallel_apply.py:79: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
78%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 61/78 [01:50<00:12, 1.37it/s]Traceback (most recent call last):
File "/media/data/xgp/scripts/psa_bi.py", line 80, in <module>
trainer.train()
File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/trainer.py", line 2122, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/trainer.py", line 2474, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/trainer.py", line 3572, in training_step
loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/trainer.py", line 3625, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/data_parallel.py", line 186, in forward
outputs = self.parallel_apply(replicas, inputs, module_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/data_parallel.py", line 201, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/parallel_apply.py", line 108, in parallel_apply
output.reraise()
File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise
raise exception
torch.OutOfMemoryError: Caught OutOfMemoryError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
output = module(*input, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/peft/peft_model.py", line 812, in forward
return self.get_base_model()(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 1183, in forward
loss = self.loss_function(logits, labels, self.vocab_size, **loss_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/loss/loss_utils.py", line 46, in ForCausalLMLoss
loss = fixed_cross_entropy(shift_logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/transformers/loss/loss_utils.py", line 26, in fixed_cross_entropy
loss = nn.functional.cross_entropy(source, target, ignore_index=ignore_index, reduction=reduction)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/torch/nn/functional.py", line 3104, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.31 GiB. GPU 1 has a total capacity of 79.25 GiB of which 1.95 GiB is free. Process 1522057 has 0 bytes memory in use. Process 1522056 has 0 bytes memory in use. Process 1615472 has 3.83 GiB memory in use. Including non-PyTorch memory, this process has 70.24 GiB memory in use. Of the allocated memory 67.70 GiB is allocated by PyTorch, and 1.91 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
...
Expected behavior
it should not happen, its a very small model after all ,and i used pissa ,simillar to lora. once max_seq_length is above 400 ,like 512 ,you will get oom, even though i have trained 7b Qwen2.5 model with max_seq_length 8192, full funtuning, deepspeed.
as max_seq_length decreases, you will see train progress completes more ,such as 56% ,62%, 72 % ,78%, 100% before it reports oom error and stops. really do not get why max_seq_length have anything to do with its progress status,it's really a first time i have ever seen.
secondly, i do not know why 403 examples get you 78 steps ? when on 2 GPUs ,you see 153 steps.
Checklist
I have checked that my issue isn't already filed (see open issues)
I have included my system information
Any code provided is minimal, complete, and reproducible (more on MREs)
Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
Any traceback provided is complete
The text was updated successfully, but these errors were encountered:
chuangzhidan
changed the title
finetune a very small 0.5B qwen2.5 model with method of pissa on 2 *A800 (80G each, available 120G ) strangely met with OOM error
finetune a very small 0.5B qwen2.5 model with method of pissa on 2 *A800 (80G each, 120G available ) strangely met with OOM error
Jan 11, 2025
System Info
Copy-paste the following information when reporting an issue:
- CUDA device(s): NVIDIA A800 80GB PCIe, NVIDIA A800 80GB PCIe
Information
Tasks
examples
folderReproduction
outputs:
Expected behavior
it should not happen, its a very small model after all ,and i used pissa ,simillar to lora. once max_seq_length is above 400 ,like 512 ,you will get oom, even though i have trained 7b Qwen2.5 model with max_seq_length 8192, full funtuning, deepspeed.
as max_seq_length decreases, you will see train progress completes more ,such as 56% ,62%, 72 % ,78%, 100% before it reports oom error and stops. really do not get why max_seq_length have anything to do with its progress status,it's really a first time i have ever seen.
secondly, i do not know why 403 examples get you 78 steps ? when on 2 GPUs ,you see 153 steps.
Checklist
The text was updated successfully, but these errors were encountered: