Code implementation issues #169

HarideP · 2025-02-03T10:06:24Z

HarideP
Feb 3, 2025

CUDA out of memory

When I reproduce this code, the GPU cache overflow always occurs. How do I fix this problem?

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.02 GiB. GPU 0 has a total capacity of 31.74 GiB of which 914.12 MiB is free. Process 13741 has 30.84 GiB memory in use. Of the allocated memory 23.95 GiB is allocated by PyTorch, and 5.95 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

sft

config_full.yaml:

# Model arguments
model_name_or_path: Qwen/Qwen2.5-1.5B-Instruct
model_revision: main
torch_dtype: bfloat16

# Data training arguments
dataset_name: HuggingFaceH4/Bespoke-Stratos-17k
dataset_configs:
- all
preprocessing_num_workers: 1

# SFT trainer config
bf16: true
do_eval: true
eval_strategy: steps
eval_steps: 100
gradient_accumulation_steps: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: Qwen2.5-1.5B-Open-R1-Distill
hub_strategy: every_save
learning_rate: 2.0e-05
log_level: info
logging_steps: 5
logging_strategy: steps
lr_scheduler_type: cosine
packing: true
max_seq_length: 1024
max_steps: -1
num_train_epochs: 1
output_dir: data/Qwen2.5-1.5B-Open-R1-Distill
overwrite_output_dir: true
per_device_eval_batch_size: 1
per_device_train_batch_size: 1
push_to_hub: true
report_to:
- wandb
save_strategy: "no"
seed: 42
warmup_ratio: 0.1

accelerate_config

zero3.yaml:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Charlesnorris509 · 2025-02-09T16:11:37Z

Charlesnorris509
Feb 9, 2025

@HarideP There are a few changes you can make to prevent cache overlow. You could chose to readjeust your memory allocationsettings. you could choose to use scalable memory to avoid fragmentation using the following query "export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True" or you could also chosse to expirement with max_split_size_mb option to control the size of memory allocations and reduce fragmentation. "export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128"
However since you are already using Gradiant_accumulation_step parameter, you could choose to increase it and expand your sample batch size instead of increasing memory usage. gradient_accumulation_steps: 4

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code implementation issues #169

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Code implementation issues #169

HarideP Feb 3, 2025

CUDA out of memory

sft

accelerate_config

Replies: 1 comment

Charlesnorris509 Feb 9, 2025

HarideP
Feb 3, 2025

Charlesnorris509
Feb 9, 2025