FSDP OOM error #35636

blurmemo · 2025-01-12T10:51:11Z

I use two 40G A100 GPUs and one 80G GPUs to fine-tune my model through lora and FSDP which ShardingStrategy is FULL SHARD. When I use command(CUDA_VISIBLE_DEVICES=5,3,4 torchrun --standalone --nnodes=1 --nproc-per-node=3 finetuning.py) to begin my work. I still get problems which are OOM on two 40G A100 GPUs. I watch my GPUs and find all GPUs will load total model weights when using FullyShardedDataParallel to init model. So I am so confused about them and do not know how to fix them.

Bug logs

[rank2]: Traceback (most recent call last):
[rank2]:   File "/data0/home/ening/NICA/cogmllm/src/cogmllm/tools/finetuning.py", line 438, in <module>
[rank2]:     fire.Fire(main)
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank2]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank2]:     component, remaining_args = _CallAndUpdateTrace(
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank2]:     component = fn(*varargs, **kwargs)
[rank2]:   File "/data0/home/ening/NICA/cogmllm/src/cogmllm/tools/finetuning.py", line 281, in main
[rank2]:     model = FSDP(
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 509, in __init__
[rank2]:     _init_param_handle_from_module(
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 636, in _init_param_handle_from_module
[rank2]:     _init_param_handle_from_params(state, managed_params, fully_sharded_module)
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 648, in _init_param_handle_from_params
[rank2]:     handle = FlatParamHandle(
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 584, in __init__
[rank2]:     self._init_flat_param_and_metadata(
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 739, in _init_flat_param_and_metadata
[rank2]:     self.flat_param: FlatParameter = self.flatten_tensors_into_flat_param(
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 852, in flatten_tensors_into_flat_param
[rank2]:     flat_param_data = self.flatten_tensors(tensors, aligned_numel)
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 844, in flatten_tensors
[rank2]:     return torch.cat(flat_tensors, dim=0)
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 19.88 GiB. GPU 2 has a total capacity of 39.38 GiB of which 18.80 GiB is free. Including non-PyTorch memory, this process has 20.57 GiB memory in use. Of the allocated memory 19.89 GiB is allocated by PyTorch, and 208.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP OOM error #35636

FSDP OOM error #35636

blurmemo commented Jan 12, 2025

FSDP OOM error #35636

FSDP OOM error #35636

Comments

blurmemo commented Jan 12, 2025