You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I use two 40G A100 GPUs and one 80G GPUs to fine-tune my model through lora and FSDP which ShardingStrategy is FULL SHARD. When I use command(CUDA_VISIBLE_DEVICES=5,3,4 torchrun --standalone --nnodes=1 --nproc-per-node=3 finetuning.py) to begin my work. I still get problems which are OOM on two 40G A100 GPUs. I watch my GPUs and find all GPUs will load total model weights when using FullyShardedDataParallel to init model. So I am so confused about them and do not know how to fix them.
Bug logs
[rank2]: Traceback (most recent call last):
[rank2]: File "/data0/home/ening/NICA/cogmllm/src/cogmllm/tools/finetuning.py", line 438, in <module>
[rank2]: fire.Fire(main)
[rank2]: File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank2]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank2]: File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank2]: component, remaining_args = _CallAndUpdateTrace(
[rank2]: File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank2]: component = fn(*varargs, **kwargs)
[rank2]: File "/data0/home/ening/NICA/cogmllm/src/cogmllm/tools/finetuning.py", line 281, in main
[rank2]: model = FSDP(
[rank2]: File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 509, in __init__
[rank2]: _init_param_handle_from_module(
[rank2]: File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 636, in _init_param_handle_from_module
[rank2]: _init_param_handle_from_params(state, managed_params, fully_sharded_module)
[rank2]: File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 648, in _init_param_handle_from_params
[rank2]: handle = FlatParamHandle(
[rank2]: File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 584, in __init__
[rank2]: self._init_flat_param_and_metadata(
[rank2]: File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 739, in _init_flat_param_and_metadata
[rank2]: self.flat_param: FlatParameter = self.flatten_tensors_into_flat_param(
[rank2]: File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 852, in flatten_tensors_into_flat_param
[rank2]: flat_param_data = self.flatten_tensors(tensors, aligned_numel)
[rank2]: File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 844, in flatten_tensors
[rank2]: return torch.cat(flat_tensors, dim=0)
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 19.88 GiB. GPU 2 has a total capacity of 39.38 GiB of which 18.80 GiB is free. Including non-PyTorch memory, this process has 20.57 GiB memory in use. Of the allocated memory 19.89 GiB is allocated by PyTorch, and 208.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
The text was updated successfully, but these errors were encountered:
I use two 40G A100 GPUs and one 80G GPUs to fine-tune my model through lora and FSDP which ShardingStrategy is FULL SHARD. When I use command(CUDA_VISIBLE_DEVICES=5,3,4 torchrun --standalone --nnodes=1 --nproc-per-node=3 finetuning.py) to begin my work. I still get problems which are OOM on two 40G A100 GPUs. I watch my GPUs and find all GPUs will load total model weights when using FullyShardedDataParallel to init model. So I am so confused about them and do not know how to fix them.
Bug logs
The text was updated successfully, but these errors were encountered: