Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Although enabling vramfs, cuda-oom happens #39

Open
leemgs opened this issue Jun 18, 2024 · 1 comment
Open

Although enabling vramfs, cuda-oom happens #39

leemgs opened this issue Jun 18, 2024 · 1 comment

Comments

@leemgs
Copy link

leemgs commented Jun 18, 2024

Hello. I want to use vramfs as a swap space for Nvidia GPU Memory.
So after reading the README.md file, I set vramfs to 20GB space.
When I executed the nvidia-smi command, I was happy to see that vramfs was grabbed as 20GB as shown below.

# vramfs /tmp/vram 20G
# nvidia-smi
Tue Jun 18 13:22:00 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:21:00.0 Off |                    0 |
| N/A   40C    P0              65W / 300W |  76773MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2867      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A   1856687      C   bin/vramfs                                20892MiB | <--- 20GB for OpenCL
|    0   N/A  N/A   1906793      C   /opt/conda/bin/python3.10                 51754MiB |
|    0   N/A  N/A   1988805      C   /usr/bin/python                            2670MiB |
|    0   N/A  N/A   3729345      C   /usr/bin/python                            1418MiB |
+---------------------------------------------------------------------------------------+

And then, I also created the /tmp/vram/swapfile with 10GB as follows.

# cd /tmp/vram
# LOOPDEV=$(losetup -f)
# truncate -s 10G swapfile # replace 10G with target swapspace size, has to be smaller than the allocated vramfs (e.g. 20G)
# losetup $LOOPDEV swapfile
# mkswap $LOOPDEV
# swapon $LOOPDEV
# cat /proc/swaps
   Filename                                Type            Size            Used            Priority
   /dev/loop7                              partition       10485756        0               -3

# vi /etc/security/limits.conf
leemgs hard memlock unlimited
leemgs soft memlock unlimited
leemgs hard rtprio unlimited
leemgs soft rtprio unlimited

However, when I used an open source project called axolotl (https://github.com/OpenAccess-AI-Collective/axolotl) to run the model training as shown below, I got a cuda-oom error (e.g., torch.cuda.OutOfMemoryError: CUDA out of memory). I got a cuda-oom error when I ran the model training like below.

$ accelerate launch -m axolotl.cli.train examples/openllama-3b/lora.yml
  • log messages:
 ........... Omission ....................
[2024-06-18 13:11:25,615] [DEBUG] [axolotl.load_tokenizer:216] [PID:3288778] [RANK:0] EOS: 2 / </s>
[2024-06-18 13:11:25,615] [DEBUG] [axolotl.load_tokenizer:217] [PID:3288778] [RANK:0] BOS: 1 / <s>
[2024-06-18 13:11:25,616] [DEBUG] [axolotl.load_tokenizer:218] [PID:3288778] [RANK:0] PAD: 2 / </s>
[2024-06-18 13:11:25,616] [DEBUG] [axolotl.load_tokenizer:219] [PID:3288778] [RANK:0] UNK: 0 / <unk>
[2024-06-18 13:11:25,616] [INFO] [axolotl.load_tokenizer:224] [PID:3288778] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-06-18 13:11:25,616] [DEBUG] [axolotl.train.log:61] [PID:3288778] [RANK:0] loading model and peft_config...
[2024-06-18 13:11:25,862] [INFO] [axolotl.load_model:280] [PID:3288778] [RANK:0] patching with flash attention for sample packing
[2024-06-18 13:11:25,862] [INFO] [axolotl.load_model:366] [PID:3288778] [RANK:0] patching _expand_mask
/home/guest/.local/lib/python3.10/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
[2024-06-18 13:11:32,028] [ERROR] [axolotl.load_model:591] [PID:3288778] [RANK:0] CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 79.15 GiB total capacity; 3.20 GiB already allocated; 153.94 MiB free; 3.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/data/home/guest/fine-tuning-axolotl/src/axolotl/utils/models.py", line 480, in load_model
    model = LlamaForCausalLM.from_pretrained(
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3852, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4286, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 841, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(model, param_name, param_device, value=param)
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/integrations/bitsandbytes.py", line 128, in set_module_quantized_tensor_to_device
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 79.15 GiB total capacity; 3.20 GiB already allocated; 153.94 MiB free; 3.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/home/guest/fine-tuning-axolotl/src/axolotl/cli/train.py", line 49, in <module>
    fire.Fire(do_cli)
  File "/home/guest/.local/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/guest/.local/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/guest/.local/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/data/home/guest/fine-tuning-axolotl/src/axolotl/cli/train.py", line 33, in do_cli
    return do_train(parsed_cfg, parsed_cli_args)
  File "/data/home/guest/fine-tuning-axolotl/src/axolotl/cli/train.py", line 45, in do_train
    return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
  File "/data/home/guest/fine-tuning-axolotl/src/axolotl/train.py", line 65, in train
    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
  File "/data/home/guest/fine-tuning-axolotl/src/axolotl/utils/models.py", line 592, in load_model
    raise err
  File "/data/home/guest/fine-tuning-axolotl/src/axolotl/utils/models.py", line 480, in load_model
    model = LlamaForCausalLM.from_pretrained(
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3852, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4286, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 841, in _load_state_dict_into_meta_model
    set_module_quantized_tensor_to_device(model, param_name, param_device, value=param)
  File "/home/guest/.local/lib/python3.10/site-packages/transformers/integrations/bitsandbytes.py", line 128, in set_module_quantized_tensor_to_device
    new_value = value.to(device)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 79.15 GiB total capacity; 3.20 GiB already allocated; 153.94 MiB free; 3.53 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/guest/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/guest/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/guest/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
    simple_launcher(args)
  File "/home/guest/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'axolotl.cli.train', 'examples/openllama-3b/lora.yml']' returned non-zero exit status 1.

# cat /proc/swaps
Filename                                Type            Size            Used            Priority
/dev/loop7                              partition       10485756        0               -2

As you can see, the used swap space of /dev/loop7 is still 0. It's weird.

So I was wondering, is it possible to use vramfs as a swap space for Nvidia GPUs by using vramfs? Welcome to any hints or clue.

@leemgs leemgs changed the title Using as swap for Nvidia GPU Memory causes cuda-oom. Although enabling vramfs, cuda-oom happens Jun 18, 2024
@Overv
Copy link
Owner

Overv commented Jun 30, 2024

Am I understanding correctly that you want to use GPU memory as swap space for GPU memory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants