Lower Hunyuan Video LoRA memory requirements #135

a-r-r-o-w · 2024-12-23T13:35:31Z

It should be possible to leverage fp8 casted models, or torchao quantization, to support training in under 24 GB upto a reasonable resolution. Or atleast that's the hope when combined with precomputation from #129. Will take a look soon 🤗

TorchAO docs: https://huggingface.co/docs/diffusers/main/en/quantization/torchao
FP8 casting: huggingface/diffusers#10347

Lumoria · 2024-12-23T13:38:00Z

What are the memory requirements for Hunyuan currently? I'm OOMing with 48gb

a-r-r-o-w · 2024-12-23T13:54:03Z

Could you give #129 a try? I believe with FP8 it should fit in 24 gb based on rough calculations, but will continue to try and improve

Lumoria · 2024-12-23T16:08:12Z

Sadly I still OOM even after precompiling the conditions and latents

a-r-r-o-w · 2024-12-23T16:17:12Z

Just to confirm, are you using the bash script from README or a custom launch script? And are you sure --gradient_checkpointing is being used. If it still isn't working after that, I'll take a look tomorrow and try to have FP8 support asap

Lumoria · 2024-12-23T16:46:09Z

Yeah, using the bash script from the readme. --gradient_checkpointing and --precompute_conditions are both being passed.

Aristo23333 · 2024-12-24T06:33:38Z

Hi, I also try the bash in your README.md and load the CKPT you provide in https://huggingface.co/hunyuanvideo-community/HunyuanVideo. But I get OOM even in an 80 GiB H800 when loading the HunyuanVideo transformer, before training. And my training device is 1/2 H800

generalsvr · 2024-12-24T14:28:48Z

Have same OOM problem with --precompute_conditions and --gradient_checkpointing form README script on A100

a-r-r-o-w · 2024-12-24T15:18:15Z

I'm unable to replicate unfortunately. I just verified once again that I can run training in about 42 GB of memory when precomputation and gradient checkpointing is enabled with 49x512x768 videos. I would like to know what version of pytorch everyone is using. Can you share the output of diffusers-cli env? Here's mine:

- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.14
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): 0.8.5 (cpu)
- Jax version: 0.4.31
- JaxLib version: 0.4.31
- Huggingface_hub version: 0.26.2
- Transformers version: 4.48.0.dev0
- Accelerate version: 1.1.0.dev0
- PEFT version: 0.13.3.dev0
- Bitsandbytes version: 0.43.3
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA DGX Display, 4096 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Can you also try running training with resolution buckets set as 1x512x512 and see if that OOMs as well? If it OOMs, it's like a problem with the pytorch version and needs upgrade. If it doesn't, I will try running on the code on a few different environments to verify. So far, I know that it works on A6000, A100, RTX 4090 (but with custom fp8 code), and H100 for sure from atleast two other folks but the above info will be very helpful to localize the error. DeepSpeed support by @sayakpaul should be in soon too, so hopefully that helps further reduce some memory requirements

Lumoria · 2024-12-24T16:13:21Z

Setting the bucket size to 1x512x12 still OOMs.

🤗 Diffusers version: 0.32.0.dev0
Platform: Linux-6.11.2-amd64-x86_64-with-glibc2.40
Running on Google Colab?: No
Python version: 3.11.11
PyTorch version (GPU?): 2.4.1+cu121 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Huggingface_hub version: 0.27.0
Transformers version: 4.47.1
Accelerate version: 1.2.1
PEFT version: 0.14.0
Bitsandbytes version: 0.45.0
Safetensors version: 0.4.5
xFormers version: not installed
Accelerator: NVIDIA RTX 6000 Ada Generation, 49140 MiB
NVIDIA GeForce RTX 3080 Ti, 12288 MiB
Using GPU in script?:
Using distributed or parallel set-up in script?:

a-r-r-o-w · 2024-12-24T16:17:42Z

I see. I'll give pytorch 2.4 a try and profile it tomorrow. Could you try upgrading to pytorch 2.5.1 and see if it does away, or the nightly 2.6.0?

a-r-r-o-w · 2024-12-24T16:21:28Z

Also, on 2.4, could you first check if the inference results in a normal video or a black video with the example code here: https://huggingface.co/docs/diffusers/en/api/pipelines/hunyuan_video

There have been reports of it not working and I suspect it's something to do with the torch version. If the inference is not working, there's a slim chance training would work well

The example doesn't mention it, but if you're facing OOM for inference, pipe.enable_model_cpu_offload() should do the trick

generalsvr · 2024-12-24T16:38:12Z

I'm unable to replicate unfortunately. I just verified once again that I can run training in about 42 GB of memory when precomputation and gradient checkpointing is enabled with 49x512x768 videos. I would like to know what version of pytorch everyone is using. Can you share the output of diffusers-cli env? Here's mine:
- 🤗 Diffusers version: 0.33.0.dev0
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.14
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): 0.8.5 (cpu)
- Jax version: 0.4.31
- JaxLib version: 0.4.31
- Huggingface_hub version: 0.26.2
- Transformers version: 4.48.0.dev0
- Accelerate version: 1.1.0.dev0
- PEFT version: 0.13.3.dev0
- Bitsandbytes version: 0.43.3
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
NVIDIA DGX Display, 4096 MiB
NVIDIA A100-SXM4-80GB, 81920 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
Can you also try running training with resolution buckets set as 1x512x512 and see if that OOMs as well? If it OOMs, it's like a problem with the pytorch version and needs upgrade. If it doesn't, I will try running on the code on a few different environments to verify. So far, I know that it works on A6000, A100, RTX 4090 (but with custom fp8 code), and H100 for sure from atleast two other folks but the above info will be very helpful to localize the error. DeepSpeed support by @sayakpaul should be in soon too, so hopefully that helps further reduce some memory requirements

I was able to run training with accelerate_configs/uncompiled_1.yaml config but during training the loss was nan. The output of this lora model after training is black screen. Can you explain what is the difference between these configs please? Pytorch 2.4.0

a-r-r-o-w · 2024-12-24T17:06:08Z

@generalsvr It seems like 2.4 might be a problematic torch version for some operations. I'm going through the relevant commits in pytorch to try and see what exactly causes this, but I believe upgrading to 2.5.1 will fix the nan loss. Could you give that a try?

a-r-r-o-w · 2024-12-24T17:08:57Z

The configs are simply some rules that tell accelerate, the HuggingFace solution we used for distributed data parallelism, what kind of environment and settings we want to train with. So, uncompiled_1.yaml is a file that specifies we want to use 1 GPU without torch.compile. uncompiled_8.yaml, similarly, is for when we want to use 8 GPUs in parallel. The full list of configurations is explorable via the accelerate config command and you can read about it in the docs

generalsvr · 2024-12-24T17:19:51Z

@generalsvr It seems like 2.4 might be a problematic torch version for some operations. I'm going through the relevant commits in pytorch to try and see what exactly causes this, but I believe upgrading to 2.5.1 will fix the nan loss. Could you give that a try?

After updating pytorch to 2.5.1 on the same machine original hunyuan model started to generate videos. But training is still a problem. I can see loss appeared once on step 1, but then again nan. Video generation with lora resulting in a black screen.

Training run 1 log:

Training steps: 0%| | 0/20 [00:00<?, ?it/s]12/24/2024 16:47:37 - DEBUG - finetrainers - Starting epoch (1/1)
12/24/2024 16:47:37 - DEBUG - finetrainers - Starting step 1
Training steps: 5%|███▏ | 1/20 [02:06<40:10, 126.89s/it, loss=0.462, lr=2e-7]12/24/2024 16:49:44 - DEBUG - accelerate.tracking - Successfully logged to WandB
12/24/2024 16:49:44 - DEBUG - finetrainers - Starting step 2
Training steps: 10%|██████▌ | 2/20 [04:11<36:54, 123.04s/it, loss=nan, lr=4e-7]12/24/2024 16:51:49 - DEBUG - accelerate.tracking - Successfully logged to WandB
12/24/2024 16:51:49 - DEBUG - finetrainers - Starting step 3
Training steps: 15%|█████████▉ | 3/20 [06:17<35:07, 123.97s/it, loss=nan, lr=6e-7]12/24/2024 16:53:54 - DEBUG - accelerate.tracking - Successfully logged to WandB
12/24/2024 16:53:54 - DEBUG - finetrainers - Starting step 4
Training steps: 20%|█████████████▏ | 4/20 [08:22<33:10, 124.40s/it, loss=nan, lr=8e-7]12/24/2024 16:55:59 - DEBUG - accelerate.tracking - Successfully logged to WandB
12/24/2024 16:55:59 - DEBUG - finetrainers - Starting step 5
Training steps: 25%|████████████████▌ | 5/20 [10:22<31:09, 124.64s/it, loss=nan, lr=8e-7]12/24/2024 16:57:59 - INFO - finetrainers - Checkpointing at step 5

Training run 2 log:

Training steps: 0%| | 0/20 [00:00<?, ?it/s]12/24/2024 17:10:49 - DEBUG - finetrainers - Starting epoch (1/1)
12/24/2024 17:10:50 - DEBUG - finetrainers - Starting step 1
Training steps: 5%|███▏ | 1/20 [02:06<40:11, 126.92s/it, loss=0.462, lr=2e-7]12/24/2024 17:12:56 - DEBUG - accelerate.tracking - Successfully logged to WandB
12/24/2024 17:12:56 - DEBUG - finetrainers - Starting step 2
Training steps: 10%|██████▌ | 2/20 [04:11<36:54, 123.05s/it, loss=nan, lr=4e-7]12/24/2024 17:15:01 - DEBUG - accelerate.tracking - Successfully logged to WandB
12/24/2024 17:15:01 - DEBUG - finetrainers - Starting step 3
Training steps: 15%|█████████▉ | 3/20 [06:17<35:07, 123.97s/it, loss=nan, lr=6e-7]12/24/2024 17:17:06 - DEBUG - accelerate.tracking - Successfully logged to WandB
12/24/2024 17:17:06 - DEBUG - finetrainers - Starting step 4
Training steps: 20%|█████████████▏ | 4/20 [08:22<33:10, 124.41s/it, loss=nan, lr=8e-7]12/24/2024 17:19:12 - DEBUG - accelerate.tracking - Successfully logged to WandB
12/24/2024 17:19:12 - DEBUG - finetrainers - Starting step 5

sayakpaul · 2024-12-30T03:04:28Z

I tested it 2 days back and it seemed fine. Below is my command:

Command

export NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG

GPU_IDS="0,1"

DATA_ROOT="/home/sayak/finetrainers/video-dataset-disney"
CAPTION_COLUMN="prompt.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/raid/.cache/huggingface/sayak/hunyuan_video/hunyuan_disney"

# Model arguments
model_cmd="--model_name hunyuan_video \
  --pretrained_model_name_or_path hunyuanvideo-community/HunyuanVideo"

# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
  --video_column $VIDEO_COLUMN \
  --caption_column $CAPTION_COLUMN \
  --id_token afkx \
  --video_resolution_buckets 81x512x768 \
  --caption_dropout_p 0.05"

# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 4"

# Diffusion arguments
diffusion_cmd=""

# Training arguments
training_cmd="--training_type lora \
  --seed 42 \
  --mixed_precision bf16 \
  --batch_size 2 \
  --train_steps 50 \
  --rank 4 \
  --lora_alpha 4 \
  --target_modules to_q to_k to_v to_out.0 \
  --gradient_accumulation_steps 1 \
  --gradient_checkpointing \
  --checkpointing_steps 5 \
  --checkpointing_limit 2 \
  --enable_slicing \
  --enable_tiling"

# Optimizer arguments
optimizer_cmd="--optimizer adamw \
  --lr 2e-5 \
  --lr_scheduler constant_with_warmup \
  --lr_warmup_steps 100 \
  --lr_num_cycles 1 \
  --beta1 0.9 \
  --beta2 0.95 \
  --weight_decay 1e-4 \
  --epsilon 1e-8 \
  --max_grad_norm 1.0"

# Validation arguments
validation_cmd="--validation_prompts \"afkx A baker carefully cuts a green bell pepper cake on a white plate against a bright yellow background, followed by a strawberry cake with a similar slice of cake being cut before the interior of the bell pepper cake is revealed with the surrounding cake-to-object sequence.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@97x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@129x512x768:::A person with gloved hands carefully cuts a cake shaped like a Skittles bottle, beginning with a precise incision at the lid, followed by careful sequential cuts around the neck, eventually detaching the lid from the body, revealing the chocolate interior of the cake while showcasing the layered design's detail.@@@61x512x768:::afkx A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage@@@61x512x768\" \
  --num_validation_videos 1 \
  --validation_steps 100"

# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-hunyuan-video \
  --output_dir $OUTPUT_DIR \
  --nccl_timeout 1800 \
  --report_to wandb"

cmd="accelerate launch --config_file accelerate_configs/deepspeed.yaml --gpu_ids $GPU_IDS --main_process_port 29501 train.py \
  $model_cmd \
  $dataset_cmd \
  $dataloader_cmd \
  $diffusion_cmd \
  $training_cmd \
  $optimizer_cmd \
  $validation_cmd \
  $miscellaneous_cmd"

echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n"

a-r-r-o-w · 2025-01-14T19:17:03Z

Naive FP8 weight-casting training has been merged with #184. The loss curves match almost exactly throughout training for FP8 vs BF16 due to sensible defaults for ignoring certain layers from being casted to FP8 precision. TorchAO is still on todos, but relatively low prio since the memory gains would not be much compared to other optimizations that can be made at the moment, and because torchao quantizations usually also require torch.compile and relatively new hardware to deliver faster training speed.

Next up on my priority will be:

prefetched group offloading to further reduce requirements without compromising too much on speed
flash attention & sage attention

a-r-r-o-w added the enhancement New feature or request label Dec 23, 2024

a-r-r-o-w closed this as completed Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower Hunyuan Video LoRA memory requirements #135

Lower Hunyuan Video LoRA memory requirements #135

a-r-r-o-w commented Dec 23, 2024

Lumoria commented Dec 23, 2024

a-r-r-o-w commented Dec 23, 2024

Lumoria commented Dec 23, 2024

a-r-r-o-w commented Dec 23, 2024

Lumoria commented Dec 23, 2024

Aristo23333 commented Dec 24, 2024 •

edited

Loading

generalsvr commented Dec 24, 2024

a-r-r-o-w commented Dec 24, 2024 •

edited

Loading

Lumoria commented Dec 24, 2024

a-r-r-o-w commented Dec 24, 2024

a-r-r-o-w commented Dec 24, 2024 •

edited

Loading

generalsvr commented Dec 24, 2024 •

edited

Loading

a-r-r-o-w commented Dec 24, 2024

a-r-r-o-w commented Dec 24, 2024

generalsvr commented Dec 24, 2024

sayakpaul commented Dec 30, 2024

a-r-r-o-w commented Jan 14, 2025

Lower Hunyuan Video LoRA memory requirements #135

Lower Hunyuan Video LoRA memory requirements #135

Comments

a-r-r-o-w commented Dec 23, 2024

Lumoria commented Dec 23, 2024

a-r-r-o-w commented Dec 23, 2024

Lumoria commented Dec 23, 2024

a-r-r-o-w commented Dec 23, 2024

Lumoria commented Dec 23, 2024

Aristo23333 commented Dec 24, 2024 • edited Loading

generalsvr commented Dec 24, 2024

a-r-r-o-w commented Dec 24, 2024 • edited Loading

Lumoria commented Dec 24, 2024

a-r-r-o-w commented Dec 24, 2024

a-r-r-o-w commented Dec 24, 2024 • edited Loading

generalsvr commented Dec 24, 2024 • edited Loading

a-r-r-o-w commented Dec 24, 2024

a-r-r-o-w commented Dec 24, 2024

generalsvr commented Dec 24, 2024

sayakpaul commented Dec 30, 2024

a-r-r-o-w commented Jan 14, 2025

Aristo23333 commented Dec 24, 2024 •

edited

Loading

a-r-r-o-w commented Dec 24, 2024 •

edited

Loading

a-r-r-o-w commented Dec 24, 2024 •

edited

Loading

generalsvr commented Dec 24, 2024 •

edited

Loading