-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lower Hunyuan Video LoRA memory requirements #135
Comments
What are the memory requirements for Hunyuan currently? I'm OOMing with 48gb |
Could you give #129 a try? I believe with FP8 it should fit in 24 gb based on rough calculations, but will continue to try and improve |
Sadly I still OOM even after precompiling the conditions and latents |
Just to confirm, are you using the bash script from README or a custom launch script? And are you sure |
Yeah, using the bash script from the readme. --gradient_checkpointing and --precompute_conditions are both being passed. |
Hi, I also try the bash in your README.md and load the CKPT you provide in https://huggingface.co/hunyuanvideo-community/HunyuanVideo. But I get OOM even in an 80 GiB H800 when loading the HunyuanVideo transformer, before training. And my training device is 1/2 H800 |
Have same OOM problem with --precompute_conditions and --gradient_checkpointing form README script on A100 |
I'm unable to replicate unfortunately. I just verified once again that I can run training in about 42 GB of memory when precomputation and gradient checkpointing is enabled with
Can you also try running training with resolution buckets set as |
Setting the bucket size to 1x512x12 still OOMs.
|
I see. I'll give pytorch 2.4 a try and profile it tomorrow. Could you try upgrading to pytorch 2.5.1 and see if it does away, or the nightly 2.6.0? |
Also, on 2.4, could you first check if the inference results in a normal video or a black video with the example code here: https://huggingface.co/docs/diffusers/en/api/pipelines/hunyuan_video There have been reports of it not working and I suspect it's something to do with the torch version. If the inference is not working, there's a slim chance training would work well The example doesn't mention it, but if you're facing OOM for inference, |
I was able to run training with accelerate_configs/uncompiled_1.yaml config but during training the loss was nan. The output of this lora model after training is black screen. Can you explain what is the difference between these configs please? Pytorch 2.4.0 |
@generalsvr It seems like 2.4 might be a problematic torch version for some operations. I'm going through the relevant commits in pytorch to try and see what exactly causes this, but I believe upgrading to 2.5.1 will fix the nan loss. Could you give that a try? |
The configs are simply some rules that tell |
After updating pytorch to 2.5.1 on the same machine original hunyuan model started to generate videos. But training is still a problem. I can see loss appeared once on step 1, but then again nan. Video generation with lora resulting in a black screen. Training run 1 log: Training steps: 0%| | 0/20 [00:00<?, ?it/s]12/24/2024 16:47:37 - DEBUG - finetrainers - Starting epoch (1/1) Training run 2 log: Training steps: 0%| | 0/20 [00:00<?, ?it/s]12/24/2024 17:10:49 - DEBUG - finetrainers - Starting epoch (1/1) |
I tested it 2 days back and it seemed fine. Below is my command: Commandexport NCCL_P2P_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
export FINETRAINERS_LOG_LEVEL=DEBUG
GPU_IDS="0,1"
DATA_ROOT="/home/sayak/finetrainers/video-dataset-disney"
CAPTION_COLUMN="prompt.txt"
VIDEO_COLUMN="videos.txt"
OUTPUT_DIR="/raid/.cache/huggingface/sayak/hunyuan_video/hunyuan_disney"
# Model arguments
model_cmd="--model_name hunyuan_video \
--pretrained_model_name_or_path hunyuanvideo-community/HunyuanVideo"
# Dataset arguments
dataset_cmd="--data_root $DATA_ROOT \
--video_column $VIDEO_COLUMN \
--caption_column $CAPTION_COLUMN \
--id_token afkx \
--video_resolution_buckets 81x512x768 \
--caption_dropout_p 0.05"
# Dataloader arguments
dataloader_cmd="--dataloader_num_workers 4"
# Diffusion arguments
diffusion_cmd=""
# Training arguments
training_cmd="--training_type lora \
--seed 42 \
--mixed_precision bf16 \
--batch_size 2 \
--train_steps 50 \
--rank 4 \
--lora_alpha 4 \
--target_modules to_q to_k to_v to_out.0 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing \
--checkpointing_steps 5 \
--checkpointing_limit 2 \
--enable_slicing \
--enable_tiling"
# Optimizer arguments
optimizer_cmd="--optimizer adamw \
--lr 2e-5 \
--lr_scheduler constant_with_warmup \
--lr_warmup_steps 100 \
--lr_num_cycles 1 \
--beta1 0.9 \
--beta2 0.95 \
--weight_decay 1e-4 \
--epsilon 1e-8 \
--max_grad_norm 1.0"
# Validation arguments
validation_cmd="--validation_prompts \"afkx A baker carefully cuts a green bell pepper cake on a white plate against a bright yellow background, followed by a strawberry cake with a similar slice of cake being cut before the interior of the bell pepper cake is revealed with the surrounding cake-to-object sequence.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@49x512x768:::afkx A cake shaped like a Nutella container is carefully sliced, revealing a light interior, amidst a Nutella-themed setup, showcasing deliberate cutting and preserved details for an appetizing dessert presentation on a white base with accompanying jello and cutlery, highlighting culinary skills and creative cake designs.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@61x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@97x512x768:::afkx A vibrant orange cake disguised as a Nike packaging box sits on a dark surface, meticulous in its detail and design, complete with a white swoosh and 'NIKE' logo. A person's hands, holding a knife, hover over the cake, ready to make a precise cut, amidst a simple and clean background.@@@129x512x768:::A person with gloved hands carefully cuts a cake shaped like a Skittles bottle, beginning with a precise incision at the lid, followed by careful sequential cuts around the neck, eventually detaching the lid from the body, revealing the chocolate interior of the cake while showcasing the layered design's detail.@@@61x512x768:::afkx A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage@@@61x512x768\" \
--num_validation_videos 1 \
--validation_steps 100"
# Miscellaneous arguments
miscellaneous_cmd="--tracker_name finetrainers-hunyuan-video \
--output_dir $OUTPUT_DIR \
--nccl_timeout 1800 \
--report_to wandb"
cmd="accelerate launch --config_file accelerate_configs/deepspeed.yaml --gpu_ids $GPU_IDS --main_process_port 29501 train.py \
$model_cmd \
$dataset_cmd \
$dataloader_cmd \
$diffusion_cmd \
$training_cmd \
$optimizer_cmd \
$validation_cmd \
$miscellaneous_cmd"
echo "Running command: $cmd"
eval $cmd
echo -ne "-------------------- Finished executing script --------------------\n\n" |
Naive FP8 weight-casting training has been merged with #184. The loss curves match almost exactly throughout training for FP8 vs BF16 due to sensible defaults for ignoring certain layers from being casted to FP8 precision. TorchAO is still on todos, but relatively low prio since the memory gains would not be much compared to other optimizations that can be made at the moment, and because torchao quantizations usually also require Next up on my priority will be:
|
It should be possible to leverage fp8 casted models, or torchao quantization, to support training in under 24 GB upto a reasonable resolution. Or atleast that's the hope when combined with precomputation from #129. Will take a look soon 🤗
TorchAO docs: https://huggingface.co/docs/diffusers/main/en/quantization/torchao
FP8 casting: huggingface/diffusers#10347
The text was updated successfully, but these errors were encountered: