-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lower VRAM usage in CPU offload for Flux ControlNet Pipeline #10790
Comments
@NielsPichon We recently shipped "group offloading" in #10503. Would you like to give it a try and see if helps? I was able to run the model in ~1 GB VRAM without much hit to generation time. We're further working on some improvements that will allow lowering this even more You can configure the number of internal transformer blocks to load at a time per model, as well as just performing offloading at the lowest possible leaf-module level. What you describe as "slow offload mode" is the default behaviour with group offloading, i.e., NEED LAYER X -> LOAD LAYER X on GPU -> PERFORM FORWARD -> OFFLOAD LAYER X. As you mention a 24 GB GPU, it's safe to assume it is a modern GPU that supports cuda streams -- so you can benefit from offloading without taking much of a performance hit if you specify to use that option. LMK if you need any help with examples apart from the ones in the PR :) |
If this works (and I have no doubt it will ;) ) this is absolutely brilliant! I will give it a try and keep you posted on how it goes. In the meantime you may consider this issue as resolved. Thanks for your help 🙂 |
Hey! I am also interested. I tried applying the group-offload to the control-net itself and the individual components like the transformer, text encoders etc., but couldn't figure it out. Thank you in advance :) |
@BenjaminE98 Could you let me know what errors you're facing and share a code example? I'll be better able to help debug that way. I also just gave it a spin on SDXL ControlNet and it worked without problems. Our tests on dummy models for Flux ControlNet, or
|
Hey! Thank you very much for the fast response. I can´t use my machine right now, as I had RAM stability issues with new ram and can hopefully on Friday access my machine again (crossed fingers). I tried something like mentioned here: I tried applying the group offload to the text_encoder, text_encoder_2, vae and transformer. VRam usage caused my GPU to freeze (7900XTX, over 24GB it seemed). I will try again and get back to you as soon as possible. |
See if this helps |
Thank you very much! Your script was working but other problems did arise. Really appreciated the help! |
Hey there! Just wanting to report back! The feature did drastically reduce VRAM-usage. However I have another problem. This might be related to AMD or similar. I don't know if I should open a new issue, but before I will do some more research. The 7900XTX keeps on stuttering and freezing after quite sometime when using the upscaling pipeline. Don't know what it is, but it also never seems to actually run the inference. It only loads the pipeline etc. but it never accomplishes any inferences. Thank you so much for the help :) |
How much shared RAM do you have? |
I would have to run it, to see it again. I have 128GB RAM and 24GB VRAM. Earlier I had 32gb only and thought that this was the reason. It looked like a typical freeze, stutter because of RAM. I think it happens only after it even says 0/... inferences. EDIT: The command rocm-smi --showmeminfo gtt says 64 GB. I don't own the 7900XTX for so long... Please bear with me if this is not the shared memory. :) When enabling vae tiling it works, however takes about 55 Minutes for a 4x on a 128 x 128 image. |
I have tested group offloading with leaf level offloading and it works really well 👍. Thanks for the help |
I tested it on 3 platforms. On ROCm and an Nvidia RTX 5070 Ti it is not working (maybe due to drivers?) On 4090s it runs really well :) Thank you :) |
What are the CPU memory specs on ROCm and RTX 5070 Ti? The currently implementation of group offloading requires a lot of CPU memory, so it might be failing due to that -- we'll work on improving this soon. In any case, would you be able to share the errors you're facing? |
The specs are the following: PC2 (not working): PC3 (working): Unfortunately I can't provide any errors as the PC just freezes after some time. I would try and get further debug information, but I don't know where to start as the pcs just keep hanging. I could try and switch to another TTY (maybe), but so far the PCs just kept on freezing. This is maybe some sort of driver problem (as the AMD Driver and 5070 Ti Driver may not be mature enough? Thanks for the support! :) |
Is your feature request related to a problem? Please describe.
I have a 24GB VRAM GPU. When running a diffusion model like Flux1, I can barely fit the model in memory during inference with batch size 1. Enabling CPU offload does not help because the offload does not occur between the controlnet forward pass and the transformer foward pass (which makes sense perfromance-wise).
I would be great to enable offloading between controlnet call and transformer denoising steps (or any other auxiliary model that does not currently get offloaded in the middle of the denoising process) to further reduce VRAM requirements.
Describe the solution you'd like.
What I would suggest is having a "slow" offload mode where the models do get offloaded to CPU, even if it is really slow.
For instance, in the image-to-image pipeline on line 927:
Describe alternatives you've considered.
I am not sure there are alternatives if the usage of these models is to be allowed at the desired fp precision (in my case bfloat16).
The text was updated successfully, but these errors were encountered: