Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower VRAM usage in CPU offload for Flux ControlNet Pipeline #10790

Open
NielsPichon opened this issue Feb 14, 2025 · 14 comments
Open

Lower VRAM usage in CPU offload for Flux ControlNet Pipeline #10790

NielsPichon opened this issue Feb 14, 2025 · 14 comments

Comments

@NielsPichon
Copy link

Is your feature request related to a problem? Please describe.

I have a 24GB VRAM GPU. When running a diffusion model like Flux1, I can barely fit the model in memory during inference with batch size 1. Enabling CPU offload does not help because the offload does not occur between the controlnet forward pass and the transformer foward pass (which makes sense perfromance-wise).

I would be great to enable offloading between controlnet call and transformer denoising steps (or any other auxiliary model that does not currently get offloaded in the middle of the denoising process) to further reduce VRAM requirements.

Describe the solution you'd like.

What I would suggest is having a "slow" offload mode where the models do get offloaded to CPU, even if it is really slow.

def enable_sequential_cpu_offload(self, gpu_id: Optional[int] = None, device: Union[torch.device, str] = "cuda", enable_slow_mode: bool = False)
    ...

For instance, in the image-to-image pipeline on line 927:

                controlnet_block_samples, controlnet_single_block_samples = self.controlnet(
                       ....
                )
          
                if self._enable_slow_cpu_offload:
                    self.maybe_free_model_hooks()

                ...

                noise_pred = self.transformer(
                    ...
                )[0]

                if self._enable_slow_cpu_offload:
                    self.maybe_free_model_hooks()

Describe alternatives you've considered.

I am not sure there are alternatives if the usage of these models is to be allowed at the desired fp precision (in my case bfloat16).

@a-r-r-o-w
Copy link
Member

@NielsPichon We recently shipped "group offloading" in #10503. Would you like to give it a try and see if helps? I was able to run the model in ~1 GB VRAM without much hit to generation time. We're further working on some improvements that will allow lowering this even more

You can configure the number of internal transformer blocks to load at a time per model, as well as just performing offloading at the lowest possible leaf-module level. What you describe as "slow offload mode" is the default behaviour with group offloading, i.e., NEED LAYER X -> LOAD LAYER X on GPU -> PERFORM FORWARD -> OFFLOAD LAYER X. As you mention a 24 GB GPU, it's safe to assume it is a modern GPU that supports cuda streams -- so you can benefit from offloading without taking much of a performance hit if you specify to use that option. LMK if you need any help with examples apart from the ones in the PR :)

@NielsPichon
Copy link
Author

If this works (and I have no doubt it will ;) ) this is absolutely brilliant! I will give it a try and keep you posted on how it goes. In the meantime you may consider this issue as resolved.

Thanks for your help 🙂

@BenjaminE98
Copy link

If this works (and I have no doubt it will ;) ) this is absolutely brilliant! I will give it a try and keep you posted on how it goes. In the meantime you may consider this issue as resolved.

Thanks for your help 🙂

Hey!

I am also interested.

I tried applying the group-offload to the control-net itself and the individual components like the transformer, text encoders etc., but couldn't figure it out.

Thank you in advance :)

@a-r-r-o-w
Copy link
Member

a-r-r-o-w commented Feb 18, 2025

@BenjaminE98 Could you let me know what errors you're facing and share a code example? I'll be better able to help debug that way. I also just gave it a spin on SDXL ControlNet and it worked without problems. Our tests on dummy models for Flux ControlNet, or <any_model> + ControlNet, don't seem to fail either:

@BenjaminE98
Copy link

@BenjaminE98 Could you let me know what errors you're facing and share a code example? I'll be better able to help debug that way. I also just gave it a spin on SDXL ControlNet and it worked without problems. Our tests on dummy models for Flux ControlNet, or <any_model> + ControlNet, don't seem to fail either:

diffusers/tests/pipelines/controlnet_flux/test_controlnet_flux.py

Line 54 in 924f880

test_group_offloading = True

Hey!

Thank you very much for the fast response.

I can´t use my machine right now, as I had RAM stability issues with new ram and can hopefully on Friday access my machine again (crossed fingers).

I tried something like mentioned here:

#10797

I tried applying the group offload to the text_encoder, text_encoder_2, vae and transformer.

VRam usage caused my GPU to freeze (7900XTX, over 24GB it seemed).

I will try again and get back to you as soon as possible.

@nitinmukesh
Copy link

@BenjaminE98

See if this helps

#10840

@BenjaminE98
Copy link

@BenjaminE98

See if this helps

#10840

Thank you very much!

Your script was working but other problems did arise.

Really appreciated the help!

@BenjaminE98
Copy link

@BenjaminE98 Could you let me know what errors you're facing and share a code example? I'll be better able to help debug that way. I also just gave it a spin on SDXL ControlNet and it worked without problems. Our tests on dummy models for Flux ControlNet, or <any_model> + ControlNet, don't seem to fail either:

diffusers/tests/pipelines/controlnet_flux/test_controlnet_flux.py

Line 54 in 924f880
test_group_offloading = True

Hey there!

Just wanting to report back!

The feature did drastically reduce VRAM-usage.

However I have another problem. This might be related to AMD or similar.

I don't know if I should open a new issue, but before I will do some more research.

The 7900XTX keeps on stuttering and freezing after quite sometime when using the upscaling pipeline.

Don't know what it is, but it also never seems to actually run the inference.

It only loads the pipeline etc. but it never accomplishes any inferences.

Thank you so much for the help :)

@nitinmukesh
Copy link

How much shared RAM do you have?

@BenjaminE98
Copy link

BenjaminE98 commented Feb 22, 2025

How much shared RAM do you have?

I would have to run it, to see it again.

I have 128GB RAM and 24GB VRAM.

Earlier I had 32gb only and thought that this was the reason.

It looked like a typical freeze, stutter because of RAM.

I think it happens only after it even says 0/... inferences.

EDIT: The command rocm-smi --showmeminfo gtt says 64 GB. I don't own the 7900XTX for so long... Please bear with me if this is not the shared memory. :)

When enabling vae tiling it works, however takes about 55 Minutes for a 4x on a 128 x 128 image.

@NielsPichon
Copy link
Author

I have tested group offloading with leaf level offloading and it works really well 👍. Thanks for the help

@BenjaminE98
Copy link

I tested it on 3 platforms.

On ROCm and an Nvidia RTX 5070 Ti it is not working (maybe due to drivers?)

On 4090s it runs really well :) Thank you :)

@a-r-r-o-w
Copy link
Member

What are the CPU memory specs on ROCm and RTX 5070 Ti? The currently implementation of group offloading requires a lot of CPU memory, so it might be failing due to that -- we'll work on improving this soon. In any case, would you be able to share the errors you're facing?

@BenjaminE98
Copy link

What are the CPU memory specs on ROCm and RTX 5070 Ti? The currently implementation of group offloading requires a lot of CPU memory, so it might be failing due to that -- we'll work on improving this soon. In any case, would you be able to share the errors you're facing?

The specs are the following:
PC 1 (Not working):
AMD 5950x
128GB DDR4
AMD RX 7900xtx

PC2 (not working):
Intel 13700KF
64GB DDR5
NVidia 5070Ti

PC3 (working):
AMD 9950X
64GB Ram
2x 4090

Unfortunately I can't provide any errors as the PC just freezes after some time.

I would try and get further debug information, but I don't know where to start as the pcs just keep hanging.

I could try and switch to another TTY (maybe), but so far the PCs just kept on freezing.

This is maybe some sort of driver problem (as the AMD Driver and 5070 Ti Driver may not be mature enough?

Thanks for the support! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants