Lower VRAM usage in CPU offload for Flux ControlNet Pipeline #10790

NielsPichon · 2025-02-14T10:56:10Z

Is your feature request related to a problem? Please describe.

I have a 24GB VRAM GPU. When running a diffusion model like Flux1, I can barely fit the model in memory during inference with batch size 1. Enabling CPU offload does not help because the offload does not occur between the controlnet forward pass and the transformer foward pass (which makes sense perfromance-wise).

I would be great to enable offloading between controlnet call and transformer denoising steps (or any other auxiliary model that does not currently get offloaded in the middle of the denoising process) to further reduce VRAM requirements.

Describe the solution you'd like.

What I would suggest is having a "slow" offload mode where the models do get offloaded to CPU, even if it is really slow.

def enable_sequential_cpu_offload(self, gpu_id: Optional[int] = None, device: Union[torch.device, str] = "cuda", enable_slow_mode: bool = False)
    ...

For instance, in the image-to-image pipeline on line 927:

                controlnet_block_samples, controlnet_single_block_samples = self.controlnet(
                       ....
                )
          
                if self._enable_slow_cpu_offload:
                    self.maybe_free_model_hooks()

                ...

                noise_pred = self.transformer(
                    ...
                )[0]

                if self._enable_slow_cpu_offload:
                    self.maybe_free_model_hooks()

Describe alternatives you've considered.

I am not sure there are alternatives if the usage of these models is to be allowed at the desired fp precision (in my case bfloat16).

The text was updated successfully, but these errors were encountered:

a-r-r-o-w · 2025-02-14T19:48:09Z

@NielsPichon We recently shipped "group offloading" in #10503. Would you like to give it a try and see if helps? I was able to run the model in ~1 GB VRAM without much hit to generation time. We're further working on some improvements that will allow lowering this even more

You can configure the number of internal transformer blocks to load at a time per model, as well as just performing offloading at the lowest possible leaf-module level. What you describe as "slow offload mode" is the default behaviour with group offloading, i.e., NEED LAYER X -> LOAD LAYER X on GPU -> PERFORM FORWARD -> OFFLOAD LAYER X. As you mention a 24 GB GPU, it's safe to assume it is a modern GPU that supports cuda streams -- so you can benefit from offloading without taking much of a performance hit if you specify to use that option. LMK if you need any help with examples apart from the ones in the PR :)

NielsPichon · 2025-02-14T22:02:08Z

If this works (and I have no doubt it will ;) ) this is absolutely brilliant! I will give it a try and keep you posted on how it goes. In the meantime you may consider this issue as resolved.

Thanks for your help 🙂

BenjaminE98 · 2025-02-18T17:29:04Z

If this works (and I have no doubt it will ;) ) this is absolutely brilliant! I will give it a try and keep you posted on how it goes. In the meantime you may consider this issue as resolved.

Thanks for your help 🙂

Hey!

I am also interested.

I tried applying the group-offload to the control-net itself and the individual components like the transformer, text encoders etc., but couldn't figure it out.

Thank you in advance :)

a-r-r-o-w · 2025-02-18T19:30:42Z

@BenjaminE98 Could you let me know what errors you're facing and share a code example? I'll be better able to help debug that way. I also just gave it a spin on SDXL ControlNet and it worked without problems. Our tests on dummy models for Flux ControlNet, or <any_model> + ControlNet, don't seem to fail either:

diffusers/tests/pipelines/controlnet_flux/test_controlnet_flux.py

Line 54 in 924f880

test_group_offloading = True

BenjaminE98 · 2025-02-20T10:52:08Z

@BenjaminE98 Could you let me know what errors you're facing and share a code example? I'll be better able to help debug that way. I also just gave it a spin on SDXL ControlNet and it worked without problems. Our tests on dummy models for Flux ControlNet, or <any_model> + ControlNet, don't seem to fail either:

diffusers/tests/pipelines/controlnet_flux/test_controlnet_flux.py

Line 54 in 924f880

test_group_offloading = True

Hey!

Thank you very much for the fast response.

I can´t use my machine right now, as I had RAM stability issues with new ram and can hopefully on Friday access my machine again (crossed fingers).

I tried something like mentioned here:

#10797

I tried applying the group offload to the text_encoder, text_encoder_2, vae and transformer.

VRam usage caused my GPU to freeze (7900XTX, over 24GB it seemed).

I will try again and get back to you as soon as possible.

nitinmukesh · 2025-02-20T13:49:01Z

@BenjaminE98

See if this helps

#10840

BenjaminE98 · 2025-02-22T10:29:35Z

@BenjaminE98

See if this helps

#10840

Thank you very much!

Your script was working but other problems did arise.

Really appreciated the help!

BenjaminE98 · 2025-02-22T10:32:18Z

@BenjaminE98 Could you let me know what errors you're facing and share a code example? I'll be better able to help debug that way. I also just gave it a spin on SDXL ControlNet and it worked without problems. Our tests on dummy models for Flux ControlNet, or <any_model> + ControlNet, don't seem to fail either:

diffusers/tests/pipelines/controlnet_flux/test_controlnet_flux.py

Line 54 in 924f880
test_group_offloading = True

Hey there!

Just wanting to report back!

The feature did drastically reduce VRAM-usage.

However I have another problem. This might be related to AMD or similar.

I don't know if I should open a new issue, but before I will do some more research.

The 7900XTX keeps on stuttering and freezing after quite sometime when using the upscaling pipeline.

Don't know what it is, but it also never seems to actually run the inference.

It only loads the pipeline etc. but it never accomplishes any inferences.

Thank you so much for the help :)

nitinmukesh · 2025-02-22T10:58:00Z

How much shared RAM do you have?

BenjaminE98 · 2025-02-22T11:00:43Z

How much shared RAM do you have?

I would have to run it, to see it again.

I have 128GB RAM and 24GB VRAM.

Earlier I had 32gb only and thought that this was the reason.

It looked like a typical freeze, stutter because of RAM.

I think it happens only after it even says 0/... inferences.

EDIT: The command rocm-smi --showmeminfo gtt says 64 GB. I don't own the 7900XTX for so long... Please bear with me if this is not the shared memory. :)

When enabling vae tiling it works, however takes about 55 Minutes for a 4x on a 128 x 128 image.

NielsPichon · 2025-02-24T11:22:14Z

I have tested group offloading with leaf level offloading and it works really well 👍. Thanks for the help

BenjaminE98 · 2025-02-24T11:44:58Z

I tested it on 3 platforms.

On ROCm and an Nvidia RTX 5070 Ti it is not working (maybe due to drivers?)

On 4090s it runs really well :) Thank you :)

a-r-r-o-w · 2025-02-24T11:46:59Z

What are the CPU memory specs on ROCm and RTX 5070 Ti? The currently implementation of group offloading requires a lot of CPU memory, so it might be failing due to that -- we'll work on improving this soon. In any case, would you be able to share the errors you're facing?

BenjaminE98 · 2025-02-24T11:56:21Z

What are the CPU memory specs on ROCm and RTX 5070 Ti? The currently implementation of group offloading requires a lot of CPU memory, so it might be failing due to that -- we'll work on improving this soon. In any case, would you be able to share the errors you're facing?

The specs are the following:
PC 1 (Not working):
AMD 5950x
128GB DDR4
AMD RX 7900xtx

PC2 (not working):
Intel 13700KF
64GB DDR5
NVidia 5070Ti

PC3 (working):
AMD 9950X
64GB Ram
2x 4090

Unfortunately I can't provide any errors as the PC just freezes after some time.

I would try and get further debug information, but I don't know where to start as the pcs just keep hanging.

I could try and switch to another TTY (maybe), but so far the PCs just kept on freezing.

This is maybe some sort of driver problem (as the AMD Driver and 5070 Ti Driver may not be mature enough?

Thanks for the support! :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower VRAM usage in CPU offload for Flux ControlNet Pipeline #10790

Lower VRAM usage in CPU offload for Flux ControlNet Pipeline #10790

NielsPichon commented Feb 14, 2025

a-r-r-o-w commented Feb 14, 2025

NielsPichon commented Feb 14, 2025

BenjaminE98 commented Feb 18, 2025

a-r-r-o-w commented Feb 18, 2025 •

edited

Loading

BenjaminE98 commented Feb 20, 2025

nitinmukesh commented Feb 20, 2025

BenjaminE98 commented Feb 22, 2025

BenjaminE98 commented Feb 22, 2025

nitinmukesh commented Feb 22, 2025

BenjaminE98 commented Feb 22, 2025 •

edited

Loading

NielsPichon commented Feb 24, 2025

BenjaminE98 commented Feb 24, 2025

a-r-r-o-w commented Feb 24, 2025

BenjaminE98 commented Feb 24, 2025

Lower VRAM usage in CPU offload for Flux ControlNet Pipeline #10790

Lower VRAM usage in CPU offload for Flux ControlNet Pipeline #10790

Comments

NielsPichon commented Feb 14, 2025

a-r-r-o-w commented Feb 14, 2025

NielsPichon commented Feb 14, 2025

BenjaminE98 commented Feb 18, 2025

a-r-r-o-w commented Feb 18, 2025 • edited Loading

BenjaminE98 commented Feb 20, 2025

nitinmukesh commented Feb 20, 2025

BenjaminE98 commented Feb 22, 2025

BenjaminE98 commented Feb 22, 2025

nitinmukesh commented Feb 22, 2025

BenjaminE98 commented Feb 22, 2025 • edited Loading

NielsPichon commented Feb 24, 2025

BenjaminE98 commented Feb 24, 2025

a-r-r-o-w commented Feb 24, 2025

BenjaminE98 commented Feb 24, 2025

a-r-r-o-w commented Feb 18, 2025 •

edited

Loading

BenjaminE98 commented Feb 22, 2025 •

edited

Loading