Implement GPU memory optimization #1

siriux · 2022-11-06T09:46:20Z

When stable diffusion was released, the most requested features were always about reducing GPU requirements as this makes it available to more users with cheaper GPUs. I think this is one important feature to implement in diffusers-rs.

Here it's a list of the diffusers library optimizations: https://github.com/huggingface/diffusers/blob/main/docs/source/optimization/fp16.mdx

The main memory reduction feature I think would be allowing half precision models (fp16), that AFAIK is not yet implemented (but I might have missed the conversion somewhere).

I don't know if there is a better option to set the fp16 requirement in advance, but otherwise just loading the VarStore in a CPU device, calling half() on it, and moving it to the GPU using set_device should do the trick. This can be done in the example without touching the library I think.

Then, the next important thing would be Sliced Attention. This should be really straight forward, as we can see here it's just splitting the normal attention into slices and computing each slice at a time in a loop: https://github.com/huggingface/diffusers/blob/08a6dc8a5840e0cc09e65e71e9647321ab9bb254/src/diffusers/models/attention.py#L526

Then, is just a mater of exposing an slice_size configuration element on CrossAttention, and add an attention_slice_size to anyone that uses it.

Supporting Flash Attention from https://github.com/HazyResearch/flash-attention would be nice, but it's much more complicated as needs to be compiled for Cuda and it only works with Nvidia Cards. But this optimization is the main one used by the xformers library related to attention and provides a very good speedup in many cases.

Finally, the last important thing missing would be to allow move some models to the CPU when not in use (see huggingface/diffusers#850), or even run them on the CPU as needed only leaving the unet on the GPU (see huggingface/diffusers#537).

They use the accelerate library, but I think this can be implemented directly on tch. I think just providing a set_device method on all the models would be enough, as everything else can be handled directly on the example or the user code.

What do you think? I'm planning to play a little bit with the diffusers-rs library and stable diffusion the next weeks and I can try to implement a few optimizations, but my knowledge on ML is still very basic.

LaurentMazare · 2022-11-06T20:44:38Z

Thanks for all the details, that's very interesting and it would be great to support GPUs with less memory. I have a 8GB GPU and was only able to run the code on cpu.
Based on your details and links, I was able to get the code to run using the fp16 weights for unet (available here) and making a couple changes to run the unet only on the GPU..

PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128 \
    cargo run --example stable-diffusion --features clap -- --cpu-for-vae --unet-weights data/unet-fp16.ot

Fwiw, autocast is also available in tch but somehow it did not seem to help much, I probably messed something up and will have to dig further.

geocine · 2022-11-07T00:07:38Z

8 bit optimization as well https://github.com/TimDettmers/bitsandbytes

siriux · 2022-11-07T06:53:53Z

Thanks, that's great. I got it working yesterday in fp16 but my solution had a few hacks, your solution is much cleaner.

I can also confirm that it works for me using the vae on the GPU (with its fp16 version). And that clip works on fp16 too, also on the GPU. I had to force clip on the GPU because you didn't include the option, is there a reason for only allowing it on the CPU?

So it works for me with all fp16 and on my GPU (2070 Super with 8GB).

I'm using everything from v1.5 of stable diffusion (https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/fp16) , including the clip model from text_encoder. Probably it's better suggest to source the clip model from there too in the example.

LaurentMazare · 2022-11-07T21:07:14Z

The main reason why I've put clip model on the cpu by default was that I find it to run very quickly compared to the rest of the pipeline so didn't find much of an advantage to running it on the gpu and it's using a bit of memory there, also it seemed a bit painful for users willing to run everything on the cpu to have to specify 3 flags. Anyway, I've tweaked the cpu flag a bit so that users can set it to -cpu all, or -cpu vae -cpu clip, etc.

siriux · 2022-11-08T11:01:10Z

I've started to test the generation speed and I've realized that even though we are loading fp16 weights, all the computations are in fp32. This results in a ~2x slowdown with respect to the python implementation I'm testing against (Automatic1111), that's why I've noticed.

I've added my fp16 hacks back and now the speed is comparable to the python one. Basically, what I've done is replace all Kind::Float with Kind::Half in attention.rs, embeddings.rs, unet_2d.rs and clip.rs But in clip.rs I need to keep the last Float and I use mask.fill_(f32::MIN as f64).triu_(1).unsqueeze(1).to_kind(Kind::Half) instead.

On the pipeline I do vs.half(); after loading the weights (and the equivalent for vs_ae and vs_unet). And finally, when generating the random latents I also use Kind::Half.

This is really just a hack that forces to always use Kind::Half, and we should do it in a configurable way, that's why I'm just explaining the hack instead of creating a pull request. Any preferences on how to implement this the right way?

Once we have the hack in place and using sliced attention of size 1 I've compared it to Automatic1111 with the xformers attention. In this case we only lose ~25% of performance or maybe even less. We also lose a little bit of max image size, due to the higher memory needs. But in any case, I'm very satisfied with the sliced attention performance even if supporting xformers would be a really welcomed optimization.

LaurentMazare · 2022-11-08T11:16:10Z

Right, it would be better to use fp16 all the way, I actually mentioned earlier giving a try to autocast which is probably what the python version does to but somehow the generated images look bad there. We should probably try to get to the bottom of this as autocast is fairly nice and should ensure that fp16 is used where appropriate (and fp32 is still used on some small bits that require more precision).
You can see an example of how to use autocast in this test snippet.

siriux · 2022-11-08T18:07:03Z

I just saw you comment, autocast seems great, it's probably what we need, I'll have a look tomorrow.

In the mean time, I've created this draft PR that includes my changes for Kind::Half as well as other things for img2img just in case anyone wants to try before we get autocast working: #6

Emulator000 · 2023-04-19T17:41:48Z

Based on your details and links, I was able to get the code to run using the fp16 weights for unet (available here) and making a couple changes to run the unet only on the GPU..

Unfortunately those weight are for Stable Diffusion v1.5 and I think is not compatible with v2.x, just tried (I have a RTX 2070 Super too) and I get this error at runtime:

Building the Clip transformer.
Building the autoencoder.
Building the unet.
Error: cannot find the tensor named up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.bias in data/unet_v1.5_fp16.ot

Any hint here to get it running or I can use sadly only the CPU version? 😢

LaurentMazare · 2023-04-19T17:46:04Z

Have you tried the weights for the 2.1 version? I would guess they are here on huggingface though I haven't tried.

Emulator000 · 2023-04-19T17:59:28Z

Thanks @LaurentMazare!

Tried just now, converted from the original weight but now I get the out of memory issue:

CUDA out of memory. Tried to allocate 3.16 GiB (GPU 0; 7.78 GiB total capacity; 3.52 GiB already allocated; 2.62 GiB free; 3.63 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is this the fp16 version?

LaurentMazare · 2023-04-19T18:03:35Z

I think so as it's in the fp16 branch in the huggingface repo. You can probably check by loading it in python and looking at the dtype reported by torch.

LaurentMazare mentioned this issue Nov 6, 2022

Can't run the stable diffusion example LaurentMazare/tch-rs#558

Closed

sssemil mentioned this issue Dec 19, 2022

Regarding Inference Time #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement GPU memory optimization #1

Implement GPU memory optimization #1

siriux commented Nov 6, 2022

LaurentMazare commented Nov 6, 2022

geocine commented Nov 7, 2022

siriux commented Nov 7, 2022

LaurentMazare commented Nov 7, 2022

siriux commented Nov 8, 2022

LaurentMazare commented Nov 8, 2022

siriux commented Nov 8, 2022

Emulator000 commented Apr 19, 2023

LaurentMazare commented Apr 19, 2023

Emulator000 commented Apr 19, 2023 •

edited

Loading

LaurentMazare commented Apr 19, 2023

Implement GPU memory optimization #1

Implement GPU memory optimization #1

Comments

siriux commented Nov 6, 2022

LaurentMazare commented Nov 6, 2022

geocine commented Nov 7, 2022

siriux commented Nov 7, 2022

LaurentMazare commented Nov 7, 2022

siriux commented Nov 8, 2022

LaurentMazare commented Nov 8, 2022

siriux commented Nov 8, 2022

Emulator000 commented Apr 19, 2023

LaurentMazare commented Apr 19, 2023

Emulator000 commented Apr 19, 2023 • edited Loading

LaurentMazare commented Apr 19, 2023

Emulator000 commented Apr 19, 2023 •

edited

Loading