-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement GPU memory optimization #1
Comments
Thanks for all the details, that's very interesting and it would be great to support GPUs with less memory. I have a 8GB GPU and was only able to run the code on cpu. PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128 \
cargo run --example stable-diffusion --features clap -- --cpu-for-vae --unet-weights data/unet-fp16.ot Fwiw, |
8 bit optimization as well https://github.com/TimDettmers/bitsandbytes |
Thanks, that's great. I got it working yesterday in fp16 but my solution had a few hacks, your solution is much cleaner. I can also confirm that it works for me using the vae on the GPU (with its fp16 version). And that clip works on fp16 too, also on the GPU. I had to force clip on the GPU because you didn't include the option, is there a reason for only allowing it on the CPU? So it works for me with all fp16 and on my GPU (2070 Super with 8GB). I'm using everything from v1.5 of stable diffusion (https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/fp16) , including the clip model from text_encoder. Probably it's better suggest to source the clip model from there too in the example. |
The main reason why I've put clip model on the cpu by default was that I find it to run very quickly compared to the rest of the pipeline so didn't find much of an advantage to running it on the gpu and it's using a bit of memory there, also it seemed a bit painful for users willing to run everything on the cpu to have to specify 3 flags. Anyway, I've tweaked the cpu flag a bit so that users can set it to |
I've started to test the generation speed and I've realized that even though we are loading fp16 weights, all the computations are in fp32. This results in a ~2x slowdown with respect to the python implementation I'm testing against (Automatic1111), that's why I've noticed. I've added my fp16 hacks back and now the speed is comparable to the python one. Basically, what I've done is replace all Kind::Float with Kind::Half in attention.rs, embeddings.rs, unet_2d.rs and clip.rs But in clip.rs I need to keep the last Float and I use mask.fill_(f32::MIN as f64).triu_(1).unsqueeze(1).to_kind(Kind::Half) instead. On the pipeline I do vs.half(); after loading the weights (and the equivalent for vs_ae and vs_unet). And finally, when generating the random latents I also use Kind::Half. This is really just a hack that forces to always use Kind::Half, and we should do it in a configurable way, that's why I'm just explaining the hack instead of creating a pull request. Any preferences on how to implement this the right way? Once we have the hack in place and using sliced attention of size 1 I've compared it to Automatic1111 with the xformers attention. In this case we only lose ~25% of performance or maybe even less. We also lose a little bit of max image size, due to the higher memory needs. But in any case, I'm very satisfied with the sliced attention performance even if supporting xformers would be a really welcomed optimization. |
Right, it would be better to use fp16 all the way, I actually mentioned earlier giving a try to |
I just saw you comment, autocast seems great, it's probably what we need, I'll have a look tomorrow. In the mean time, I've created this draft PR that includes my changes for Kind::Half as well as other things for img2img just in case anyone wants to try before we get autocast working: #6 |
Unfortunately those weight are for Stable Diffusion v1.5 and I think is not compatible with v2.x, just tried (I have a RTX 2070 Super too) and I get this error at runtime:
Any hint here to get it running or I can use sadly only the CPU version? 😢 |
Have you tried the weights for the 2.1 version? I would guess they are here on huggingface though I haven't tried. |
Thanks @LaurentMazare! Tried just now, converted from the original weight but now I get the out of memory issue:
Is this the fp16 version? |
I think so as it's in the |
When stable diffusion was released, the most requested features were always about reducing GPU requirements as this makes it available to more users with cheaper GPUs. I think this is one important feature to implement in diffusers-rs.
Here it's a list of the diffusers library optimizations: https://github.com/huggingface/diffusers/blob/main/docs/source/optimization/fp16.mdx
The main memory reduction feature I think would be allowing half precision models (fp16), that AFAIK is not yet implemented (but I might have missed the conversion somewhere).
I don't know if there is a better option to set the fp16 requirement in advance, but otherwise just loading the VarStore in a CPU device, calling half() on it, and moving it to the GPU using set_device should do the trick. This can be done in the example without touching the library I think.
Then, the next important thing would be Sliced Attention. This should be really straight forward, as we can see here it's just splitting the normal attention into slices and computing each slice at a time in a loop: https://github.com/huggingface/diffusers/blob/08a6dc8a5840e0cc09e65e71e9647321ab9bb254/src/diffusers/models/attention.py#L526
Then, is just a mater of exposing an slice_size configuration element on CrossAttention, and add an attention_slice_size to anyone that uses it.
Supporting Flash Attention from https://github.com/HazyResearch/flash-attention would be nice, but it's much more complicated as needs to be compiled for Cuda and it only works with Nvidia Cards. But this optimization is the main one used by the xformers library related to attention and provides a very good speedup in many cases.
Finally, the last important thing missing would be to allow move some models to the CPU when not in use (see huggingface/diffusers#850), or even run them on the CPU as needed only leaving the unet on the GPU (see huggingface/diffusers#537).
They use the accelerate library, but I think this can be implemented directly on tch. I think just providing a set_device method on all the models would be enough, as everything else can be handled directly on the example or the user code.
What do you think? I'm planning to play a little bit with the diffusers-rs library and stable diffusion the next weeks and I can try to implement a few optimizations, but my knowledge on ML is still very basic.
The text was updated successfully, but these errors were encountered: