Difference between `latents requires_grad=True` and `torch.no_grad()` #34

gvalvano · 2023-05-04T07:41:53Z

Thanks for sharing such an amazing work :)

In the last section of the notebook Stable Diffusion Deep Dive.ipynb, you mention:

NB: We should set latents requires_grad=True before we do the forward pass of the unet (removing with torch.no_grad()) if we want mode accurate gradients. BUT this requires a lot of extra memory. You'll see both approaches used depending on whose implementation you're looking at.

Can you please clarify what is the difference between the two approaches? For example, if I had to code this, I would have used torch.no_grad(), but apparently you preferred another approach. What does it change computationally and results-wise?.

I think adding this as extra info to the notebook would be useful to others, too :)

The text was updated successfully, but these errors were encountered:

johnowhitaker · 2023-05-04T08:38:46Z

If we set requires_grad=True AFTER getting the noise prediction from the unet (the example shown), then he gradient of the loss function w.r.t the latents tells us "how do I change these latents such that when I remove this noise it looks good"

If we set requires_grad=True BEFORE getting the noise prediction from the unet, then the noise prediction depends on the latents and the gradients can be traced back through the unet. So they tell us "how do I change these latents such that WHEN I FEED THEM THROUGH THE UNET AND THEN REMOVE THE PREDICTED NOISE it looks good".

The second case reflects what actually happens during sampling. We want to tweak the latents such that the final result (based on a prediction made with those modified latents) minimizes our loss. The first case tweaks the latents such that a prediction based on the unmodified latents minimizes the loss. It's a subtle difference, but especially in more complicated cases than the demo it does make a difference. For example, with CLIP guidance (where you try to minimize the difference between the generated image and a text or image prompt in CLIP embedding space) the results with the non-shortcut method seem to be better. The downside is much higher memory and compute usage, since the gradients need to be traced back through the UNet.

If you're interested in guidance, there are other improvements worth looking at - https://arxiv.org/abs/2301.11558 shows an example of CLIP guidance in their repo (https://github.com/sWizad/split-diffusion) that uses some clever maths to make the guidance more stable. I tend to use one of their approaches in all my guidance stuff these days.

gvalvano · 2023-05-04T08:55:06Z

I see. This is a very interesting topic and it looks much clearer to me, now 😁
I will check the references, too. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference between `latents requires_grad=True` and `torch.no_grad()` #34

Difference between `latents requires_grad=True` and `torch.no_grad()` #34

gvalvano commented May 4, 2023 •

edited

Loading

johnowhitaker commented May 4, 2023

gvalvano commented May 4, 2023

Difference between latents requires_grad=True and torch.no_grad() #34

Difference between latents requires_grad=True and torch.no_grad() #34

Comments

gvalvano commented May 4, 2023 • edited Loading

johnowhitaker commented May 4, 2023

gvalvano commented May 4, 2023

Difference between `latents requires_grad=True` and `torch.no_grad()` #34

Difference between `latents requires_grad=True` and `torch.no_grad()` #34

gvalvano commented May 4, 2023 •

edited

Loading