-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix layer_normalize gradients #3001
Conversation
It still doesn't fix the discrepancy between GPU and CPU, but fixes a bug in the implementation. |
I'm looking at this and not sure what the code is supposed to be doing. Go through the contracts and make sure they are right. Like
That's saying beta and gamma are the same shape and all the dimensions are 1 except That's totally at the root of the problem here. Everything starts with having contracts that are right. Then put DLIB_ASSERT statements that check all the requires statements so you know for sure they are not being violated. That will chase down the problem. Although you've got to decide what the arguments are first. Not sure what you want them to be for this layer. I would think |
Inside Sorry I never really looked at this. You write such good PRs I just kinda skimmed this one and was like "yeah another Adria PR, going to be great and looks great 👍 :D " without really reading it all. |
Oh, right, what was I thinking. It looks like I got confused half-way through the code, where I should normalize each channel independently, but I ended up trying to normalize along Hopefully, I will fix that and the dangling pointer tonight if life allows. |
I was just checking this again: It seems like, it does normalize along C, H, W. and there's one beta and one gamma for each normalized item. import torch
N, C, H, W = 20, 5, 10, 10
x = torch.randn(N, C, H, W)
layer_norm = torch.nn.LayerNorm([C, H, W])
y = layer_norm(x)
sum(p.numel() for p in layer_norm.parameters() if p.requires_grad) # 1000 = 2 * (5 * 10 * 10) So, maybe the issue is just the dangling pointers? I will make sure the contracts are correct and respected, though. EDIT: after necrobumping ConvNeXt, each LayerNorm only has 2 * C learnable parameters (beta and gamma), so the implementation here is wrong. You're right about the dimensions of beta and gamma: they should only have |
Yeah, might be just the dangling pointer and everything else is fine.
…On Thu, Aug 29, 2024 at 8:21 AM Adrià ***@***.***> wrote:
I was just checking this again:
https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html
image.png (view on web)
<https://github.com/user-attachments/assets/8a757676-7a92-4649-9bf0-2e5f23325901>
It seems like, it does normalize along C, H, W. and there's one gamma and
one beta for each normalized item.
import torchN, C, H, W = 20, 5, 10, 10x = torch.randn(N, C, H, W)layer_norm = torch.nn.LayerNorm([C, H, W])y = layer_norm(x)sum(p.numel() for p in layer_norm.parameters() if p.requires_grad) # 1000 = 2 * (5 * 10 * 10)
So, maybe the issue is just the dangling pointers? I will make sure the
contracts are correct and respected, though.
—
Reply to this email directly, view it on GitHub
<#3001 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPYFR4GTZW35ALFN7QVDHLZT4G45AVCNFSM6AAAAABNIK6NHOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJXGQ4TONJVGQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I am now confident about the CPU implementation, however, the CUDA version still fails. |
I honestly don't know what else to do. However, if you run the
All of them are within the tolerance error of 1e-5 or 1e-4. However, |
Ok, fixed a race condition, now
EDIT: after running a clean build, it's working! |
Nice. I'm away from my computer. I'll look in a bit. Seems like you got it 🥳 |
It took an awful amount of time... |
dlib/cuda/cuda_dlib.cu
Outdated
for (auto nk : grid_stride_range_y(0, ns * ks)) | ||
{ | ||
const auto n = nk / ks; | ||
const auto k = nk % ks; | ||
const auto ps = s + (n * ks + k) * num; | ||
const auto pgi = gi + (n * ks + k) * num; | ||
float temp_bg = 0; | ||
float temp_gg = 0; | ||
for (auto i : grid_stride_range(0, num)) | ||
{ | ||
const float x_hat = (ps[i] - m[n]) * v[n]; | ||
temp_bg += pgi[i]; | ||
temp_gg += pgi[i] * x_hat; | ||
} | ||
warp_reduce_atomic_add(bg[k], temp_bg); | ||
warp_reduce_atomic_add(gg[k], temp_gg); | ||
} | ||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that kind of warp reduction loop is the best way I know to do it too.
After the previous two commits, the network went to train from 320 img/s to 2450 img/s (close to the official CUDA/CUDNN batch norm at 2560 img/s) |
Yeah that's awesome. All the tests are passing for me too on a GPU machine. Passing for you too now? Anything else you want to change before I merge it? :D |
Nothing else to add, I think it's done now. FINALLY. And yes, tests are passing now :D |
Yeah nice, thanks for all the good work. Looks perfect :D |
Closes #2902