-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch vs layer normalization description (page 560) #40
Comments
Thanks for the note! Phew this is a tricky one Originally:
Yeah, reading this again, it does sound a bit weird. Here is an attempt to clarify it:
Maybe I should also swap the figure with a better one. This is the figure from the original layer norm paper: I just found this helpful one here for transformer contexts: Do you think that showing it like this would make it more clear? We can potentially update the book and I could swap it out. Do you think it's a helpful change? |
I don't have much practical experience in DL, but I suppose the following:
|
Hi Sebastian,
There is the description of the batch and layer normalization (including the picture) on the page 560:
"While layer normalization is traditionally performed across all elements in a given feature for each feature independently, the layer normalization used in transformers extends this concept and computes the normalization statistics across all feature values independently for each training example."
Is layer normalization mentioned correctly in the first case? It seems that when we calculate statistics for each feature independently, we perform batch normalization?
Thank you.
The text was updated successfully, but these errors were encountered: