Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch vs layer normalization description (page 560) #40

Open
labdmitriy opened this issue Mar 23, 2022 · 2 comments
Open

Batch vs layer normalization description (page 560) #40

labdmitriy opened this issue Mar 23, 2022 · 2 comments

Comments

@labdmitriy
Copy link

Hi Sebastian,

There is the description of the batch and layer normalization (including the picture) on the page 560:

"While layer normalization is traditionally performed across all elements in a given feature for each feature independently, the layer normalization used in transformers extends this concept and computes the normalization statistics across all feature values independently for each training example."

Is layer normalization mentioned correctly in the first case? It seems that when we calculate statistics for each feature independently, we perform batch normalization?

Thank you.

@rasbt
Copy link
Owner

rasbt commented Apr 15, 2022

Thanks for the note! Phew this is a tricky one

Originally:

Screen Shot 2022-04-14 at 11 01 18 PM

While layer normalization is traditionally performed across all elements in a given feature for each feature independently, the layer normalization used in transformers extends this concept and computes the normalization statistics across all feature values independently for each training example.

Yeah, reading this again, it does sound a bit weird. Here is an attempt to clarify it:

While layer normalization is traditionally performed across all feature values in a given layer for each training example independently, the layer normalization used in transformers extends this concept and computes the normalization statistics across all feature values for a given sentence token position independently for each training example

Maybe I should also swap the figure with a better one. This is the figure from the original layer norm paper:

Screen_Shot_2020-05-19_at_4 24 42_PM

I just found this helpful one here for transformer contexts:
E3104-1

Do you think that showing it like this would make it more clear? We can potentially update the book and I could swap it out. Do you think it's a helpful change?

@labdmitriy
Copy link
Author

labdmitriy commented Apr 15, 2022

I don't have much practical experience in DL, but I suppose the following:

  • The first picture has dimension annotations which are more specific to CNN than to sequences.
    And it can be confusing to read PyTorch documentation, where for example convolutional layer will have N, C, L (for 1d) or N, C, H, W (for 2d) dimensions, while for embedding we will have swapped dimensions (N, L, C), and others can have (L, N, C) dimensions:
    https://discuss.pytorch.org/t/inconsistent-dimension-ordering-for-1d-networks-ncl-vs-nlc-vs-lnc/14807
  • The implementations which I saw for transformers, including "The Annotated transformer", as I understand, uses LayerNorm with normalization only by the last dimension (embeddings dimension), so it seems that classical Layer normalization is used.
  • I suppose that the second picture is from this article, if you mean the left case in the picture then it is probably better for understanding of your phrase. As I understand, this type of layer normalization is not described in the article in detail, just for reference.
  • Unfortunately I didn't find any description of this extended type of normalization neither in articles (for example, here) nor in implementations, and it will be great if you suggest any additional information about that.
  • Anyway it is pretty strange to me that this article with this new picture uses the same name (layer normalization) for the algorithm which is different from classical layer normalization. For example, if "we just remove the sum over NN in the previous equation compared to BN" (citation from here), then it has separate name - instance normalization. But for our case I didn't find any own naming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants