This is the paper presentation repo for "Attention Is All You Need".
- The paper proposes a new network architecture, the Transformer, based solyely on attention mechanisms. The Transformer is more parallelizable and require less time to train (WMT).
- The Transformer contains Encoder and Decoder parts. Each of them contaions 6 identical layers.
- For Encoder, each layer has 2 sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed forward network.
- The Decorder is also composed of 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attion over the output of the encorder stack.
- An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The Transformer model in this paper uses the "Scaled Dot-Product Attention".
Why the Masked Multi-Head Attention need to have “Masked”?
Why there is Multi-Head Attention?
Why there is attention?
- The paper poposed a new mdoel architecture and it is only implemented in the translation.
- I think the every part in the transformer could be explained in a more detailed way.
- The transformer contains 2 parts in this paper. But they can also be used in a seperate way, which are BERT and GPT.
- I think the attention mechanism could make the deep learning model more explainable.
- Original Article: https://arxiv.org/abs/1706.03762v5
- Tensor2Tensor has some code with a tutorial: https://www.tensorflow.org/text/tutorials/transformer
- Transformer very intuitively explained: http://jalammar.github.io/illustrated-transformer/
The link of the notebook: https://colab.research.google.com/github/bentrevett/pytorch-seq2seq/blob/master/6%20-%20Attention%20is%20All%20You%20Need.ipynb#scrollTo=8hOLjW7rJKJL
Link to video recording: https://drive.google.com/file/d/1td5ZYbuOeuZ8Hta0rR_TuCtCNS9CQTWB/view?usp=sharing