Skip to content

andersonbcdefg/bpt

Repository files navigation

BPT – Ben's Pretrained Transformer ✨

This repository contains my implementation an autoregressive transformer language model in PyTorch, which I am working on as a personal project to familiarize myself with recent advances in deep learning. Because I'm making my own model from scratch, I was empowered to pick all my favorite transformer bells and whistles, which are detailed below. So far, I've been experimenting by training a 1.3B variant of this model on Jean Kaddour's MiniPile, which contains ~1.5B tokens from The Pile. Rumor has it this dataset is not the best, but it's a convenient size for experiments! Once I've figured out a training setup I'm happy with, I'll train it on something bigger, like C4 or OpenWebText.

Model Architecture

  • Position embeddings: Right now, you can either choose rotary embeddings (used in PaLM), rotary embeddings with XPos (helps with length extrapolation), or no position embeddings. (Yes, believe it or not, autoregressive language models can figure out positions without it, because of the causal attention mask! See https://arxiv.org/abs/2210.15424). I'll probably add ALiBi at some point too since it seems pretty neat. Note that rotary embeddings are noticeably slower than using no position embeddings at all—I'm working on figuring out how to optimize their performance; right now I'm using Phil Wang's implementation from PaLM-rlhf-pytorch.
  • Parallel Transformer Block: Following GPT-J and PaLM, the attention and feed-forward layers in each transformer block are in parallel instead of in series. This lets you fuse the matrix multiplications for $X \to QKV$ and the feed-forward projection into a single operation! (This is supposed to provide performance benefits.) It should also reduce communication costs when sharding the model across devices, which I'm not doing just yet, but... yknow. Something to look forward to.
  • SwiGLU: I use the SiLU variant of Gated Linear Units for the feed-forward blocks, where $FFN(X) = W_{out} (W_aX \odot SiLU(W_bX))$, as proposed by Noam Shazeer in GLU Variants Improve Transformer. SiLU is a non-linear activation function also known as "Swish".
  • RMSNorm: The consensus seems to be that RMSNorm (where you just divide a tensor by its root-mean square, followed by a rescale) doesn't really do anything that LayerNorm doesn't, but since it's a little faster, might as well use it!
  • No Biases: Following PaLM, linear layers are all just projections with no added bias, and normalization layers have a rescale factor, but no bias. This was found to speed up the model (fewer operations!) without hurting performance (high-dimensional space is weird, man...)–in fact, the PaLM authors found that removing biases stabilized training at larger model sizes.
  • Multi-Query Attention: Not implemented yet, but will eventually add this as it greatly improves decoding efficiency without hurting performance.
  • Tying Embeddings: I provide the option to tie the token embeddings to the output head (projection from the hidden state back into the vocabulary). This is a pretty common practice, and it doesn't seem to really hurt or help (tied embeddings saves some parameters; untied is supposedly easier for interpretability).

References

Relevant Papers

Code References

  • Andrej Karpathy's mingpt: Referenced for some tricks related to implementation of multi-head attention. Also for BPE, borrowed pre-tokenization regex and mapping from bytes to characters (which in turn are borrowed from the OpenAI implementation).
  • Einops Documentation: Referenced for more tricks related to multi-head attention, namely, Einstein notation.
  • Phil Wang's ViT repository: Referenced for more attention tricks (wrapping the attention and FFN blocks in a "PreNorm" layer, which results in a much cleaner transformer block implementation).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published