You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Transformers are RNNs (link: https://arxiv.org/pdf/2006.16236.pdf) talks about constant memory gradient computations which should be entirely realizeable for this project (the math is reasonably similar in structure at least).
Seemingly currently the memory usage does scale with sequence length though this may just be because larger inputs will need more memory to store all the embeddings. To that extent, it shouldn't help the training that's done in parallel.
This will be important for infinite context in the RNN formulation though! (see #14)
The text was updated successfully, but these errors were encountered:
Transformers are RNNs (link: https://arxiv.org/pdf/2006.16236.pdf) talks about constant memory gradient computations which should be entirely realizeable for this project (the math is reasonably similar in structure at least).
Seemingly currently the memory usage does scale with sequence length though this may just be because larger inputs will need more memory to store all the embeddings. To that extent, it shouldn't help the training that's done in parallel.
This will be important for infinite context in the RNN formulation though! (see #14)
The text was updated successfully, but these errors were encountered: