Training NLP...and trying to create a chatbot
Base architecture for NLP tasks, more effective than RNNs, yet it is somehow more sentimental than RNNs.
To begin with, the Transformer model receives not only the input sentence, but also the target sentence, a bit like RNNs with Teacher Forcing. However, Teacher Forcing has the fundamental issue that is...everything goes fine in the training phase, since the model trains already knowing the answers. But what about the evaluation phase?
The default approach is to simply make the model receive, as a target sentence, a token, generate a single word based on the most likely output, add this new word to the target sentence and repeat, until you have a token. However, in (my) pratical scenario, this tends to take too much time and the model is prone to produce nonsense, or even simply repeat, over and over, the same token. It seems that, to avoid this issue, a common procedure is to simply make parallel iterations, generating multiple outputs from the same target sentence in order to select the one that will produce the best result (perplexity/Bleu Score). However, such approach demands too much computation power and it's still prone to certain randomness.
An interesting alternative is to use Schedule Sampling, where the Target Sentence is gradually replaced by an output generated by the model. Such approach can be done during training and it allows the model to properly learn how to generate sentences in evaluation mode: https://arxiv.org/pdf/1906.07651.pdf
Yet, the most effective (and coolest) approach is to use Reinforcement Learning, like what was done to train ChatGPT: https://arxiv.org/pdf/2203.02155.pdf This method, however, will stay with Hakisa for now.
Another problem with Transformer is that the model is quite unstable. It tends to generate gradients that are too small, and it even justified the inclusion of warmup steps before starting training, in order to properly allign the weights and gradients of each layer.
However, the instability doesn't come from the gradients, but actually from the residual blocks. They tend to get amplified through training, obfuscating the outputs produced in the attention and feedforward layers, even with the Layer Normalization. For this, a new parameter for scaling those outputs, and for initializing the model, has been suggested: https://aclanthology.org/2020.emnlp-main.463.pdf
Bônus) some tips for training Transformers: https://arxiv.org/pdf/1804.00247.pdf
PS: Besides RNNs and Transformer, there's also a possible Text GAN I'm currently testing. But I suppose that, considering how unstable GANs are, using the other 2 architectures will be preferable.