Alice

Training NLP...and trying to create a chatbot

Transformer (and its issues)

Base architecture for NLP tasks, more effective than RNNs, yet it is somehow more sentimental than RNNs.

To begin with, the Transformer model receives not only the input sentence, but also the target sentence, a bit like RNNs with Teacher Forcing. However, Teacher Forcing has the fundamental issue that is...everything goes fine in the training phase, since the model trains already knowing the answers. But what about the evaluation phase?

The default approach is to simply make the model receive, as a target sentence, a token, generate a single word based on the most likely output, add this new word to the target sentence and repeat, until you have a token. However, in (my) pratical scenario, this tends to take too much time and the model is prone to produce nonsense, or even simply repeat, over and over, the same token. It seems that, to avoid this issue, a common procedure is to simply make parallel iterations, generating multiple outputs from the same target sentence in order to select the one that will produce the best result (perplexity/Bleu Score). However, such approach demands too much computation power and it's still prone to certain randomness.

An interesting alternative is to use Schedule Sampling, where the Target Sentence is gradually replaced by an output generated by the model. Such approach can be done during training and it allows the model to properly learn how to generate sentences in evaluation mode: https://arxiv.org/pdf/1906.07651.pdf

Yet, the most effective (and coolest) approach is to use Reinforcement Learning, like what was done to train ChatGPT: https://arxiv.org/pdf/2203.02155.pdf This method, however, will stay with Hakisa for now.

Another problem with Transformer is that the model is quite unstable. It tends to generate gradients that are too small, and it even justified the inclusion of warmup steps before starting training, in order to properly allign the weights and gradients of each layer.

However, the instability doesn't come from the gradients, but actually from the residual blocks. They tend to get amplified through training, obfuscating the outputs produced in the attention and feedforward layers, even with the Layer Normalization. For this, a new parameter for scaling those outputs, and for initializing the model, has been suggested: https://aclanthology.org/2020.emnlp-main.463.pdf

Bônus) some tips for training Transformers: https://arxiv.org/pdf/1804.00247.pdf

PS: Besides RNNs and Transformer, there's also a possible Text GAN I'm currently testing. But I suppose that, considering how unstable GANs are, using the other 2 architectures will be preferable.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
Alice.ipynb		Alice.ipynb
Alicel.ipynb		Alicel.ipynb
README.md		README.md
Transformer.ipynb		Transformer.ipynb
Translator.ipynb		Translator.ipynb
Voice recognition.ipynb		Voice recognition.ipynb
word_predictor.py		word_predictor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Alice

Transformer (and its issues)

About

Releases

Packages

Languages

Martyn0324/Alice

Folders and files

Latest commit

History

Repository files navigation

Alice

Transformer (and its issues)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages