movie-chatbot

JupyterNotebook:

This NoteBook can totaly be run on GoogleColab, try it! -->

Usefull Links:

Vocabulary list

Creat the vocabulary list with all words stem found in the training set

Algoritm

Text processing

Can be done with:

spaCy
NLTK

Lower case
Standardizing numbers (ex. '12' -> 'number')
Transform question mark ('?' -> 'questionmark')
Word Stemming (ex. 'discount', 'discounts', 'discounted', 'discounting' -> 'discount')
Removal of non-usefull characters/words (ex. stop words, ponctuation)

Features

Word to vectors:

For the input text, fill a list of the size of the vocabulary list, with the score of each word. The following scoring method can be use for n-gram:
- Binary, i.e. the word is present or not in the text
- Count, i.e. the number of time the word appear in the text
- Frequency, i.e. Count/Total number of words in the text
- TF-IDF (Term Frequency – Inverse Document Frequency), i.e. the score increase with the word frequency, but a penality is given if this word is widely used in the training set (like 'for', 'a', 'the'). The scores have the effect of highlighting words that are distinct (contain useful information) in a given text.
If the dataset is big and the sentence are small, we can use word embeddings.

Temps Colab 25min, Temps local 22min

Classification

SVM is machine learing algorithm, MLP is a deep learning algorithm

Notes

Il faut avoir le même nombre de features pour n'importe quel text
La structure et l'ordre des mots est perdu