Skip to content

Latest commit

 

History

History
56 lines (47 loc) · 3.44 KB

README.md

File metadata and controls

56 lines (47 loc) · 3.44 KB

movie-chatbot

JupyterNotebook:

This NoteBook can totaly be run on GoogleColab, try it! --> Open In Colab

Usefull Links:

Vocabulary list

  • Creat the vocabulary list with all words stem found in the training set

Algoritm

Text processing

Can be done with:

  1. Lower case
  2. Standardizing numbers (ex. '12' -> 'number')
  3. Transform question mark ('?' -> 'questionmark')
  4. Word Stemming (ex. 'discount', 'discounts', 'discounted', 'discounting' -> 'discount')
  5. Removal of non-usefull characters/words (ex. stop words, ponctuation)

Features

Word to vectors:

  1. For the input text, fill a list of the size of the vocabulary list, with the score of each word. The following scoring method can be use for n-gram:
    • Binary, i.e. the word is present or not in the text
    • Count, i.e. the number of time the word appear in the text
    • Frequency, i.e. Count/Total number of words in the text
    • TF-IDF (Term Frequency – Inverse Document Frequency), i.e. the score increase with the word frequency, but a penality is given if this word is widely used in the training set (like 'for', 'a', 'the'). The scores have the effect of highlighting words that are distinct (contain useful information) in a given text.
  2. If the dataset is big and the sentence are small, we can use word embeddings.

Temps Colab 25min, Temps local 22min

Classification

  1. SVM is machine learing algorithm, MLP is a deep learning algorithm

Notes

  • Il faut avoir le même nombre de features pour n'importe quel text
  • La structure et l'ordre des mots est perdu