In this repository there are different models that analyze the opinion left by travelers on twitter. The data has been taken from a competition in Kaggle carried out by a Spanish areoline. The data has been processed and various techniques have been tried for its processing.
- Bag of words(TF-idf)
- Random Forest
- GuassianNB
- XGBoost
- Word Embedding(Glove)
- CNN with Kernel = 1: this is a video where explain this technique.
- Fast-Text: a simple and efficient model for text classification.
- BETO: the model bert trained for spanish.
- GRUs: Gated recurrent units.
All models have undergone a fine tuning process to get the best performance from them.
Figure 1: Results of the experiment for a balanced Dataset
Figure 2: Results of the experiment for a unbalanced Dataset
As can be seen in the figures, the connectionist approach (deep learning) generates better results for both datasets, however, using balanced data, the models manage to reach 80% accuracy.CNN and Fast-text are fast and effective methods, however, despite being more powerful transofmers fall below the previous two methods. I suppose that the reason is because being a model designed for large volumes of data, with few data as is the case, only 7000 samples, these models give good results but it is not as impressive as in other applications.
- Python 🐍
- Sklearn 🧮
- PyTorch ❤️
- Hugging Face 🤖