Application for detecting and identifying the arabic dialects using machine learning and NLP.
The nature of the Arabic language being a macro, morphological language with many varieties makes it one of the challenging languages for natural language processing and language identification. Machine learning and NLP build effective arabic dialect identification prediction model trained on 450k tweets across 18 classes(Arabic countries).
Training three classical machine learning models
- Multinomial Nayive Bayes Classifier
- Random Forest Classifier
- Linear Model with SGD learning
Training two deep neural networks architecture
- Multi-layer Perceptron
- Long short-term memory(LSTM)
Before you get your hands dirty with the code, the flow of the project is, first you have to fetch the data from the internet, using HTTP request and dialect_tweets.csv file you can fetch the whole dataset and store it in another csv file for data preprocessing and training models.
[Note] IPython notebook version of the code avaliable here
pip install -r requirements.txt
python data_fetching.py
python data_preprocessing.py
python machine_learning_model_training.py [-m [model_type]]
--model_type Choose the model you want to train from Multinomial NB, RF, and SGD
python dense_layers_model_training.py
python lstm_model_training.py
python text_inference.py [-m [model_type] -t [text]]
--model_type Choose the model you want to train from LSTM, Dense_NN, SVM, RF, and SGD
--text The text you want to classify
FLASK_APP=model_deploy.py flask run
- The data set was fairly large, which made it quite interesting.
- The dataset suffers from class imbalance, and the text needed plenty of cleaning.
- The dataset can be clustered into a lower number of classes (Egyptian, Levantine, Iraqi, Arabian Peninsula, and Maghreb dialects) this step can lead to better results in both classical machine learning and deep learning models.
- One of ways to solve class imbalance and the most widely used approach to synthesizing new examples is SMOTE.
- Evaluating the models based on F1-score because the dataset was quite imbalance.
- Getting 49% accuracy using linear classifier(SVM) with SGD learning was quite good as we trained on the 18 class.
- Training one epoch of LSTM model take +2.5 hours because of not accessing to gpu power, however after 5 epochs the model reaches to 50% accuracy on the test set, if the model continue training and using early stopping criteria definitely the model will reach to better accuracy.
- LSTM model outperforms to the rest of the models even if it isn’t trained enough on the dataset.
- Techniques for increasing the accuracy are using pre-trained embedding models such as word2vec trained on arabic data and using state of the art pre-training BERT for arabic language understanding.