📙 `Language-Identifier with SVM in python` 🐍

🎯 In this project, I developed a script that can identify the language used in a given text.
🛠️ The script currently supports the following languages: Swahili, Wolof, French, English, Arabic and Dyula.
⚠️ To obtain accurate results, the input text should be relatively long (at least 4-5 words). The script can be easily modified to add or modify the supported languages by adding a training dataset for the desired language, this dataset can be found by example on HuggingFace Datesets.

You can find the model and the vectorizer in the /model directory. (you can also find the python script: will be used in meth2)
Here are TWO ways to use the trained model in notebook: (You must before install the requirements)

    !pip install pickle sys pandas

meth 1

via model and vectorizer import

    import pickle
    import pandas as pd

    SVM_model = pickle.load(open('model/SVM_model_language_identifier.pkl', 'rb'))
    SVM_vectorizer = pickle.load(open("model/SVM_vectorizer.pk","rb"))


    def predict_language(text):
        serie = pd.Series(text)
        vector = SVM_vectorizer.transform(serie)
        return str(SVM_model.predict(vector)[0])
    
    text = "Na nga def ?" 
    print(predict_language(text))
    
    >>> wolof

meth 2

by calling a script that does all the work for us

    text = "I'm not really into the birthday thing honestly but I admit this was a really chill"
    var = !python model/language_identifier.py $text 
    print(var[-1])
    
    >>> english

💪 Model performance: Here are the results obtained after training the model

wolof:  {'precision': 0.9956011730205279, 'recall': 0.9883551673944687, 'f1-score': 0.9919649379108838, 'support': 687}
french:  {'precision': 0.9971264367816092, 'recall': 0.9788434414668548, 'f1-score': 0.9879003558718862, 'support': 709}
swahili:  {'precision': 1.0, 'recall': 0.9849108367626886, 'f1-score': 0.9923980649619903, 'support': 729}
english:  {'precision': 0.9683195592286501, 'recall': 0.9736842105263158, 'f1-score': 0.9709944751381215, 'support': 722}
arabic:  {'precision': 0.9363354037267081, 'recall': 0.9741518578352181, 'f1-score': 0.9548693586698337, 'support': 619}
dyula:  {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 691}

Overall, it the SVM model is performing very well for all the languages, with high values for all performance metrics.

📫 Feel free to contact me if anything is wrong or if anything needs to be changed 😎! [email protected]

🙌 Notebook made by @labriji_saad

🔗 Linledin @labriji_saad

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
model		model
Language identifier - SVM-v3.ipynb		Language identifier - SVM-v3.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📙 `Language-Identifier with SVM in python` 🐍

meth 1

meth 2

About

Releases

Packages

Languages

labrijisaad/Language-Identifier-SVM

Folders and files

Latest commit

History

Repository files navigation

📙 Language-Identifier with SVM in python 🐍

meth 1

meth 2

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

📙 `Language-Identifier with SVM in python` 🐍

Packages