Test basic AI model #17

marco-c · 2023-11-23T23:30:52Z

@mbanon what do you think of an approach like this?
Basically, building a ML model where the input features are the top three predictions of fastText and the number of words in dictionaries for these top three predictions.

It might also be worth exploring a more complex model, e.g. increasing the number of "top" predictions to consider (though the more we add, the more expensive the model becomes to run, as it will have to check many dictionaries).

Instead of the number of words, we could also experiment with other features:

is there at least one word in language X?
are there only words in language X?
what is the distance to a correct sentence (using https://github.com/MSeal/cython_hunspell#suggestions)? This would also be more robust to typos.

I tested this model on the dataset on https://github.com/pemistahl/lingua-py for Italian and it fared pretty well compared to all other alternatives, especially for shorter sentences. Plain fastspell also fared pretty well (pemistahl/lingua-py#190).

My exploration was done in the context of mozilla/firefox-translations-training#248 (comment).

marco-c · 2023-11-23T23:42:55Z

src/fastspell/ai.py

+            "fastText_lang0": lang0_prob,
+            "fastText_lang1": lang1_prob,
+            "fastText_lang2": lang2_prob,
+            "lang0_dic_tokens": lang0_dic_tokens,
+            "lang1_dic_tokens": lang1_dic_tokens,
+            "lang2_dic_tokens": lang2_dic_tokens,


These are the input features for the model.

marco-c · 2023-11-23T23:44:03Z

src/fastspell/ai.py

+            "lang1_dic_tokens": lang1_dic_tokens,
+            "lang2_dic_tokens": lang2_dic_tokens,
+        })
+        labels.append(label)


The label is 1 if the language to choose is the first among the three, 2 if it's the second, 3 if it's the third.

Oh sorry, I did not see this. Maybe this changes a little bit my proposal.

marco-c · 2023-11-23T23:45:17Z

src/fastspell/ai.py

+
+clf = GridSearchCV(
+    xgb_model,
+    {"max_depth": [1, 2, 4, 6], "n_estimators": [25, 50, 100, 200]},


We could presumably introduce another hyperparameter, a "threshold" above which we don't need to go look at the dictionary (e.g. if fastText is 99% sure it's Italian, we don't need to check further). This could lower the cost of the approach.

That's right. I've always been interested if there is a correlation between confidence and precision. But I really never had time to see if the number of false positives or false negatives is very small when confidence is high. Therefore we could add this exception and speed it up.

marco-c · 2023-11-23T23:46:23Z

src/fastspell/fastspell.py

-            raise RuntimeError(f"It does not exist any valid dictionary directory"
-                               f"for {lang_code} in the paths {self.hunspell_paths}."
+
+        if hunspell_obj is None:


This change is unrelated, I needed it to load as many dictionaries as possible from my system (as I was testing with Italian, and it is not available in fastspell-dictionaries).

I guess this change could be useful in its own right, so users of fastspell can more easily load more dictionaries.

Let me know if you want to land this change (after cleaning it up of course) and I'll open a PR for it.

I'm not sure if I'm not understanding this change correctly, but there's already a whole path search for possible dictionary candidates other than the fastspell-dictionaries if a user wants to use the system's. It's explained in the documentation and here is the code.

This was to allow loading a dictionary more easily, e.g. if "it" is the language and there's "it-IT.dic" in the system, it will assume that's the one to load (even though there might be others like "it_CH.dic").

src/fastspell/fastspell.py

ZJaume

Hi Marco! I'm glad FastSpell is useful to you and you offer to contribute. This machine learning approach seems pretty interesting and I would like to see some numbers. These are only preliminary comments, as we will need to see more in order to decide if we approve this. But it looks promising!

Regarding the features

is there at least one word in language X?

are there only words in language X?

I like these

what is the distance to a correct sentence (using https://github.com/MSeal/cython_hunspell#suggestions)? This would also be more robust to typos.
But this one I'm doubtful if it is expensive.

Also, I would like to suggest a more flexible set of features than this:

fastspell/src/fastspell/ai.py

Lines 160 to 165 in 440da84

    
           "fastText_lang0": lang0_prob, 
        
           "fastText_lang1": lang1_prob, 
        
           "fastText_lang2": lang2_prob, 
        
           "lang0_dic_tokens": lang0_dic_tokens, 
        
           "lang1_dic_tokens": lang1_dic_tokens, 
        
           "lang2_dic_tokens": lang2_dic_tokens,

Our general experience with the number of dictionaries checking for errors, is that 4 is starting to be expensive and 5 or more is generally prohibitive.
Of course, this always depends on if the language we are trying to improve is low-resource or not. For example, Icelandic is low-resource, we are interested to preserve as much as data as possible and the amount of data we usually process for Icelandic is low. So using 5 dictionaries for Icelandic is not a big deal. But it may be for Spanish or Italian.

Currently the number of similars is different in every set of languages and we'd like to keep it that way for the reasons I explained before. So I think the features should allow this. Then, I also found many times that the correct language is sometimes like 7th or 8th in the top predictions, so it could be necessary to have a number of features *fasttext_prob different (and bigger) than the *dic_tokens.

After thinking about this, the idea that comes to my mind it would be to use a feature array that each position represents one of the all possible labels fasttext has (in the model we use, would be 176). So the position 0 represents Afrikaans probability, the position 1 AbKhasian and so on. Then, if we extract, for example, the top10 languages, we set for those languages in the array, the corresponding probability value. For the rest just 0. For the number of correct tokens in each language, it would be more or less the same. Another 176 features and set the corresponding positions to the values that we obtained after hunspell checking.

It this could be easily achived with https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html

The idea behind this is that the model could be able to establish the relationship between each feature and label, so in the end you could give to it any arbitrary number of probabilities and spellcheck corrections.
Of course, this could be an overkill and maybe too expensive to run on each segment to classify. But I think the features are still cheap to compute this way. Maybe the classifier with 260 features is not.

A more simpler approach could be simply having this?

            "fastText_lang0": lang0_prob,
            "fastText_lang1": lang1_prob,
            ...
            "fastText_lang9": lang1_prob,
            "lang0_dic_tokens": lang0_dic_tokens,
            "lang1_dic_tokens": lang1_dic_tokens,
            "lang2_dic_tokens": lang2_dic_tokens,
            "lang2_dic_tokens": lang2_dic_tokens,

and always giving the top10 and setting to -1 the dic_tokens when you haven't checked that language? But in this case the classifier desn't know to which language of the spellings corresponds which probability.

EDIT: I just saw that you approach is to predict which of those 3 feats is the correct language, not predicting the language label. So maybe my idea is not useful anymore. But I still think more flexibility could improve the model.

src/fastspell/fastspell.py

ZJaume · 2023-11-24T13:54:29Z

src/fastspell/fastspell.py

-            raise RuntimeError(f"It does not exist any valid dictionary directory"
-                               f"for {lang_code} in the paths {self.hunspell_paths}."
+
+        if hunspell_obj is None:


I'm not sure if I'm not understanding this change correctly, but there's already a whole path search for possible dictionary candidates other than the fastspell-dictionaries if a user wants to use the system's. It's explained in the documentation and here is the code.

ZJaume · 2023-11-24T13:56:56Z

src/fastspell/config/hunspell.yaml

@@ -18,6 +18,7 @@ hunspell_codes:
    da: da_DK
    de: de_DE
    en: en_GB
+    el: el_GR


This will be cleaned up, right? We prefer to add new languages to the default config if the come along with the corresponding fastspell-dictionaries update that adds them.

Yeah this was just for testing.

ZJaume · 2023-11-24T14:00:52Z

src/fastspell/ai.py

+
+clf = GridSearchCV(
+    xgb_model,
+    {"max_depth": [1, 2, 4, 6], "n_estimators": [25, 50, 100, 200]},


That's right. I've always been interested if there is a correlation between confidence and precision. But I really never had time to see if the number of false positives or false negatives is very small when confidence is high. Therefore we could add this exception and speed it up.

ZJaume · 2023-11-24T15:33:05Z

I also have the doubt of what happens when the language we are targetting is not supported in FastText (therefore will never appear in the topk) or in other extreme cases. Leike Uzbek #14 that indicates that probably fasttext never had any cyrillic training for Uzbek so when it comes Uzbek in cyrillic it would also not appear in the topk.

marco-c · 2023-12-21T23:13:50Z

EDIT: I just saw that you approach is to predict which of those 3 feats is the correct language, not predicting the language label. So maybe my idea is not useful anymore. But I still think more flexibility could improve the model.

Your proposal is very similar to what I wanted to try too:

It might also be worth exploring a more complex model, e.g. increasing the number of "top" predictions to consider (though the more we add, the more expensive the model becomes to run, as it will have to check many dictionaries).
with the addition of the confidence threshold.

I think the approach is promising, unfortunately I won't have a lot of time to devote to the exploration soon. It might be an interesting research project for a student or an intern :)

marco-c · 2023-12-21T23:21:26Z

Yet another option, which would be more costly to train but likely provide better results, is training a fastText model with the addition of dictionary features, so everything is learned together instead of building a model on top of another one.

I guess we could use https://github.com/laurieburchell/open-lid-dataset to train and validate the model (either the simple one on top of fastText or the more complex one which would fully replace fastText).

Test basic AI model

440da84

marco-c commented Nov 23, 2023

View reviewed changes

src/fastspell/fastspell.py Show resolved Hide resolved

mbanon requested review from ZJaume and mbanon November 24, 2023 08:50

mbanon self-assigned this Nov 24, 2023

ZJaume reviewed Nov 24, 2023

View reviewed changes

marco-c mentioned this pull request Nov 28, 2023

Add fastspell to the comparison pemistahl/lingua-py#188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test basic AI model #17

Test basic AI model #17

marco-c commented Nov 23, 2023

marco-c Nov 23, 2023

marco-c Nov 23, 2023

ZJaume Nov 24, 2023

marco-c Nov 23, 2023

ZJaume Nov 24, 2023

marco-c Nov 23, 2023

marco-c Nov 23, 2023

marco-c Nov 23, 2023

ZJaume Nov 24, 2023

marco-c Dec 21, 2023

ZJaume left a comment •

edited

Loading

ZJaume Nov 24, 2023

ZJaume Nov 24, 2023

marco-c Dec 21, 2023

ZJaume Nov 24, 2023

ZJaume commented Nov 24, 2023 •

edited

Loading

marco-c commented Dec 21, 2023

marco-c commented Dec 21, 2023

	"fastText_lang0": lang0_prob,
	"fastText_lang1": lang1_prob,
	"fastText_lang2": lang2_prob,
	"lang0_dic_tokens": lang0_dic_tokens,
	"lang1_dic_tokens": lang1_dic_tokens,
	"lang2_dic_tokens": lang2_dic_tokens,

Test basic AI model #17

Are you sure you want to change the base?

Test basic AI model #17

Conversation

marco-c commented Nov 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZJaume left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZJaume commented Nov 24, 2023 • edited Loading

marco-c commented Dec 21, 2023

marco-c commented Dec 21, 2023

ZJaume left a comment •

edited

Loading

ZJaume commented Nov 24, 2023 •

edited

Loading