Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test basic AI model #17

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Test basic AI model #17

wants to merge 1 commit into from

Conversation

marco-c
Copy link
Contributor

@marco-c marco-c commented Nov 23, 2023

@mbanon what do you think of an approach like this?
Basically, building a ML model where the input features are the top three predictions of fastText and the number of words in dictionaries for these top three predictions.

It might also be worth exploring a more complex model, e.g. increasing the number of "top" predictions to consider (though the more we add, the more expensive the model becomes to run, as it will have to check many dictionaries).

Instead of the number of words, we could also experiment with other features:

I tested this model on the dataset on https://github.com/pemistahl/lingua-py for Italian and it fared pretty well compared to all other alternatives, especially for shorter sentences. Plain fastspell also fared pretty well (pemistahl/lingua-py#190).

My exploration was done in the context of mozilla/firefox-translations-training#248 (comment).

Comment on lines +160 to +165
"fastText_lang0": lang0_prob,
"fastText_lang1": lang1_prob,
"fastText_lang2": lang2_prob,
"lang0_dic_tokens": lang0_dic_tokens,
"lang1_dic_tokens": lang1_dic_tokens,
"lang2_dic_tokens": lang2_dic_tokens,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the input features for the model.

"lang1_dic_tokens": lang1_dic_tokens,
"lang2_dic_tokens": lang2_dic_tokens,
})
labels.append(label)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The label is 1 if the language to choose is the first among the three, 2 if it's the second, 3 if it's the third.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I did not see this. Maybe this changes a little bit my proposal.


clf = GridSearchCV(
xgb_model,
{"max_depth": [1, 2, 4, 6], "n_estimators": [25, 50, 100, 200]},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could presumably introduce another hyperparameter, a "threshold" above which we don't need to go look at the dictionary (e.g. if fastText is 99% sure it's Italian, we don't need to check further). This could lower the cost of the approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I've always been interested if there is a correlation between confidence and precision. But I really never had time to see if the number of false positives or false negatives is very small when confidence is high. Therefore we could add this exception and speed it up.

raise RuntimeError(f"It does not exist any valid dictionary directory"
f"for {lang_code} in the paths {self.hunspell_paths}."

if hunspell_obj is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is unrelated, I needed it to load as many dictionaries as possible from my system (as I was testing with Italian, and it is not available in fastspell-dictionaries).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this change could be useful in its own right, so users of fastspell can more easily load more dictionaries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you want to land this change (after cleaning it up of course) and I'll open a PR for it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I'm not understanding this change correctly, but there's already a whole path search for possible dictionary candidates other than the fastspell-dictionaries if a user wants to use the system's. It's explained in the documentation and here is the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to allow loading a dictionary more easily, e.g. if "it" is the language and there's "it-IT.dic" in the system, it will assume that's the one to load (even though there might be others like "it_CH.dic").

@mbanon mbanon requested review from ZJaume and mbanon November 24, 2023 08:50
@mbanon mbanon self-assigned this Nov 24, 2023
Copy link
Collaborator

@ZJaume ZJaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Marco! I'm glad FastSpell is useful to you and you offer to contribute. This machine learning approach seems pretty interesting and I would like to see some numbers. These are only preliminary comments, as we will need to see more in order to decide if we approve this. But it looks promising!

Regarding the features

  • is there at least one word in language X?

  • are there only words in language X?

I like these

Also, I would like to suggest a more flexible set of features than this:

"fastText_lang0": lang0_prob,
"fastText_lang1": lang1_prob,
"fastText_lang2": lang2_prob,
"lang0_dic_tokens": lang0_dic_tokens,
"lang1_dic_tokens": lang1_dic_tokens,
"lang2_dic_tokens": lang2_dic_tokens,

Our general experience with the number of dictionaries checking for errors, is that 4 is starting to be expensive and 5 or more is generally prohibitive.
Of course, this always depends on if the language we are trying to improve is low-resource or not. For example, Icelandic is low-resource, we are interested to preserve as much as data as possible and the amount of data we usually process for Icelandic is low. So using 5 dictionaries for Icelandic is not a big deal. But it may be for Spanish or Italian.

Currently the number of similars is different in every set of languages and we'd like to keep it that way for the reasons I explained before. So I think the features should allow this. Then, I also found many times that the correct language is sometimes like 7th or 8th in the top predictions, so it could be necessary to have a number of features *fasttext_prob different (and bigger) than the *dic_tokens.

After thinking about this, the idea that comes to my mind it would be to use a feature array that each position represents one of the all possible labels fasttext has (in the model we use, would be 176). So the position 0 represents Afrikaans probability, the position 1 AbKhasian and so on. Then, if we extract, for example, the top10 languages, we set for those languages in the array, the corresponding probability value. For the rest just 0. For the number of correct tokens in each language, it would be more or less the same. Another 176 features and set the corresponding positions to the values that we obtained after hunspell checking.

It this could be easily achived with https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html

The idea behind this is that the model could be able to establish the relationship between each feature and label, so in the end you could give to it any arbitrary number of probabilities and spellcheck corrections.
Of course, this could be an overkill and maybe too expensive to run on each segment to classify. But I think the features are still cheap to compute this way. Maybe the classifier with 260 features is not.

A more simpler approach could be simply having this?

            "fastText_lang0": lang0_prob,
            "fastText_lang1": lang1_prob,
            ...
            "fastText_lang9": lang1_prob,
            "lang0_dic_tokens": lang0_dic_tokens,
            "lang1_dic_tokens": lang1_dic_tokens,
            "lang2_dic_tokens": lang2_dic_tokens,
            "lang2_dic_tokens": lang2_dic_tokens,

and always giving the top10 and setting to -1 the dic_tokens when you haven't checked that language? But in this case the classifier desn't know to which language of the spellings corresponds which probability.

EDIT: I just saw that you approach is to predict which of those 3 feats is the correct language, not predicting the language label. So maybe my idea is not useful anymore. But I still think more flexibility could improve the model.

src/fastspell/fastspell.py Show resolved Hide resolved
raise RuntimeError(f"It does not exist any valid dictionary directory"
f"for {lang_code} in the paths {self.hunspell_paths}."

if hunspell_obj is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I'm not understanding this change correctly, but there's already a whole path search for possible dictionary candidates other than the fastspell-dictionaries if a user wants to use the system's. It's explained in the documentation and here is the code.

@@ -18,6 +18,7 @@ hunspell_codes:
da: da_DK
de: de_DE
en: en_GB
el: el_GR
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be cleaned up, right? We prefer to add new languages to the default config if the come along with the corresponding fastspell-dictionaries update that adds them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was just for testing.


clf = GridSearchCV(
xgb_model,
{"max_depth": [1, 2, 4, 6], "n_estimators": [25, 50, 100, 200]},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I've always been interested if there is a correlation between confidence and precision. But I really never had time to see if the number of false positives or false negatives is very small when confidence is high. Therefore we could add this exception and speed it up.

@ZJaume
Copy link
Collaborator

ZJaume commented Nov 24, 2023

I also have the doubt of what happens when the language we are targetting is not supported in FastText (therefore will never appear in the topk) or in other extreme cases. Leike Uzbek #14 that indicates that probably fasttext never had any cyrillic training for Uzbek so when it comes Uzbek in cyrillic it would also not appear in the topk.

@marco-c
Copy link
Contributor Author

marco-c commented Dec 21, 2023

EDIT: I just saw that you approach is to predict which of those 3 feats is the correct language, not predicting the language label. So maybe my idea is not useful anymore. But I still think more flexibility could improve the model.

Your proposal is very similar to what I wanted to try too:

It might also be worth exploring a more complex model, e.g. increasing the number of "top" predictions to consider (though the more we add, the more expensive the model becomes to run, as it will have to check many dictionaries).
with the addition of the confidence threshold.

I think the approach is promising, unfortunately I won't have a lot of time to devote to the exploration soon. It might be an interesting research project for a student or an intern :)

@marco-c
Copy link
Contributor Author

marco-c commented Dec 21, 2023

Yet another option, which would be more costly to train but likely provide better results, is training a fastText model with the addition of dictionary features, so everything is learned together instead of building a model on top of another one.

I guess we could use https://github.com/laurieburchell/open-lid-dataset to train and validate the model (either the simple one on top of fastText or the more complex one which would fully replace fastText).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants