Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fastspell to the comparison #188

Closed
marco-c opened this issue Nov 11, 2023 · 6 comments · Fixed by #190
Closed

Add fastspell to the comparison #188

marco-c opened this issue Nov 11, 2023 · 6 comments · Fixed by #190

Comments

@marco-c
Copy link
Contributor

marco-c commented Nov 11, 2023

fastspell uses a combination of fastText and dictionaries to identify the language. It would be interesting to see how it compares to lingua-py.

@pemistahl
Copy link
Owner

Thank you for the suggestion. I will include it in the comparison.

@marco-c
Copy link
Contributor Author

marco-c commented Nov 11, 2023

I have a local WIP patch, I could submit a PR in a few days.

@pemistahl
Copy link
Owner

Sounds good, PRs are always welcome. :)

@pemistahl
Copy link
Owner

@marco-c I've now updated the accuracy reports and plots to include FastSpell. It's more accurate than pure FastText, especially in aggressive mode. It also beats Lingua in low accuracy mode most of the time, even though the difference is not that big. Lingua in high accuracy mode is still unchallenged. Phew, lucky me. ;-)

In terms of runtime performance, FastSpell is significantly slower than FastText but on par with single-threaded Lingua in high accuracy mode.

Thanks again for your contribution. Let me know if you plan to use Lingua in one of your projects. I'm always curious.


Average Detection Performance

@marco-c
Copy link
Contributor Author

marco-c commented Nov 28, 2023

@pemistahl the accuracy gets way higher as you add more dictionaries, for example for Italian you can look at the results in my comment here: mozilla/firefox-translations-training#248 (comment). Especially for single word and word pair, the difference is huge. And most of the errors I saw were actually labelling errors in the Wortschatz corpora.

Italian is not yet in the default fastspell configuration though, so we don't see yet the improvements in the analysis in the lingua-py repo.

I'm not sure yet which language identification library we will use for Firefox translations, lingua seems to be very good and high performance with the Rust backend, fastspell seems to be very good for very short sentences. We might consider a mix of them (and actually, you could implement dictionary lookup as a feature in lingua, just like fastspell is doing on top of fasttext).

A couple of links that might be interesting for you:
Helsinki-NLP/OpusFilter#65
mbanon/fastspell#17

@pemistahl
Copy link
Owner

Thank you for considering Lingua for Firefox. That would be amazing. :)

If the usage of dictionaries adds so much to accuracy, I will think about adding dictionaries to Lingua but without the performance penalty as in FastSpell. With Rust, that should be doable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants