Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lowercase before langid #42

Open
ZJaume opened this issue Oct 5, 2023 · 2 comments
Open

Lowercase before langid #42

ZJaume opened this issue Oct 5, 2023 · 2 comments

Comments

@ZJaume
Copy link
Member

ZJaume commented Oct 5, 2023

If we are going to use FastText, we should be applying lowercase before language identification. At least in the official lid.175 model, uppercased text completely messes up the identification for mid/low-resource languages, always identifying them as the highest resource language of the script (Russian for cyrillic, English/Spanish/French for latin).

@jelmervdl
Copy link
Member

@laurieburchell do you have an opinion on this?

Ideally if this is the case, this would be a part of the model, and not an option inside warc2text, as it would be really hard to keep track of which model benefits from it, and which doesn't.

On the other hand, I can also understand that the web is kind of garbage and there's a lot of ALL UPPER CASE text out there that's not in the training data. And that doesn't match any ngrams in the model.

Maybe we should train a model on explicitly all lower case text, see whether it degrades performance a lot, and if it doesn't do indeed just classify always on lowercase?

@laurieburchell
Copy link

I would suggest building the lowercasing into the LID model if possible - apart from anything else, it helps deal with feature sparsity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants