Lowercase before langid #42

ZJaume · 2023-10-05T14:11:03Z

If we are going to use FastText, we should be applying lowercase before language identification. At least in the official lid.175 model, uppercased text completely messes up the identification for mid/low-resource languages, always identifying them as the highest resource language of the script (Russian for cyrillic, English/Spanish/French for latin).

jelmervdl · 2023-10-06T14:17:24Z

@laurieburchell do you have an opinion on this?

Ideally if this is the case, this would be a part of the model, and not an option inside warc2text, as it would be really hard to keep track of which model benefits from it, and which doesn't.

On the other hand, I can also understand that the web is kind of garbage and there's a lot of ALL UPPER CASE text out there that's not in the training data. And that doesn't match any ngrams in the model.

Maybe we should train a model on explicitly all lower case text, see whether it degrades performance a lot, and if it doesn't do indeed just classify always on lowercase?

laurieburchell · 2023-10-06T15:44:49Z

I would suggest building the lowercasing into the LID model if possible - apart from anything else, it helps deal with feature sparsity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lowercase before langid #42

Lowercase before langid #42

ZJaume commented Oct 5, 2023

jelmervdl commented Oct 6, 2023

laurieburchell commented Oct 6, 2023

Lowercase before langid #42

Lowercase before langid #42

Comments

ZJaume commented Oct 5, 2023

jelmervdl commented Oct 6, 2023

laurieburchell commented Oct 6, 2023