Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fun-langid: The code does lowercase, but that is a locale sensitive operation. #9

Open
mihnita opened this issue Sep 25, 2023 · 2 comments

Comments

@mihnita
Copy link

mihnita commented Sep 25, 2023

This method:

def _normalize(self, line:str):
    return line.lower().replace('"', "'").replace("/", " ")

invokes .lower() on the input text, but that is a locale-sensitive operation.
The uppercase I (U+0049) converts i (U+0069) in all languages except Turkish and Azeri, where it should convert to dotless lowercase i (ı, U+0131).

So with the current code lowercase of "LARI" will be "lari", which does not exist in the Turkish n-gram, instead of "ları", which does exist.

This means that the recognition of Turkish and Azeri uppercase text will be problematic.

@mihnita mihnita changed the title The code does lowercase, but that is a locale sensitive operation. fun-langid: The code does lowercase, but that is a locale sensitive operation. Sep 25, 2023
@icaswell
Copy link

Ah, this is indeed an annoying issue :) Thanks for pointing it out! I'm inclined not to worry about it, since this will probably not have a large effect on the accuracy of the model, and the goal of this model is simplicity over accuracy in any case.

@mihnita
Copy link
Author

mihnita commented Oct 5, 2023

I can't think of a good fix.
But I agree, big chunks of uppercase text are pretty rare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants