fun-langid: The code does lowercase, but that is a locale sensitive operation. #9

mihnita · 2023-09-25T19:56:13Z

This method:

def _normalize(self, line:str):
    return line.lower().replace('"', "'").replace("/", " ")

invokes .lower() on the input text, but that is a locale-sensitive operation.
The uppercase I (U+0049) converts i (U+0069) in all languages except Turkish and Azeri, where it should convert to dotless lowercase i (ı, U+0131).

So with the current code lowercase of "LARI" will be "lari", which does not exist in the Turkish n-gram, instead of "ları", which does exist.

This means that the recognition of Turkish and Azeri uppercase text will be problematic.

The text was updated successfully, but these errors were encountered:

icaswell · 2023-09-26T01:00:31Z

Ah, this is indeed an annoying issue :) Thanks for pointing it out! I'm inclined not to worry about it, since this will probably not have a large effect on the accuracy of the model, and the goal of this model is simplicity over accuracy in any case.

mihnita · 2023-10-05T02:05:26Z

I can't think of a good fix.
But I agree, big chunks of uppercase text are pretty rare.

mihnita changed the title ~~The code does lowercase, but that is a locale sensitive operation.~~ fun-langid: The code does lowercase, but that is a locale sensitive operation. Sep 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fun-langid: The code does lowercase, but that is a locale sensitive operation. #9

fun-langid: The code does lowercase, but that is a locale sensitive operation. #9

mihnita commented Sep 25, 2023

icaswell commented Sep 26, 2023

mihnita commented Oct 5, 2023

fun-langid: The code does lowercase, but that is a locale sensitive operation. #9

fun-langid: The code does lowercase, but that is a locale sensitive operation. #9

Comments

mihnita commented Sep 25, 2023

icaswell commented Sep 26, 2023

mihnita commented Oct 5, 2023