You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
invokes .lower() on the input text, but that is a locale-sensitive operation.
The uppercase I (U+0049) converts i (U+0069) in all languages except Turkish and Azeri, where it should convert to dotless lowercase i (ı, U+0131).
So with the current code lowercase of "LARI" will be "lari", which does not exist in the Turkish n-gram, instead of "ları", which does exist.
This means that the recognition of Turkish and Azeri uppercase text will be problematic.
The text was updated successfully, but these errors were encountered:
mihnita
changed the title
The code does lowercase, but that is a locale sensitive operation.
fun-langid: The code does lowercase, but that is a locale sensitive operation.
Sep 25, 2023
Ah, this is indeed an annoying issue :) Thanks for pointing it out! I'm inclined not to worry about it, since this will probably not have a large effect on the accuracy of the model, and the goal of this model is simplicity over accuracy in any case.
This method:
invokes
.lower()
on the input text, but that is a locale-sensitive operation.The uppercase I (
U+0049
) converts i (U+0069
) in all languages except Turkish and Azeri, where it should convert to dotless lowercase i (ı,U+0131
).So with the current code lowercase of "LARI" will be "lari", which does not exist in the Turkish n-gram, instead of "ları", which does exist.
This means that the recognition of Turkish and Azeri uppercase text will be problematic.
The text was updated successfully, but these errors were encountered: