-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CHINESE detect error #202
Comments
this text also detects errors in JAPANESE, real language is CHINESE. |
Hi @simplew2011, thanks for reaching out to me. Yes, I know about this problem. The library's rule engine is a bit too sensitive and tends to classify Chinese text as Japanese too often. These two languages are the most difficult ones to distinguish correctly. Do you know any heuristics which can reliably tell whether a text is Chinese or Japanese? |
In opposite, I found enough samples of texts that were identified as Chinese instead of Japanese. |
building on @romiras 's comment above, would it help if we tried to provide an annotated set of texts that the library gets wrong across Chinese <-> Japanese? If so, how many items would make such a set useful? |
I deal with often a lot of very short translations, trying to verify if they are valid for a particular language. For example "即時" could be Japanese or Chinese, both are valid (with maybe slightly different meanings). But it's detected as Chinese with 1.0 confidence and Japanese is 0. |
output:
real language is CHINESE
The text was updated successfully, but these errors were encountered: