Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detect_multiple_languages_of() does not work at all for mixed English, Chinese and Japanese #219

Open
xuancong84 opened this issue Jan 27, 2024 · 3 comments

Comments

@xuancong84
Copy link

xuancong84 commented Jan 27, 2024

You can easily test out sentences like the following:

The name of that celebrity is 王菲
everything will be classified as English (you can try any Chinese name or any English prefix sentence, it will always get it wrong).

这首歌的名字叫 ロミオとシンデレラ
The name of the song is ロミオとシンデレラ
everything will be classified as Japanese. FYI, Japanese character set does include almost all traditional Chinese character set, however, a person or a language model can easily tell the boundary between the two chunks, especially if the first chunk is in English.

@ryanheise
Copy link

I've found that adding some combination of superfluous languages into the language detector can randomly get it to work. In my case, adding French and German helped it to better detect Japanese and English (but not when I tested on your examples.)

In your first example, the Chinese name might be too short, but in theory I think this ought to be a solvable problem because the character sets of English and Chinese are completely different. That alone ought to give high confidence even on the shortest substrings.

@juntaosun
Copy link

可以尝试一下 LangSegment :

https://huggingface.co/spaces/sunnyboxs/LangSegment

@ryanheise
Copy link

@juntaosun That looks interesting. Have you also considered implementing your algorithm on top of lingua?

By the way, you might want to update your example to use "." at the end of the sentence rather than "。" since that is what's used in practice for horizontal text (although your algorithm works fine with both).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants