Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[quality] Long words in zh-hans model (20198 suggested changes) #874

Open
peterburk opened this issue Jan 10, 2025 · 0 comments
Open

[quality] Long words in zh-hans model (20198 suggested changes) #874

peterburk opened this issue Jan 10, 2025 · 0 comments

Comments

@peterburk
Copy link

Thank you for making budoux! I've been using it actively for Chinese (Traditional), Chinese (Simplified), Japanese, and Thai. It's very fast, and I really appreciate your work on it!

Input: UNv1.0.en-zh.zh

Process:
cat "/Users/peter/Downloads/budoux-main/UNv1.0.en-zh.zh" | python3 budoux/main.py -m 'budoux/models/zh-hans.json' > "/Users/peter/Downloads/budoux-main/UNv1.0.en-zh.zhSpaced.txt"

Expected output (sample):

基 皮亚 克 土著马 赛 群体 争取 生存 计划
释放 利比亚国民 阿卜杜勒 巴塞特
波斯尼亚 - 克罗地亚 - 塞尔维亚
( 阿波 斯托 洛斯安 德 列 亚斯 角)

Expected output (full):
UNv1.0.en-zh.zhSpacedWordsOver5CharactersSpaced.txt

Expected output is built using a development copy of https://pingtype.github.io/

Actual output (sample):

基皮亚克土著马赛群体争取生存计划
释放利比亚国民阿卜杜勒巴塞特
波斯尼亚-克罗地亚-塞尔维亚
(阿波斯托洛斯安德列亚斯角)

Actual output (full)
UNv1.0.en-zh.zhSpacedWordsOver5Characters.txt

Please message me if you have any more questions, and I'd be happy to advise. I also have more data for long words (over 5 characters) in Japanese, Chinese (Traditional), and Thai - please comment here or email me when you're working on this issue, and I can collaborate with you more :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant