Add tokenisation options for more languages #342

stijn-uva · 2023-03-15T17:44:43Z

Some languages, in particular East-Asian ones, don't (just) use spaces to separate words, so the standard NLTK tokeniser doesn't work for them. It is likely that there are many languages for which this is an issue but the East-Asian ones are probably the most pressing because they represent a large number of people online.

Support for Chinese tokenisation has been added using jieba. There are other languages to consider, here is a nice overview. But the libraries listed there all have dependencies that make them difficult to install, so more work is needed to figure out how to best make them install with 4CAT.

oxygala · 2023-03-28T18:51:59Z

I am not sure if I have to open a new issue for this, but tokenisation for Turkish would be fantastic too.
Here's a project that might help: https://github.com/apdullahyayik/TrTokenizer (there's also a stemmer: https://github.com/otuncelli/turkish-stemmer-python)

stijn-uva added enhancement New feature or request processors Involves self-contained analyticalprocessors. labels Mar 15, 2023

stijn-uva added this to the 1.40 (Summer School 2023) milestone Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenisation options for more languages #342

Add tokenisation options for more languages #342

stijn-uva commented Mar 15, 2023 •

edited

Loading

oxygala commented Mar 28, 2023 •

edited

Loading

Add tokenisation options for more languages #342

Add tokenisation options for more languages #342

Comments

stijn-uva commented Mar 15, 2023 • edited Loading

oxygala commented Mar 28, 2023 • edited Loading

stijn-uva commented Mar 15, 2023 •

edited

Loading

oxygala commented Mar 28, 2023 •

edited

Loading