You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some languages, in particular East-Asian ones, don't (just) use spaces to separate words, so the standard NLTK tokeniser doesn't work for them. It is likely that there are many languages for which this is an issue but the East-Asian ones are probably the most pressing because they represent a large number of people online.
Support for Chinese tokenisation has been added using jieba. There are other languages to consider, here is a nice overview. But the libraries listed there all have dependencies that make them difficult to install, so more work is needed to figure out how to best make them install with 4CAT.
The text was updated successfully, but these errors were encountered:
Some languages, in particular East-Asian ones, don't (just) use spaces to separate words, so the standard NLTK tokeniser doesn't work for them. It is likely that there are many languages for which this is an issue but the East-Asian ones are probably the most pressing because they represent a large number of people online.
Support for Chinese tokenisation has been added using jieba. There are other languages to consider, here is a nice overview. But the libraries listed there all have dependencies that make them difficult to install, so more work is needed to figure out how to best make them install with 4CAT.
The text was updated successfully, but these errors were encountered: