-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support empty alphabet, for simple CJK word segmentation #75
Comments
Solves #45 Consider alphanumeric characters to be part of the vocabulary.
Surely this is as trivial as adding a |
What if someone wants only some chars to be unknown-tokenizable? |
I guess. I say this should be an opt-out, then. Default should be to have as much as possible in the alphabet, and people can then opt-out with something like |
Definitely opt-out, which is why I suggested |
The last binary break prepared for this eventuality: https://github.com/apertium/lttoolbox/blob/master/lttoolbox/compression.h#L29 - we can add features without breaking existing files. But yeah, a cmdline flag for now would work. |
Regarding #52 isn't this what the |
oh yeah :) @Fred-Git-Hub ↑ would this cover your use-case? With
I get
(See http://wiki.apertium.org/wiki/Inconditional#inconditional for more info.) |
well, the problem is that anything without an analysis in
so then you'd have to make sure to put every symbol you might expect to appear before other symbols into |
Aha, got it @unhammer, that makes sense. In general I think that in order to deal with this properly we need (1) weights in the lexicon, and (2) a special function of lttoolbox that does segmentation... maybe something like the compounding functionality. |
Yeah, I do have the feeling plain LRLM should eventually hit something it can't handle, but I wonder how far you can get with what @Fred-Git-Hub had going (if the language was mostly single-character words, it should be possible without any new features). Languages like Thai would need something more, but the current weights and compounding features don't look at context – wouldn't context be needed? Even the simple Norwegian case of |
Yeah, either you'd be stuck with a unigram model or you'd need to incorporate n-gram information somehow. |
Before
944ed25 / #52
it was possible to use monodix files with an empty
<alphabet>
in order to segment into all known analyses (presumably symbols without analyses were output as blanks). But after the change, this is no longer possible.See 944ed25#commitcomment-35679780 for test cases for Chinese/Japanese/Korean.
Maybe the iswalnum test could be turned off by a flag, e.g.
lt-proc --no-implicit-alphabet
?The text was updated successfully, but these errors were encountered: