Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand (some) ICU character classes in regex_compiler? #160

Open
unhammer opened this issue Sep 22, 2022 · 3 comments
Open

Expand (some) ICU character classes in regex_compiler? #160

unhammer opened this issue Sep 22, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@unhammer
Copy link
Member

At the very least, getting Lower and Upper ranges would be nice, so we could

<e> <re>\\p{Lu}\\p{Ll}+</re><par n="guess_np"/> </e>

and whatnot.

If we do the "simple" thing and just expand like ranges in https://github.com/apertium/lttoolbox/blob/acx-spaces/lttoolbox/regexp_compiler.cc we get quite a lot of transitions
https://www.compart.com/en/unicode/category/Ll (probably unreliable source) claims 2155 lowercase letters. But maybe it's OK if we keep regexes in their own <section> – more research needed.

Alternatively, could/should we do something like insert a special symbol and have fst_processor treat it specially? (any tools operating on the compiled fst's like lt-trim or lt-print|hfst-txt2fst|hfst-stuff would just have to treat it opaquely)

@TinoDidriksen
Copy link
Member

Code can ask ICU for the list of characters, but then the finished FST will change depending on which version of ICU (and thus Unicode) it was built with. It could encode the version in the file and do it at runtime if the version differs.

@unhammer
Copy link
Member Author

Classes typically only get wider, so that sounds fine by me. I don't see a need for fst's to be perfectly reproducible when built on differing libraries – though encoding the ICU version in the file sounds like a good idea anyway.

@mr-martian
Copy link
Contributor

The current binary format for alphabets makes some assumptions about alphabet symbols (see apertium/apertium-yid#3 (comment)) that I think would make having non-expanded class symbols almost certainly require a file version bump (though I suppose you'd get that from including the ICU version anyway...).

@unhammer unhammer added the enhancement New feature or request label Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants