Expand (some) ICU character classes in regex_compiler? #160

unhammer · 2022-09-22T09:28:39Z

At the very least, getting Lower and Upper ranges would be nice, so we could

<e> <re>\\p{Lu}\\p{Ll}+</re><par n="guess_np"/> </e>

and whatnot.

If we do the "simple" thing and just expand like ranges in https://github.com/apertium/lttoolbox/blob/acx-spaces/lttoolbox/regexp_compiler.cc we get quite a lot of transitions
– https://www.compart.com/en/unicode/category/Ll (probably unreliable source) claims 2155 lowercase letters. But maybe it's OK if we keep regexes in their own <section> – more research needed.

Alternatively, could/should we do something like insert a special symbol and have fst_processor treat it specially? (any tools operating on the compiled fst's like lt-trim or lt-print|hfst-txt2fst|hfst-stuff would just have to treat it opaquely)

The text was updated successfully, but these errors were encountered:

TinoDidriksen · 2022-09-22T09:32:20Z

Code can ask ICU for the list of characters, but then the finished FST will change depending on which version of ICU (and thus Unicode) it was built with. It could encode the version in the file and do it at runtime if the version differs.

unhammer · 2022-09-22T09:36:40Z

Classes typically only get wider, so that sounds fine by me. I don't see a need for fst's to be perfectly reproducible when built on differing libraries – though encoding the ICU version in the file sounds like a good idea anyway.

mr-martian · 2022-09-24T01:05:42Z

The current binary format for alphabets makes some assumptions about alphabet symbols (see apertium/apertium-yid#3 (comment)) that I think would make having non-expanded class symbols almost certainly require a file version bump (though I suppose you'd get that from including the ICU version anyway...).

unhammer added the enhancement New feature or request label Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand (some) ICU character classes in regex_compiler? #160

Expand (some) ICU character classes in regex_compiler? #160

unhammer commented Sep 22, 2022

TinoDidriksen commented Sep 22, 2022

unhammer commented Sep 22, 2022

mr-martian commented Sep 24, 2022

Expand (some) ICU character classes in regex_compiler? #160

Expand (some) ICU character classes in regex_compiler? #160

Comments

unhammer commented Sep 22, 2022

TinoDidriksen commented Sep 22, 2022

unhammer commented Sep 22, 2022

mr-martian commented Sep 24, 2022