Unicode troubles? #349

sixtyfive · 2021-12-01T13:21:03Z

$ ack '[ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω]' file.xml

74:         <String CONTENT="(2) AL-MUKHTAṢAR FĪ ʿILM AL-ISTIBDĀL, by AL-KĀ-"
91:         <String CONTENT="FIYĀJĪ."
108:        <String CONTENT="[Another tract on the same subject; foll. 9—16.]"
142:        <String CONTENT="Foll. 16. 17·7 × 13·3 cm. Clear scholar’s naskh."
159:        <String CONTENT="Copyist, Yaḥyā b. ʿAbd al-Ghanī b. ʿAlī al-Imām."
176:        <String CONTENT="Dated 13 Jumādā II 870 (30 January 1466)."
210:        <String CONTENT="(1) TAʾSĪS AL-NAẒĀʾIR, by Abū Zaid ʿAbd (ʿUbaid) Allāh b."
227:        <String CONTENT="ʿUmar b. ʿĪsā AL-DABŪSĪ (d. 430/1039)."
312:        <String CONTENT="(2) AL-IḤKĀM FĪ MAʿRIFAT AL-AIMĀN WAʾL-"
329:        <String CONTENT="AḤKĀM, by AL-KĀFIYĀJĪ (d. 879/1474)."
380:        <String CONTENT="Dated 4 Ramaḍān 866 (2 June 1462)."
414:        <String CONTENT="(3) IJĀRAT AL-IQṬĀʿ, by Zain al-Dīn Abu ʾl-Faḍl al-Qāsim b."
431:        <String CONTENT="ʿAbd Allāh B. QUṬLŪBUGHĀ al-Ḥanafī al-Sūdūnī (d. 879/1474)."
499:        <String CONTENT="Foll. 87. 17·7 × 13 cm. Clear scholar’s naskh."

I can't figure out what the output has to do with the regular expression here. Highlighted are all letters with diacritica, my intention was to search for all Greek letters (I also tried \p{Greek}, which had no results at all). What am I missing here?

The text was updated successfully, but these errors were encountered:

petdance · 2021-12-01T17:39:00Z

I'm sorry. ack does not handle Unicode very well at all. I'm tagging this issue as Unicode and maybe some day we can somehow address it.

n1vux · 2021-12-01T21:59:08Z

While Perl and most other modern programming languages allow subroutine and variable names to be Unicode, and thus in the natural language of the coder, usage seen is nearly uniformly Latin 1 or ASCII, matching the alphabet and often the language of the keywords.

Ack is unapologetically defined as a coder's search tool for searching collections of code files, even though some folks (including myself, one of Andy Petdance's associate devs) use it off-label to search data, both structured and unstructured. (I have a large collection of OCR text, indexed with swish-e but searched internally with ack; but a paragrep mode with a scalable would surely be useful.) This is why it has --perl filetype shortcuts. As such, support of Unicode has not had the priority it would have were this a tool hypothetically defined as for searching data and only incidentally good for searching code.

If one has an "older" perl (5.28 or earlier), there is a workaround to trick Ack into using Perl's native Unicode support, that sortof mostly works.
#222 (comment)
(But alas sysread on a Unicode filehandle was deprecated in Perl 5.24-5.28 and is fatal in 5.30.)

n1vux · 2021-12-01T22:10:37Z

The other issue with Greek letters in particular is that they appear at multiple Unicode codepoints with different semantics ... there's Math Greek, with Bold etc … variants; there' s Greek Greek (upper and lower); and there's Cyrillic and Armenian letters that are the same as Greek letters; and maybe more. https://codepoints.net/U+03C0 lists 20 related characters for π, and more than thrice as many "confusables".

(And there's no guarantee (unless you have wonderful provenance!) that a document/file uses the Nu ν or Omicron Ο codepoint from the semantically correct sequence (Unicode Script, Category, & Block). I have no clue which ν my ⏹*n X-compose sequence inserted here, or if GH will swap it! Documents may even sloppily use Latin O where ο should have been used.)

sixtyfive · 2021-12-02T09:32:55Z

there's no guarantee (unless you have wonderful provenance!) that a document/file uses the Nu ν or Omicron Ο codepoint from the semantically correct sequence (Unicode Script, Category, & Block)

Except for when your OCR engine takes a whitelist of allowed codepoints :-)

Ack is unapologetically defined as a coder's search tool for searching collections of code files

For what it's worth, even though the example was also of an OCR file, I do have collections of code files with non-ASCII characters in them, both as part of the comments as well as the code itself. Such is the nature of working with natural language. I'm aware that digital humanities is somewhat of a niche phenomenon, we're still coders none the less.

But hey, this is your tool, I just happen to love using it, and have (yesterday for the first time, by the way) stumbled upon something unexpected.

n1vux · 2021-12-02T17:31:17Z

Except for when your OCR engine takes a whitelist of allowed codepoints

That could count as "wonderful provenance" 😄 .

digital humanities

indeed.
(One of the committee that spun XML off of SGML was a Digital Humanities academic. A dear friend.)

Comments outside of Latin-1 alphabet will be more common than identifiers, whether digital humanities or ^regular^ developers just writing their comments in language they're most expressive in when writing for themselves and not far away customers.
If only for comments, it would be good to support Unicode.

If you look at the linked Unicode tickets, you'll see that on of the requirements for doing Unicode right will be multiplying test cases and test data. Digital Humanities / Modern Languages talent might be useful when (I say when not if hopefully) we get to it. Since the commandline hack mostly worked (upto 5.28), i expect redoing the testing N times is most of the work, but there's some architectural choice on how to handle mix and match files.

hftf · 2021-12-03T04:02:23Z

$ ack '[ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω]' file.xml
I can't figure out what the output has to do with the regular expression here. Highlighted are all letters with diacritica, my intention was to search for all Greek letters (I also tried \p{Greek}, which had no results at all). What am I missing here?

While I know this is not technically a support forum, I will try to directly reply to the filer's question with a simple to understand explanation and an easy practical workaround solution, since I've been in the exact same boat before and for a while. I leave it to others to handle this thread in terms of being a bug report for Ack.

The common way Unicode data is stored is via UTF-8 encoding, in which most of the rarer characters are represented by a sequence of multiple bytes. For example, the character α (‎03B1 GREEK SMALL LETTER ALPHA) is represented by the two bytes CE B1 in the UTF-8 encoding.
Regular expression engines have very different implementations. For example, some engines support a mechanism like \p{Greek} for matching any character in a particular Unicode class, while other engines do not understand it at all, or even use \p to mean something completely different. Ack likely doesn't support \p unfortunately.
Currently in Ack, a multibyte Unicode character inside of a character class seems to behave as a character class over the character's individual bytes. So think of [αβ] as [\xCE\xB1\xCE\xB2], a character class over four half-characters (three unique); then it's no wonder this pattern would match (the first byte/the first half of) γ!
Therefore, an easy workaround is replacing the character class [αβ...] with a disjunction of sequences (α|β|...).

n1vux · 2021-12-03T18:14:55Z

FWIW, there is a support forum - ack-users mailing list.

Minor correction: Ack uses the Perl RE engine, in which \p is supported.
(Ack's only differences from Perl RE are just prohibitions on unsafe features, or failures in our input processing. One should be able to use the full documented RE features of whichever Perl you invoke Ack with, including (?xism: ) provided you managed the shell escapes.)
\p isn't specifically unsupported in Ack.
But without Unicode input handling, \p{Greek} will not be useful; as noted in #222 , to enable \p{Han}or\p{Greek}` with Ack, one needs to force filehandles to UTF-8, which isn't yet available as an Ack command flag option. The following hack workaround warns in 5.24-5.28 and fails with Perl 5.30+, so is NOT a longterm workaround, but if you have PerlBrew or an older Perl, you can use it:

$ perlbrew exec --with perl-5.24.2@class-std perl  -C '-Mopen IO=>":encoding(UTF-8)"' ~/bin/ack --noenv '\p{Han}'  bugs/han.txt
sysread() is deprecated on :utf8 handles at /home/wdr/bin/ack line 4894.
hello 世界

$ perlbrew exec --with perl-5.24.2@class-std perl  -C '-Mopen IO=>":encoding(UTF-8)"' ~/bin/ack --noenv '\p{Greek}'  bugs/greek.txt
sysread() is deprecated on :utf8 handles at /home/wdr/bin/ack line 4894.
Ἄλκηστις
Ἄδμηθ', ὁρᾷς γὰρ τἀμὰ πράγμαθ' ὡς ἔχει,
λέξαι θέλω σοι πρὶν θανεῖν ἃ βούλομαι.
...

You are correct that [αβ...] considered as non-Unicode is going to do the wrong thing. Were that pattern inline in Perl program, use utf8; at the top of the file would have it understood as UTF encoding properly. But we're reading it from the shell commandline. So to use [αβ...] correctly, Ack would need to handle the commandline regex argument as Unicode (if any high bits marking extension bytes present? or always?) as well as interpreting the input files as Unicode. That requires an additional code patch or workaround from that to enable \p{Han}. Quite possibly utf8::upgrade($re); ... utf8::upgrade($buffer); .
Whether this can be always or automatically done as needed or whether it requires a --do-Unicode commandflag needs exploration.

(On a current Ubuntu, grep does correctly handle RE [αβ] and an implicitly UTF-8 Greek test file, and since in many modern programming languages, having identifiers, string data, and comments containing UTF-8 Greek, Han, etc is perfectly legal, this is presumed desirable behavior.)

n1vux · 2021-12-03T18:31:28Z

Additional note, the sysread incompatibility with the Unicode inputs workaround is only in our pre-check optimization which may be turned off with --passthru (which will result in non-highlighted lines printing).

n1vux · 2021-12-03T20:42:47Z

Additional aside re Perl RE engine and Unicode:

One can not expect the RE [ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω] or equivalent [Α-Ωα-ω] to match the "pre-composed" accented Unicode codepoints in a text such as ὑμῖν δέ, παῖδες, μητρὸς ἐκπεφυκέναι. (which is figuratively as well as literally Greek to me, found test data!), it will only match the unaccented characters (including those trailed by combining accents; but not the "pre-composed" ones with accent built into the codepoint), so it matters which your OCR or whatever is generating.
(Same problem in Latin-1 actually. [A-Za-z] will match a combining a' but not á as a single Unicode codepoint. \w and \p{Letter} are your friend.)

Ref wikipedia Greek diacritics#Unicode

The simple RE will usually be enough to find lines containing at least one letter of Greek, but if expanded to [Α-Ωα-ω]+ find words, it won't match whole words with accented characters, which might be desired with -o or --output, it will only the strings of unaccented characters. In that case, the accented pre-composed codepoints (NFC) are recognized as words nicely by \p{Greek}+ but a combining accent when de-composed (NFD) breaks the word, ugh, have to move to ((?x: \p{Greek} | \p{Diacritic} )+)/ or ((?x: \p{Greek} | \p{gc:Mn} )+)/ to capture words of normalized-form-decomposed Greek.

(If the file is all Greek, one can just trust -o '\w+' to isolate the words but that won't reject English, French words. A lookahead to require the first word char to be Greek would heuristically make that mostly work, (?=\p{Greek})(\w+) , but would accept mixed alphabetic ΦW )

To handle these subtleties in an e.g. Perl program I would normalize the input to NFD or NFC, depending which behavior is desired.

Input on how Ack should handle Unicode is welcome.
(More such input may move it up the queue.)
( How soon we get to it will depend on having the right volunteer able to work the testing ... )

Should ack assume all files are NFD or NFC? I doubt it. Or trust the input files selected are already whichever of NFC or nFC makes sense for the given RE ? Maybe. I do not expect Ack to ever guess correctly based on detecting sequences in RE pattern and input files. Routinely conver all inputs to NFD (or NFC) whether needed or not is a non-starter, that makes it slower for all users to benefit a few. A --unicode=NFD|NFC option to request a specific normalization (that digital humanities can put in .ackrc) might be possible but at what cost ???

I'm guessing we'll only ever support UTF-8; I've experimented a bit with 16 and 32 bit UTF BOM, and while it's sometimes possible to detect a file format if it properly starts with a BOM, they are hardly universally provided; and while i even provided a workaround to allow a collection of UCS-2/UTF-16 files to be searched, it isn't always practical.

DabeDotCom · 2022-06-07T02:13:42Z

Input on how Ack should handle Unicode is welcome. (More such input may move it up the queue.)

Sorry to bump a six-month-old thread, but I arrived here because I was astonished to discover that this didn't DWIM:

perl -CSA -E 'say "w\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}rd"' | ack 'w\Xrd'

To be fair, neither did:

perl -CSA -E 'say "w\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}rd"' | pcre2grep 'w\Xrd'

However, pcre2grep -u worked, both for NFC and NFD forms:

perl -CSA -E 'say "w\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}rd"' | pcre2grep -u 'w\Xrd'
wôrd

perl -CSA -E 'say "wo\N{COMBINING CIRCUMFLEX ACCENT}rd"' | pcre2grep -u 'w\Xrd'
wôrd

PS: As an honorable mention, ack 'w\X+rd' did manage to back into the right answer(s) also — although it would obviously return a lot of false positives, as well:

perl -CSA -E 'say "w\N{LATIN SMALL LETTER O WITH CIRCUMFLEX}rd"' | ack 'w\X+rd'
wôrd

perl -CSA -E 'say "wo\N{COMBINING CIRCUMFLEX ACCENT}rd"' | ack 'w\X+rd'
wôrd

perl -CSA -E 'say "wayward"' | ack 'w\X+rd'
wayward

Vis-a-vis "how Ack should handle Unicode", I would point to the old axiom: "Good artists imitate; great artists steal!" 😎

pcre2grep's -u | --utf and/or -U | --utf-allow-invalid options seem like excellent candidates/precedent for ~~plagarism~~— er, I mean "inspiration!"

petdance added the unicode label Dec 1, 2021

n1vux mentioned this issue Jun 7, 2022

Add pcre2grep and a category for "does it match across multiple lines?" beyondgrep/website#74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode troubles? #349

Unicode troubles? #349

sixtyfive commented Dec 1, 2021 •

edited

Loading

petdance commented Dec 1, 2021

n1vux commented Dec 1, 2021

n1vux commented Dec 1, 2021

sixtyfive commented Dec 2, 2021

n1vux commented Dec 2, 2021

hftf commented Dec 3, 2021

n1vux commented Dec 3, 2021 •

edited

Loading

n1vux commented Dec 3, 2021

n1vux commented Dec 3, 2021

DabeDotCom commented Jun 7, 2022

Unicode troubles? #349

Unicode troubles? #349

Comments

sixtyfive commented Dec 1, 2021 • edited Loading

petdance commented Dec 1, 2021

n1vux commented Dec 1, 2021

n1vux commented Dec 1, 2021

sixtyfive commented Dec 2, 2021

n1vux commented Dec 2, 2021

hftf commented Dec 3, 2021

n1vux commented Dec 3, 2021 • edited Loading

n1vux commented Dec 3, 2021

n1vux commented Dec 3, 2021

DabeDotCom commented Jun 7, 2022

sixtyfive commented Dec 1, 2021 •

edited

Loading

n1vux commented Dec 3, 2021 •

edited

Loading