-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode troubles? #349
Comments
I'm sorry. ack does not handle Unicode very well at all. I'm tagging this issue as Unicode and maybe some day we can somehow address it. |
While Perl and most other modern programming languages allow subroutine and variable names to be Unicode, and thus in the natural language of the coder, usage seen is nearly uniformly Latin 1 or ASCII, matching the alphabet and often the language of the keywords. Ack is unapologetically defined as a coder's search tool for searching collections of code files, even though some folks (including myself, one of Andy Petdance's associate devs) use it off-label to search data, both structured and unstructured. (I have a large collection of OCR text, indexed with If one has an "older" perl (5.28 or earlier), there is a workaround to trick Ack into using Perl's native Unicode support, that sortof mostly works. |
The other issue with Greek letters in particular is that they appear at multiple Unicode codepoints with different semantics ... there's Math Greek, with Bold etc … variants; there' s Greek Greek (upper and lower); and there's Cyrillic and Armenian letters that are the same as Greek letters; and maybe more. https://codepoints.net/U+03C0 lists 20 related characters for π, and more than thrice as many "confusables". (And there's no guarantee (unless you have wonderful provenance!) that a document/file uses the Nu |
Except for when your OCR engine takes a whitelist of allowed codepoints :-)
For what it's worth, even though the example was also of an OCR file, I do have collections of code files with non-ASCII characters in them, both as part of the comments as well as the code itself. Such is the nature of working with natural language. I'm aware that digital humanities is somewhat of a niche phenomenon, we're still coders none the less. But hey, this is your tool, I just happen to love using it, and have (yesterday for the first time, by the way) stumbled upon something unexpected. |
That could count as "wonderful provenance" 😄 .
indeed. Comments outside of Latin-1 alphabet will be more common than identifiers, whether digital humanities or ^regular^ developers just writing their comments in language they're most expressive in when writing for themselves and not far away customers. If you look at the linked Unicode tickets, you'll see that on of the requirements for doing Unicode right will be multiplying test cases and test data. Digital Humanities / Modern Languages talent might be useful when (I say when not if hopefully) we get to it. Since the commandline hack mostly worked (upto 5.28), i expect redoing the testing N times is most of the work, but there's some architectural choice on how to handle mix and match files. |
While I know this is not technically a support forum, I will try to directly reply to the filer's question with a simple to understand explanation and an easy practical workaround solution, since I've been in the exact same boat before and for a while. I leave it to others to handle this thread in terms of being a bug report for Ack.
|
FWIW, there is a support forum - Minor correction: Ack uses the Perl RE engine, in which
You are correct that (On a current Ubuntu, |
Additional note, the |
Additional aside re Perl RE engine and Unicode: One can not expect the RE Ref wikipedia Greek diacritics#Unicode The simple RE will usually be enough to find lines containing at least one letter of Greek, but if expanded to (If the file is all Greek, one can just trust To handle these subtleties in an e.g. Perl program I would normalize the input to Input on how Ack should handle Unicode is welcome. Should ack assume all files are NFD or NFC? I doubt it. Or trust the input files selected are already whichever of NFC or nFC makes sense for the given RE ? Maybe. I do not expect Ack to ever guess correctly based on detecting sequences in RE pattern and input files. Routinely conver all inputs to NFD (or NFC) whether needed or not is a non-starter, that makes it slower for all users to benefit a few. A I'm guessing we'll only ever support UTF-8; I've experimented a bit with 16 and 32 bit UTF BOM, and while it's sometimes possible to detect a file format if it properly starts with a BOM, they are hardly universally provided; and while i even provided a workaround to allow a collection of UCS-2/UTF-16 files to be searched, it isn't always practical. |
Sorry to bump a six-month-old thread, but I arrived here because I was astonished to discover that this didn't DWIM:
To be fair, neither did:
However,
PS: As an honorable mention,
Vis-a-vis "how Ack should handle Unicode", I would point to the old axiom: "Good artists imitate; great artists steal!" 😎
|
I can't figure out what the output has to do with the regular expression here. Highlighted are all letters with diacritica, my intention was to search for all Greek letters (I also tried
\p{Greek}
, which had no results at all). What am I missing here?The text was updated successfully, but these errors were encountered: