Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collate gives different results than applying compare on sortKey #91

Open
ChickenProp opened this issue Jul 20, 2023 · 0 comments
Open

Comments

@ChickenProp
Copy link

ghci> import qualified Data.Text.ICU as ICU
ghci> let testCompare c a b = (ICU.collate c a b, compare (ICU.sortKey c a) (ICU.sortKey c b))

according to the docs, testCompare c a b should always return a pair of two equal values (i.e. (EQ, EQ), (LT, LT) or (GT, GT)). But this isn't the case, for example:

ghci> let c = ICU.collator ICU.Root
ghci> testCompare c "" "\EOT"
(EQ,LT)
ghci> testCompare c "" "\ETX"
(EQ,LT)
ghci> testCompare c "" "\NUL"
(EQ,LT)
ghci> testCompare c "" "\2205"
(EQ,LT)
ghci> testCompare c "" "\2250"
(EQ,LT)
ghci> testCompare c "" "\2250\ETX\2205"
(EQ,LT)

As far as I can tell, there are a handful of characters (including all of those above) such that Data.ByteString.unpack $ ICU.sortKey "(char)" gives [1, 1, 0]. And the problem manifests when we compare a string of any number of these characters (such a string also has sort key [1, 1, 0]) to the empty string (sort key []). I haven't seen this in any other situation.

(\2250 is U+08ca "arabic small high farsi yeh" and \2205 is "arabic superscripet alef mokhassas". Found these essentially randomly. A few others in the vicinity have the same property, like \2251 but not \2206. I haven't looked to see if there's any pattern here.)

I tried a few other collators. collatorWith _ [Strength Secondary] makes the sort key of the non-empty strings [1, 0] instead of [1, 1, 0], but testCompare gives the same results. Changing the base to Locale "en" or adding Numeric True doesn't obviously make a difference.

This is with text-icu-0.8.0.2. I can't rule out that this is a bug in icu itself. I'm not familiar enough with C to be able to test that easily, though I expect I could figure it out. I'm using a version provided by nix. Based on the output of lsof, it seems to be version 72.1: my running GHC is has these files open:

/nix/store/x6cq3940a5krcwj0p28y3b6lckxmcfqw-icu4c-72.1/lib/libicudata.so.72.1
/nix/store/x6cq3940a5krcwj0p28y3b6lckxmcfqw-icu4c-72.1/lib/libicui18n.so.72.1
/nix/store/x6cq3940a5krcwj0p28y3b6lckxmcfqw-icu4c-72.1/lib/libicuuc.so.72.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant