You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ghci> import qualified Data.Text.ICU as ICU
ghci> let testCompare c a b = (ICU.collate c a b, compare (ICU.sortKey c a) (ICU.sortKey c b))
according to the docs, testCompare c a b should always return a pair of two equal values (i.e. (EQ, EQ), (LT, LT) or (GT, GT)). But this isn't the case, for example:
ghci> let c = ICU.collator ICU.Root
ghci> testCompare c "" "\EOT"
(EQ,LT)
ghci> testCompare c "" "\ETX"
(EQ,LT)
ghci> testCompare c "" "\NUL"
(EQ,LT)
ghci> testCompare c "" "\2205"
(EQ,LT)
ghci> testCompare c "" "\2250"
(EQ,LT)
ghci> testCompare c "" "\2250\ETX\2205"
(EQ,LT)
As far as I can tell, there are a handful of characters (including all of those above) such that Data.ByteString.unpack $ ICU.sortKey "(char)" gives [1, 1, 0]. And the problem manifests when we compare a string of any number of these characters (such a string also has sort key [1, 1, 0]) to the empty string (sort key []). I haven't seen this in any other situation.
(\2250 is U+08ca "arabic small high farsi yeh" and \2205 is "arabic superscripet alef mokhassas". Found these essentially randomly. A few others in the vicinity have the same property, like \2251 but not \2206. I haven't looked to see if there's any pattern here.)
I tried a few other collators. collatorWith _ [Strength Secondary] makes the sort key of the non-empty strings [1, 0] instead of [1, 1, 0], but testCompare gives the same results. Changing the base to Locale "en" or adding Numeric True doesn't obviously make a difference.
This is with text-icu-0.8.0.2. I can't rule out that this is a bug in icu itself. I'm not familiar enough with C to be able to test that easily, though I expect I could figure it out. I'm using a version provided by nix. Based on the output of lsof, it seems to be version 72.1: my running GHC is has these files open:
according to the docs,
testCompare c a b
should always return a pair of two equal values (i.e.(EQ, EQ)
,(LT, LT)
or(GT, GT)
). But this isn't the case, for example:As far as I can tell, there are a handful of characters (including all of those above) such that
Data.ByteString.unpack $ ICU.sortKey "(char)"
gives[1, 1, 0]
. And the problem manifests when we compare a string of any number of these characters (such a string also has sort key[1, 1, 0]
) to the empty string (sort key[]
). I haven't seen this in any other situation.(
\2250
is U+08ca "arabic small high farsi yeh" and\2205
is "arabic superscripet alef mokhassas". Found these essentially randomly. A few others in the vicinity have the same property, like\2251
but not\2206
. I haven't looked to see if there's any pattern here.)I tried a few other collators.
collatorWith _ [Strength Secondary]
makes the sort key of the non-empty strings[1, 0]
instead of[1, 1, 0]
, buttestCompare
gives the same results. Changing the base toLocale "en"
or addingNumeric True
doesn't obviously make a difference.This is with text-icu-0.8.0.2. I can't rule out that this is a bug in icu itself. I'm not familiar enough with C to be able to test that easily, though I expect I could figure it out. I'm using a version provided by nix. Based on the output of
lsof
, it seems to be version 72.1: my running GHC is has these files open:The text was updated successfully, but these errors were encountered: