-
Notifications
You must be signed in to change notification settings - Fork 52
How Does Natsort Work? (2 ‐ Advanced)
This page is a continuation of the first how-it-works page the page describing the basics of natsort. It is assumed you have read that page first.
If what I described in TL;DR 1 were all that natsort needed to do then there probably wouldn't be much need for a third-party module, right? Probably. But it turns out that in real-world data there are a lot of special cases that need to be handled, and in true 80%/20% fashion, the majority of the code in natsort is devoted to handling special cases like those described below.
The first major special case I encountered was sorting filesystem paths
(if you go to the link, you will see I didn't handle it well for a year...
this was before I fully realized how much functionality I could really add
to natsort). Let's apply the natsort_key
from above to some
filesystem paths that you might see being auto-generated from your operating
system:
>>> paths = ['Folder (10)/file.tar.gz',
... 'Folder/file.tar.gz',
... 'Folder (1)/file (1).tar.gz',
... 'Folder (1)/file.tar.gz']
>>> sorted(paths, key=natsort_key)
['Folder (1)/file (1).tar.gz', 'Folder (1)/file.tar.gz', 'Folder (10)/file.tar.gz', 'Folder/file.tar.gz']
Well that's not right! What is 'Folder/file.tar.gz'
doing at the end?
It has to do with the numerical ASCII code assigned to the space and
/
characters in the ASCII table. According to the ASCII table, the
space character (number 32) comes before the /
character (number 47). If
we remove the common prefix in all of the above strings ('Folder'
), we
can see why this happens:
>>> ' (1)/file.tar.gz' < '/file.tar.gz'
True
>>> ' ' < '/'
True
This isn't very convenient... how do we solve it? We can split the path across the path separators and then sort. A convenient way do to this is with the Path.parts property from pathlib:
>>> import pathlib
>>> sorted(paths, key=lambda x: tuple(natsort_key(s) for s in pathlib.Path(x).parts))
['Folder/file.tar.gz', 'Folder (1)/file (1).tar.gz', 'Folder (1)/file.tar.gz', 'Folder (10)/file.tar.gz']
Almost! It seems like there is some funny business going on in the final filename component as well. We can solve that nicely and quickly with Path.suffixes and Path.stem.
>>> def decompose_path_into_components(x):
... path_split = list(pathlib.Path(x).parts)
... # Remove the final filename component from the path.
... final_component = pathlib.Path(path_split.pop())
... # Split off all the extensions.
... suffixes = final_component.suffixes
... stem = final_component.name.replace(''.join(suffixes), '')
... # Remove the '.' prefix of each extension, and make that
... # final component a list of the stem and each suffix.
... final_component = [stem] + [x[1:] for x in suffixes]
... # Replace the split final filename component.
... path_split.extend(final_component)
... return path_split
...
>>> def natsort_key_with_path_support(x):
... return tuple(natsort_key(s) for s in decompose_path_into_components(x))
...
>>> sorted(paths, key=natsort_key_with_path_support)
['Folder/file.tar.gz', 'Folder (1)/file.tar.gz', 'Folder (1)/file (1).tar.gz', 'Folder (10)/file.tar.gz']
This works because in addition to breaking the input by path separators, the final filename component is separated from its extensions as well. Then, each of these separated components is sent to the natsort algorithm, so the result is a tuple of tuples. Once that is done, we can see how comparisons can be done in the expected manner.
>>> a = natsort_key_with_path_support('Folder (1)/file (1).tar.gz')
>>> a
(('Folder (', 1, ')'), ('file (', 1, ')'), ('tar',), ('gz',))
>>>
>>> b = natsort_key_with_path_support('Folder/file.tar.gz')
>>> b
(('Folder',), ('file',), ('tar',), ('gz',))
>>>
>>> a > b
True
NOTE: The actual decompose_path_into_components()
-equivalent function in
natsort actually has a few more heuristics than shown here so that
it is not over-zealous in what it defines as a path suffix, but this has
been omitted in this how-to for clarity.
The second major special case I encountered was sorting of different types.
On Python 2 (i.e. legacy Python), this mostly didnt't matter too
much since it uses an arbitrary heuristic to allow traditionally un-comparable
types to be compared (such as comparing 'a'
to 1
). However, on Python 3
(i.e. Python) it simply won't let you perform such nonsense, raising a
TypeError instead.
You can imagine that a module that breaks strings into tuples of numbers and strings is walking a dangerous line if it does not have special handling for comparing numbers and strings. My imagination was not so great at first. Let's take a look at all the ways this can fail with real-world data.
>>> def natsort_key_with_poor_real_number_support(x):
... split_input = re.split(signed_float, x)
... split_input = filter(None, split_input) # removes null strings
... return tuple(coerce_to_float(s) for s in split_input)
>>>
>>> sorted([5, '4'], key=natsort_key_with_poor_real_number_support)
Traceback (most recent call last):
...
TypeError: ...
>>>
>>> sorted(['12 apples', 'apples'], key=natsort_key_with_poor_real_number_support)
Traceback (most recent call last):
...
TypeError: ...
>>>
>>> sorted(['version5.3.0', 'version5.3rc1'], key=natsort_key_with_poor_real_number_support)
Traceback (most recent call last):
...
TypeError: ...
Let's break these down.
- The integer
5
is sent tore.split
which expects only strings or bytes, which is a no-no. -
natsort_key_with_poor_real_number_support('12 apples') < natsort_key_with_poor_real_number_support('apples')
is the same as(12.0, ' apples') < ('apples',)
, and thus a number gets compared to a string [1] which also is a no-no. - This one scores big on the astonishment scale, especially if one
accidentally uses signed integers or real numbers when they mean
to use unsigned integers.
natsort_key_with_poor_real_number_support('version5.3.0') < natsort_key_with_poor_real_number_support('version5.3rc1')
is the same as('version', 5.3, 0.0) < ('version', 5.3, 'rc', 1.0)
, so in the third element a number gets compared to a string, once again the same old no-no. (The same would happen with'version5-3'
and'version5-a'
, which would become('version', 5, -3)
and('version', 5, '-a')
).
As you might expect, the solution to the first issue is to wrap the
re.split
call in a try: except:
block and handle the number specially
if a TypeError is raised. The second and third cases could be handled
in a "special case" manner, meaning only respond and do something different
if these problems are detected. But a less error-prone method is to ensure
that the data is correct-by-construction, and this can be done by ensuring
that the returned tuples always start with a string, and then alternate
in a string-number-string-number-string pattern; this can be achieved by
adding an empty string wherever the pattern is not followed [2]. This ends
up working out pretty nicely because empty strings are always "less" than
any non-empty string, and we typically want numbers to come before strings.
Let's take a look at how this works out.
>>> from natsort import sep_inserter
>>> list(sep_inserter(iter(['apples']), ''))
['apples']
>>> list(sep_inserter(iter([12, ' apples']), ''))
['', 12, ' apples']
>>> list(sep_inserter(iter(['version', 5, -3]), ''))
['version', 5, '', -3]
>>>
>>> from natsort import natsort_keygen, ns
>>> natsort_key_with_good_real_number_support = natsort_keygen(alg=ns.REAL)
>>>
>>> sorted([5, '4'], key=natsort_key_with_good_real_number_support)
['4', 5]
>>>
>>> sorted(['12 apples', 'apples'], key=natsort_key_with_good_real_number_support)
['12 apples', 'apples']
>>>
>>> sorted(['version5.3.0', 'version5.3rc1'], key=natsort_key_with_good_real_number_support)
['version5.3.0', 'version5.3rc1']
How the "good" version works will be given in TL;DR 2 - Handling Crappy, Real-World Input.
A rather unexpected special case I encountered was sorting collections containing NaN. Let's see what happens when you try to sort a plain old list of numbers when there is a NaN floating around in there.
>>> danger = [7, float('nan'), 22.7, 19, -14, 59.123, 4]
>>> sorted(danger)
[7, nan, -14, 4, 19, 22.7, 59.123]
Clearly that isn't correct, and for once it isn't my fault! It's hard to compare floating point numbers. By definition, NaN is unorderable to any other number, and is never equal to any other number, including itself.
>>> nan = float('nan')
>>> 5 > nan
False
>>> 5 < nan
False
>>> 5 == nan
False
>>> 5 != nan
True
>>> nan == nan
False
>>> nan != nan
True
The implication of all this for us is that if there is an NaN in the
data-set we are trying to sort, the data-set will end up being sorted in
two separate yet individually sorted sequences - the one before the NaN,
and the one after. This is because the <
operation that is used
to sort always returns False with NaN.
Because natsort aims to sort sequences in a way that does not surprise the user, keeping this behavior is not acceptable (I don't require my users to know how NaN will behave in a sorting algorithm). The simplest way to satisfy the "least astonishment" principle is to substitute NaN with some other value. But what value is least astonishing? I chose to replace NaN with negative infinity so that these poorly behaved elements always end up at the front where the users will most likely be alerted to their presence.
>>> def fix_nan(x):
... if x != x: # only true for NaN
... return float('-inf')
... else:
... return x
...
Let's check out TL;DR 2 to see how this can be incorporated into the simple key function from TL;DR 1.
Let's see how our elegant key function from TL;DR 1 has become bastardized in order to support handling mixed real-world data and user customizations.
>>> def natsort_key(x, as_float=False, signed=False, as_path=False):
... if as_float:
... regex = signed_float if signed else unsigned_float
... else:
... regex = signed_int if signed else unsigned_int
... try:
... if as_path:
... x = decompose_path_into_components(x) # Decomposes into list of strings
... # If this raises a TypeError, input is not a string.
... split_input = re.split(regex, x)
... except TypeError:
... try:
... # Does this need to be applied recursively (list-of-list)?
... return tuple(map(natsort_key, x))
... except TypeError:
... # Must be a number
... ret = ('', fix_nan(x)) # Maintain string-number-string pattern
... return (ret,) if as_path else ret # as_path returns tuple-of-tuples
... else:
... split_input = filter(None, split_input) # removes null strings
... # Note that the coerce_to_int/coerce_to_float functions
... # are also modified to use the fix_nan function.
... if as_float:
... coerced_input = (coerce_to_float(s) for s in split_input)
... else:
... coerced_input = (coerce_to_int(s) for s in split_input)
... return tuple(sep_inserter(coerced_input, ''))
...
And this doesn't even show handling bytes type! Notice that we have
to do non-obvious things like modify the return form of numbers when as_path
is given, just to avoid comparing strings and numbers for the case in which a
user provides input like ['/home/me', 42]
.
Let's take it out for a spin!
>>> danger = [7, float('nan'), 22.7, '19', '-14', '59.123', 4]
>>> sorted(danger, key=lambda x: natsort_key(x, as_float=True, signed=True))
[nan, '-14', 4, 7, '19', 22.7, '59.123']
>>>
>>> paths = ['Folder (1)/file.tar.gz',
... 'Folder/file.tar.gz',
... 123456]
>>> sorted(paths, key=lambda x: natsort_key(x, as_path=True))
[123456, 'Folder/file.tar.gz', 'Folder (1)/file.tar.gz']
Probably the most challenging special case I had to handle was getting natsort to handle sorting the non-numerical parts of input correctly, and also allowing it to sort the numerical bits in different locales. This was in no way what I originally set out to do with this library, so I was caught a bit off guard when the request was initially made. I discovered the locale library, and assumed that if it's part of Python's StdLib there can't be too many dragons, right?
INCOMPLETE LIST OF DRAGONS
- https://github.com/SethMMorton/natsort/issues/21
- https://github.com/SethMMorton/natsort/issues/22
- https://github.com/SethMMorton/natsort/issues/23
- https://github.com/SethMMorton/natsort/issues/36
- https://github.com/SethMMorton/natsort/issues/44
- https://bugs.python.org/issue2481
- https://bugs.python.org/issue23195
- https://stackoverflow.com/questions/3412933/python-not-sorting-unicode-properly-strcoll-doesnt-help
- https://stackoverflow.com/questions/22203550/sort-dictionary-by-key-using-locale-collation
- https://stackoverflow.com/questions/33459384/unicode-character-not-in-range-when-calling-locale-strxfrm
- https://stackoverflow.com/questions/36431810/sort-numeric-lines-with-thousand-separators
- https://stackoverflow.com/questions/45734562/how-can-i-get-a-reasonable-string-sorting-with-python
These can be summed up as follows:
- locale is a thin wrapper over your operating system's locale library, so if that is broken (like it is on OSX) then locale is broken in Python.
- Because of a bug in legacy Python (i.e. Python 2), there was no uniform way to use the locale sorting functionality between legacy Python and Python (luckily this is no longer an issue now that Python 2 is EOL).
- People have differing opinions of how capitalization should affect word order.
- There is no built-in way to handle locale-dependent thousands separators and decimal points robustly.
- Proper handling of Unicode is complicated.
- Proper handling of locale is complicated.
Easily over half of the code in natsort is in some way dealing with some aspect of locale or basic case handling. It would have been impossible to get right without a really good testing strategy.
Don't expect any more TL;DR's... if you want to see how all this is fully incorporated into the natsort algorithm then please take a look at the code. However, I will hint at how specific steps are taken in each section.
Let's see how we can handle some of the dragons, one-by-one.
Without even thinking about the mess that is adding locale support, natsort can introduce support for controlling how case is interpreted.
First, let's take a look at how it is sorted by default (due to where characters lie on the ASCII table).
>>> a = ['Apple', 'corn', 'Corn', 'Banana', 'apple', 'banana']
>>> sorted(a)
['Apple', 'Banana', 'Corn', 'apple', 'banana', 'corn']
All uppercase letters come before lowercase letters in the ASCII table,
so all capitalized words appear first. Not everyone agrees that this
is the correct order. Some believe that the capitalized words should
be last (['apple', 'banana', 'corn', 'Apple', 'Banana', 'Corn']
).
Some believe that both the lowercase and uppercase versions
should appear together
(['Apple', 'apple', 'Banana', 'banana', 'Corn', 'corn']
).
Some believe that both should be true ☹. Some people don't care at all [3].
Solving the first case (I call it LOWERCASEFIRST) is actually pretty easy... just call the str.swapcase() method on the input.
>>> sorted(a, key=lambda x: x.swapcase())
['apple', 'banana', 'corn', 'Apple', 'Banana', 'Corn']
The last (i call it IGNORECASE) is pretty easy. Simply call str.casefold() on the input (it's like str.lowercase() but does a better job on non-latin character sets).
>>> sorted(a, key=lambda x: x.casefold())
['Apple', 'apple', 'Banana', 'banana', 'corn', 'Corn']
The middle case (I call it GROUPLETTERS) is less straightforward. The most efficient way to handle this is to duplicate each character with its lowercase version and then the original character.
>>> import itertools
>>> def groupletters(x):
... return ''.join(itertools.chain.from_iterable((y.casefold(), y) for y in x))
...
>>> groupletters('Apple')
'aAppppllee'
>>> groupletters('apple')
'aappppllee'
>>> sorted(a, key=groupletters)
['Apple', 'apple', 'Banana', 'banana', 'Corn', 'corn']
The effect of this is that both 'Apple'
and 'apple'
are
placed adjacent to each other because their transformations both begin
with 'a'
, and then the second character can be used to order them
appropriately with respect to each other.
There's a problem with this, though. Within the context of natsort we are trying to correctly sort numbers and those should be left alone.
>>> a = ['Apple5', 'apple', 'Apple4E10', 'Banana']
>>> sorted(a, key=lambda x: natsort_key(x, as_float=True))
['Apple5', 'Apple4E10', 'Banana', 'apple']
>>> sorted(a, key=lambda x: natsort_key(groupletters(x), as_float=True))
['Apple4E10', 'Apple5', 'apple', 'Banana']
>>> groupletters('Apple4E10')
'aAppppllee44eE1100'
We messed up the numbers! Looks like groupletters()
needs to be applied
after the strings are broken into their components. I'm not going to show
how this is done here, but basically it requires applying the function in
the else:
block of coerce_to_int()
/ coerce_to_float()
.
>>> better_groupletters = natsort_keygen(alg=ns.GROUPLETTERS | ns.REAL)
>>> better_groupletters('Apple4E10')
('aAppppllee', 40000000000.0)
>>> sorted(a, key=better_groupletters)
['Apple5', 'Apple4E10', 'apple', 'Banana']
Of course, applying both LOWERCASEFIRST and GROUPLETTERS is just a matter of turning on both functions.
Unicode is hard and complicated. Here's an example.
>>> b = [b'\x66', b'\x65', b'\xc3\xa9', b'\x65\xcc\x81', b'\x61', b'\x7a']
>>> a = [x.decode('utf8') for x in b]
>>> a # doctest: +SKIP
['f', 'e', 'é', 'é', 'a', 'z']
>>> sorted(a) # doctest: +SKIP
['a', 'e', 'é', 'f', 'z', 'é']
There are more than one way to represent the character 'é' in Unicode.
In fact, many characters have multiple representations. This is a challenge
because comparing the two representations would return False
even though
they look the same.
>>> a[2] == a[3]
False
Alas, since characters are compared based on the numerical value of their representation, sorting Unicode often gives unexpected results (like seeing 'é' come both before and after 'z').
The original approach that natsort took with respect to non-ASCII Unicode characters was to say "just use the locale or PyICU library" and then cross it's fingers and hope those libraries take care of it. As you will find in the following sections, that comes with its own baggage, and turned out to not always work anyway (see https://stackoverflow.com/q/45734562/1399279). A more robust approach is to handle the Unicode out-of-the-box without invoking a heavy-handed library like locale or PyICU. To do this, we must use normalization.
To fully understand Unicode normalization, check out some official Unicode documentation. Just kidding... that's too much text. The following StackOverflow answers do a good job at explaining Unicode normalization in simple terms: https://stackoverflow.com/a/7934397/1399279 and https://stackoverflow.com/a/7931547/1399279. Put simply, normalization ensures that Unicode characters with multiple representations are in some canonical and consistent representation so that (for example) comparisons of the characters can be performed in a sane way. The following discussion assumes you at least read the StackOverflow answers.
Looking back at our 'é' example, we can see that the two versions were
constructed with the byte strings b'\xc3\xa9'
and b'\x65\xcc\x81'
.
The former representation is actually
LATIN SMALL LETTER E WITH ACUTE
and is a single character in the Unicode standard. This is known as the
compressed form and corresponds to the 'NFC' normalization scheme.
The latter representation is actually the letter 'e' followed by
COMBINING ACUTE ACCENT
and so is two characters in the Unicode standard. This is known as the
decompressed form and corresponds to the 'NFD' normalization scheme.
Since the first character in the decompressed form is actually the letter 'e',
when compared to other ASCII characters it fits where you might expect.
Unfortunately, all Unicode compressed form characters come after the
ASCII characters and so they always will be placed after 'z' when sorting.
It seems that most Unicode data is stored and shared in the compressed form which makes it challenging to sort. This can be solved by normalizing all incoming Unicode data to the decompressed form ('NFD') and then sorting.
>>> import unicodedata
>>> c = [unicodedata.normalize('NFD', x) for x in a]
>>> c # doctest: +SKIP
['f', 'e', 'é', 'é', 'a', 'z']
>>> sorted(c) # doctest: +SKIP
['a', 'e', 'é', 'é', 'f', 'z']
Huzzah! Sane sorting without having to resort to :moocale`+_
The locale module is actually pretty cool, and provides lowly spare-time programmers like myself a way to handle the daunting task of proper locale-dependent support of their libraries and utilities. Having said that, it can be a bit of a bear to get right, although they do point out in the documentation that it will be painful to use. Aside from the caveats spelled out in that link, it turns out that just comparing strings with locale in a cross-platform and cross-python-version manner is not as straightforward as one might hope.
First, how to use locale to compare strings? It's actually pretty straightforward. Simply run the input through the locale transformation function locale.strxfrm().
>>> import locale, sys
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> a = ['a', 'b', 'ä']
>>> sorted(a)
['a', 'b', 'ä']
>>> # The below fails on OSX, so don't run doctest on darwin.
>>> is_osx = sys.platform == 'darwin'
>>> sorted(a, key=locale.strxfrm) if not is_osx else ['a', 'ä', 'b']
['a', 'ä', 'b']
>>>
>>> a = ['apple', 'Banana', 'banana', 'Apple']
>>> sorted(a, key=locale.strxfrm) if not is_osx else ['apple', 'Apple', 'banana', 'Banana']
['apple', 'Apple', 'banana', 'Banana']
It turns out that locale-aware sorting groups numbers in the same
way as turning on GROUPLETTERS and LOWERCASEFIRST.
The trick is that you have to apply locale.strxfrm() only to non-numeric
characters; otherwise, numbers won't be parsed properly. Therefore, it must
be applied as part of the coerce_to_int()
/ coerce_to_float()
functions in a manner similar to groupletters()
.
Remember how in the Basic Unicode Support section I mentioned that we use the "decompressed" Unicode normalization form (e.g. NFD) on all inputs to ensure the order is as expected?
If you have been following along so far, you probably expect that it is not that easy. You would be correct.
It turns out that some locales (but not all) expect the input to be in "compressed form" (e.g. NFC) or the ordering is not as you might expect. Check out this issue for a real-world example. Here's a relevant snippet of code
In [1]: import locale, unicodedata
In [2]: a = ['Aš', 'Cheb', 'Česko', 'Cibulov', 'Znojmo', 'Žilina']
In [3]: locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
Out[3]: 'en_US.UTF-8'
In [4]: sorted(a, key=locale.strxfrm)
Out[4]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
In [5]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFD", x)))
Out[5]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
In [6]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFC", x)))
Out[6]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
In [7]: locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
Out[7]: 'de_DE.UTF-8'
In [8]: sorted(a, key=locale.strxfrm)
Out[8]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
In [9]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFD", x)))
Out[9]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
In [10]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFC", x)))
Out[10]: ['Aš', 'Česko', 'Cheb', 'Cibulov', 'Žilina', 'Znojmo']
In [11]: locale.setlocale(locale.LC_ALL, 'cs_CZ.UTF-8')
Out[11]: 'cs_CZ.UTF-8'
In [12]: sorted(a, key=locale.strxfrm)
Out[12]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']
In [13]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFD", x)))
Out[13]: ['Aš', 'Česko', 'Cibulov', 'Cheb', 'Žilina', 'Znojmo']
In [14]: sorted(a, key=lambda x: locale.strxfrm(unicodedata.normalize("NFC", x)))
Out[14]: ['Aš', 'Cibulov', 'Česko', 'Cheb', 'Znojmo', 'Žilina']
Two out of three locales sort the same data in the same order no matter how the unicode input was normalized, but Czech seems to care how the input is formatted!
So, everthing mentioned in Basic Unicode Support is conditional on whether or not the user wants to use the locale library or not. If not, then "NFD" normalization is used. If they do, "NFC" normalization is used.
But what if the underlying locale implementation that locale relies upon is simply broken? It turns out that the locale library on OSX (and possibly some other BSD systems) is broken (and for some reason has never been fixed?), and so locale does not work as expected.
How do I define doesn't work as expected?
>>> a = ['apple', 'Banana', 'banana', 'Apple']
>>> sorted(a)
['Apple', 'Banana', 'apple', 'banana']
>>>
>>> sorted(a, key=locale.strxfrm) if is_osx else sorted(a)
['Apple', 'Banana', 'apple', 'banana']
IT'S SORTING AS IF locale.strxfrm() WAS NEVER USED!! (and it's worse once non-ASCII characters get thrown into the mix.) I'm really not sure why this is considered OK for the OSX maintainers to not fix, but it's more than frustrating for poor developers who have been dragged into the locale game kicking and screaming. <deep breath>.
So, how to deal with this situation? There are two ways to do so.
-
Detect if locale is sorting incorrectly (i.e.
dumb
) by seeing if'A'
is sorted before'a'
(incorrect) or not.>>> # This is genuinely the name of this function. >>> # See natsort.compat.locale.py >>> def dumb_sort(): ... return locale.strxfrm('A') < locale.strxfrm('a') ...
If a
dumb
locale implementation is found, then automatically turn on LOWERCASEFIRST and GROUPLETTERS. -
Use an alternate library if installed. ICU is a great and powerful library that has a pretty decent Python port called (you guessed it) PyICU. If a user has this library installed on their computer, natsort chooses to use that instead of locale With a little bit of planning, one can write a set of wrapper functions that call the correct library under the hood such that the business logic never has to know what library is being used (see natsort.compat.locale.py).
Let me tell you, this little complication really makes a challenge of testing the code, since one must set up different environments on different operating systems in order to test all possible code paths. Not to mention that certain checks will fail for certain operating systems and environments so one must be diligent in either writing the tests not to fail, or ignoring those tests when on offending environments.
Thousands separator support is a problem that I knew would someday be requested but had decided to push off until a rainy day. One day it finally rained, and I decided to tackle the problem.
So what is the problem? Consider the number 1,234,567
(assuming the
','
is the thousands separator). Try to run that through int()
and you will get a ValueError. To handle this properly the thousands
separators must be removed.
>>> float('1,234,567'.replace(',', ''))
1234567.0
What if, in our current locale, the thousands separator is '.'
and
the ','
is the decimal separator (like for the German locale de_DE)?
>>> float('1.234.567'.replace('.', '').replace(',', '.'))
1234567.0
>>> float('1.234.567,89'.replace('.', '').replace(',', '.'))
1234567.89
This is pretty much what `locale.atoi`_ and `locale.atof`_ do under the hood. So what's the problem? Why doesn't natsort just use this method under its hood? Well, let's take a look at what would happen if we send some possible natsort input through our the above function:
>>> natsort_key('1,234 apples, please.'.replace(',', ''))
('', 1234, ' apples please.')
>>> natsort_key('Sir, €1.234,50 please.'.replace('.', '').replace(',', '.'), as_float=True)
('Sir. €', 1234.5, ' please')
Any character matching the thousands separator was dropped, and anything
matching the decimal separator was changed to '.'
! If these characters
were critical to how your data was ordered, this would break natsort.
The first solution one might consider would be to first decompose the input into sub-components (like we did for the GROUPLETTERS method above) and then only apply these transformations on the number components. This is a chicken-and-egg problem, though, because we cannot appropriately separate out the numbers because of the thousands separators and non-'.' decimal separators (well, at least not without making multiple passes over the data which I do not consider to be a valid option).
Regular expressions to the rescue! With regular expressions, we can remove the thousands separators and change the decimal separator only when they are actually within a number. Once the input has been pre-processed with this regular expression, all the infrastructure shown previously will work.
Beware, these regular expressions will make your eyes bleed.
>>> decimal = ',' # Assume German locale, so decimal separator is ','
>>> # Look-behind assertions cannot accept range modifiers, so instead of i.e.
>>> # (?<!\.[0-9]{1,3}) I have to repeat the look-behind for 1, 2, and 3.
>>> nodecimal = r'(?<!{dec}[0-9])(?<!{dec}[0-9]{{2}})(?<!{dec}[0-9]{{3}})'.format(dec=decimal)
>>> strip_thousands = r'''
... (?<=[0-9]{{1}}) # At least 1 number
... (?<![0-9]{{4}}) # No more than 3 numbers
... {nodecimal} # Cannot follow decimal
... {thou} # The thousands separator
... (?=[0-9]{{3}} # Three numbers must follow
... ([^0-9]|$) # But a non-number after that
... )
... '''.format(nodecimal=nodecimal, thou=re.escape('.')) # Thousands separator is '.' in German locale.
...
>>> re.sub(strip_thousands, '', 'Sir, €1.234,50 please.', flags=re.X)
'Sir, €1234,50 please.'
>>>
>>> # The decimal point must be preceded by a number or after
>>> # a number. This option only needs to be performed in the
>>> # case when the decimal separator for the locale is not '.'.
>>> switch_decimal = r'(?<=[0-9]){decimal}|{decimal}(?=[0-9])'
>>> switch_decimal = switch_decimal.format(decimal=decimal)
>>> re.sub(switch_decimal, '.', 'Sir, €1234,50 please.', flags=re.X)
'Sir, €1234.50 please.'
>>>
>>> natsort_key('Sir, €1234.50 please.', alg=ns.FLOAT)
('Sir, €', 1234.5, ' please.')
My hope is that users of natsort never have to think about or worry about all the bookkeeping or any of the details described above, and that using natsort seems to magically "just work". For those of you who took the time to read this engineering description, I hope it has enlightened you to some of the issues that can be encountered when code is released into the wild and has to accept "real-world data", or to what happens to developers who naïvely make bold assumptions that are counter to what the rest of the world assumes.
Footnotes
[1] | "But if you hadn't removed the leading empty string from re.split this wouldn't have happened!!" I can hear you saying. Well, that's true. I don't have a great reason for having done that except that in an earlier non-optimal incarnation of the algorithm I needed to it, and it kind of stuck, and it made other parts of the code easier if the assumption that there were no empty strings was valid. |
[2] | I'm not going to show how this is implemented in this document,
but if you are interested you can look at the code to
sep_inserter() in util.py. |
[3] | Handling each of these is straightforward, but coupled with the rapidly fracturing execution paths presented in TL;DR 2 one can imagine this will get out of hand quickly. If you take a look at natsort.py and util.py you can observe that to avoid this I take a more functional approach to constructing the natsort algorithm as opposed to the procedural approach illustrated in TL;DR 1 and TL;DR 2. |