-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character based sorting #174
Comments
You can already enable this functionality by creating a simple key out of regular expressions. In [1]: import re
In [2]: data = ['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']
In [3]: import natsort
In [4]: natsort.natsorted(data)
Out[4]:
['a0-item-1',
'a1-item-2',
'a-category-1',
'b0-item-1',
'b1-item-2',
'b-category-2']
In [5]: character_splitter = re.compile(r"-")
In [6]: natsort.natsorted(data, key=character_splitter.split)
Out[6]:
['a-category-1',
'a0-item-1',
'a1-item-2',
'b-category-2',
'b0-item-1',
'b1-item-2'] You can customize this regex to be as flexible as you need: |
I think a character approach would be more flexible, as you can still replicate the old existing behaviour. |
I'm not sure I follow - can you please demonstrate how it is more flexible? Also, your suggestion would require a complete re-write of |
Or, are the class-based code snippets just examples of behavior and not actual examples of how to implement? If so, it would not require a re-write, but would likely be backwards incompatible if not made a mode that can be toggled. I would still like to hear an explanation or demonstration of what more flexible means. In owning this library for 10+ years, I have found that one person's "obvious" way to sort a collection of strings is another person's "incorrect" way to sort a collection of strings. I have also found that there is no universal heuristic (hence the huge number of specifiers for the |
Sure. We can still sort the same as before, but now we can also control the sorting on a character level (see the updated comment above for the helper functions): def natsorted1(lst: list[str]) -> list[str]:
regex: Pattern = re.compile(r'(\d+)')
def key_func(string) -> tuple[int | str, ...]:
return tuple(map(try_int, regex.split(string)))
return sorted(lst, key=key_func)
def natsorted2(lst: list[str]) -> list[str]:
regex: Pattern = re.compile(r'\D|\d+')
def key_func(string) -> tuple[BaseNatChar, ...]:
return tuple(map(BaseNatChar, map(try_int, regex.findall(string))))
return sorted(lst, key=key_func)
def natsorted3(lst: list[str]) -> list[str]:
regex: Pattern = re.compile(r'\D|\d+')
def key_func(string) -> tuple[NatChar, ...]:
return tuple(map(NatChar, map(try_int, regex.findall(string))))
return sorted(lst, key=key_func)
print(natsorted1(['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']))
print(natsorted2(['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']))
print(natsorted3(['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2'])) Output:
That seems to be a problem. Maybe I'm doing something wrong, but there's a lot of overhead: ::test.bat
@echo off
echo 3 items && python -m timeit -s "import test" "test.natsorted1(['foo', 'bar', 'baz'] * 1)" && python -m timeit -s "import test" "test.natsorted2(['foo', 'bar', 'baz'] * 1)"
echo 30 items && python -m timeit -s "import test" "test.natsorted1(['foo', 'bar', 'baz'] * 10)" && python -m timeit -s "import test" "test.natsorted2(['foo', 'bar', 'baz'] * 10)"
echo 300 items && python -m timeit -s "import test" "test.natsorted1(['foo', 'bar', 'baz'] * 100)" && python -m timeit -s "import test" "test.natsorted2(['foo', 'bar', 'baz'] * 100)"
|
I have tried to update the example to closer match the actual implementation. The main difference is that we split on every character, but group numbers. However, then we need to use a wrapper to make them comparable (which is what
Well, the main reason is very simple: matching the behaviour of macOS sorting on other platforms. This implementation gets really close. |
Sorry, I think you misunderstood my question. I had proposed a solution using existing behavior (the
I was looking to understand why the solution that uses existing functionality is insufficient and why |
OK, the solution proposed wouldn't work for me as the regex needs to be 136110 characters long to include a blacklist of all unicode letters. |
Would not In [1]: import re, natsort
In [2]: data = ['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']
In [3]: character_splitter = re.compile(r"(\D|\W)")
In [4]: natsort.natsorted(data, key=character_splitter.split)
Out[4]:
['a-category-1',
'a0-item-1',
'a1-item-2',
'b-category-2',
'b0-item-1',
'b1-item-2'] |
Let's start over. It sounds to me like there is a set of requirements you have that were not present in the original problem statement:
As a result, I cannot clearly see the problems that you are having with the solutions I am providing you. The solution I provided (pre-split on non-alpha numeric) is functionally identical to your solution (post-split on non-alphanumeric), except that I used the regex I'm afraid I cannot help any further until I get a complete and unambiguous specification of what desired behavior you are looking for. "Sorts like macOS", unfortunately, is too ambiguous because to my knowledge Apple's collation is not public. |
Sadly it's not, |
This is only true if the
So, the definitions of |
Thanks for correcting me. It's a bit confusing as it doesn't match other regex parsers. |
Describe the feature or enhancement
Currently
natsort
doesn't sort strings with optional numbers intuitively because we only split around numbers.It would be better to use a character based approach (like macOS):
Provide a concrete example of how the feature or enhancement will improve
natsort
New behaviour:
['a-category-1', 'a0-item-1', 'a1-item-2', 'b-category-2', 'b0-item-1', 'b1-item-2']
Would you be willing to submit a Pull Request for this feature?
Yes, but I'm not sure how to integrate this in the project.
The text was updated successfully, but these errors were encountered: