classification
Title: sort by partially reversed key tuple
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: cykerway, rhettinger, serhiy.storchaka, steven.daprano, tim.peters, xtreak
Priority: normal Keywords: patch

Created on 2018-10-17 10:10 by cykerway, last changed 2018-10-18 13:36 by cykerway. This issue is now closed.

Files
File name Uploaded Description Edit
example.py cykerway, 2018-10-17 10:10 An example worth 1k words.
bpo35010_1.py xtreak, 2018-10-17 13:20
performance.py cykerway, 2018-10-17 13:50
example-1.py cykerway, 2018-10-17 15:46 Fix example.py
performance-1.py cykerway, 2018-10-17 22:33
performance-2.py cykerway, 2018-10-18 13:36 Performance test of 3 different sort algorithms.
Pull Requests
URL Status Linked Edit
PR 9931 closed xtreak, 2018-10-17 15:34
Messages (15)
msg327886 - (view) Author: Cyker Way (cykerway) * Date: 2018-10-17 10:10
The current `sorted` function is somewhat limited and doesn't cover a use case that frequently occurs in real applications: sort by a tuple of keys where each key can be in asc or desc order.

For example, you may have a list of site configs where each of specify which user on which site gets how much quota. This data may be loaded from a SQL table where there are 3 columns: url, user, quota. Often you may want to sort by url, but when you want to check which users have been allocated most quota, you probably want to sort by quota. Even more likely, when you are sorting by quota, you still want to sort by url for those having the same quota.

Unfortunately, current `sorted` function doesn't allow you to sort by desc quota and asc url at the same time, because the `reverse` parameter acts on the final result, regardless of columns. For numeric columns, there is a trick of using minus sign on the data. But this approach usually doesn't apply to other types of data.

The general solution is to enhance the key function, allowing users to specify each column to be considered, with an asc/desc flag:

    keyspec = [
        (itemgetter('url'), False),
        (itemgetter('user'), True),
    ]

An example is worth 1k words. The full example is provided in the attachment.

It's not a lot of code to write but making this feature builtin can save a lot of work since sorting a SQL table or other sheet data is quite common in real applications.
msg327888 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-10-17 10:34
Since sort is guaranteed to be stable, can't you sort in two runs?


py> values = ['bc', 'da', 'ba', 'abx', 'ac', 'ce', 'dc', 'ca', 'aby']
py> values.sort(key=itemgetter(1), reverse=True)
py> values.sort(key=itemgetter(0))
py> values
['ac', 'abx', 'aby', 'bc', 'ba', 'ce', 'ca', 'dc', 'da']


Its not as efficient as doing a single sort, but its easier to understand than a complex API to specify each item's sort direction individually, and therefore probably less likely to mess it up and get it wrong.
msg327889 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python triager) Date: 2018-10-17 10:36
There was some discussion about it : https://lists.gt.net/python/python/539896#539896 . As suggested by Raymond in the thread the below can be used to get the desired output

items.sort(key=lambda r: r['user'], reverse=True)
items.sort(key=lambda r: r['url'])
msg327890 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-10-17 10:40
Would be nice to add an example in the documentation.
msg327896 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python triager) Date: 2018-10-17 13:20
@Serhiy I just checked the docs to add an example and it's explained with an example at [0] :)


```

Sorts are guaranteed to be stable. That means that when multiple records have the same key, their original order is preserved.

This wonderful property lets you build complex sorts in a series of sorting steps. For example, to sort the student data by descending grade and then ascending age, do the age sort first and then sort again using grade:

>>> s = sorted(student_objects, key=attrgetter('age'))     # sort on secondary key
>>> sorted(s, key=attrgetter('grade'), reverse=True)       # now sort on primary key, descending
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

The Timsort algorithm used in Python does multiple sorts efficiently because it can take advantage of any ordering already present in a dataset.

```

As noted TimSort makes this efficient by taking advantage of the sorting in the first run though it was not done as a single pass. If I am bench marking correctly in the attached file then for a list with 400 items using multiple pass sort is 4-5x faster than the suggested sorting in Python 3.7 and also more readable (though my lambda call is messy :)

Multiple pass sort :
3.871509724
Suggested sort :
17.237952677


[0] https://docs.python.org/3/howto/sorting.html#sort-stability-and-complex-sorts
msg327897 - (view) Author: Cyker Way (cykerway) * Date: 2018-10-17 13:50
Multi-pass stable sorts should produce the correct result. But as the number of columns grow the code gets messy. For brevity this example only has 2 columns but it may be 10 or more in a real application. Furthermore, in some cases the application may need to remember which columns are sorted asc/desc so you have to keep that piece of data (termed `keyspec` but can be any meaningful name) anyway. It would be ideal if a builtin function can directly operate on this data instead of manually writing a multi-pass sorting logic;

The proposed prototype (`c.py`) is aimed at minimizing the amount of code needed to illustrate the problem and is not optimized for performance. There is a performance hit where it wraps one object into another, but this should be able to be avoided. If you want to compare real performance between single- and multi-pass sorts, you can run this script `performance.py` instead. My test result is that multi-pass sorting takes 75% more time, YMMV.
msg327901 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-10-17 14:59
The ``sorted`` docs links to the Sorting HOWTO, the ``list.sort`` docs should do the same.

https://docs.python.org/3/library/functions.html#sorted

https://docs.python.org/3/library/stdtypes.html#list.sort
msg327907 - (view) Author: Cyker Way (cykerway) * Date: 2018-10-17 15:46
Previous upload `example.py` was missing `__eq__`. Updated in `example-1.py`.
msg327914 - (view) Author: Cyker Way (cykerway) * Date: 2018-10-17 22:33
As for performance, I think both single-pass and multi-pass sorts have worst-case time complexity `m * n * log(n)`, assuming the number of items is `n` and each item has dimension `m`. Whichever is faster seems to be data-dependent. So I made a more comprehensive performance evaluation in `performance-1.py`. It turns out either single-pass or multi-pass can be 80-100% faster than the other, on different inputs.

Since single-pass and multi-pass sorts are good at different inputs and there is currently no statistics on real application data supporting either, both look OK to me. But I hope you can add something in the interface of sorting functions (mainly `sorted` and `list.sort`) so that users don't have to write that multi-pass sort again and again. If the `keyspec` format is deemed too complicated, `keys` and `reverses` also look good to me:

    sorted(items, keys=(key0, key1, key2), reverses=(True, False, True))

And you are free to use whatever sorting algorithms in its implementation for this kind of task.
msg327917 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-10-17 23:07
> And you are free to use whatever sorting algorithms in its implementation for this kind of task.

That's very kind of you *wink*

At this point, I don't think there's much more point to discussing this further until Tim Peters weighs in and lets us know what he thinks. If he loves the idea, and is able to implement it, it may happen; if he is luke-warm or hates it, probably not.

(My own *personal* view is that unless there is an obvious and consistent performance win from adding this, adding the extra complexity to both the sort implementation and the interface isn't worth the effort. Not every simple helper function needs to be a builtin.)
msg327923 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2018-10-18 01:01
Cyker, thank you for the suggestion, but we're going to decline.   The sorting HOWTO docs show how to exploit sort stability with multiple passes to handle a mix of ascending and descending steps.  It would complicate the API to have an array of key functions, each with their own ascending and descending flag. Likewise, it would also complicate the implementation.  

The added complexity just isn't worth it when we already have a reasonable solution and when the use case itself isn't very common.
msg327925 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-10-18 01:17
This comes up every few years, but that's about it.  Here's the iteration from 2 years ago:

https://mail.python.org/pipermail/python-ideas/2016-October/043039.html

Follow the thread.  It contains easy-to-use wrappers for both "do it in multiple simple passes" and "do it in one messy pass" approaches.  It's impossible for an implementation to guess in advance which will be faster - it depends on the data, which only the user can know about in advance.  There's nothing the implementation could do to improve O() behavior regardless.

If there were "real" demand for this, someone by now would have packaged those wrappers and made them available on PyPI.  Since that seems not to have happened, I agree with Raymond rejecting this idea for the core at this time.  There would be a higher-than-usual bar for that anyway, because the sorting code is already highly complex, and building in the wrappers would be a tedious, long-winded exercise in recoding in C what's _easily_ coded in Python already.
msg327926 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python triager) Date: 2018-10-18 01:50
Sorry to comment on the closed issue. I think the multisort recipe [0] is pretty neat can be added to the sorting howto page instead of being lost in the mailing list thread. It was suggested for addition later in the thread but never got added.

I guess I will open up a separate issue for the open PR and possibly add the recipe with the PR if it's okay. Thoughts?

Thanks much for the explanation Tim :)

[0] https://mail.python.org/pipermail/python-ideas/2016-October/043045.html
msg327933 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-10-18 04:40
Would be worth to add a wrapper in functools which revert the sorting order?

class reverted_order:
    def __init__(self, value):
        self.value = value
    def __lt__(self, other):
        if isinstance(other, reverted_order):
            other = other.value
        return self.value.__ge__(other)
    # __le__, __gt__, __ge__, __eq__, __ne__, __hash__

Then you could use key=lambda x: (x['url'], reverted_order(x['user'])).
msg327980 - (view) Author: Cyker Way (cykerway) * Date: 2018-10-18 13:36
Thank you very much for the great discussion here, especially Tim's great threads in *python-ideas* that give neat and insightful answers to this problem in different ways:

-   <https://mail.python.org/pipermail/python-ideas/2016-October/043045.html>

-   <https://mail.python.org/pipermail/python-ideas/2016-October/043126.html>

Since this topic is closed, future discussions probably should go to other python forums. But it might be good to draw some conclusions here for future reference.

First of all, either single-pass sort with a vector key or multi-pass sort with a scalar key may work better, depending on the input. However, in most cases, using multi-pass sort for such problem is the right way to go in the current python implementation. The multi-pass sort algorithm typically runs 2x faster or so than a single-pass sort algorithm. This is likely due to constants rather than asymptotic complexity. But when measured in real time, multi-pass sort algorithm clearly wins in most cases.

If your input is so special that it aligns much better with single-pass sort algorithms (overwhelming the constant advantage of multi-pass sort algorithm), you may use a single-pass sort algorithm. But there are actually different ways of implementing so. The algorithm posted in Tim's second thread on python-ideas is in fact different from mine in this bug thread, where Tim used a wrapper class for the keys and I used a wrapper class for the scalars. Since there are `n` keys but can be as many as `n * m` scalars, my method would be using more wrapper objects. So I expected it to run slower than Tim's. To my surprise, sometimes it works better. The reason is later found to be easy to understand: Wrapper objects are created only when a column needs to be reversed. The number of columns to be reversed is actually a factor controlling the running time. To give a fair evaluation, I created 500000 random rows with 8 columns and control the number of columns to be reversed. The evaluation shows my algorithm could have running time 80%-120% that of Tim's (excluding the special case where no column needs to be reversed). This shows a new direction of optimizing such algorithms.

Finally, this issue was rejected because the added benefits were deemed not enough to complicate the implementation. Considering the current multi-pass sort algorithm has a huge advantage and is easy to implement in python, this decision is fair. Users who care less about performance may write a key adapter in their own code if they want to stick with builtin sort functions. Users who do care about performance can use single-pass sort techniques mentioned in this issue in case multi-pass sort doesn't work well with their data.
History
Date User Action Args
2018-10-18 13:36:44cykerwaysetfiles: + performance-2.py

messages: + msg327980
2018-10-18 04:40:37serhiy.storchakasetmessages: + msg327933
2018-10-18 01:50:36xtreaksetmessages: + msg327926
2018-10-18 01:17:32tim.peterssetmessages: + msg327925
2018-10-18 01:02:03rhettingersetstatus: open -> closed
stage: patch review -> resolved
2018-10-18 01:01:49rhettingersetnosy: + rhettinger
messages: + msg327923

assignee: rhettinger
resolution: rejected
2018-10-17 23:07:48steven.dapranosetmessages: + msg327917
2018-10-17 22:34:00cykerwaysetfiles: + performance-1.py

messages: + msg327914
2018-10-17 15:46:57cykerwaysetfiles: + example-1.py

messages: + msg327907
2018-10-17 15:34:58xtreaksetkeywords: + patch
stage: patch review
pull_requests: + pull_request9282
2018-10-17 14:59:04steven.dapranosetmessages: + msg327901
2018-10-17 13:50:23cykerwaysetfiles: + performance.py

messages: + msg327897
2018-10-17 13:20:55xtreaksetfiles: + bpo35010_1.py

messages: + msg327896
2018-10-17 10:40:11serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg327890
2018-10-17 10:36:52xtreaksetnosy: + xtreak
messages: + msg327889
2018-10-17 10:34:16steven.dapranosetnosy: + tim.peters, steven.daprano
messages: + msg327888
2018-10-17 10:10:40cykerwaycreate