Message 404849 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	Stefan Pochmann, rhettinger, tim.peters
Date	2021-10-23.00:52:57
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1634950377.79.0.552987318387.issue45530@roundup.psfhosted.org>
In-reply-to

Content
Stefan, thanks - I think I understand what you're driving at now. You're (re)discovering that sorting by lexicographic ordering rules is _inherently_ suboptimal if list elements can only be compared "starting at index 0" each time. Not just tuples, but also, e.g., lists, byte arrays, and even strings. Any sequence type whose ordering derives from lexicographic generalization of its elements' ordering. This is quite independent of whether or not CPython tries to exploit type homogeneity. The usual() solution is indeed to exploit a kind of bucket sort instead, first sorting at elements' index 0 alone, saying all those elements with the same value at index 0 constitute "a bucket", and then applying the same idea to each bucket but sorting on index 1 instead. Where those in turn are partitioned into buckets equal at index 1, and those buckets are similarly each sorted on index 2. And so on, and so on. No doubt that can be a big win, and even in the absence of the type homogeneity tricks. It's far beyond the scope of _this_ PR, though, and is really an entirely different approach. I've thought about it at times, but it's "a real project" to do it right. In effect, it would implement an optimized generalization of the SO OP's "automagically convert multikey to multisort" wish. Fine by me if someone pursues that, but I don't have the energy for it, nor much interest either in doing a half-hearted job of it. I always expected that if it came up at all, it would be in the context of sorting lists of strings. For example, consider: xs = ['x' 10_000 + str(i) for i in range(1_000_000)] random.shuffle(xs) Even with type homogeneity tricks, Python's sorting of the result is very far from optimal. Every single comparison starts by burning time skipping over the (at least) ten thousands equal characters at the start. The "bucket sort" (or it could be viewed as a form of MSD radix sort) quickly discovers that all the characters at index 0 are equal, and also all those at index 1, ..., and all those at index 9999. The "real work" of sorting is then reduced to working on the (at most) 6 characters remaining. (*) But I'm lying ;-) The actual usual solution is to use "multikey quicksort", because it's easy to code and works "in place". But it's not a stable sort. String sorting doesn't care about that, but sorting other kinds of sequences can care a lot.

Stefan, thanks - I think I understand what you're driving at now.

You're (re)discovering that sorting by lexicographic ordering rules is _inherently_ suboptimal if list elements can only be compared "starting at index 0" each time. Not just tuples, but also, e.g., lists, byte arrays, and even strings. Any sequence type whose ordering derives from lexicographic generalization of its elements' ordering.

This is quite independent of whether or not CPython tries to exploit type homogeneity.

The usual(*) solution is indeed to exploit a kind of bucket sort instead, first sorting at elements' index 0 alone, saying all those elements with the same value at index 0 constitute "a bucket", and then applying the same idea to each bucket but sorting on index 1 instead. Where those in turn are partitioned into buckets equal at index 1, and those buckets are similarly each sorted on index 2. And so on, and so on.

No doubt that can be a big win, and even in the absence of the type homogeneity tricks. It's far beyond the scope of _this_ PR, though, and is really an entirely different approach.

I've thought about it at times, but it's "a real project" to do it right. In effect, it would implement an optimized generalization of the SO OP's "automagically convert multikey to multisort" wish.

Fine by me if someone pursues that, but I don't have the energy for it, nor much interest either in doing a half-hearted job of it.

I always expected that if it came up at all, it would be in the context of sorting lists of strings. For example, consider:

xs = ['x' * 10_000 + str(i) for i in range(1_000_000)]
random.shuffle(xs)

Even with type homogeneity tricks, Python's sorting of the result is very far from optimal. Every single comparison starts by burning time skipping over the (at least) ten thousands equal characters at the start.

The "bucket sort" (or it could be viewed as a form of MSD radix sort) quickly discovers that all the characters at index 0 are equal, and also all those at index 1, ..., and all those at index 9999. The "real work" of sorting is then reduced to working on the (at most) 6 characters remaining.

(*) But I'm lying ;-) The actual usual solution is to use "multikey quicksort", because it's easy to code and works "in place". But it's not a stable sort. String sorting doesn't care about that, but sorting other kinds of sequences can care a lot.

History
Date	User	Action	Args
2021-10-23 00:52:57	tim.peters	set	recipients: + tim.peters, rhettinger, Stefan Pochmann
2021-10-23 00:52:57	tim.peters	set	messageid: <1634950377.79.0.552987318387.issue45530@roundup.psfhosted.org>
2021-10-23 00:52:57	tim.peters	link	issue45530 messages
2021-10-23 00:52:57	tim.peters	create