Title: Replace list sorting merge_collapse()?
Type: performance Stage: needs patch
Components: Versions: Python 3.8
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: koos.zevenhoven, tim.peters, vincent.juge, xtreak
Priority: normal Keywords:

Created on 2018-09-01 01:22 by tim.peters, last changed 2018-10-02 09:51 by vincent.juge.

File name Uploaded Description Edit tim.peters, 2018-09-01 21:37 tim.peters, 2018-09-05 05:55 tim.peters, 2018-09-16 00:01 tim.peters, 2018-09-16 19:07 tim.peters, 2018-09-21 21:55 vincent.juge, 2018-10-02 09:51
Messages (18)
msg324455 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-01 01:22
The invariants on the run-length stack are uncomfortably subtle.  There was a flap a while back when an attempt at a formal correctness proof uncovered that the _intended_ invariants weren't always maintained.  That was easily repaired (as the researchers suggested), but it remains hard to grasp why it's always correct now.  The Java variant didn't take the suggested fix, instead boosting its stack size.  But it turns out that's still not enough, because the original researchers made a different mistake in their own paper ;-)

Buss & Knop recently published results on different ways of managing the stack, and I think they're worth investigating for "real life" use:

Offhand, their "2-merge" looks quite appealing to me.  The only invariant it maintains is that each run on the stack is at least twice the length of the run one above it on the stack, which can be maintained just by checking the topmost two stack entries in the loop.  So, e.g., where merge_collapse() is currently happy with runs of lengths 150 and 100 ending the stack, 2-merge would force a merge (it's not the case that 150 >= 2*100).  Both are happy if the stack ends with runs of lengths 200 and 100.

This is much easier to reason about, simpler to code, and would allow reducing the size of the stack (worst-case size is proportional to the log (of the largest possible array) to the base 2 instead of to the current base phi (~1.62)).  They developed some formal measures showing that its "merge cost" (an overall measure abstracting some from real-life behavior) is significantly better than timsort's current strategy.  And a little thought convinced me it preserves that, for random data, it would continue to strongly tend towared triggering perfectly balanced merges (merging runs of the same size).

When I was developing this code, virtually all the time was spent making merging itself as fast as possible; the stack-management code was just the first thing I tried that "worked well".  But turns out that's the only part researchers care about ;-) I'd be pleased if it could be replaced with something both better and simpler, or even just simpler but no worse.  But the devil's in the details ...
msg324468 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-01 21:37
The attached models the relevant parts of timsort's current merge_collapse and the proposed 2-merge.  Barring conceptual or coding errors, they appear to behave much the same with respect to "total cost", with no clear overall winner.  Of course cases can be constructed to make either look better.  As expected, twomerge requires fewer stack levels.  And they behave identically when all runs have the same length.

Note that the notion of "cost" here is simply that merging runs of lengths A and B always has cost A+B.  That should be viewed as worst case, where the rest of timsort finds nothing to exploit.
msg324532 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-03 19:08
Looks like all sorts of academics are exercised over the run-merging order now.  Here's a paper that's unhappy because timsort's strategy, and 2-merge too, aren't always near-optimal with respect to the entropy of the distribution of natural run lengths:

"Nearly-Optimal Mergesorts:  Fast, Practical Sorting Methods That Optimally Adapt to Existing Runs"
J. Ian Munro
Sebastian Wild

Their alternatives are worth looking at too.  Offhand, their "powersort" ordering looks more promising to me than their "peeksort".  It was very deliberate that timsort looks at whether to merge a run immediately after identifying one, for the "freshness in memory hierarchy" reason.  That peeksort does not is, I expect, more significant in real life than the "one little blemish" they call it.

In any case, don't flip out over this.  On randomly ordered input, timsort already strongly tends toward perfectly balanced merges, and there's no significant improvement to be had over that.  On the other hand, on inputs with significant order that _can_ be exploited by this approach, the gains are far more due to "galloping" than to the order in which runs are merged.  All these papers ignore galloping entirely.  Example:  take range(1_000_000), cut it in half, and riffle shuffle it.  So you get

    0 500000 1 500001 ... 499999 999999 or
    500000 0 500001 1 ... 999999 499999
Either way, there are half a million natural runs, each of length 2.  Any balanced way of merging them is as good as it gets.  It's galloping alone that buys huge improvements in cases like this.

Especially in the context of Python, where object comparisons are (in general) far more expensive than moving pointers around, some excess in worst-case memory copying is, well, "one little blemish" ;-)

But still worth addressing if feasible.  Now that sorting often also adapts to avoid the relatively massive expense of PyObject_RichCompareBool, memory-movement costs become more important.  Approaches like 2-merge also give simpler merge_collapse() code that's easier to reason about, and reduces the worst-case stack size (which becomes very easy to compute in advance instead of the current puzzle).
msg324596 - (view) Author: Koos Zevenhoven (koos.zevenhoven) * Date: 2018-09-04 18:27
It doesn’t seem like there’s a real problem here, but you seem to suggest there would be some useful improvements possible here. I don’t know much about sorting algorithms per se. What do you mean by galloping?
msg324597 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-04 18:46
"Galloping" is the heart & soul of Python's sorting algorithm.  It's explained in detail here:

The Java fork of the sorting code has had repeated bugs due to reducing the size of "the stack" used to hold merge states.  Read the papers already referenced for details about that.

There is no "bug" here in the Python version.  It's that the complex code Java keeps tripping over ;-) , _could_ (possibly) be replaced by simpler code where the max stack size needed is obvious; that (e.g.) 2-merge moves around less memory overall in some cases where the current scheme is particularly wasteful in this respect; and that Munro & Wild present more ambitious merge-ordering schemes that are said to be provably near-optimal in a broader sense than 2-merge achieves.
msg324607 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-05 05:55
A new version of the file models a version of the `powersort` merge ordering too.  It clearly dominates timsort and 2-merge in all cases tried, for this notion of "cost".

Against it, its code is much more complex, and the algorithm is very far from obvious.  The paper "motivates" it but doesn't really explain it in enough detail to make most of it clear (but refers to other papers where pieces of it are developed).

Curious:  unless a run is the last in an array, powersort never merges it immediately upon finding it.  Instead its start and length are used to compute a "power", in turn used to decide whether to merge some previous number of runs.  A dollar to anyone who can figure out what the heck the "power" computation is really doing in less than a full day without reading the paper ;-)  Then two dollars if you can prove that the "never equal!" assert holds.

It would require timing lots of C code on various cases to see whether the possible delay in merging new runs really matters.  It's unclear to me a priori, because it buys something too (less memory copying overall).

In any case, just switching to 2-merge would be easy, and that alone is enough to squash the contrived "bad cases" for the current approach.  `powersort` would too, but may also actually help a bit in ordinary cases.  Or not.
msg324651 - (view) Author: Koos Zevenhoven (koos.zevenhoven) * Date: 2018-09-05 20:13
Haha ok. Cache optimization makes it pretty complicated to reason about true costs. Anyway, it’s not obvious to me even that the run lengths would need to be descending for best performace. I’m not even starting to think about how the merging order might affect galloping boosts on realistic data ;-). I didn’t get to that power thing at this point, but it looks like it’s unrelated to this consideration, except perhaps by chance. I might have time tonight to have a look at that. Surely we don’t want an algorithm that’s optimized for what they call ”Timsort drag” sequences in the literature ;-).
msg324708 - (view) Author: Koos Zevenhoven (koos.zevenhoven) * Date: 2018-09-06 18:39
So it looks like we're working with a logarithmic measure of the "cost". 

I'm operating largely based on your description of Timsort in the link in msg324597, which the paper also refers to. But since the paper is sorting an array of Java ints (not Integers), I assume the performance comparisons of the code they timed is not really representative of Python equivalents. Probably galloping boosts are much larger in the Python case.

I haven't tried running the attached code yet, because I mostly carry a tablet with me now. The never-equal assert probably doesn't get any more obvious by running the code, though, anyway?-).
msg324710 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-06 19:02
The notion of cost is that merging runs of lengths A and B has "cost" A+B, period.  Nothing to do with logarithms.  Merge runs of lengths 1 and 1000, and it has cost 1001.

They don't care about galloping, only about how the order in which merges are performed affect that notion of cost.  Example:  suppose we have runs of lengths 1, 2, and 3.  Merge 1 and 2 first for a cost of 1+2 = 3, and we're left with runs of length 3 and 3.  Merge those for a cost of 3+3 = 6, and add to the earlier cost of 3 for a total cost of 9.  But if 2 and 3 are merged first that has cost 5, then merging runs of length 1 and 5 has a cost of 6, added to the earlier 5 gives a total cost of 11.  So it's cheaper to merge 1 and 2 first.

Galloping has nothing to do with this measure, nor does the binary insertion sort portion of timsort.  And Java itself doesn't use timsort for sorting integers anyway (the overheads are too high when comparisons are dirt cheap).  They're trying to quantify the effects of their merge-ordering strategies, relative to timsort's merge-ordering strategy.

Which could have been done more clearly by ignoring Java's timsort implementation entirely and just changing the parts of their code that implement _their_ merge-ordering strategy.  timsort's strategy is hard to analyze, but is trivially easy to _implement_.  As is, they're inadvertently timing all sorts of stuff that's orthogonal to the merge-ordering strategy.
msg324712 - (view) Author: Koos Zevenhoven (koos.zevenhoven) * Date: 2018-09-06 20:27
Sorry, I meant that the funny code in the "power" computation in powersort is a logarithmic measure. 

But I don't know where that came from. I just looked at Timsort and now figuring out how powersort is different.
msg324715 - (view) Author: Koos Zevenhoven (koos.zevenhoven) * Date: 2018-09-06 22:06
And by "looking at" Timsort, I mean reading your explanation. The motivation for merge ordering and so on was already quite clear from there. But that motivation does not imply that the stack has to be monotonous in run length, although memory considerations might.
msg324716 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-06 22:27
No, there's no requirement that run lengths on the stack be ordered in any way by magnitude.  That's simply one rule timsort uses, as well as 2-merge and various other schemes discussed in papers.  powersort has no such rule, and that's fine.

Regardless, rules must be such that the max stack size is at worst logarithmic in the number of array elements.  It's also nice if it's obvious that a sequence of equal-length runs will lead to perfectly balanced merges, which is "obvious enough" with the timsort and 2-merge rules.  It also happens to be true of the powersort rules, but harder to see from staring at code because it's hard to compute "node powers" in one's head.
msg325464 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-16 00:01
New version of

- Reworked code to reflect that Python's sort uses (start_offset, run_length) pairs to record runs.

- Two unbounded-integer power implementations, one using a loop and the other division.  The loop version implies that, in Python's C implementation, size_t arithmetic would always suffice.  The division version shows that 2*d-1 bit unsigned division suffices if the # of array elements fits in d bits (so 64-bit ints in C would suffice for arrays up to 2**32 elements).

- Another power implementation using frexp - unpromising.

- And another that specializes the division method by rounding the array size up to a power of 2, removing the need for division.  Maybe worth looking at more, but offhand the results were significantly poorer.

- Added a "bad case" for powersort - surprising!  timsort and 2-merge are optimal in this case.  powersort "merge cost" is up to a third larger.
msg325498 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-16 19:07
Another adds a bad case for 2-merge, and an even worse (percentage-wise) bad case for timsort.  powersort happens to be optimal for both.

So they all have contrived bad cases now.  powersort's bad cases are the least bad.  So far ;-)  But I expect that to remain so.
msg325988 - (view) Author: Vincent Jugé (vincent.juge) Date: 2018-09-21 14:44
Dear all,

After me and my colleagues worked on the first paper you mention, I recently created another merge-based sorting algorithm, which I called "Adaptive Shivers Sort". This is a close variant of the Augmented Shivers Sort presented by Buss & Knop in their article

Like Munro & Wild's Powersort and Peeksort, this algorithm has an optimal worst-case running time (up to a linear additive constant) when measured with the merge cost model that is used in all the articles you cite in this discussion. It also maintains a stack of size log_2(n).
I could not prove such an optimality result for Buss & Knop Augmented Shivers Sort.

Custom tests that I had performed suggest that this algorithm is, in practice, as efficient as Munro & Wild's algorithms, and also does not suffer from critically bad cases.

Moreover, like Buss & Knop's 2-merge, and unlike Munro & Wild's Powersort and Peeksort, this algorithm has the advantage of having a structure that is very similar to that of Timsort, which may be an advantage in your situation.

That is why, upon reading your discussion, I refurbished my notes about Adaptive Shivers Sort, which you may find here:

These notes include a description of the algorithm and a worst-time complexity analysis. If extending my analysis of this algorithm or investigating tuning details were of interest for you, I would be happy to help.

Best regards,

Vincent Jugé
msg326048 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2018-09-21 21:55
Thank you, Vincent!  I very much enjoyed - and appreciated - your paper I referenced at the start.  Way back when, I thought I had a proof of O(N log N), but never wrote it up because some details weren't convincing - even to me ;-) .  Then I had to move on to other things, and never got back to it.  I was probably mistaken.  So it was delightful to see your elegant proof of that, and even more.

I'm attaching a new that includes code modeling your new adaptive Shivers strategy.  Like all these approximations, it has its own "bad cases" on smallish inputs.  Based on the specific cases this code runs, powersort remains in a category of its own, with timsort, twomerge, and shivers roughly tied on _average_ merge cost.

Part of the reason, I suspect:  powersort is the only strategy here that's always optimal for 3-run cases.  Which it buys at the cost of never merging the run just discovered (unless it's the last run in the array).  For example, in

    31, 16, 1

twomerge and shivers merge 31 and 16 first, before seeing 1, which is "far from" optimal.  timsort and powersort merge 16 and 1 first, although that's by design in powersort and by luck in timsort.  In

    16, 31, 1
only powersort does the best thing (note: doesn't model peeksort; I'm sure it would merge 31 and 1 first on that too).

I care about small-number-of-runs because real-life Python lists aren't asymptotic in nature ;-)  People deliberately construct lists with a small number of runs now before sorting, because they know Python's sort can exploit that.  For many other cases, all runs are artificially forced to length `minrun`, and then any scheme at all that approximates balanced merging is about as good as any other.

What I can't know before we time things is whether order-of-merging makes a measurable difference in real life, and whether powersort's possible delay in merging the most recent run has bad cache effects that overwhelm its smaller "merge cost".

In any case, I'm keen to replace timsort's merge-order strategy with _something_ that's much easier to analyze.  The makes your Shivers variant and powersort the only two real contenders now.  It seems quite remarkable that the Shivers variant has such good asymptotics!  It's certainly the easiest to modify Python's C code to implement (straightforward edits in a single function).
msg326084 - (view) Author: Vincent Jugé (vincent.juge) Date: 2018-09-22 08:23
I see... Indeed, my only goal when adapting Shivers Sort was to maintain some invariant that would make the analysis easy, while mimicking the arguments developed by Buss & Knop for their analysis of (plain) Shivers Sort. It is, however, sure that the n(H+4) upper bound I get should be refined when H is small, which more or less amounts to tackling the 3-run case you mention.

I am reasonably convinced that the current version of Adaptive Shivers Sort could be adapted to take this problem into account, while keeping its overall simple stricture and complexity analysis.

If this first step turns out to be feasible, I will also look at the latest version of to investigate other possible improvements on Adaptive Shivers Sort.
msg326871 - (view) Author: Vincent Jugé (vincent.juge) Date: 2018-10-02 09:51
After having worked a little bit on improving AdaptiveShiversSort on few-run cases, I designed a new version of the algorithm, called shivers2 in the file joined to this message

It looks more complicated than the original AdaptiveShiversSort but the spirit, inspired by your comments, is as follows:
- we allow the second (bottom-most) element of the stack to be moderately larger than the bottom-most one; if that 2nd element were way larger than the 1st one, merging them, even if not optimal, would not be too painful anyways;
- we may force a merge between these 1st and 2nd elements only when the 2nd element is about to be merged with the 3rd one;
- similarly, we allow the top-most element of the stack to be moderately larger than the second top-most one;
- otherwise, we stick to the behaviour of AdaptiveShiversSort.

This variant's performance seems comparable with Powersort's.
Reasonable follow-up work that I plan to do in the coming weeks (when I have time to do so) is:
- prove that the new algorithm still runs in n H + O(n),
- investigate whether we can have guarantees such as "this new sort's merge cost is at most XXX times larger than the optimal merge cost",
- investigate improvements for Powersort.

Given its excellent overall performance, I think that trying to slightly tweak Powersort in cases it underperforms other sorts might be worth some effort.


Date User Action Args
2018-10-02 09:51:27vincent.jugesetfiles: +

messages: + msg326871
2018-09-22 08:23:08vincent.jugesetmessages: + msg326084
2018-09-21 21:55:02tim.peterssetfiles: +

messages: + msg326048
2018-09-21 14:44:08vincent.jugesetnosy: + vincent.juge
messages: + msg325988
2018-09-16 19:07:22tim.peterssetfiles: +

messages: + msg325498
2018-09-16 00:01:10tim.peterssetfiles: +

messages: + msg325464
2018-09-06 22:27:09tim.peterssetmessages: + msg324716
2018-09-06 22:06:20koos.zevenhovensetmessages: + msg324715
2018-09-06 20:27:14koos.zevenhovensetmessages: + msg324712
2018-09-06 19:02:30tim.peterssetmessages: + msg324710
2018-09-06 18:39:16koos.zevenhovensetmessages: + msg324708
2018-09-05 20:13:19koos.zevenhovensetmessages: + msg324651
2018-09-05 07:07:07xtreaksetnosy: + xtreak
2018-09-05 05:55:10tim.peterssetfiles: +

messages: + msg324607
2018-09-04 18:46:31tim.peterssetmessages: + msg324597
2018-09-04 18:27:11koos.zevenhovensetnosy: + koos.zevenhoven
messages: + msg324596
2018-09-03 19:08:12tim.peterssetmessages: + msg324532
2018-09-01 21:37:41tim.peterssetfiles: +

messages: + msg324468
2018-09-01 01:22:04tim.peterscreate