Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace list sorting merge_collapse()? #78742

Closed
tim-one opened this issue Sep 1, 2018 · 27 comments
Closed

Replace list sorting merge_collapse()? #78742

tim-one opened this issue Sep 1, 2018 · 27 comments
Labels
3.8 only security fixes performance Performance or resource usage

Comments

@tim-one
Copy link
Member

tim-one commented Sep 1, 2018

BPO 34561
Nosy @tim-one, @cfbolz, @k7hoven, @tirkarthi, @brandtbucher, @erlend-aasland, @LLyaudet
PRs
  • bpo-34561: Switch to Munro & Wild "powersort" merge strategy. #28108
  • Files
  • runstack.py
  • runstack.py
  • runstack.py
  • runstack.py
  • runstack.py
  • runstack.py
  • runstack.py
  • runstack.py
  • runstack.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-09-06.17:58:25.539>
    created_at = <Date 2018-09-01.01:22:04.494>
    labels = ['3.8', 'performance']
    title = 'Replace list sorting merge_collapse()?'
    updated_at = <Date 2021-09-06.19:42:30.126>
    user = 'https://github.com/tim-one'

    bugs.python.org fields:

    activity = <Date 2021-09-06.19:42:30.126>
    actor = 'Carl.Friedrich.Bolz'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-09-06.17:58:25.539>
    closer = 'tim.peters'
    components = []
    creation = <Date 2018-09-01.01:22:04.494>
    creator = 'tim.peters'
    dependencies = []
    files = ['47779', '47785', '47804', '47807', '47818', '47839', '50241', '50243', '50246']
    hgrepos = []
    issue_num = 34561
    keywords = ['patch']
    message_count = 27.0
    messages = ['324455', '324468', '324532', '324596', '324597', '324607', '324651', '324708', '324710', '324712', '324715', '324716', '325464', '325498', '325988', '326048', '326084', '326871', '400513', '400559', '400566', '400569', '400652', '400864', '400918', '401169', '401178']
    nosy_count = 8.0
    nosy_names = ['tim.peters', 'Carl.Friedrich.Bolz', 'koos.zevenhoven', 'xtreak', 'brandtbucher', 'vincent.juge', 'erlendaasland', 'LLyaudet']
    pr_nums = ['28108']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue34561'
    versions = ['Python 3.8']

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 1, 2018

    The invariants on the run-length stack are uncomfortably subtle. There was a flap a while back when an attempt at a formal correctness proof uncovered that the _intended_ invariants weren't always maintained. That was easily repaired (as the researchers suggested), but it remains hard to grasp why it's always correct now. The Java variant didn't take the suggested fix, instead boosting its stack size. But it turns out that's still not enough, because the original researchers made a different mistake in their own paper ;-)

    http://drops.dagstuhl.de/opus/volltexte/2018/9467/

    Buss & Knop recently published results on different ways of managing the stack, and I think they're worth investigating for "real life" use:

    https://arxiv.org/abs/1801.04641

    Offhand, their "2-merge" looks quite appealing to me. The only invariant it maintains is that each run on the stack is at least twice the length of the run one above it on the stack, which can be maintained just by checking the topmost two stack entries in the loop. So, e.g., where merge_collapse() is currently happy with runs of lengths 150 and 100 ending the stack, 2-merge would force a merge (it's not the case that 150 >= 2*100). Both are happy if the stack ends with runs of lengths 200 and 100.

    This is much easier to reason about, simpler to code, and would allow reducing the size of the stack (worst-case size is proportional to the log (of the largest possible array) to the base 2 instead of to the current base phi (~1.62)). They developed some formal measures showing that its "merge cost" (an overall measure abstracting some from real-life behavior) is significantly better than timsort's current strategy. And a little thought convinced me it preserves that, for random data, it would continue to strongly tend towared triggering perfectly balanced merges (merging runs of the same size).

    When I was developing this code, virtually all the time was spent making merging itself as fast as possible; the stack-management code was just the first thing I tried that "worked well". But turns out that's the only part researchers care about ;-) I'd be pleased if it could be replaced with something both better and simpler, or even just simpler but no worse. But the devil's in the details ...

    @tim-one tim-one added 3.8 only security fixes performance Performance or resource usage labels Sep 1, 2018
    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 1, 2018

    The attached runstack.py models the relevant parts of timsort's current merge_collapse and the proposed 2-merge. Barring conceptual or coding errors, they appear to behave much the same with respect to "total cost", with no clear overall winner. Of course cases can be constructed to make either look better. As expected, twomerge requires fewer stack levels. And they behave identically when all runs have the same length.

    Note that the notion of "cost" here is simply that merging runs of lengths A and B always has cost A+B. That should be viewed as worst case, where the rest of timsort finds nothing to exploit.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 3, 2018

    Looks like all sorts of academics are exercised over the run-merging order now. Here's a paper that's unhappy because timsort's strategy, and 2-merge too, aren't always near-optimal with respect to the entropy of the distribution of natural run lengths:

    "Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods That Optimally Adapt to Existing Runs"
    J. Ian Munro
    Sebastian Wild
    https://arxiv.org/abs/1805.04154

    Their alternatives are worth looking at too. Offhand, their "powersort" ordering looks more promising to me than their "peeksort". It was very deliberate that timsort looks at whether to merge a run immediately after identifying one, for the "freshness in memory hierarchy" reason. That peeksort does not is, I expect, more significant in real life than the "one little blemish" they call it.

    In any case, don't flip out over this. On randomly ordered input, timsort already strongly tends toward perfectly balanced merges, and there's no significant improvement to be had over that. On the other hand, on inputs with significant order that _can_ be exploited by this approach, the gains are far more due to "galloping" than to the order in which runs are merged. All these papers ignore galloping entirely. Example: take range(1_000_000), cut it in half, and riffle shuffle it. So you get

    0 500000 1 500001 ... 499999 999999 or
    500000 0 500001 1 ... 999999 499999
    

    Either way, there are half a million natural runs, each of length 2. Any balanced way of merging them is as good as it gets. It's galloping alone that buys huge improvements in cases like this.

    Especially in the context of Python, where object comparisons are (in general) far more expensive than moving pointers around, some excess in worst-case memory copying is, well, "one little blemish" ;-)

    But still worth addressing if feasible. Now that sorting often also adapts to avoid the relatively massive expense of PyObject_RichCompareBool, memory-movement costs become more important. Approaches like 2-merge also give simpler merge_collapse() code that's easier to reason about, and reduces the worst-case stack size (which becomes very easy to compute in advance instead of the current puzzle).

    @k7hoven
    Copy link
    Mannequin

    k7hoven mannequin commented Sep 4, 2018

    It doesn’t seem like there’s a real problem here, but you seem to suggest there would be some useful improvements possible here. I don’t know much about sorting algorithms per se. What do you mean by galloping?

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 4, 2018

    "Galloping" is the heart & soul of Python's sorting algorithm. It's explained in detail here:

    https://github.com/python/cpython/blob/master/Objects/listsort.txt

    The Java fork of the sorting code has had repeated bugs due to reducing the size of "the stack" used to hold merge states. Read the papers already referenced for details about that.

    There is no "bug" here in the Python version. It's that the complex code Java keeps tripping over ;-) , _could_ (possibly) be replaced by simpler code where the max stack size needed is obvious; that (e.g.) 2-merge moves around less memory overall in some cases where the current scheme is particularly wasteful in this respect; and that Munro & Wild present more ambitious merge-ordering schemes that are said to be provably near-optimal in a broader sense than 2-merge achieves.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 5, 2018

    A new version of the file models a version of the powersort merge ordering too. It clearly dominates timsort and 2-merge in all cases tried, for this notion of "cost".

    Against it, its code is much more complex, and the algorithm is very far from obvious. The paper "motivates" it but doesn't really explain it in enough detail to make most of it clear (but refers to other papers where pieces of it are developed).

    Curious: unless a run is the last in an array, powersort never merges it immediately upon finding it. Instead its start and length are used to compute a "power", in turn used to decide whether to merge some previous number of runs. A dollar to anyone who can figure out what the heck the "power" computation is really doing in less than a full day without reading the paper ;-) Then two dollars if you can prove that the "never equal!" assert holds.

    It would require timing lots of C code on various cases to see whether the possible delay in merging new runs really matters. It's unclear to me a priori, because it buys something too (less memory copying overall).

    In any case, just switching to 2-merge would be easy, and that alone is enough to squash the contrived "bad cases" for the current approach. powersort would too, but may also actually help a bit in ordinary cases. Or not.

    @k7hoven
    Copy link
    Mannequin

    k7hoven mannequin commented Sep 5, 2018

    Haha ok. Cache optimization makes it pretty complicated to reason about true costs. Anyway, it’s not obvious to me even that the run lengths would need to be descending for best performace. I’m not even starting to think about how the merging order might affect galloping boosts on realistic data ;-). I didn’t get to that power thing at this point, but it looks like it’s unrelated to this consideration, except perhaps by chance. I might have time tonight to have a look at that. Surely we don’t want an algorithm that’s optimized for what they call ”Timsort drag” sequences in the literature ;-).

    @k7hoven
    Copy link
    Mannequin

    k7hoven mannequin commented Sep 6, 2018

    So it looks like we're working with a logarithmic measure of the "cost".

    I'm operating largely based on your description of Timsort in the link in msg324597, which the paper also refers to. But since the paper is sorting an array of Java ints (not Integers), I assume the performance comparisons of the code they timed is not really representative of Python equivalents. Probably galloping boosts are much larger in the Python case.

    I haven't tried running the attached code yet, because I mostly carry a tablet with me now. The never-equal assert probably doesn't get any more obvious by running the code, though, anyway?-).

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 6, 2018

    The notion of cost is that merging runs of lengths A and B has "cost" A+B, period. Nothing to do with logarithms. Merge runs of lengths 1 and 1000, and it has cost 1001.

    They don't care about galloping, only about how the order in which merges are performed affect that notion of cost. Example: suppose we have runs of lengths 1, 2, and 3. Merge 1 and 2 first for a cost of 1+2 = 3, and we're left with runs of length 3 and 3. Merge those for a cost of 3+3 = 6, and add to the earlier cost of 3 for a total cost of 9. But if 2 and 3 are merged first that has cost 5, then merging runs of length 1 and 5 has a cost of 6, added to the earlier 5 gives a total cost of 11. So it's cheaper to merge 1 and 2 first.

    Galloping has nothing to do with this measure, nor does the binary insertion sort portion of timsort. And Java itself doesn't use timsort for sorting integers anyway (the overheads are too high when comparisons are dirt cheap). They're trying to quantify the effects of their merge-ordering strategies, relative to timsort's merge-ordering strategy.

    Which could have been done more clearly by ignoring Java's timsort implementation entirely and just changing the parts of their code that implement _their_ merge-ordering strategy. timsort's strategy is hard to analyze, but is trivially easy to _implement_. As is, they're inadvertently timing all sorts of stuff that's orthogonal to the merge-ordering strategy.

    @k7hoven
    Copy link
    Mannequin

    k7hoven mannequin commented Sep 6, 2018

    Sorry, I meant that the funny code in the "power" computation in powersort is a logarithmic measure.

    But I don't know where that came from. I just looked at Timsort and now figuring out how powersort is different.

    @k7hoven
    Copy link
    Mannequin

    k7hoven mannequin commented Sep 6, 2018

    And by "looking at" Timsort, I mean reading your explanation. The motivation for merge ordering and so on was already quite clear from there. But that motivation does not imply that the stack has to be monotonous in run length, although memory considerations might.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 6, 2018

    No, there's no requirement that run lengths on the stack be ordered in any way by magnitude. That's simply one rule timsort uses, as well as 2-merge and various other schemes discussed in papers. powersort has no such rule, and that's fine.

    Regardless, rules must be such that the max stack size is at worst logarithmic in the number of array elements. It's also nice if it's obvious that a sequence of equal-length runs will lead to perfectly balanced merges, which is "obvious enough" with the timsort and 2-merge rules. It also happens to be true of the powersort rules, but harder to see from staring at code because it's hard to compute "node powers" in one's head.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 16, 2018

    New version of runstack.py.

    • Reworked code to reflect that Python's sort uses (start_offset, run_length) pairs to record runs.

    • Two unbounded-integer power implementations, one using a loop and the other division. The loop version implies that, in Python's C implementation, size_t arithmetic would always suffice. The division version shows that 2*d-1 bit unsigned division suffices if the # of array elements fits in d bits (so 64-bit ints in C would suffice for arrays up to 2**32 elements).

    • Another power implementation using frexp - unpromising.

    • And another that specializes the division method by rounding the array size up to a power of 2, removing the need for division. Maybe worth looking at more, but offhand the results were significantly poorer.

    • Added a "bad case" for powersort - surprising! timsort and 2-merge are optimal in this case. powersort "merge cost" is up to a third larger.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 16, 2018

    Another runstack.py adds a bad case for 2-merge, and an even worse (percentage-wise) bad case for timsort. powersort happens to be optimal for both.

    So they all have contrived bad cases now. powersort's bad cases are the least bad. So far ;-) But I expect that to remain so.

    @vincentjuge
    Copy link
    Mannequin

    vincentjuge mannequin commented Sep 21, 2018

    Dear all,

    After me and my colleagues worked on the first paper you mention, I recently created another merge-based sorting algorithm, which I called "Adaptive Shivers Sort". This is a close variant of the Augmented Shivers Sort presented by Buss & Knop in their article
    https://arxiv.org/abs/1801.04641

    Like Munro & Wild's Powersort and Peeksort, this algorithm has an optimal worst-case running time (up to a linear additive constant) when measured with the merge cost model that is used in all the articles you cite in this discussion. It also maintains a stack of size log_2(n).
    I could not prove such an optimality result for Buss & Knop Augmented Shivers Sort.

    Custom tests that I had performed suggest that this algorithm is, in practice, as efficient as Munro & Wild's algorithms, and also does not suffer from critically bad cases.

    Moreover, like Buss & Knop's 2-merge, and unlike Munro & Wild's Powersort and Peeksort, this algorithm has the advantage of having a structure that is very similar to that of Timsort, which may be an advantage in your situation.

    That is why, upon reading your discussion, I refurbished my notes about Adaptive Shivers Sort, which you may find here:
    http://igm.univ-mlv.fr/~juge/papers/shivers_arxiv.pdf

    These notes include a description of the algorithm and a worst-time complexity analysis. If extending my analysis of this algorithm or investigating tuning details were of interest for you, I would be happy to help.

    Best regards,

    Vincent Jugé

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 21, 2018

    Thank you, Vincent! I very much enjoyed - and appreciated - your paper I referenced at the start. Way back when, I thought I had a proof of O(N log N), but never wrote it up because some details weren't convincing - even to me ;-) . Then I had to move on to other things, and never got back to it. I was probably mistaken. So it was delightful to see your elegant proof of that, and even more.

    I'm attaching a new runstack.py that includes code modeling your new adaptive Shivers strategy. Like all these approximations, it has its own "bad cases" on smallish inputs. Based on the specific cases this code runs, powersort remains in a category of its own, with timsort, twomerge, and shivers roughly tied on _average_ merge cost.

    Part of the reason, I suspect: powersort is the only strategy here that's always optimal for 3-run cases. Which it buys at the cost of never merging the run just discovered (unless it's the last run in the array). For example, in

    31, 16, 1
    

    twomerge and shivers merge 31 and 16 first, before seeing 1, which is "far from" optimal. timsort and powersort merge 16 and 1 first, although that's by design in powersort and by luck in timsort. In

    16, 31, 1
    

    only powersort does the best thing (note: runstack.py doesn't model peeksort; I'm sure it would merge 31 and 1 first on that too).

    I care about small-number-of-runs because real-life Python lists aren't asymptotic in nature ;-) People deliberately construct lists with a small number of runs now before sorting, because they know Python's sort can exploit that. For many other cases, all runs are artificially forced to length minrun, and then any scheme at all that approximates balanced merging is about as good as any other.

    What I can't know before we time things is whether order-of-merging makes a measurable difference in real life, and whether powersort's possible delay in merging the most recent run has bad cache effects that overwhelm its smaller "merge cost".

    In any case, I'm keen to replace timsort's merge-order strategy with _something_ that's much easier to analyze. The makes your Shivers variant and powersort the only two real contenders now. It seems quite remarkable that the Shivers variant has such good asymptotics! It's certainly the easiest to modify Python's C code to implement (straightforward edits in a single function).

    @vincentjuge
    Copy link
    Mannequin

    vincentjuge mannequin commented Sep 22, 2018

    I see... Indeed, my only goal when adapting Shivers Sort was to maintain some invariant that would make the analysis easy, while mimicking the arguments developed by Buss & Knop for their analysis of (plain) Shivers Sort. It is, however, sure that the n(H+4) upper bound I get should be refined when H is small, which more or less amounts to tackling the 3-run case you mention.

    I am reasonably convinced that the current version of Adaptive Shivers Sort could be adapted to take this problem into account, while keeping its overall simple stricture and complexity analysis.

    If this first step turns out to be feasible, I will also look at the latest version of runstack.py to investigate other possible improvements on Adaptive Shivers Sort.

    @vincentjuge
    Copy link
    Mannequin

    vincentjuge mannequin commented Oct 2, 2018

    After having worked a little bit on improving AdaptiveShiversSort on few-run cases, I designed a new version of the algorithm, called shivers2 in the file runstack.py joined to this message

    It looks more complicated than the original AdaptiveShiversSort but the spirit, inspired by your comments, is as follows:

    • we allow the second (bottom-most) element of the stack to be moderately larger than the bottom-most one; if that 2nd element were way larger than the 1st one, merging them, even if not optimal, would not be too painful anyways;
    • we may force a merge between these 1st and 2nd elements only when the 2nd element is about to be merged with the 3rd one;
    • similarly, we allow the top-most element of the stack to be moderately larger than the second top-most one;
    • otherwise, we stick to the behaviour of AdaptiveShiversSort.

    This variant's performance seems comparable with Powersort's.
    Reasonable follow-up work that I plan to do in the coming weeks (when I have time to do so) is:

    • prove that the new algorithm still runs in n H + O(n),
    • investigate whether we can have guarantees such as "this new sort's merge cost is at most XXX times larger than the optimal merge cost",
    • investigate improvements for Powersort.

    Given its excellent overall performance, I think that trying to slightly tweak Powersort in cases it underperforms other sorts might be worth some effort.

    Best,

    Vincent

    @tim-one
    Copy link
    Member Author

    tim-one commented Aug 29, 2021

    The merge order was mentioned on python-dev today, and a quick web searched turned up a revision of Vincent Jugé's "Adaptive Shivers Sort: An Alternative Sorting Algorithm" paper I hadn't seen before:

    https://arxiv.org/pdf/1809.08411.pdf

    Its "length-adaptive ShiversSort" variation sure _sounds_ like it was intended to address the issues we discussed here (some "bad" cases when very few runs are in play).

    The analyses in that paper show that length-adaptive ShiversSort, and powersort, have the same asymptotic best and worst-case behaviors. Average case? Who knows? Hard to believe it could really be an issue, because even the worst cases are pretty darn good.

    So length-adaptive ShiversSort is worth looking into. But, if someone has the energy to pursue it, I'd be happy, at this point, just to give plain old "adaptive ShiversSort" a try. The version of the paper linked to above even lists the precise changes needed to CPython's code to implement it (although a code review would ask for changes, most obviously that its "int x" is too narrow an integer type).

    @LLyaudet
    Copy link
    Mannequin

    LLyaudet mannequin commented Aug 29, 2021

    My benchmarks could be improved but however I found that Shivers' sort
    and adaptive Shivers' sort (aka Jugé's sort) performs better than
    Tim's sort.

    it can be done easily by switching the
    main natural merge strategy like I did here :
    https://github.com/LLyaudet/TSODLULS/commit/2968c4b4ca58ae794157dc9636ed87017e5f7a17
    The relevant code is not very long :
    /**
     * This strategy is from ShiversSort:
     * see the research report by Olin Shivers, Georgia Institute of
    Technology, 2002.
     * It uses bitwise tricks to check condition on floor of logarithms in
    base 2 of runs lengths.
     * When I benchmark it, it is slightly faster than Tim's sort strategy.
     */
    #define TSODLULS_natural_merge_main_strategy__Shivers_sort \
    while(merge_state.i_run_instances_count > 1){\
      size_t i_merge_at = merge_state.i_run_instances_count - 2;\
      size_t i_order_of_magnitude =
    merge_state.arr_run_instances[i_merge_at + 1].i_length;\
      if(i_order_of_magnitude < ((~i_order_of_magnitude) &
    merge_state.arr_run_instances[i_merge_at].i_length) ){\
        break;\
      }\
      i_compare_result = TSODLULS_MERGE_TWO_RUNS(&merge_state, i_merge_at);\
      if(i_compare_result != 0){\
        goto clean_and_return_error;\
      }\
    }\
    
    
    
    /**
     * This strategy is from AdaptiveShiversSort:
     * see the articles by Vincent Jugé, for example 1024 Bulletin de la
    SIF, avril 2020, in French,
     * or the preprint on arXiv or SODA 2020 proceedings.
     * It uses bitwise tricks to check condition on floor of logarithms in
    base 2 of runs lengths.
     * When I benchmark it, it is slightly faster than Tim's sort strategy.
     * Despite its ressemblance with Shivers's sort,
     * the distinct properties of this strategy make that JugéSort would
    probably be a better name than AdaptiveShiversSort,
     * or an even better name for the whole algorithm should be
    TimShiversJugéSort and I must have missed many names ;)
     * With AdaptiveShiversSort we avoid 'é' and diacritics in function names ;P
     */
    #define TSODLULS_natural_merge_main_strategy__adaptive_Shivers_sort \
    while(merge_state.i_run_instances_count > 2){\
      size_t i_merge_at = merge_state.i_run_instances_count - 3;\
      size_t i_order_of_magnitude =
    merge_state.arr_run_instances[i_merge_at + 1].i_length |
    merge_state.arr_run_instances[i_merge_at + 2].i_length;\
      if(i_order_of_magnitude < ((~i_order_of_magnitude) &
    merge_state.arr_run_instances[i_merge_at].i_length) ){\
        break;\
      }\
      i_compare_result = TSODLULS_MERGE_TWO_RUNS(&merge_state, i_merge_at);\
      if(i_compare_result != 0){\
        goto clean_and_return_error;\
      }\
    }\

    I will try to provide a patch before the end of the week.

    @tim-one
    Copy link
    Member Author

    tim-one commented Aug 29, 2021

    Added new runstack.py.

    New shivers2() adds the implementation of adaptive ShiversSort from Vincent's later paper. While the code is simpler, it appears to behave identically.

    New shivers3() adds, from the same paper, the new "length-adaptive ShiversSort". Wow! This is the only thing we've seen yet that's in the same universe as powersort. In fact, it usually does a tiny bit better (on the randomized cases). But it's not obvious how to rework its "if" test to be truly efficient (as-is, it needs two divisions, two calls to log2(), and two calls to floor()).

    Worth some thought, though. From the results here, length-adaptive ShiversSort is a bigger improvement over (plain-old) adaptive ShiversSort than the latter is over timsort.

    @tim-one
    Copy link
    Member Author

    tim-one commented Aug 30, 2021

    And another runstack.py adds shivers4(), which reworks shivers3() (length-adaptive ShiversSort) so that division, log2(), and floor() aren't used anymore. It does need a loop, though, which needs to go around a number of times k such that k is the smallest integer for which

    2**k * max(run_length_1, run_length_2, run_length3) >=
    1 + len(original_list)

    @tim-one
    Copy link
    Member Author

    tim-one commented Aug 30, 2021

    New runstack.py mostly adds comments about a surprise: the idea that length-adaptive ShiversSort eeks out better results than powersort appears nearly unique to the specific "0.80" cutoff used in the random-case generation code to pick between two uniform distributions. Change that cutoff, and powersort almost always does better.

    So powersort remains the best known overall, although shivers4 remains competitive with it.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 1, 2021

    I created a PR that implements the powersort merge strategy:

    #28108

    Across all the time this issue report has been open, that strategy continues to be the top contender. Enough already ;-) It's indeed a more difficult change to make to the code, but that's in relative terms. In absolute terms, it's not at all a hard change.

    Laurent, if you find that some variant of ShiversSort actually runs faster than that, let us know here! I'm a big fan of Vincent's innovations too, but powersort seems to do somewhat better "on average" than even his length-adaptive ShiversSort (and implementing that too would require changing code outside of merge_collapse()).

    @LLyaudet
    Copy link
    Mannequin

    LLyaudet mannequin commented Sep 2, 2021

    Thanks for the patch Tim.
    If I find a rival that stands against power sort, I'll tell you.

    @tim-one
    Copy link
    Member Author

    tim-one commented Sep 6, 2021

    New changeset 5cb4c67 by Tim Peters in branch 'main':
    bpo-34561: Switch to Munro & Wild "powersort" merge strategy. (bpo-28108)
    5cb4c67

    @tim-one tim-one closed this as completed Sep 6, 2021
    @cfbolz
    Copy link
    Mannequin

    cfbolz mannequin commented Sep 6, 2021

    Thanks for your work Tim, just adapted the changes to PyPy's Timsort, using bits of runstack.py!

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant