Title: Possible performance improvement for heapq.merge()
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.10
Status: open Resolution:
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: Dennis Sweeney, Jim Fasarakis-Hilliard, bbayles, rhettinger, serhiy.storchaka, tim.peters
Priority: normal Keywords: patch

Created on 2019-11-29 02:44 by Dennis Sweeney, last changed 2020-05-31 14:08 by Jim Fasarakis-Hilliard.

File name Uploaded Description Edit rhettinger, 2020-05-17 03:57 Iterative version for comparison Dennis Sweeney, 2020-05-17 12:05 Using a heap that stores each item only once, items move from leaves to root. Dennis Sweeney, 2020-05-17 12:07 Recursive 2-way-merging generators Dennis Sweeney, 2020-05-17 12:26 Using a "loser tree" Dennis Sweeney, 2020-05-18 09:25 like new_merge, except jump straight to leaf instead of traversing down to it Dennis Sweeney, 2020-05-22 07:27 Similar to but handles key and reverse; also uses lists as node structs. rhettinger, 2020-05-22 23:13 Tree formation is iterative and inlined Dennis Sweeney, 2020-05-28 08:27
Pull Requests
URL Status Linked Edit
PR 17422 closed Dennis Sweeney, 2019-12-01 03:05
PR 17729 closed Dennis Sweeney, 2019-12-28 10:18
PR 18427 closed Dennis Sweeney, 2020-02-10 01:28
PR 20550 open Dennis Sweeney, 2020-05-31 05:08
Messages (23)
msg357631 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2019-11-29 02:44
Although the implementation of the heapq.merge function uses an underlying heap structure, its behavior centers on iterators. For this reason, I believe there should either be an alias to this function in the itertools module or at least a recipe in the itertools docs describing the use of heapq.merge.

Furthermore, I have an implementation (attached) of heapq.merge that is twice as fast for two iterables (as used in mergesort and other common cases), and is faster for up to 6 or 7 iterables. I think it would be nice to special-case this for small numbers of iterables for this significant speedup.
msg357632 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2019-11-29 03:06
The following seems like it is a short, readable recipe for itertools.
msg357664 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2019-11-30 21:57
Disregard it would skip over a value that had already been retrieved from the iterator when the loop finished.
msg357665 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-11-30 22:10
Several thoughts:

* heapq.merge() could have plausibly been in the itertools module instead of the heapq module.  However, since it already has a home, there is no reason to move it or to create duplication with an intermodule dependency.  We certainly don't have a rule that the itertools module must house everything accepts an iterator as an input and returns an iterator in the output.

* The iter_merge() recipe is a bit sprawling and IMHO not a useful teaching aid (a primary goal of the itertools recipe), so it doesn't have much documentation value.  Instead, consider submitting this to the "more-itertools" project which is referenced by the docs.  There it would likely be used directly rather than as a recipe.

* Over the next few days, I'll take a closer look at the binary tree approach versus a heap approach.  I like how it intrinsically avoids the overhead of the 4-tuple while maintaining order, but I want to test how well it does with slow comparison keys (i.e. does the total number of comparisons get worse).  Am not sure whether the tree needs to be rebalanced as some of the input streams become exhausted.  I have a vague recollection that Knuth's TAOCP deemed the heap approach to be algorithmically superior, but need to dig out my books to be verify that.

* We could just build-out the current heapq.merge() code to include a special case for only two input iterables.  That would speed-up a common case by avoiding the overhead of tracking a 4-tuple for each element.  If you want to submit a PR for that, I would be happy to take a close look and verify the timings.

* It might be interesting to time the pure python variants against "sorted(chain(a, b, c, d, e))".  It isn't quite an apples-to-apples comparison because the latter pulls all the data in memory and it doesn't the runs pre-segregated, but it might give a sense of how well merge() could do if we decided to go gonzo on optimizations.  Until now, we've been satisfied with letting the heap structure minimize the number of comparisons and there hasn't been a focus on reducing the constant factor overhead.
msg358968 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2019-12-28 22:09
PR 17729 is a C implementation of a non-recursive "flattening" of the the recursive-lazy-mergesort algorithm into a tournament whose state is a tree of losers of comparisons.
msg364160 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-03-14 13:27
This is a large change. And making heaqq.merge a class looks unrelated to the declared goal. I afraid it may add significant overhead.

Since this is an optimization, could you please provide any benchmarks which we can use to check the performance boost?
msg364198 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2020-03-14 21:24
First, as I posted at, there is a theoretical advantage of fewer comparisons in all cases, and the new algorithm would be especially dominant when one iterable keeps winning. (I'm given to understand that "lists very often do have exploitable partial order in real life" ;-)

> making heaqq.merge a class looks unrelated to the declared goal.

Is there a better way to implement a C accelerator for a Python function that returns a generator? I figured it would be better to have the pure-python implementation match the C implementation.

As for the timing, I'm running Windows and pyperf gives a "WARNING: unable to increase process priority", and a "WARNING: no operation available for your platform" when I run "pyperf system tune", but I'm still able to get the following results:

.\python.bat -m pyperf timeit -s "from random import random; from heapq import merge; from collections import deque; iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

    Master: Mean +- std dev: 88.0 ms +- 10.3 ms
    This PR: Mean +- std dev: 28.2 ms +- 1.0 ms

The above I'll call "merging 20 of size 10_000." 
For "merging 2 of size 100_000", I get:
    Master: Mean +- std dev: 34.4 ms +- 3.2 ms
    This PR: Mean +- std dev: 10.6 ms +- 0.7 ms

Merging 10_000 of size 100:

    Master: Mean +- std dev: 1.56 sec +- 0.11 sec
    This PR: Mean +- std dev: 628 ms +- 22 ms
msg364201 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-03-14 22:02
Sorry, I did not notice that there is a C implementation in PR 18427. Changes in the Python implementations are so larger that I though this is the goal of the PR.

Often the most clear and efficient way to implement an iterator in Python is to write a generator function. In C you need to write a class with the __next__ method, but Python has better way.

I have tested your first example with the Python implementation and got 93.9 msec on master vs 314 msec with PR 18427 applied. It is expected that the C implementation is faster than the Python implementation, but was there a need to make the Python implementation 3 times slower?
msg364205 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2020-03-14 22:31
The existing Python implementation is benefiting from the C accelerators for heapify and heapreplace. When forcing pure python using, I get these results:

.\python.bat -m pyperf timeit -s "from random import random; from collections import deque; from test import support; merge = support.import_fresh_module('heapq', blocked=['_heapq']).merge; iters = [sorted(random() for j in range(1_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

    Master: Mean +- std dev: 73.1 ms +- 8.4 ms
    PR: Mean +- std dev: 63.7 ms +- 7.8 ms

I think I remember getting similarly slightly-better results out of PyPy, but I will double-check.

So while the existing "mixed-c-and-Python" implementation would be deleted, the "true pure-Python" implementation would get faster and the pure-c implementation would be created much faster. Is that an acceptable trade-off?

Regarding using a generator for the Python implementation, is it okay if type(heapq.merge) gives <class 'function'> for the Python implementation but <class 'type'> for the c implementation? How much seemingly-irrelevant variation like this is acceptable between the APIs?
msg364260 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2020-03-15 20:53
My suspicion was confirmed about PyPy (My PyPy here is Python 3.6.1 (784b254d6699, Apr 16 2019, 12:10:48) [PyPy 7.1.1-beta0 with MSC v.1910 32 bit] on win32). In what follows, "" had exactly the `class merge` Python implementation from PR 18427.

pypy -m pyperf timeit -s "from random import random; from heapq import merge; from collections import deque; iters = [sorted(random() for j in ra
nge(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

    Mean +- std dev: 191 ms +- 3 ms

pypy -m pyperf timeit -s "from random import random; from heapq2 import merge; from collections import deque; iters = [sorted(random() for j in r
ange(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

    Mean +- std dev: 69.3 ms +- 2.2 ms

Another test: PyPy merging 2 iterables of size 100_000:

    heapq: Mean +- std dev: 91.4 ms +- 2.4 ms
    heapq2: Mean +- std dev: 52.4 ms +- 2.1 ms

PyPy merging 10_000 iterables of size 100:

    heapq: Mean +- std dev: 2.74 sec +- 0.07 sec
    heapq2: Mean +- std dev: 781 ms +- 19 ms
msg364279 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-03-16 03:12
Serhiy, I've got this one.  It will still take a little to get to it, but I've already invested work in it.  The code idea is plausible but I want to think it through for both CPython and PyPy.
msg364297 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-03-16 08:20
Are there any algorithmic optimizations in the Python version or it was just rewritten from a generator function to a class? If yes, how hard to keep the functional design?
msg368934 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2020-05-15 10:21
As Serhiy suggested, keeping the algorithm but moving the Python implementation to be a generator again (as I recently changed in PR 18427) gives another performance boost (although this unrolling is many lines of code).

Timing the C implementation:
    .\python.bat -m pyperf timeit -s "from random import random; from collections import deque; from heapq import merge;  iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

        On Master: Mean +- std dev: 66.9 ms +- 0.6 ms
        With PR:   Mean +- std dev: 17.3 ms +- 0.9 ms
Timing the Python Implementation in CPython:

    .\python.bat -m pyperf timeit -s "from random import random; from collections import deque; from test import support; merge = support.import_fresh_module('heapq', blocked=['_heapq']).merge; iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

        On Master: Mean +- std dev: 385 ms +- 3 ms
        With PR:   Mean +- std dev: 250 ms +- 2 ms

Timing PyPy:

    pypy -m pyperf timeit -s "... from heapq import merge; ..." ...
    pypy -m pyperf timeit -s "... from heapq2 import merge; ..." ...

        Testing on PyPy 7.1.1: Mean +- std dev: 142 ms +- 2 ms
        This implementation:   Mean +- std dev: 38.0 ms +- 1.2 ms


Another positive for this proposal I just discovered is that object.__eq__ no longer needs to be called at each comparison:

    from heapq import merge
    from collections import deque

    class Int(int):
        lt = eq = 0
        def __lt__(self, other):
   += 1
            return int.__lt__(self, other)

        def __eq__(self, other):
            __class__.eq += 1
            return int.__eq__(self, other)

    def comparisons(iterables): = Int.eq = 0
        deque(merge(*iterables), maxlen=0)
        return, Int.eq

    no_overlap = comparisons(
        # (0..999), (1_000..1_999), (2_000..2_999), ...
        map(Int, range(x, x+1_000))
        for x in range(0, 16_000, 1_000)

    interleaved = comparisons(
        # (0,16,32,...), (1,17,33,...), (2,18,34,...), ...
        map(Int, range(x, 16_000, 16))
        for x in range(16)

    print("No overlap: {:,} lt; {:,} eq".format(*no_overlap))
    print("Interleaved: {:,} lt; {:,} eq".format(*interleaved))


    No overlap: 65,004 lt; 65,004 eq
    Interleaved: 64,004 lt; 64,004 eq


    No overlap: 32,000 lt; 0 eq
    Interleaved: 63,968 lt; 0 eq

This comes from the way that tuples are compared by scanning item-wise for equality before comparing the first discrepancy. Using the positional information in the tree with the logic 

    yield (right if right < left else left)

requires only one rich comparison, while breaking ties with the stored index and the logic 
    yield (b if (b, b_index) < (a, a_index) else a)

requires two arbitrary rich comparisons (a != b, then a < b). This can be somewhat fixed with the logic

    if a_index < b_index:
        yield (b if b < a else a)
        yield (a if a < b else b)

...but this is another branch in the innermost loop.


I also played with a "Tree of Losers" method[1] here[2], and on the same benchmarks I got:

    CPython (C): Mean +- std dev: 22.7 ms +- 0.2 ms
        (slower than in PR 18427)
    CPython (Pure Python): Mean +- std dev: 197 ms +- 9 ms
        (faster -- likely because of fewer int operations)
    PyPy: Mean +- std dev: 38.8 ms +- 0.8 ms
        (about the same)

Maybe some more optimizations could be made to that code.

The two approaches should be "isomorphic" in some sense: both should do the same number of comparisons on the same data. But I'll emphasize the differences:

Tree of Losers:
    - Data structure: Loser Tree (does not satisfy heap invariant)
    - Nodes: (item, index) pairs
    - Tree has n nodes
    - To get the next item: do exchanges from leaf to root, swapping as you find better items
    - Comparisons: branching logic based on index comparison

PR 18427 "Tree of winners":
    - Data structure: binary heap (always satisfies heap invariant)
    - Nodes: just the items
    - Tree has 2n-1 nodes (more eagerly evaluated)
    - To get the next item: replace parent with better child, root to leaf
    - Comparisons: right < left (simple)

Personally, I find the Tree of Winners approach more intuitive, but the Tree of Losers approach seems to be the one that comes up more in the literature.

msg368989 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-05-16 00:06
The nested generators approach looks promising. I hope you keep working on that :-)

I'm not too keen on PR 18427 because it is a massive explosion in code volume and complexity.

With some continued effort, I expect we'll get to something much simpler and more maintainable.  The existing code already performs well for typical applications, so there is no need to "go gonzo" and throw a massive wall of code at the problem.  The ensuing maintenance costs would be too high.

To save time and effort, please hold-off on a C implementation until we've agreed to a pure python version of the algorithm.

For timings, try using strings, tuples, or dataclass instances.  Timing integer inputs isn't representative of how merge() is used and it tends to exaggerate improvements.  For interesting merges, the dominant cost is the comparison step.

By switching to a binary tree arrangement, the payoff we're aiming for is to avoid the double comparison in the tuple inequality logic — "('a',1)<('b',2)" tests both 'a'=='b' and 'a'<='b'.  I suggest aiming for the simplest possible code that avoids the double test.
msg369012 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2020-05-16 05:36
The attached should be much less ugly and still somewhat performant.

It should be the same algorithm as that PR, just written recursively rather than iteratively.

I got some text files from and tried merging them line-by-line:

py -3.9 -m pyperf timeit -s "from heapq import merge; from collections import deque" "deque(merge(open('english.txt'), open('dutch.txt'), open('french.txt'), open('german.txt'), open('italian.txt')), maxlen=0)"

    Mean +- std dev: 391 ms +- 9 ms

py -3.9 -m pyperf timeit -s "from recursive_merge import merge; from collections import deque" "deque(merge(open('english.txt'), open('dutch.txt'), open('french.txt'), open('german.txt'), open('italian.txt')), maxlen=0)"

    Mean +- std dev: 262 ms +- 9 ms

Perhaps that's a more real-world benchmark.
msg369064 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-05-16 19:59
For comparison, I'm attaching an iterative version that manipulates the tree nodes with all local variables.  Try it out and see what you think.
msg369115 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2020-05-17 12:15
It seems to me that the code sprawl mostly comes from the separate handling of the four keyed/unkeyed and forward/reverse cases, which as far as I can tell requires a branch in the innermost loop if not unrolled into separate cases. I think is about as concise as possible while still efficiently handling all four cases.

Is there any issue with recursion in this application? Even if there are 2**64-1 iterables, this would only mean 64 nested next() calls, which should be within the stack on most machines, no?

I did the following benchmark:

py -3.9 -m pyperf timeit -s "from [FILENAME] import merge; from collections import deque" "deque(merge(open('english.txt'), open('dutch.txt'), open('french.txt'), open('german.txt'), open('italian.txt')), maxlen=0)"          Mean +- std dev: 773 ms +- 16 ms    Mean +- std dev: 665 ms +- 42 ms             Mean +- std dev: 470 ms +- 31 ms
    Existing heapq:        Mean +- std dev: 313 ms +- 2 ms    Mean +- std dev: 266 ms +- 3 ms

I can look at some more diverse benchmarks (types, iterable length imbalance, lengths of an iterator's win-streak, etc.) soon.
msg369145 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-05-17 18:49
Am leaning toward the iterative approach in because it most directly implements the core concept of storing the data in a binary tree.

Merits: no recursion, no nested generators, avoids the clutter of left-empty and right-empty checks, easy to understand, simple and fast loops, all local variables or instance variables, no calls to memory allocator after then initial tree construction, runs well on PyPy, and easily Cythonized.
msg369198 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2020-05-18 09:25
I mostly like too, especially the dynamic reduction of the tree.

However, it looks like ``list(merge([2],[1],[1]))`` currently fails, and I think what's missing is the following in the sibling-promotion:

+             if sibling.left is not None:
+                 sibling.left.parent = sibling.right.parent = parent

Also for what it's worth, I think both appearances of "c1 if c1.value < c2.value else c2" should become "c2 if c2.value < c1.value else c1" for stability. 

I've attached, which is similar to, but replaces the "while node.left is not None: node = node.source" traversal with a single "node = node.leaf" call and adds a fast yield-from for the last iterator. I don't think it's much more complex either.
msg369559 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2020-05-22 07:27 employs the same strategy as, but uses lists as the nodes of the tree rather than using Node instances. It also eliminates the recursion of treeify, and adds (with neither much of a performance hit nor much code duplication) support for ``key=some_func`` and/or ``reverse=True`` keyword arguments, so it now passes the test_heapq suite.
msg369667 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-05-22 23:13
Thanks.  I'll look at this shortly.  We're getting much closer.

Just so that I don't lose other work I've done, am uploading with in-line iterative tree construction of the initial tree.
msg370176 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2020-05-28 08:27 is my favorite so far. It still handles key and reverse,
but using instance attributes instead of the list indices I tried before.
It does this by only advancing the "key" and "leaf" attributes up toward
the root (where the key is just the value if key is None), while the value
is stored only in the leaf. Since the "value" attribute is no longer used
except for at the leaves, we can pack a leaf's value into its left slot
and reduce the number of slots.

This seems to run very well on pypy and it looks like it would be pretty
receptive to a c implementation, which I would be happy to work on.
msg370413 - (view) Author: Dennis Sweeney (Dennis Sweeney) * Date: 2020-05-31 05:17
PR 20550 uses a linked structure like what we've been talking about.
Date User Action Args
2020-05-31 14:08:54Jim Fasarakis-Hilliardsetnosy: + Jim Fasarakis-Hilliard
2020-05-31 05:17:02Dennis Sweeneysetmessages: + msg370413
2020-05-31 05:08:43Dennis Sweeneysetpull_requests: + pull_request19794
2020-05-28 08:27:16Dennis Sweeneysetfiles: +

versions: + Python 3.10, - Python 3.9
messages: + msg370176
title: Possible performance improvement for heaqq.merge() -> Possible performance improvement for heapq.merge()
2020-05-22 23:13:41rhettingersetfiles: +

messages: + msg369667
2020-05-22 07:27:01Dennis Sweeneysetfiles: +

messages: + msg369559
2020-05-18 09:25:04Dennis Sweeneysetfiles: +

messages: + msg369198
2020-05-17 18:49:03rhettingersetmessages: + msg369145
2020-05-17 12:26:51Dennis Sweeneysetfiles: +
2020-05-17 12:26:25Dennis Sweeneysetfiles: -
2020-05-17 12:15:45Dennis Sweeneysetmessages: + msg369115
2020-05-17 12:07:59Dennis Sweeneysetfiles: +
2020-05-17 12:06:49Dennis Sweeneysetfiles: +
2020-05-17 12:05:47Dennis Sweeneysetfiles: +
2020-05-17 12:04:15Dennis Sweeneysetfiles: -
2020-05-17 12:04:09Dennis Sweeneysetfiles: -
2020-05-17 12:04:02Dennis Sweeneysetfiles: -
2020-05-17 03:57:27rhettingersetfiles: +
2020-05-17 03:57:05rhettingersetfiles: -
2020-05-16 20:36:16rhettingersetfiles: +
2020-05-16 20:36:00rhettingersetfiles: -
2020-05-16 19:59:49rhettingersetfiles: +

messages: + msg369064
2020-05-16 05:36:09Dennis Sweeneysetfiles: +

messages: + msg369012
2020-05-16 00:07:00rhettingersetmessages: + msg368989
2020-05-15 10:21:37Dennis Sweeneysetmessages: + msg368934
2020-03-16 08:20:35serhiy.storchakasetmessages: + msg364297
2020-03-16 03:12:04rhettingersetmessages: + msg364279
2020-03-15 20:53:42Dennis Sweeneysetmessages: + msg364260
2020-03-14 22:31:23Dennis Sweeneysetmessages: + msg364205
2020-03-14 22:02:45serhiy.storchakasetmessages: + msg364201
2020-03-14 21:24:32Dennis Sweeneysetmessages: + msg364198
2020-03-14 13:27:41serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg364160
2020-02-10 01:28:45Dennis Sweeneysetpull_requests: + pull_request17803
2019-12-28 22:09:05Dennis Sweeneysetmessages: + msg358968
2019-12-28 10:18:50Dennis Sweeneysetpull_requests: + pull_request17174
2019-12-01 03:59:27bbaylessetnosy: + bbayles
2019-12-01 03:05:25Dennis Sweeneysetkeywords: + patch
stage: patch review
pull_requests: + pull_request16903
2019-11-30 22:13:46rhettingersetnosy: + tim.peters
2019-11-30 22:10:36rhettingersettitle: Iterable merge performance and inclusion in itertools -> Possible performance improvement for heaqq.merge()
2019-11-30 22:10:16rhettingersetmessages: + msg357665
2019-11-30 21:57:30Dennis Sweeneysetmessages: + msg357664
2019-11-30 21:07:09rhettingersetassignee: rhettinger
versions: - Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8
2019-11-29 03:06:21Dennis Sweeneysetfiles: +

messages: + msg357632
2019-11-29 03:01:55xtreaksetnosy: + rhettinger
2019-11-29 02:44:21Dennis Sweeneycreate