Author Dennis Sweeney
Recipients Dennis Sweeney, bbayles, rhettinger, serhiy.storchaka, tim.peters
Date 2020-05-15.10:21:36
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
As Serhiy suggested, keeping the algorithm but moving the Python implementation to be a generator again (as I recently changed in PR 18427) gives another performance boost (although this unrolling is many lines of code).

Timing the C implementation:
    .\python.bat -m pyperf timeit -s "from random import random; from collections import deque; from heapq import merge;  iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

        On Master: Mean +- std dev: 66.9 ms +- 0.6 ms
        With PR:   Mean +- std dev: 17.3 ms +- 0.9 ms
Timing the Python Implementation in CPython:

    .\python.bat -m pyperf timeit -s "from random import random; from collections import deque; from test import support; merge = support.import_fresh_module('heapq', blocked=['_heapq']).merge; iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

        On Master: Mean +- std dev: 385 ms +- 3 ms
        With PR:   Mean +- std dev: 250 ms +- 2 ms

Timing PyPy:

    pypy -m pyperf timeit -s "... from heapq import merge; ..." ...
    pypy -m pyperf timeit -s "... from heapq2 import merge; ..." ...

        Testing on PyPy 7.1.1: Mean +- std dev: 142 ms +- 2 ms
        This implementation:   Mean +- std dev: 38.0 ms +- 1.2 ms


Another positive for this proposal I just discovered is that object.__eq__ no longer needs to be called at each comparison:

    from heapq import merge
    from collections import deque

    class Int(int):
        lt = eq = 0
        def __lt__(self, other):
   += 1
            return int.__lt__(self, other)

        def __eq__(self, other):
            __class__.eq += 1
            return int.__eq__(self, other)

    def comparisons(iterables): = Int.eq = 0
        deque(merge(*iterables), maxlen=0)
        return, Int.eq

    no_overlap = comparisons(
        # (0..999), (1_000..1_999), (2_000..2_999), ...
        map(Int, range(x, x+1_000))
        for x in range(0, 16_000, 1_000)

    interleaved = comparisons(
        # (0,16,32,...), (1,17,33,...), (2,18,34,...), ...
        map(Int, range(x, 16_000, 16))
        for x in range(16)

    print("No overlap: {:,} lt; {:,} eq".format(*no_overlap))
    print("Interleaved: {:,} lt; {:,} eq".format(*interleaved))


    No overlap: 65,004 lt; 65,004 eq
    Interleaved: 64,004 lt; 64,004 eq


    No overlap: 32,000 lt; 0 eq
    Interleaved: 63,968 lt; 0 eq

This comes from the way that tuples are compared by scanning item-wise for equality before comparing the first discrepancy. Using the positional information in the tree with the logic 

    yield (right if right < left else left)

requires only one rich comparison, while breaking ties with the stored index and the logic 
    yield (b if (b, b_index) < (a, a_index) else a)

requires two arbitrary rich comparisons (a != b, then a < b). This can be somewhat fixed with the logic

    if a_index < b_index:
        yield (b if b < a else a)
        yield (a if a < b else b)

...but this is another branch in the innermost loop.


I also played with a "Tree of Losers" method[1] here[2], and on the same benchmarks I got:

    CPython (C): Mean +- std dev: 22.7 ms +- 0.2 ms
        (slower than in PR 18427)
    CPython (Pure Python): Mean +- std dev: 197 ms +- 9 ms
        (faster -- likely because of fewer int operations)
    PyPy: Mean +- std dev: 38.8 ms +- 0.8 ms
        (about the same)

Maybe some more optimizations could be made to that code.

The two approaches should be "isomorphic" in some sense: both should do the same number of comparisons on the same data. But I'll emphasize the differences:

Tree of Losers:
    - Data structure: Loser Tree (does not satisfy heap invariant)
    - Nodes: (item, index) pairs
    - Tree has n nodes
    - To get the next item: do exchanges from leaf to root, swapping as you find better items
    - Comparisons: branching logic based on index comparison

PR 18427 "Tree of winners":
    - Data structure: binary heap (always satisfies heap invariant)
    - Nodes: just the items
    - Tree has 2n-1 nodes (more eagerly evaluated)
    - To get the next item: replace parent with better child, root to leaf
    - Comparisons: right < left (simple)

Personally, I find the Tree of Winners approach more intuitive, but the Tree of Losers approach seems to be the one that comes up more in the literature.

Date User Action Args
2020-05-15 10:21:37Dennis Sweeneysetrecipients: + Dennis Sweeney, tim.peters, rhettinger, serhiy.storchaka, bbayles
2020-05-15 10:21:37Dennis Sweeneysetmessageid: <>
2020-05-15 10:21:37Dennis Sweeneylinkissue38938 messages
2020-05-15 10:21:36Dennis Sweeneycreate