As Serhiy suggested, keeping the algorithm but moving the Python implementation to be a generator again (as I recently changed in PR 18427) gives another performance boost (although this unrolling is many lines of code).
Timing the C implementation:
.\python.bat -m pyperf timeit -s "from random import random; from collections import deque; from heapq import merge; iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"
On Master: Mean +- std dev: 66.9 ms +- 0.6 ms
With PR: Mean +- std dev: 17.3 ms +- 0.9 ms
Timing the Python Implementation in CPython:
.\python.bat -m pyperf timeit -s "from random import random; from collections import deque; from test import support; merge = support.import_fresh_module('heapq', blocked=['_heapq']).merge; iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"
On Master: Mean +- std dev: 385 ms +- 3 ms
With PR: Mean +- std dev: 250 ms +- 2 ms
Timing PyPy:
pypy -m pyperf timeit -s "... from heapq import merge; ..." ...
versus
pypy -m pyperf timeit -s "... from heapq2 import merge; ..." ...
Testing on PyPy 7.1.1: Mean +- std dev: 142 ms +- 2 ms
This implementation: Mean +- std dev: 38.0 ms +- 1.2 ms
=============================================================================
Another positive for this proposal I just discovered is that object.__eq__ no longer needs to be called at each comparison:
from heapq import merge
from collections import deque
class Int(int):
lt = eq = 0
def __lt__(self, other):
__class__.lt += 1
return int.__lt__(self, other)
def __eq__(self, other):
__class__.eq += 1
return int.__eq__(self, other)
def comparisons(iterables):
Int.lt = Int.eq = 0
deque(merge(*iterables), maxlen=0)
return Int.lt, Int.eq
no_overlap = comparisons(
# (0..999), (1_000..1_999), (2_000..2_999), ...
map(Int, range(x, x+1_000))
for x in range(0, 16_000, 1_000)
)
interleaved = comparisons(
# (0,16,32,...), (1,17,33,...), (2,18,34,...), ...
map(Int, range(x, 16_000, 16))
for x in range(16)
)
print("No overlap: {:,} lt; {:,} eq".format(*no_overlap))
print("Interleaved: {:,} lt; {:,} eq".format(*interleaved))
Before:
No overlap: 65,004 lt; 65,004 eq
Interleaved: 64,004 lt; 64,004 eq
After:
No overlap: 32,000 lt; 0 eq
Interleaved: 63,968 lt; 0 eq
This comes from the way that tuples are compared by scanning item-wise for equality before comparing the first discrepancy. Using the positional information in the tree with the logic
yield (right if right < left else left)
requires only one rich comparison, while breaking ties with the stored index and the logic
yield (b if (b, b_index) < (a, a_index) else a)
requires two arbitrary rich comparisons (a != b, then a < b). This can be somewhat fixed with the logic
if a_index < b_index:
yield (b if b < a else a)
else:
yield (a if a < b else b)
...but this is another branch in the innermost loop.
==============================================================================
I also played with a "Tree of Losers" method[1] here[2], and on the same benchmarks I got:
CPython (C): Mean +- std dev: 22.7 ms +- 0.2 ms
(slower than in PR 18427)
CPython (Pure Python): Mean +- std dev: 197 ms +- 9 ms
(faster -- likely because of fewer int operations)
PyPy: Mean +- std dev: 38.8 ms +- 0.8 ms
(about the same)
Maybe some more optimizations could be made to that code.
The two approaches should be "isomorphic" in some sense: both should do the same number of comparisons on the same data. But I'll emphasize the differences:
Tree of Losers:
- Data structure: Loser Tree (does not satisfy heap invariant)
- Nodes: (item, index) pairs
- Tree has n nodes
- To get the next item: do exchanges from leaf to root, swapping as you find better items
- Comparisons: branching logic based on index comparison
PR 18427 "Tree of winners":
- Data structure: binary heap (always satisfies heap invariant)
- Nodes: just the items
- Tree has 2n-1 nodes (more eagerly evaluated)
- To get the next item: replace parent with better child, root to leaf
- Comparisons: right < left (simple)
Personally, I find the Tree of Winners approach more intuitive, but the Tree of Losers approach seems to be the one that comes up more in the literature.
[1] https://en.wikipedia.org/wiki/K-way_merge_algorithm
[2] https://github.com/sweeneyde/cpython/tree/replacement-selection |