Notes in C code =============== A list can be heapified in-place in O(n) time with a series of siftup() operations to combine many small heaps into progressively larger heaps. The nodes positioned at and after n//2 start out as heaps because they have no children. Starting with that pool, we build bigger heaps using the rule that siftup() puts any node in the heap condition as long as its children are already heaps. Visiting all nodes with two children that are already heaps is traditionally done by sifting all nodes positioned before n//2 in reverse order: m = len(x) // 2 # first childless node for i in reversed(range(m)): siftup(x, i) However, the traditional traversal order has poor cache performance because siftup(x, i) occurs long after the two children (at 2*i+1 and 2*i+2) have been heapified. A more cache friendly technique is to heapify a node immediately after its children have been heapified: def heapify(x): m = len(x) // 2 # first childless node leftmost = _keep_top_bit(m + 1) - 1 # leftmost node in row of m for i in reversed(range(m >> 1, leftmost)): siftup(x, i) while i & 1: i >>= 1 siftup(x, i) for i in reversed(range(leftmost, m)): siftup(x, i) while i & 1: i >>= 1 siftup(x, i) The for-loops traverse right-to-left over the nodes spanning arr[m//2 : m]. Those nodes are situated immediately above the childless nodes located at arr[m:]. When arriving at a node i, the children are already heaps and the nodes to its right on the same row are "piled-up", meaning that all parent nodes above and to the right are also heaps. Before visiting i: p / \ p p / \ / \ i p p p p p / \ h h After visiting a node i, it too becomes heapified and piled-up: p / \ p* p p / \ / \ / \ p* p p p p p / \ h h The order of sifts can be visualized as starting at the lower right node and making successively larger lower right triangles of heapified nodes: . / \ / \ / \ / piled \ <-- i / ..... Start Both the traditional traversal order and the new pile-up order do that same number of comparisons and produce exactly same heap. The only difference is in choosing to heapify a node when its child nodes are most likely to be in cache. Accordingly, the performance benefits only begin at heap sizes that overflow the various layers of cache (for example, a 32 kb L1 data cache can hold 10,000 28-byte integers but not 10,000 tuples of length one): For example, here are comparative performance measurements taken on one machine: Case Heap Size n Percent Improvement -------------- ------------------ ------------------- L1 < size < L2 10,000 196kb 5.6% L2 < size < L3 100,000 1.96Mb 14.0% size > L3 cache 1,000,000 28MB 41.8% (Test machine was a 2.6 Ghz Haswell with 32kb L1 data cache, 256kb L2 cache, and 6 Mb L3 cache. Timed python with a 64-bit build on OS/X 10.9. Inputs were a list of tuples each containing a single random integer).