classification
Title: Optimize heapify for better cache utililzation
Type: performance Stage:
Components: Extension Modules Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: python-dev, rhettinger
Priority: normal Keywords: patch

Created on 2015-05-10 04:52 by rhettinger, last changed 2015-05-11 17:19 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
wordy_explanation.txt rhettinger, 2015-05-10 04:58 More detailed write-up with ASCII diagrams
better_heapify.diff rhettinger, 2015-05-10 05:04 Patch for cache friendly heapify() review
Messages (2)
msg242850 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-05-10 04:52
Heapify() is implemented with a series of siftup() operations that aggregate small heaps into bigger heaps.  

The current algorithm builds all the smallest heaps first, then makes all of the next largest size, and so on.  This is not cache friendly because the aggregation step operates on smaller heaps built long before.  The new algorithm performs the aggregation step immediately after its child heaps are built (while they are still likely to be in cache).

The overall algorithm is the same (children built before parents).  The number of comparisons is the same.   And the resulting heap is identical.  The only difference is performing work while the inputs are still in cache rather than waiting until all heaps at the current size have been built.

For small heaps that already fit entirely in L1 cache, there is no benefit, so we stick with the current version which has less branching. For larger heaps, we switch to the new order.

The timings and benefits depend on the number and sizes of the objects being heapified as well as the relative speed of L1 to L2 to L3 to DRAM.

------------------------------------------------------------------

For those who are interested, here timings for heapifying shuffled lists of various sizes.  The elements are tuples of length 1 that contain distinct integers.

The first row has the time to copy to the data.  It should be substracted from the timings on the next two rows which time the new algorithm versus the old algorithm.

The benefits don't start to show up until after N is over 1000 (depending on the input type, the breakeven point seems to fall somewhere between 1200 and 2500 on my machine).

N = 100
[2.9262970201671124e-05, 2.9265997000038624e-05, 2.9325950890779495e-05] tupledata[:]
[0.0006274560000747442, 0.0006340609397739172, 0.0006361680570989847] heapify(tupledata[:])
[0.0006139189936220646, 0.0006186790997162461, 0.000632670009508729] heapify_old(tupledata[:])

[2.8867041692137718e-05, 2.8883921913802624e-05, 2.896797377616167e-05] tupledata[:]
[0.000608008005656302, 0.0006171419518068433, 0.0006187589606270194] heapify(tupledata[:])
[0.0006224410608410835, 0.000638791942037642, 0.0006388520123437047] heapify_old(tupledata[:])

[2.89019662886858e-05, 2.8969021514058113e-05, 2.8973910957574844e-05] tupledata[:]
[0.0006031119264662266, 0.0006048450013622642, 0.0006136660231277347] heapify(tupledata[:])
[0.000612352043390274, 0.0006144039798527956, 0.0006217029877007008] heapify_old(tupledata[:])


N = 1000
[0.0002854769118130207, 0.0002856890205293894, 0.00028590403962880373] tupledata[:]
[0.006836145068518817, 0.006866019102744758, 0.006885501905344427] heapify(tupledata[:])
[0.0067316299537196755, 0.006792359985411167, 0.0067987809889018536] heapify_old(tupledata[:])

[0.00028532894793897867, 0.0002853329060599208, 0.00028538203332573175] tupledata[:]
[0.006822419003583491, 0.0068415619898587465, 0.006888034055009484] heapify(tupledata[:])
[0.006734892027452588, 0.006814536056481302, 0.0068227669689804316] heapify_old(tupledata[:])

[0.00028527993708848953, 0.0002854960039258003, 0.0002858199877664447] tupledata[:]
[0.006787727936170995, 0.0067988099763169885, 0.006827510078437626] heapify(tupledata[:])
[0.0067258820636197925, 0.006815586006268859, 0.006871008081361651] heapify_old(tupledata[:])


N = 10000
[0.004415847011841834, 0.004417525022290647, 0.0044295149855315685] tupledata[:]
[0.07748138904571533, 0.07753941905684769, 0.07756883592810482] heapify(tupledata[:])
[0.08400217199232429, 0.08420385408680886, 0.08428021904546767] heapify_old(tupledata[:])

[0.004418709082528949, 0.004422315978445113, 0.004425868042744696] tupledata[:]
[0.07753065403085202, 0.0775474050315097, 0.07755298691336066] heapify(tupledata[:])
[0.08406145800836384, 0.08412359503563493, 0.08419332408811897] heapify_old(tupledata[:])

[0.0044234748929739, 0.0044267530320212245, 0.0044296300038695335] tupledata[:]
[0.07729987089987844, 0.07750388595741242, 0.07770221296232194] heapify(tupledata[:])
[0.08401058206800371, 0.0840839499142021, 0.08423375408165157] heapify_old(tupledata[:])


N = 100000
[0.055330604896880686, 0.05594596697483212, 0.056045167963020504] tupledata[:]
[1.2075877389870584, 1.207723677973263, 1.2084980909712613] heapify(tupledata[:])
[1.56127171497792, 1.5691186729818583, 1.575164051959291] heapify_old(tupledata[:])

[0.0558202009415254, 0.05597207904793322, 0.0560223578941077] tupledata[:]
[1.2101711059221998, 1.211772706010379, 1.2120026310440153] heapify(tupledata[:])
[1.5360360990744084, 1.5435883220052347, 1.5501357419416308] heapify_old(tupledata[:])

[0.05999936908483505, 0.06000674597453326, 0.06018067698460072] tupledata[:]
[1.209613809012808, 1.2116600699955598, 1.2144729839637876] heapify(tupledata[:])
[1.5371010650414973, 1.5499007020844147, 1.5706949040759355] heapify_old(tupledata[:])


N = 1000000
[0.8224946830887347, 0.8234598189592361, 0.8247971039963886] tupledata[:]
[18.152570085017942, 18.340327466023155, 18.413799613015726] heapify(tupledata[:])
[19.786154965986498, 19.91440916794818, 19.952165015041828] heapify_old(tupledata[:])

[0.8147928019752726, 0.8154206149047241, 0.8169217950198799] tupledata[:]
[18.227028850931674, 18.265947047038935, 18.36190685792826] heapify(tupledata[:])
[19.587209751014598, 19.62119024794083, 19.85366743709892] heapify_old(tupledata[:])

[0.8098425359930843, 0.8100302360253409, 0.8104055189760402] tupledata[:]
[18.16859135404229, 18.207948053022847, 18.350174001068808] heapify(tupledata[:])
[19.6270396419568, 19.85634774295613, 20.017801710986532] heapify_old(tupledata[:])


N = 10000000
[16.074914730968885, 16.084022041060962, 16.150474293972366] tupledata[:]
[205.83624146296643, 205.94496312504634, 205.96075649105478] heapify(tupledata[:])
[349.2653319039382, 349.8653641429264, 351.06795906298794] heapify_old(tupledata[:])

[16.07795425108634, 16.091183452052064, 16.10310253489297] tupledata[:]
[205.1686675120145, 206.0234369279351, 206.10121345799416] heapify(tupledata[:])
[348.308155576, 348.6673470499227, 348.7512100059539] heapify_old(tupledata[:])

[16.074230056954548, 16.098330755950883, 16.106685669976287] tupledata[:]
[205.18946332705673, 205.4753301080782, 205.5993042109767] heapify(tupledata[:])
[349.5718760070158, 349.6067797210999, 351.29123334004544] heapify_old(tupledata[:])
msg242903 - (view) Author: Roundup Robot (python-dev) Date: 2015-05-11 17:19
New changeset db87591fce01 by Raymond Hettinger in branch 'default':
Issue #24155: Optimize heapify for better cache utililzation.
https://hg.python.org/cpython/rev/db87591fce01
History
Date User Action Args
2015-05-11 17:19:28rhettingersetstatus: open -> closed
resolution: fixed
2015-05-11 17:19:15python-devsetnosy: + python-dev
messages: + msg242903
2015-05-10 05:04:45rhettingersetfiles: + better_heapify.diff
2015-05-10 05:03:54rhettingersetfiles: - draft_heapq.diff
2015-05-10 04:58:21rhettingersetfiles: + wordy_explanation.txt
2015-05-10 04:52:28rhettingercreate