classification
Title: Reduce the cost of hash collisions for set objects
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: christian.heimes, haypo, jcea, python-dev, rhettinger, serhiy.storchaka, taleinat
Priority: low Keywords: patch

Created on 2013-08-17 18:57 by rhettinger, last changed 2013-09-15 21:57 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
so.diff rhettinger, 2013-08-17 20:57 First draft patch for paired adjacent insertions review
set_bench.py rhettinger, 2013-08-18 01:49 Simple set benchmark
insert_clean.s rhettinger, 2013-08-18 02:46 Annotated disassembly of set_insert_clean()
grind.txt rhettinger, 2013-08-18 05:47 Cachegrind analysis detail
so2.diff rhettinger, 2013-08-18 17:24 Second draft patch pair adjacent insertions review
set_bench_smartcache.py taleinat, 2013-08-19 10:42 Revised benchmark for processors with Intel SmartCache
Messages (26)
msg195506 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-17 18:57
I'm working on a patch for the lookkey() functions in Object/setobject.c.  The core idea is to follow the probe sequence as usual but to also check an adjacent entry for each probe (i.e. check &table[i & mask] as usual and then check &table[(i & mask) ^ 1] before going on the next probes which are scattered around memory).

On modern processors (anything in the last decade), this is likely to reduce the cost of hash collisions and reduce the number of cache lines fetched from memory.

+ Cache line benefit:  improves the odds that the adjacent probe will be on the same cache line, thereby reducing the total number of cache lines fetched.  This benefit will work for set insertions as well as set lookups (a partial write to a cache line requires that a full cache line be read prior to writing, so it is helpful if we've just examined another slot on the same cache line).

+ Parallel execution:  because the second lookup has no data dependency on the first lookup, some of its instructions can be executed in parallel by the out-of-order engine.

+ Reduced loop overhead:  the second lookup doesn't require a new computation of the index *i* (we just do a XOR 1), a new perturb shift, a new masking of *i*, or a jump to the top of the loop.  In other words, it has all the usual benefits of loop unrolling.

- On the downside, even this partial unrolling when increase the total code size which has the negative effect of blowing other code out of the I-cache (typically 32kb).

? I'm unsure whether the additional if-statements will have a net positive or net negative effect on branch prediction rates (positive because each branch can be tracked separately or negative because additional branches use up more space in the branch prediction tables).  Once the patch is ready, I'll run CacheGrind to get a better sense of what is happening.
msg195511 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-17 19:33
How it will affect a performance of set(range(n))?
msg195520 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-17 21:11
Serhiy, I've posted a patch so you can examine the effects in detail.

For something like set(range(n)), I would expect no change because the dataset is collision free due to Tim's design where hash(someint)==someint.  That said, I'm trying to optimize the general case and don't really if rare cases are modestly slowed.

The new code only kicks in when there is a collision.  The logic for the initial probe is unchanged.   The idea is to pickup some of the cache locality benefits of collision resolution by separate chaining, but not incurring the 100% malloc typically associated with chaining.

When adjacent entries are in the same cache-line, the cost of the extra adjacent probe is much cheaper (nearly zero) than a probe somewhere else in memory.
msg195524 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-08-17 22:04
Do you expect visible difference on a microbenchmark? Do you have
benchmarks showing the speedup and showing that many collisions are no
too much slower?
msg195526 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-17 22:41
I just made the patch a few minutes ago.  Am just starting to work on benchmarks.  I posted here so that you guys could help find the strengths and weaknesses of the approach.  

The theory is sound, but it is going to take a good deal of effort to isolate the effects (either good or bad) in realistic benchmarks.  The short, tight loops in timeit tend to put everything in cache and hide the effects  of good memory access patterns.

If would love to have you all team up with me on this one.  I could use help on timings, examining disassembly, and runs with cache grind.

This is just in the experimental phase, so it is premature to be throwing up obstacles with "show me your data" or "what does it do in situatation x".  At this point, you all could help out by running some timings designed to isolate the effects on high collision data big enough to not fit in  L2 cache.  I hope you assist in evaluating the idea with an open mind.
msg195529 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-08-17 22:52
> The theory is sound, but it is going to take a good deal of effort to isolate the effects (either good or bad) in realistic benchmarks.

Oh, it was just asking for a microbenchmark on set(). I don't care of
a "real" benchmark (realistic use case) for such issue :-)

I asked because I don't understand all these complex things of the CPU
(cache size, data dependency, prefetch, ...), whereas I can compare
number of a microbenchmark :-) And I can try to reproduce it on a
different CPU and a different dataset. Because I don't understand
low-level CPU things, I always fear that a patch works well on a
specific CPU whereas it would behave badly on a completly different
CPU architecture (ex: RISC vs CISC).

All your arguments sound good and I'm always happy to see people
involved to optimize Python!
msg195531 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-17 23:08
Here's an excerpt from the patch that gives a pretty good idea of what is being changed (the three lines of new logic are marked):

static void
set_insert_clean(PySetObject *so, PyObject *key, Py_hash_t hash)
{
    setentry *table = so->table;
    setentry *entry;
    size_t perturb = hash;
    size_t mask = (size_t)so->mask;
    size_t i, j;                        // the j variable is new
    i = j = (size_t)hash & mask;
    while(1) {
        entry = &table[j];
        if (entry->key == NULL)
            break;
        entry = &table[j ^ 1];          // this line is new
        if (entry->key == NULL)         // this line is new
            break;                      // this line new
        i = i * 5 + perturb + 1;
        j = i & mask;
        perturb >>= PERTURB_SHIFT;
    }
    so->fill++;
    entry->key = key;
    entry->hash = hash;
    so->used++;
}
msg195532 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-08-17 23:28
"+ Cache line benefit:  improves the odds that the adjacent probe will
be on the same cache line (...)
+ Reduced loop overhead:  the second lookup doesn't require a new
computation of the index *i* (we just do a XOR 1) (...)"

With j=11, is the lookup at index 10 in the same cache line? Does a
cache line start at the address of the first lookup, or is it
"rounded" to the size of a cache line?

I read that CPU likes to prefetch data forward, but I didn't know that
they like prefetching backward.
msg195533 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-18 00:04
> With j=11, is the lookup at index 10 in the same cache line?

It might or it might not.  The cache lines are typically 64 bytes which holds 4 key/hash pairs.  So, the odds are 75% in favor of an adjacent entry being in the same cache line.

If malloc where guaranteed to return 64-byte aligned memory blocks, the odds would be 100%, but my understanding is that 16-byte alignment is the norm.

> I read that CPU likes to prefetch data forward, 
> but I didn't know that they like prefetching backward.

The integrated memory controller can also prefetch backwards, but that won't help us.  The prefetching doesn't start until you've done a series of monotonically increasing or decreasing accesses.  There is no prefetch on the first probe regardless of whether you're going up or down.

The whole problem with large sets is that the data is scattered all over memory and is unlikely to be in cache or to benefit from prefetching in the same way that lists and deques do (their memory access patterns are consecutive).   

The core concept for the patch is that a cache line fetch may be expensive so you might as well check more than one slot once a line is fetched.  Working against us is that we don't have control over where malloc starts a memory block (possibly starting in the middle of a cache line).
msg195539 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-18 01:49
Here's a simple benchmark to start with.

On my machine (2.4Ghz Core2 Duo with 2MB L3 cache) and compiled with GCC-4.8, the benchmark shows a 6% speedup for sets that spill over L2 cache and 11% for sets that spill over the L3 cache.

The theoretically expected result is 9.75%:

   26% collision rate (for tables that are 64% full)
 x 75% chance of adjacent entry in same cache line
 x 50% savings (two lookups for the price of one)
 -----
 9.75%
 =====

Most programs will have less benefit (they may smaller tables that fit in L1 cache or have tables that are less densely filled resulting in fewer collisions).  My quick timings show either no change or inconsequential improvement in the performance of those programs.
msg195543 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-18 05:47
Attaching the detailed results of a CacheGrind analysis.  Here are the high-points (all of which were expected):

* The last level data cache miss-rate improved 12% (from 4.9% to 4.3%).
* The first level data cache miss-rate improved 14% (from 5.5% to 4.7%).
* The additional code caused the instruction cache miss-rate to get slightly worse (from .32% to .40%).

The last level cache misses are the most important.  Per the cachegrind docs, "an L1 miss will typically cost around 10 cycles, [and] an LL miss can cost as much as 200 cycles."
msg195544 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-18 06:27
Also, cachegrind shows a 3% improvement in the branch misprediction rate (from 9.9% to 9.6%).

==82814== Mispred rate:             9.9% (          9.4%     +          15.5%
==82812== Mispred rate:             9.6% (          9.1%     +          14.1%   )
msg195551 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-08-18 10:23
Instead of only a second lookup, could you try for example 4 lookup and
align j to fit in a cache line? Or do you expect more collisions with 4
contiguious lookup?
msg195571 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-18 16:54
> Instead of only a second lookup, could you try for example 4 lookup
> and align j to fit in a cache line? 

Accessing 4 entries per probe is a tempting line of development, but will be subject to diminishing returns (second and third collisions aren't frequent).

I like the idea of aligning j to fit in a cache line, but the computation would need to be cheap and portable (standard C with no pointer tricks that rely on non-guaranteed behavior).

Have you had a chance to run the benchmarks on your machine?  I'm curious how this works out on other processors and with other compilers.
msg195572 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-18 17:24
Attaching an updated patch (same algorithm but with a little more factoring to make the patch smaller).
msg195619 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2013-08-19 10:42
The following benchmarks were run with the second version of the patch (so2.diff), once before and once after.

OSX 10.8 on a Macbook Pro, 2.5 GHz Core i5, two cores each with 256kb L2 cache, 3MB L3 cache:

1.32% less time for first benchmark (7.07 before, 6.98 after)
1.31% less time for second benchmark (9.81 before, 9.68 after)

Ubuntu 12.10 64-bit, AMD Athlon II X2 250 Processor (3.0 GHz, 2MB L2 cache, 256KB total L1 per processor):

5% less time for first benchmark (13.6 before, 12.9 after)
11% less time for second benchmark (18.3 before, 16.3 after)

Win7 64-bit, Intel Xeon E3-1230 V2 (quad-core, 3.30GHz, 256kb L2 cache per core, 8MB L3 cache):

1.8% less time for first benchmark (3.90 before, 3.83 after)
3.5% less time for second benchmark (7.27 before, 7.02 after)

However, this processor has "Intel SmartCache", which as far as I can tell means one processor can use other processors' caches if they are inactive. So I ran the benchmarks again with 4x larger data sets (revised benchmark script attached):

3.8% less time for first benchmark (19.3 before, 18.6 after)
7.5% less time for second benchmark (33.1 before, 30.6 after)
msg195620 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2013-08-19 10:56
Because of the unusually low differences, I tried running the benchmarks again on the MacBook (OSX 10.8). The first time was actually with so.diff, this time was with so2.diff.

0.69% less time for first benchmark (7.30 before, 7.25 after)
8.08% less time for second benchmark (10.85 before, 9.97 after)

Strange...
msg195621 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-08-19 11:49
Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz on Ubuntu 13.04 X86_64 with GCC 4.7.3

1MB L2 cache test: 11.33sec / 10.83sec (4.4% faster)
8Mb L3 cache test: 22.16sec / 19.67sec (11% faster)
msg195622 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2013-08-19 12:01
Sorry to be pedantic, but while 19.67sec is about 11.2% less than 22.16sec, it is actually about 12.6% faster (22.16/19.67 - 1).

Regardless, that's a good result! And I'm happy that the modified benchmark is useful.
msg195623 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-08-19 12:10
Oh *curse*, you are right. Why do I always confuse the order? :|

It looks like the performance boost increases with CPU cache size as well as the CPU's generation. The i7-4770K box at work is less than a week old. The CPU is currently the fast desktop CPU from Intel.
msg195633 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-19 14:17
Wow, thanks for running these tests and posting the results :-)
msg195635 - (view) Author: Roundup Robot (python-dev) Date: 2013-08-19 14:39
New changeset a887596b841f by Raymond Hettinger in branch 'default':
Issue18771:  Reduce the cost of hash collisions for set objects.
http://hg.python.org/cpython/rev/a887596b841f
msg195636 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-08-19 14:51
How about a Misc/NEWS and a Doc/whatsnew/3.4 entry, too?
msg195637 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-08-19 14:53
This optimization cannot be applied to the dict type?
msg195644 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-19 16:17
Victor, this should also work for dictionaries as well.   Fundamentally, the concepts of loop-unrolling and exploiting cache locality should apply to any hash table implementation.

That said, the dicts are more complex than sets due to 1) storing the hash/key/value 2) key-sharing, and 3) different access patterns.  So, it will require more effort than just porting over the patch.  Feel free to open a new tracker item for that one.  It will take more work but will potentially have a much greater pay-off.
msg197837 - (view) Author: Roundup Robot (python-dev) Date: 2013-09-15 21:57
New changeset 9353c611f897 by Raymond Hettinger in branch 'default':
Issue 18771: Make it possible to set the number linear probes at compile-time.
http://hg.python.org/cpython/rev/9353c611f897
History
Date User Action Args
2013-09-15 21:57:35python-devsetmessages: + msg197837
2013-08-19 16:17:29rhettingersetmessages: + msg195644
2013-08-19 14:53:08hayposetmessages: + msg195637
2013-08-19 14:51:58christian.heimessetmessages: + msg195636
2013-08-19 14:41:13rhettingersetstatus: open -> closed
resolution: fixed
2013-08-19 14:39:09python-devsetnosy: + python-dev
messages: + msg195635
2013-08-19 14:17:49rhettingersetmessages: + msg195633
2013-08-19 12:10:51christian.heimessetmessages: + msg195623
2013-08-19 12:01:22taleinatsetmessages: + msg195622
2013-08-19 11:49:34christian.heimessetmessages: + msg195621
2013-08-19 10:56:56taleinatsetmessages: + msg195620
2013-08-19 10:42:56taleinatsetfiles: + set_bench_smartcache.py
nosy: + taleinat
messages: + msg195619

2013-08-18 17:24:30rhettingersetfiles: + so2.diff

messages: + msg195572
2013-08-18 17:12:12jceasetnosy: + jcea
2013-08-18 16:54:55rhettingersetmessages: + msg195571
2013-08-18 10:23:17hayposetmessages: + msg195551
2013-08-18 06:27:40rhettingersetmessages: + msg195544
2013-08-18 05:47:34rhettingersetfiles: + grind.txt

messages: + msg195543
2013-08-18 02:46:35rhettingersetfiles: + insert_clean.s
2013-08-18 01:49:58rhettingersetfiles: + set_bench.py

messages: + msg195539
2013-08-18 00:04:58rhettingersetmessages: + msg195533
2013-08-17 23:28:55hayposetmessages: + msg195532
2013-08-17 23:08:40rhettingersetmessages: + msg195531
2013-08-17 23:07:44rhettingersetmessages: - msg195530
2013-08-17 23:07:08rhettingersetmessages: + msg195530
2013-08-17 22:52:25hayposetmessages: + msg195529
2013-08-17 22:44:09rhettingersetpriority: normal -> low
assignee: rhettinger
2013-08-17 22:41:01rhettingersetmessages: + msg195526
2013-08-17 22:04:29hayposetmessages: + msg195524
2013-08-17 21:11:13rhettingersetmessages: + msg195520
2013-08-17 20:57:49rhettingersetfiles: + so.diff
keywords: + patch
2013-08-17 19:33:34serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg195511
2013-08-17 19:17:23christian.heimessetnosy: + christian.heimes
2013-08-17 18:57:58rhettingercreate