Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the cost of hash collisions for set objects #62971

Closed
rhettinger opened this issue Aug 17, 2013 · 26 comments
Closed

Reduce the cost of hash collisions for set objects #62971

rhettinger opened this issue Aug 17, 2013 · 26 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@rhettinger
Copy link
Contributor

BPO 18771
Nosy @rhettinger, @jcea, @vstinner, @taleinat, @tiran, @serhiy-storchaka
Files
  • so.diff: First draft patch for paired adjacent insertions
  • set_bench.py: Simple set benchmark
  • insert_clean.s: Annotated disassembly of set_insert_clean()
  • grind.txt: Cachegrind analysis detail
  • so2.diff: Second draft patch pair adjacent insertions
  • set_bench_smartcache.py: Revised benchmark for processors with Intel SmartCache
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/rhettinger'
    closed_at = <Date 2013-08-19.14:41:13.508>
    created_at = <Date 2013-08-17.18:57:58.292>
    labels = ['interpreter-core', 'performance']
    title = 'Reduce the cost of hash collisions for set objects'
    updated_at = <Date 2013-09-15.21:57:35.577>
    user = 'https://github.com/rhettinger'

    bugs.python.org fields:

    activity = <Date 2013-09-15.21:57:35.577>
    actor = 'python-dev'
    assignee = 'rhettinger'
    closed = True
    closed_date = <Date 2013-08-19.14:41:13.508>
    closer = 'rhettinger'
    components = ['Interpreter Core']
    creation = <Date 2013-08-17.18:57:58.292>
    creator = 'rhettinger'
    dependencies = []
    files = ['31345', '31349', '31350', '31352', '31361', '31372']
    hgrepos = []
    issue_num = 18771
    keywords = ['patch']
    message_count = 26.0
    messages = ['195506', '195511', '195520', '195524', '195526', '195529', '195531', '195532', '195533', '195539', '195543', '195544', '195551', '195571', '195572', '195619', '195620', '195621', '195622', '195623', '195633', '195635', '195636', '195637', '195644', '197837']
    nosy_count = 7.0
    nosy_names = ['rhettinger', 'jcea', 'vstinner', 'taleinat', 'christian.heimes', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'low'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue18771'
    versions = ['Python 3.4']

    @rhettinger
    Copy link
    Contributor Author

    I'm working on a patch for the lookkey() functions in Object/setobject.c. The core idea is to follow the probe sequence as usual but to also check an adjacent entry for each probe (i.e. check &table[i & mask] as usual and then check &table[(i & mask) ^ 1] before going on the next probes which are scattered around memory).

    On modern processors (anything in the last decade), this is likely to reduce the cost of hash collisions and reduce the number of cache lines fetched from memory.

    + Cache line benefit: improves the odds that the adjacent probe will be on the same cache line, thereby reducing the total number of cache lines fetched. This benefit will work for set insertions as well as set lookups (a partial write to a cache line requires that a full cache line be read prior to writing, so it is helpful if we've just examined another slot on the same cache line).

    + Parallel execution: because the second lookup has no data dependency on the first lookup, some of its instructions can be executed in parallel by the out-of-order engine.

    + Reduced loop overhead: the second lookup doesn't require a new computation of the index *i* (we just do a XOR 1), a new perturb shift, a new masking of *i*, or a jump to the top of the loop. In other words, it has all the usual benefits of loop unrolling.

    • On the downside, even this partial unrolling when increase the total code size which has the negative effect of blowing other code out of the I-cache (typically 32kb).

    ? I'm unsure whether the additional if-statements will have a net positive or net negative effect on branch prediction rates (positive because each branch can be tracked separately or negative because additional branches use up more space in the branch prediction tables). Once the patch is ready, I'll run CacheGrind to get a better sense of what is happening.

    @rhettinger rhettinger added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Aug 17, 2013
    @serhiy-storchaka
    Copy link
    Member

    How it will affect a performance of set(range(n))?

    @rhettinger
    Copy link
    Contributor Author

    Serhiy, I've posted a patch so you can examine the effects in detail.

    For something like set(range(n)), I would expect no change because the dataset is collision free due to Tim's design where hash(someint)==someint. That said, I'm trying to optimize the general case and don't really if rare cases are modestly slowed.

    The new code only kicks in when there is a collision. The logic for the initial probe is unchanged. The idea is to pickup some of the cache locality benefits of collision resolution by separate chaining, but not incurring the 100% malloc typically associated with chaining.

    When adjacent entries are in the same cache-line, the cost of the extra adjacent probe is much cheaper (nearly zero) than a probe somewhere else in memory.

    @vstinner
    Copy link
    Member

    Do you expect visible difference on a microbenchmark? Do you have
    benchmarks showing the speedup and showing that many collisions are no
    too much slower?

    @rhettinger
    Copy link
    Contributor Author

    I just made the patch a few minutes ago. Am just starting to work on benchmarks. I posted here so that you guys could help find the strengths and weaknesses of the approach.

    The theory is sound, but it is going to take a good deal of effort to isolate the effects (either good or bad) in realistic benchmarks. The short, tight loops in timeit tend to put everything in cache and hide the effects of good memory access patterns.

    If would love to have you all team up with me on this one. I could use help on timings, examining disassembly, and runs with cache grind.

    This is just in the experimental phase, so it is premature to be throwing up obstacles with "show me your data" or "what does it do in situatation x". At this point, you all could help out by running some timings designed to isolate the effects on high collision data big enough to not fit in L2 cache. I hope you assist in evaluating the idea with an open mind.

    @rhettinger rhettinger self-assigned this Aug 17, 2013
    @vstinner
    Copy link
    Member

    The theory is sound, but it is going to take a good deal of effort to isolate the effects (either good or bad) in realistic benchmarks.

    Oh, it was just asking for a microbenchmark on set(). I don't care of
    a "real" benchmark (realistic use case) for such issue :-)

    I asked because I don't understand all these complex things of the CPU
    (cache size, data dependency, prefetch, ...), whereas I can compare
    number of a microbenchmark :-) And I can try to reproduce it on a
    different CPU and a different dataset. Because I don't understand
    low-level CPU things, I always fear that a patch works well on a
    specific CPU whereas it would behave badly on a completly different
    CPU architecture (ex: RISC vs CISC).

    All your arguments sound good and I'm always happy to see people
    involved to optimize Python!

    @rhettinger
    Copy link
    Contributor Author

    Here's an excerpt from the patch that gives a pretty good idea of what is being changed (the three lines of new logic are marked):

    static void
    set_insert_clean(PySetObject *so, PyObject *key, Py_hash_t hash)
    {
        setentry *table = so->table;
        setentry *entry;
        size_t perturb = hash;
        size_t mask = (size_t)so->mask;
        size_t i, j;                        // the j variable is new
        i = j = (size_t)hash & mask;
        while(1) {
            entry = &table[j];
            if (entry->key == NULL)
                break;
            entry = &table[j ^ 1];          // this line is new
            if (entry->key == NULL)         // this line is new
                break;                      // this line new
            i = i * 5 + perturb + 1;
            j = i & mask;
            perturb >>= PERTURB_SHIFT;
        }
        so->fill++;
        entry->key = key;
        entry->hash = hash;
        so->used++;
    }

    @vstinner
    Copy link
    Member

    "+ Cache line benefit: improves the odds that the adjacent probe will
    be on the same cache line (...)
    + Reduced loop overhead: the second lookup doesn't require a new
    computation of the index *i* (we just do a XOR 1) (...)"

    With j=11, is the lookup at index 10 in the same cache line? Does a
    cache line start at the address of the first lookup, or is it
    "rounded" to the size of a cache line?

    I read that CPU likes to prefetch data forward, but I didn't know that
    they like prefetching backward.

    @rhettinger
    Copy link
    Contributor Author

    With j=11, is the lookup at index 10 in the same cache line?

    It might or it might not. The cache lines are typically 64 bytes which holds 4 key/hash pairs. So, the odds are 75% in favor of an adjacent entry being in the same cache line.

    If malloc where guaranteed to return 64-byte aligned memory blocks, the odds would be 100%, but my understanding is that 16-byte alignment is the norm.

    I read that CPU likes to prefetch data forward,
    but I didn't know that they like prefetching backward.

    The integrated memory controller can also prefetch backwards, but that won't help us. The prefetching doesn't start until you've done a series of monotonically increasing or decreasing accesses. There is no prefetch on the first probe regardless of whether you're going up or down.

    The whole problem with large sets is that the data is scattered all over memory and is unlikely to be in cache or to benefit from prefetching in the same way that lists and deques do (their memory access patterns are consecutive).

    The core concept for the patch is that a cache line fetch may be expensive so you might as well check more than one slot once a line is fetched. Working against us is that we don't have control over where malloc starts a memory block (possibly starting in the middle of a cache line).

    @rhettinger
    Copy link
    Contributor Author

    Here's a simple benchmark to start with.

    On my machine (2.4Ghz Core2 Duo with 2MB L3 cache) and compiled with GCC-4.8, the benchmark shows a 6% speedup for sets that spill over L2 cache and 11% for sets that spill over the L3 cache.

    The theoretically expected result is 9.75%:

    26% collision rate (for tables that are 64% full)
    x 75% chance of adjacent entry in same cache line
    x 50% savings (two lookups for the price of one)
    -----
    9.75%
    =====

    Most programs will have less benefit (they may smaller tables that fit in L1 cache or have tables that are less densely filled resulting in fewer collisions). My quick timings show either no change or inconsequential improvement in the performance of those programs.

    @rhettinger
    Copy link
    Contributor Author

    Attaching the detailed results of a CacheGrind analysis. Here are the high-points (all of which were expected):

    • The last level data cache miss-rate improved 12% (from 4.9% to 4.3%).
    • The first level data cache miss-rate improved 14% (from 5.5% to 4.7%).
    • The additional code caused the instruction cache miss-rate to get slightly worse (from .32% to .40%).

    The last level cache misses are the most important. Per the cachegrind docs, "an L1 miss will typically cost around 10 cycles, [and] an LL miss can cost as much as 200 cycles."

    @rhettinger
    Copy link
    Contributor Author

    Also, cachegrind shows a 3% improvement in the branch misprediction rate (from 9.9% to 9.6%).

    ==82814== Mispred rate: 9.9% ( 9.4% + 15.5%
    ==82812== Mispred rate: 9.6% ( 9.1% + 14.1% )

    @vstinner
    Copy link
    Member

    Instead of only a second lookup, could you try for example 4 lookup and
    align j to fit in a cache line? Or do you expect more collisions with 4
    contiguious lookup?

    @rhettinger
    Copy link
    Contributor Author

    Instead of only a second lookup, could you try for example 4 lookup
    and align j to fit in a cache line?

    Accessing 4 entries per probe is a tempting line of development, but will be subject to diminishing returns (second and third collisions aren't frequent).

    I like the idea of aligning j to fit in a cache line, but the computation would need to be cheap and portable (standard C with no pointer tricks that rely on non-guaranteed behavior).

    Have you had a chance to run the benchmarks on your machine? I'm curious how this works out on other processors and with other compilers.

    @rhettinger
    Copy link
    Contributor Author

    Attaching an updated patch (same algorithm but with a little more factoring to make the patch smaller).

    @taleinat
    Copy link
    Contributor

    The following benchmarks were run with the second version of the patch (so2.diff), once before and once after.

    OSX 10.8 on a Macbook Pro, 2.5 GHz Core i5, two cores each with 256kb L2 cache, 3MB L3 cache:

    1.32% less time for first benchmark (7.07 before, 6.98 after)
    1.31% less time for second benchmark (9.81 before, 9.68 after)

    Ubuntu 12.10 64-bit, AMD Athlon II X2 250 Processor (3.0 GHz, 2MB L2 cache, 256KB total L1 per processor):

    5% less time for first benchmark (13.6 before, 12.9 after)
    11% less time for second benchmark (18.3 before, 16.3 after)

    Win7 64-bit, Intel Xeon E3-1230 V2 (quad-core, 3.30GHz, 256kb L2 cache per core, 8MB L3 cache):

    1.8% less time for first benchmark (3.90 before, 3.83 after)
    3.5% less time for second benchmark (7.27 before, 7.02 after)

    However, this processor has "Intel SmartCache", which as far as I can tell means one processor can use other processors' caches if they are inactive. So I ran the benchmarks again with 4x larger data sets (revised benchmark script attached):

    3.8% less time for first benchmark (19.3 before, 18.6 after)
    7.5% less time for second benchmark (33.1 before, 30.6 after)

    @taleinat
    Copy link
    Contributor

    Because of the unusually low differences, I tried running the benchmarks again on the MacBook (OSX 10.8). The first time was actually with so.diff, this time was with so2.diff.

    0.69% less time for first benchmark (7.30 before, 7.25 after)
    8.08% less time for second benchmark (10.85 before, 9.97 after)

    Strange...

    @tiran
    Copy link
    Member

    tiran commented Aug 19, 2013

    Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz on Ubuntu 13.04 X86_64 with GCC 4.7.3

    1MB L2 cache test: 11.33sec / 10.83sec (4.4% faster)
    8Mb L3 cache test: 22.16sec / 19.67sec (11% faster)

    @taleinat
    Copy link
    Contributor

    Sorry to be pedantic, but while 19.67sec is about 11.2% less than 22.16sec, it is actually about 12.6% faster (22.16/19.67 - 1).

    Regardless, that's a good result! And I'm happy that the modified benchmark is useful.

    @tiran
    Copy link
    Member

    tiran commented Aug 19, 2013

    Oh *curse*, you are right. Why do I always confuse the order? :|

    It looks like the performance boost increases with CPU cache size as well as the CPU's generation. The i7-4770K box at work is less than a week old. The CPU is currently the fast desktop CPU from Intel.

    @rhettinger
    Copy link
    Contributor Author

    Wow, thanks for running these tests and posting the results :-)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 19, 2013

    New changeset a887596b841f by Raymond Hettinger in branch 'default':
    bpo-18771: Reduce the cost of hash collisions for set objects.
    http://hg.python.org/cpython/rev/a887596b841f

    @tiran
    Copy link
    Member

    tiran commented Aug 19, 2013

    How about a Misc/NEWS and a Doc/whatsnew/3.4 entry, too?

    @vstinner
    Copy link
    Member

    This optimization cannot be applied to the dict type?

    @rhettinger
    Copy link
    Contributor Author

    Victor, this should also work for dictionaries as well. Fundamentally, the concepts of loop-unrolling and exploiting cache locality should apply to any hash table implementation.

    That said, the dicts are more complex than sets due to 1) storing the hash/key/value 2) key-sharing, and 3) different access patterns. So, it will require more effort than just porting over the patch. Feel free to open a new tracker item for that one. It will take more work but will potentially have a much greater pay-off.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 15, 2013

    New changeset 9353c611f897 by Raymond Hettinger in branch 'default':
    bpo-18771: Make it possible to set the number linear probes at compile-time.
    http://hg.python.org/cpython/rev/9353c611f897

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants