Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster hash implementation #60631

Closed
serhiy-storchaka opened this issue Nov 7, 2012 · 36 comments
Closed

Faster hash implementation #60631

serhiy-storchaka opened this issue Nov 7, 2012 · 36 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@serhiy-storchaka
Copy link
Member

BPO 16427
Nosy @tim-one, @rhettinger, @gpshead, @jcea, @mdickinson, @pitrou, @vstinner, @tiran, @asvetlov, @skrah, @serhiy-storchaka
Superseder
  • bpo-19183: PEP 456 Secure and interchangeable hash algorithm
  • Files
  • test_hash_align.patch
  • fast_hash_3.patch
  • bench_hash.txt
  • bench_hash.py
  • cityhash.txt
  • cityhash_arm.txt
  • cityhash_fashhash3_arm.txt
  • cityhash_fasthast3.txt
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/tiran'
    closed_at = <Date 2013-10-28.11:58:03.901>
    created_at = <Date 2012-11-07.12:38:49.143>
    labels = ['interpreter-core', 'performance']
    title = 'Faster hash implementation'
    updated_at = <Date 2013-10-28.11:58:03.900>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2013-10-28.11:58:03.900>
    actor = 'serhiy.storchaka'
    assignee = 'christian.heimes'
    closed = True
    closed_date = <Date 2013-10-28.11:58:03.901>
    closer = 'serhiy.storchaka'
    components = ['Interpreter Core']
    creation = <Date 2012-11-07.12:38:49.143>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['27948', '27956', '29742', '29743', '30446', '30447', '30474', '30475']
    hgrepos = []
    issue_num = 16427
    keywords = ['patch']
    message_count = 36.0
    messages = ['175093', '175298', '175300', '175302', '175303', '175330', '175332', '175333', '175349', '175351', '175353', '175361', '175367', '175385', '186352', '186353', '186403', '186405', '186406', '186421', '186439', '186441', '186442', '186443', '186481', '190475', '190476', '190478', '190488', '190489', '190491', '190676', '190704', '201493', '201525', '201529']
    nosy_count = 15.0
    nosy_names = ['tim.peters', 'rhettinger', 'gregory.p.smith', 'jcea', 'mark.dickinson', 'pitrou', 'vstinner', 'christian.heimes', 'ebfe', 'asvetlov', 'skrah', 'neologix', 'python-dev', 'serhiy.storchaka', 'isoschiz']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '19183'
    type = 'performance'
    url = 'https://bugs.python.org/issue16427'
    versions = ['Python 3.4']

    @serhiy-storchaka
    Copy link
    Member Author

    In the discussion of bpo-14621 it was noted that much more complex hash algorithms can overtake the current one due to the fact that they process more data at a time. Here is a patch that implements this idea for the current algorithm. Also code duplication removed.

    Microbenchmarks:

    $ ./python -m timeit -n 1 -s "t = b'a' * 10**8"  "hash(t)"
    $ ./python -m timeit -n 1 -s "t = 'a' * 10**8"  "hash(t)"
    $ ./python -m timeit -n 1 -s "t = '\u0100' * 10**8"  "hash(t)"
    $ ./python -m timeit -n 1 -s "t = '\U00010000' * 10**8"  "hash(t)"

    Results on 32-bit Linux on AMD Athlon 64 X2 4600+:

       original  patched    speedup
    

    bytes 181 msec 45.7 msec 4x
    UCS1 429 msec 45.7 msec 9.4x
    UCS2 179 msec 92 msec 1.9x
    UCS4 183 msec 183 msec 1x

    If the idea is acceptable, I will create benchmarks for short strings.

    @serhiy-storchaka serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Nov 7, 2012
    @gpshead
    Copy link
    Member

    gpshead commented Nov 10, 2012

    yes please!

    Any reason you're using an unsigned int in your loop instead of a Py_uhash_t?

    @serhiy-storchaka
    Copy link
    Member Author

    Any reason you're using an unsigned int in your loop instead of a Py_uhash_t?

    In fact, there is no serious reason. This should be the type aligned as minimal alignment of void*, size_t and Py_hash_t. Since de facto Py_uhash_t is size_t, then we can use size_t.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 10, 2012

    The patch is too optimistic, it gives different results depending on the alignment of the memory buffer:

    >>> b = b"abcd"*100
    >>> hash(b[1:])
    7264687738704559842
    >>> hash(memoryview(b)[1:])
    9054791208347464792
    >>> memoryview(b)[1:] == b[1:]
    True

    @pitrou
    Copy link
    Member

    pitrou commented Nov 10, 2012

    Here is a test case for the hash/alignment issue.

    @serhiy-storchaka
    Copy link
    Member Author

    The patch is too optimistic, it gives different results depending on the alignment of the memory buffer:

    So this method is not applicable for a byte. Here is a patch only for strings.

    If a fast hash for bytes/memoryview is desirable, I can write a fast robust implementation for nonaligned data. But this will be more cumbersome and a bit slower.

    @serhiy-storchaka
    Copy link
    Member Author

    Here is a test case for the hash/alignment issue.

    I think here should be a test for a shifted data. Something like hash(b'abcd...') == hash(memoryview(b'xabcd...')[1:]).

    @serhiy-storchaka
    Copy link
    Member Author

    Oh, I see, it's already here.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 11, 2012

    If a fast hash for bytes/memoryview is desirable, I can write a fast
    robust implementation for nonaligned data. But this will be more
    cumbersome and a bit slower.

    Unaligned accesses are not a problem on x86(-64), but they will segfault
    (bus error, IIRC) on other architectures such as SPARC, unfortunately.

    (blame RISC for being really too "simplified")

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Nov 11, 2012

    FWIW, on x86/x64 gcc often generates identical code for x = y and
    memcpy(x, y, 8). See e.g. the PACK_SINGLE and UNPACK_SINGLE macros in
    Objects/memoryobject.c.

    I didn't look at the patch yet, just an idea.

    @serhiy-storchaka
    Copy link
    Member Author

    Unaligned accesses are not a problem on x86(-64), but they will segfault
    (bus error, IIRC) on other architectures such as SPARC, unfortunately.

    On x86(-64) this kills performance and makes the optimization be senseless.

    FWIW, on x86/x64 gcc often generates identical code for x = y and
    memcpy(x, y, 8).

    The code can be identical, but the time will differ significantly for aligned and non-aligned data.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Nov 11, 2012

    The code can be identical, but the time will differ significantly for
    aligned and non-aligned data.

    Of course, but in most cases the data *is* aligned, so only code that does
    something quite special pays the performance penalty.

    @serhiy-storchaka
    Copy link
    Member Author

    Stefan, thank you for the suggestion. The test showed that, in fact, at least under some x86 there is no performance decrease when using memcpy on nonaligned data. This is good news. The code can left simple and even some doubtful potential undefined behavior was removed.

    Additional microbenchmarks:

    $ ./python -m timeit -n 1 -s "t = memview(b'a' * 10**8)"  "hash(t)"
    $ ./python -m timeit -n 1 -s "t = memview(b'a' * 10**8)[1:]"  "hash(t)"
    $ ./python -m timeit -n 1 -s "t = memview(b'a' * 10**8)[8:]"  "hash(t)"
               original  patched    speedup
    

    bytes 181 msec 46 msec 3.9x
    UCS1 429 msec 46.2 msec 9.3x
    UCS2 179 msec 91.9 msec 1.9x
    UCS4 183 msec 184 msec 1x
    memview() 362 msec 91.7 msec 3.9x
    memview()[1:] 362 msec 93.2 msec 3.9x
    memview()[8:] 362 msec 92.4 msec 3.9x

    I don't know how it will be on other platforms.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 11, 2012

    New changeset 797de1864fd9 by Antoine Pitrou in branch '3.3':
    Add a test for hashing of unaligned memory buffers (from issue bpo-16427).
    http://hg.python.org/cpython/rev/797de1864fd9

    New changeset 9cb1366b251b by Antoine Pitrou in branch 'default':
    Add a test for hashing of unaligned memory buffers (from issue bpo-16427).
    http://hg.python.org/cpython/rev/9cb1366b251b

    @vstinner
    Copy link
    Member

    vstinner commented Apr 8, 2013

    fast_hash_3.patch is a litte bit (6%) slower for Unicode string shorter than 10 characters, but much faster for string equal or longer than 100 characters (up to 10x faster).

    I used the str type and disabled its cache ("_PyUnicode_HASH(self) = x;" in unicode_hash()) to run my benchmark.

    Summary        |   original |        patched
    ---------------+------------+---------------
    Length 1       | 231 ns () |   244 ns (+6%)
    Length 3       | 238 ns (
    ) |   253 ns (+6%)
    Length 10      | 254 ns () |         251 ns
    Length 20      | 280 ns (
    ) |   256 ns (-8%)
    Length 100     | 528 ns () |  321 ns (-39%)
    Length 10 ** 4 |  32 us (
    ) | 9.49 us (-70%)
    Length 10 ** 8 | 329 ms () |  104 ms (-68%)
    ---------------+------------+---------------
    Total          | 329 ms (
    ) |  104 ms (-68%)

    @vstinner
    Copy link
    Member

    vstinner commented Apr 8, 2013

    Does anyone know if fast_hash_3.patch may reduce the quality of the hash function? (May the patched hash function produce more collisions? The "Avalanche effect" thing.)

    @pitrou
    Copy link
    Member

    pitrou commented Apr 9, 2013

    Well, the quality of the hash function is clearly reduced:

    >>> hash("abcdefgh") & 0xff
    206
    >>> hash("abcdefgi") & 0xff
    206
    >>> hash("abcdefgj") & 0xff
    206
    >>> hash("abxxxxxx") & 0xff
    206
    >>> hash("aaaaaa11") & 0xff
    206
    >>> hash("aaaaaa12") & 0xff
    206

    Now to know if that may produce slowdowns in some situations... (dicts and sets have a sophisticated probing algorithm which takes into account the whole hash value, not the masked one).

    @vstinner
    Copy link
    Member

    vstinner commented Apr 9, 2013

    (dicts and sets have a sophisticated probing algorithm which takes into account the whole hash value, not the masked one).

    Correct, so your specific example should not be a problem since the
    whole hash value is different for the 6 hash values.

    Now to know if that may produce slowdowns in some situations...

    pybench and perf.py can be used to measure performances of the patch.
    The speedup may not be detected, but a slowdown would be detected at
    least.

    2013/4/9 Antoine Pitrou <report@bugs.python.org>:
    >
    > Antoine Pitrou added the comment:
    >
    > Well, the quality of the hash function is clearly reduced:
    >
    >>>> hash("abcdefgh") & 0xff
    > 206
    >>>> hash("abcdefgi") & 0xff
    > 206
    >>>> hash("abcdefgj") & 0xff
    > 206
    >>>> hash("abxxxxxx") & 0xff
    > 206
    >>>> hash("aaaaaa11") & 0xff
    > 206
    >>>> hash("aaaaaa12") & 0xff
    > 206
    >
    >
    > Now to know if that may produce slowdowns in some situations... (dicts and sets have a sophisticated probing algorithm which takes into account the whole hash value, not the masked one).
    >
    > 


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue16427\>


    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Apr 9, 2013

    Note that the patch uses type punning through a union: while GCC allows this, it's not allowed by ANSI (although since we're using a char [], it's somewhat a grey area). An aggresive compiler could optimiza the read/write away.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 9, 2013

    Note that the patch uses type punning through a union

    What is the standard and portable way to cast an array of bytes to size_t?

    2013/4/9 Charles-François Natali <report@bugs.python.org>:

    Charles-François Natali added the comment:

    Note that the patch uses type punning through a union: while GCC allows this, it's not allowed by ANSI (although since we're using a char [], it's somewhat a grey area). An aggresive compiler could optimiza the read/write away.

    ----------
    nosy: +neologix


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue16427\>


    @gpshead
    Copy link
    Member

    gpshead commented Apr 9, 2013

    > Note that the patch uses type punning through a union

    What is the standard and portable way to cast an array of bytes to size_t?

    I'd expect just casting the pointer type before dereferencing:

    unsigned char *p;
    ...
    hash = (multiplier * hash) ^ *((Py_uhash_t *)p);

    (don't use size_t, use Py_uhash_t)

    @pitrou
    Copy link
    Member

    pitrou commented Apr 9, 2013

    pybench and perf.py can be used to measure performances of the patch.
    The speedup may not be detected, but a slowdown would be detected at
    least.

    The slowdown would only occur for specific, well-chosen patterns. Also it may make DoS attacks easier.
    (remember the reason we randomized hashes is so that it's hard for attackers to find collisions)

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Apr 9, 2013

    I'd expect just casting the pointer type before dereferencing:

    unsigned char *p;
    ...
    hash = (multiplier * hash) ^ *((Py_uhash_t *)p);

    (don't use size_t, use Py_uhash_t)

    Is p guaranteed to be size_t aligned?
    If not, unaligned access can segfault (e.g. on Sparc IIRC).

    Also it may make DoS attacks easier.

    Indeed.
    And the increase in collision you demonstrated in your previous
    message worries me (both security and performance wise).

    @neologix
    Copy link
    Mannequin

    neologix mannequin commented Apr 9, 2013

    Is p guaranteed to be size_t aligned?
    If not, unaligned access can segfault (e.g. on Sparc IIRC).

    Apparently yes.

    @mdickinson
    Copy link
    Member

    Note that the patch uses type punning through a union: while GCC allows > this, it's not allowed by ANSI.

    I believe it's legal under C99 + TC3.

    @ebfe
    Copy link
    Mannequin

    ebfe mannequin commented Jun 2, 2013

    I was investigating a callgrind dump of my code, showing how badly unicode_hash() was affecting my performance. Using google's cityhash instead of the builtin algorithm to hash unicode objects improves overall performance by about 15 to 20 percent for my case - that is quite a thing.
    Valgrind shows that the number of instructions spent by unicode_hash() drops from ~20% to ~11%. Amdahl crunches the two-fold performance increase to the mentioned 15 percent.

    Cityhash was chosen because of it's MIT license and advertisement for performance on short strings.

    I've now found this bug and attached a log for haypo's benchmark which compares native vs. cityhash. Caching was disabled during the test. Cityhash was compiled using -O3 -msse4.2 (cityhash uses cpu-native crc instructions). CPython's unittests fail due to known_hash and gdb output; besides that, everything else seems to work fine.

    Cityhash is advertised for it's performance with short strings, which does not seem to show in the benchmark. However, longer strings perform *much* better.

    If people are insterested, i can repeat the test on a armv7l

    @pitrou
    Copy link
    Member

    pitrou commented Jun 2, 2013

    I was investigating a callgrind dump of my code, showing how badly
    unicode_hash() was affecting my performance.

    Can you tell us about your use case?
    There are several CityHash variants, which one did you use? CityHash64?

    @ebfe
    Copy link
    Mannequin

    ebfe mannequin commented Jun 2, 2013

    It's a cache sitting between an informix db and and an internal web service. Stuff comes out of db, processed, json'ifed, cached and put on the wire. 10**6s of strings pass this process per request if uncached...

    I use CityHash64WithSeed, the seed being cpython's hash prefix (which I don't care about but found reassuring to put in anyway)

    @ebfe
    Copy link
    Mannequin

    ebfe mannequin commented Jun 2, 2013

    Here are some benchmarks for a arm7l on a rk30-board. CityHash was compiled with -mcpu=native -O3.

    CityHash is around half as fast as the native algorithm for small strings and way, way slower on larger ones. My guess would be that the complex arithmetic in cityhash outweights the gains of better scheduling.

    The results are somewhat inconclusive, as the performance increases again for very long strings.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 2, 2013

    Here are some benchmarks for a arm7l on a rk30-board. CityHash was
    compiled with -mcpu=native -O3.

    The results look unbelievable. If you take "Length 10 ** 4", it means
    arm7l is able to hash 20 GB/s using the default unicode hash function.

    (did you disable caching?)

    @ebfe
    Copy link
    Mannequin

    ebfe mannequin commented Jun 2, 2013

    The 10**4-case is an error (see insane %), I've never been able to reproduce. Having done more tests with fixed cpu frequency and other daemons' process priority reduced, cityhash always comes out much slower on arm7l.

    @ebfe
    Copy link
    Mannequin

    ebfe mannequin commented Jun 5, 2013

    Here are more benchmarks of vanilla 3.4 vs. cityhash vs. fast_hash_3 on both arm7l and x86-64. The patch was applied varbatim, only caching disabled. On arm7l, the cpu was fixed to maximum freq (it seems to take ages to switch frequencies, at least there is a lot of jitter with ondemand). The cityhash implementation was compiled with -O3 on both platforms and -msse4.2 on x86-64.

    CityHash and fh3 come out much better than vanilla on x86-64 with cityhash being slightly faster (which is surprising). On ARM7 CityHash performs much worse than vanilla and fh3 significantly better.

    @vstinner
    Copy link
    Member

    vstinner commented Jun 6, 2013

    I'm not sure that micro-benchmarks are revelant for this issue. Hash collisions may have an higher cost than the gain of a faster hash function. Some people feel also concerned by the dict denial of service issue.

    It would be interesting to compare performances using the benchmark suite:
    http://hg.python.org/benchmarks/

    @tiran
    Copy link
    Member

    tiran commented Oct 27, 2013

    Antoine, I have addressed your concern "Well, the quality of the hash function is clearly reduced" in patch http://hg.python.org/features/pep-456/rev/765930d944a5

    >>> s = set()
    >>> for i in range(256):
    ...     s.add(hash("abcdfeg" + chr(i)) & 0xff)
    ... 
    >>> len(s)
    256
    >>> s = set()
    >>> for i in range(256):
    ...     s.add(hash("abcdfeghabcdefg" + chr(i)) & 0xff)
    ... 
    >>> len(s)
    256

    @tiran tiran self-assigned this Oct 27, 2013
    @pitrou
    Copy link
    Member

    pitrou commented Oct 28, 2013

    Shouldn't this issue be obsoleted?

    @serhiy-storchaka
    Copy link
    Member Author

    Since bpo-19183 supersedes it, yes.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants