New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PEP 456 Secure and interchangeable hash algorithm #63382
Comments
The patch implements the current state of PEP-456 plus a configure option to select the hash algorithm. I have tested it only on 64bit Linux so far. |
Here is a simple benchmark (Linux, gcc 4.7.3): $ ./python -m timeit -s "words=[w for line in open('LICENSE') for w in line.split()]; import collections" "c = collections.Counter(words); c.most_common(10)"
|
Microbenchmarking hash computation (Linux, gcc 4.7.3):
Overall, performance looks fine to me. |
Your benchmark is a bit unrealistic because it times the hash cache most of the time. Here is a better benchmark (but bytes-only): $ ./python -m timeit -s "words=[w.encode('utf-8') for line in open('../LICENSE') for w in line.split()]; import collections" -- "c = collections.Counter(memoryview(w) for w in words); c.most_common(10)"
1000 loops, best of 3: 1.63 msec per loop This increases the number of hash calculations from about 28k to over 8.4 mio. I also added a little statistic function to see how large typical string are. The artificial benchmark: hash 1: 115185 And here is the statistic of a full test run. hash 1: 18935 About 50% of the hashed elements are between 1 and 5 characters long. A realistic hash collision attack on 32bit needs at least 7, 8 chars. I see the chance of a micro-optimization! :) |
Good point. Can you also post all benchmark results? |
unmodified Python: 1000 loops, best of 3: 307 usec per loop (unicode) SipHash: 1000 loops, best of 3: 300 usec per loop (unicode) |
New changeset c960bed22bf6 by Christian Heimes in branch 'default': |
I propose extract all hash related stuff from Include/object.h in separated file Include/pyhash.h. And perhaps move Objects/hash.c to Python/pyhash.c. |
Since hash algorithm determined at compile time, the _Py_HashSecret_t structure and the _Py_HashSecret function are redundant. We need define only the _Py_HashBytes function. Currently SipHash algorithm doesn't work with unaligned data. |
Sure it does. The test for unaligned hashing passes without an error or a segfault. |
And note that the quality of the FNV hash function is reduced (msg186403). We need "shuffle" result's bits. |
On some platforms it can work without a segfault. |
Added some structural comments to the patch. I'll defer to Serhiy when it comes to assessing the algorithm details :) |
I have created a clone for PEP-456 and applied your suggestions. I'm still looking for a nice API to handle the hash definition. Do you have some suggestions? |
Nick, can you do another review? All tests should pass on common boxes. The latest code hides the struct with the hash function. I have added a configure block that detects platforms that don't support unaligned memory access. It works correctly on the SPARC Solaris 10 box. I'm still looking for a 64bit big endian box and a 32bit big endian box that support unaligned memory. |
Have you tried the PPC64 PowerLinux box? It's in the stable buildbots for a reason :-) |
I can no longer find the configuration for custom path. It's still documented but there is no field for "repo path". http://buildbot.python.org/all/buildslaves/edelsohn-powerlinux-ppc64 |
http://buildbot.python.org/all/builders/PPC64%20PowerLinux%20custom (usually, just replace "3.x" with "custom" in the URL) |
Nick, please review the latest patch. I have addressed Antoine's review in 257597d20fa8.diff. I'll update the PEP as soon as you are happy with the patch. |
I suggest move to Include/pyhash.h and Python/pyhash.c only _Py_HashBytes() and string hash algorithm related constants, and don't touch PyObject_Hash(), _Py_HashDouble(), etc. So if anybody want change string hashing algorithm, it need only replace these two minimal files. |
PyObject_Hash() and PyObject_HashNotImplemented() should not have been moved to pyhash.h. But the other internal helper methods should be kept together. |
On 27 Oct 2013 23:46, "Christian Heimes" <report@bugs.python.org> wrote:
Comments from the others sound promising, I should be able to take a look |
Christian, why PY_HASH_EXTERNAL is here? Do you plan use it any official build? I think that in custom build of Python whole files pyhash.c and pyhash.h can be replaced. When you will get rid from PY_HASH_EXTERNAL, then you could get rid from PyHash_FuncDef, PyHash_Func, etc. Why _Py_HashDouble() and _Py_HashPointer() are moved to pyhash.c? They are hash algorithm agnostic, and it is unlikely they will be redefined in custom build. You not need the HAVE_ALIGNED_REQUIRED macros if use PY_UHASH_CPY (or something like for exact 64 bit) in siphash24. On platforms where aligned access is required you will use per-bytes copy, otherwise you will use fast 64-bit copy. |
Am 28.10.2013 16:18, schrieb Serhiy Storchaka:
Because you can't simple replace the files. It also contains
I don't understand why you want me to get rid of the struct. What's your
I have moved the functions to pyhash.c in order to keep all related
I'm not going to make siphash24 compatible with platforms that require Seriously, nobody gives a ... about SPARC and MIPS. :) It's nice that |
Serhiy, I would like to land my patch before beta 1 hits the fan. We can always improve the code during beta. Right now I don't want to mess around with SipHash24 code. That includes non-64bit platforms as well as architectures that enforce aligned memory for integers. |
Well, unaligned memory access is usually slower on all architectures :-) |
Am 28.10.2013 16:59, schrieb Charles-François Natali:
On modern computers it's either not slower or just a tiny bit slower. Python's str and bytes datatype are always aligned properly. The You might see a 10% slowdown for very long and unaligned byte arrays on Oh, and ARM has unaligned memory access: |
To support Windows 32 bit, the following code in PC/pyconfig.h can be modified to use __int64 or _W64: see ssize_t definition below in the same file. #ifndef PY_UINT64_T
#if SIZEOF_LONG_LONG == 8
#define HAVE_UINT64_T 1
#define PY_UINT64_T unsigned PY_LONG_LONG
#endif
#endif |
Victor: Nick: I'll update my PEP shortly and address the memory layout of _Py_HashSecret_t, the small string hashing optimization and performance/memory alignment. |
About memcpy(). Here is sample file. Compile it to assembler: gcc -O2 -S -masm=intel fnv.c With memcpy() main loop is compiled to: .L3: With per-byte copy it is compiled to: .L3: |
I had to add the conversion from LE to host endianess. The missing conversion was affecting and degrading hash value dispersion. |
Hi Nick, I have updated the patch and the PEP text. The new version has small string hash optimization disabled. The other changes are mostly cleanup, reformatting and simplifications. Can you please do a review so I can get the patch into 3.4 before beta1 is released? |
I reviewed the latest PEP text at http://www.python.org/dev/peps/pep-0456/ I'm almost prepared to accept the current version of the implementation, but there's one technical decision to be clarified and a few placeholders in the PEP that need to be cleaned up prior to formal acceptance:
|
Here are benchmarks on two Linux machine. It looks like SipHash24 takes advantage of newer CPUs. I'm a bit puzzled about the results. Or maybe my super simple and naive analyzer doesn't give sensible results... |
Thanks - are those numbers with the current feature branch, and hence no small string optimization? To be completely clear, I'm happy to accept a performance penalty to fix the hash algorithm. I'd just like to know exactly how big a penalty I'm accepting, and whether taking advantage of the small string optimization makes it measurably smaller. |
For the record, it's better to use a geometric mean when agregating benchmark results into a single score. |
So, amusingly, Christian's patch seems to be 4-5% faster than vanilla on many benchmarks here (Sandy Bridge Core i5, 64-bit, gcc 4.8.1). A couple of benchmarks are a couple % slower, but nothing severe. This without the small strings optimization. On top of that, the small strings optimization seems to have varying effects, some positive, some negative, so it doesn't seem to be desirable (on this platform anyway). |
Benchmark report (without the small strings optimization): |
The numbers are between cpython default tip and my feature branch. I have pulled and merged all upstream changes into my feature branch yesterday. The results with "sso" in the file name are with small string optimization. Performance greatly depends on compiler and CPU. In general SipHash24 is faster on modern compilers and modern CPUs. MSVC generates slower code for SipHash24 but I haven't optimized the code for MSVC yet. Perhaps Serhiy is able to do his magic. :) PS: I have uploaded more benchmarks. The directories contain the raw output and CSV files. |
The PEP should be ready now. I have addressed your input in http://hg.python.org/peps/rev/fbe779221a7a |
New changeset adb471b9cba1 by Christian Heimes in branch 'default': |
New changeset 422ed27b62ce by Christian Heimes in branch 'default': |
New changeset 11cb1c8faf11 by Victor Stinner in branch 'default': |
Not only test_gdb relies on repr() exact value, there is also test_functools: http://buildbot.python.org/all/builders/AMD64%20OpenIndiana%203.x/builds/6875/steps/test/logs/stdio ====================================================================== Traceback (most recent call last):
File "/export/home/buildbot/64bits/3.x.cea-indiana-amd64/build/Lib/test/test_functools.py", line 174, in test_repr
repr(f))
AssertionError: 'func[51 chars]88>, b=<object object at 0xffffdd7fff790440>, [36 chars]00>)' != 'func[51 chars]88>, a=<object object at 0xffffdd7fff790400>, [36 chars]40>)'
- functools.partial(<function capture at 0xffffdd7ffe64d788>, b=<object object at 0xffffdd7fff790440>, a=<object object at 0xffffdd7fff790400>)
+ functools.partial(<function capture at 0xffffdd7ffe64d788>, a=<object object at 0xffffdd7fff790400>, b=<object object at 0xffffdd7fff790440>) ====================================================================== Traceback (most recent call last):
File "/export/home/buildbot/64bits/3.x.cea-indiana-amd64/build/Lib/test/test_functools.py", line 174, in test_repr
repr(f))
AssertionError: 'Part[49 chars]88>, b=<object object at 0xffffdd7fff790540>, [36 chars]00>)' != 'Part[49 chars]88>, a=<object object at 0xffffdd7fff790500>, [36 chars]40>)'
- PartialSubclass(<function capture at 0xffffdd7ffe64d788>, b=<object object at 0xffffdd7fff790540>, a=<object object at 0xffffdd7fff790500>)
+ PartialSubclass(<function capture at 0xffffdd7ffe64d788>, a=<object object at 0xffffdd7fff790500>, b=<object object at 0xffffdd7fff790540>) |
New changeset 961d832d8734 by Christian Heimes in branch 'default': |
The problems have been resolved. |
New changeset eec4758e3a45 by Victor Stinner in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: