Message 286609 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	christian.heimes, lemburg, methane, rhettinger, serhiy.storchaka
Date	2017-02-01.10:13:33
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<4662a5f9-ebbc-bf12-730d-d4d514dae563@egenix.com>
In-reply-to	<1485942644.07.0.463511504547.issue29410@psf.upfronthosting.co.za>

Content
On 01.02.2017 10:50, INADA Naoki wrote: > >> it seems as if it would make sense to not use a fixed >> hash algorithm for all strings lengths, but instead a >> hybrid one to increase performance for short strings >> (which are used a lot in Python). >> >> Is there a good hash algorithm with provides better >> performance for short strings than siphash ? > > There is undocumented option "Py_HASH_CUTOFF" to use DJBX33A for short string. > > See https://github.com/python/cpython/blob/master/Include/pyhash.h#L98 Thanks. I found a reference in the PEP 456: """ ... However a fast function like DJBX33A is not as secure as SipHash24. A cutoff at about 5 to 7 bytes should provide a decent safety margin and speed up at the same time. The PEP's reference implementation provides such a cutoff with Py_HASH_CUTOFF . The optimization is disabled by default for several reasons. For one the security implications are unclear yet and should be thoroughly studied before the optimization is enabled by default. Secondly the performance benefits vary. On 64 bit Linux system with Intel Core i7 multiple runs of Python's benchmark suite [pybench] show an average speedups between 3% and 5% for benchmarks such as django_v2, mako and etree with a cutoff of 7. Benchmarks with X86 binaries and Windows X86_64 builds on the same machine are a bit slower with small string optimization. The state of small string optimization will be assessed during the beta phase of Python 3.4. The feature will either be enabled with appropriate values or the code will be removed before beta 2 is released. """ The mentioned speedup sounds like enabling this by default would make a lot of sense (keeping the compile time option of setting Py_HASH_CUTOFF to 0). Aside: I haven't been following this since the days I proposed the collision counting method as solution to the vulnerability, which was rejected at the time. It's interesting how much effort went into trying to fix the string hashing in dicts over the years. Yet the much easier to use integer key hash collisions have still not been addressed.

On 01.02.2017 10:50, INADA Naoki wrote:
> 
>> it seems as if it would make sense to not use a fixed
>> hash algorithm for all strings lengths, but instead a
>> hybrid one to increase performance for short strings
>> (which are used a lot in Python).
>>
>> Is there a good hash algorithm with provides better
>> performance for short strings than siphash ?
> 
> There is undocumented option "Py_HASH_CUTOFF" to use DJBX33A for short string.
> 
> See https://github.com/python/cpython/blob/master/Include/pyhash.h#L98

Thanks. I found a reference in the PEP 456:

"""
...
However a fast function like DJBX33A is not as secure as SipHash24. A
cutoff at about 5 to 7 bytes should provide a decent safety margin and
speed up at the same time. The PEP's reference implementation provides
such a cutoff with Py_HASH_CUTOFF . The optimization is disabled by
default for several reasons. For one the security implications are
unclear yet and should be thoroughly studied before the optimization is
enabled by default. Secondly the performance benefits vary. On 64 bit
Linux system with Intel Core i7 multiple runs of Python's benchmark
suite [pybench] show an average speedups between 3% and 5% for
benchmarks such as django_v2, mako and etree with a cutoff of 7.
Benchmarks with X86 binaries and Windows X86_64 builds on the same
machine are a bit slower with small string optimization.

The state of small string optimization will be assessed during the beta
phase of Python 3.4. The feature will either be enabled with appropriate
values or the code will be removed before beta 2 is released.
"""

The mentioned speedup sounds like enabling this by default
would make a lot of sense (keeping the compile time option of
setting Py_HASH_CUTOFF to 0).

Aside: I haven't been following this since the days I proposed the
collision counting method as solution to the vulnerability, which was
rejected at the time. It's interesting how much effort went into
trying to fix the string hashing in dicts over the years. Yet the
much easier to use integer key hash collisions have still not
been addressed.

History
Date	User	Action	Args
2017-02-01 10:13:33	lemburg	set	recipients: + lemburg, rhettinger, christian.heimes, methane, serhiy.storchaka
2017-02-01 10:13:33	lemburg	link	issue29410 messages
2017-02-01 10:13:33	lemburg	create