This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients christian.heimes, lemburg, methane, rhettinger, serhiy.storchaka
Date 2017-02-01.10:13:33
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <4662a5f9-ebbc-bf12-730d-d4d514dae563@egenix.com>
In-reply-to <1485942644.07.0.463511504547.issue29410@psf.upfronthosting.co.za>
Content
On 01.02.2017 10:50, INADA Naoki wrote:
> 
>> it seems as if it would make sense to not use a fixed
>> hash algorithm for all strings lengths, but instead a
>> hybrid one to increase performance for short strings
>> (which are used a lot in Python).
>>
>> Is there a good hash algorithm with provides better
>> performance for short strings than siphash ?
> 
> There is undocumented option "Py_HASH_CUTOFF" to use DJBX33A for short string.
> 
> See https://github.com/python/cpython/blob/master/Include/pyhash.h#L98

Thanks. I found a reference in the PEP 456:

"""
...
However a fast function like DJBX33A is not as secure as SipHash24. A
cutoff at about 5 to 7 bytes should provide a decent safety margin and
speed up at the same time. The PEP's reference implementation provides
such a cutoff with Py_HASH_CUTOFF . The optimization is disabled by
default for several reasons. For one the security implications are
unclear yet and should be thoroughly studied before the optimization is
enabled by default. Secondly the performance benefits vary. On 64 bit
Linux system with Intel Core i7 multiple runs of Python's benchmark
suite [pybench] show an average speedups between 3% and 5% for
benchmarks such as django_v2, mako and etree with a cutoff of 7.
Benchmarks with X86 binaries and Windows X86_64 builds on the same
machine are a bit slower with small string optimization.

The state of small string optimization will be assessed during the beta
phase of Python 3.4. The feature will either be enabled with appropriate
values or the code will be removed before beta 2 is released.
"""

The mentioned speedup sounds like enabling this by default
would make a lot of sense (keeping the compile time option of
setting Py_HASH_CUTOFF to 0).

Aside: I haven't been following this since the days I proposed the
collision counting method as solution to the vulnerability, which was
rejected at the time. It's interesting how much effort went into
trying to fix the string hashing in dicts over the years. Yet the
much easier to use integer key hash collisions have still not
been addressed.
History
Date User Action Args
2017-02-01 10:13:33lemburgsetrecipients: + lemburg, rhettinger, christian.heimes, methane, serhiy.storchaka
2017-02-01 10:13:33lemburglinkissue29410 messages
2017-02-01 10:13:33lemburgcreate