This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients Mark.Shannon, christian.heimes, erlendaasland, gvanrossum, lemburg, methane, rhettinger, serhiy.storchaka, vstinner
Date 2021-10-07.13:58:27
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <44b84420-6b63-940c-0d01-f10e5613217e@egenix.com>
In-reply-to <1633613394.26.0.149303567264.issue29410@roundup.psfhosted.org>
Content
On 07.10.2021 15:29, Christian Heimes wrote:
> 
> Christian Heimes <lists@cheimes.de> added the comment:
> 
> JP got back to me
> 
> On 07/10/2021 14.34, Jean-Philippe Aumasson wrote:
>> xxHash is much faster indeed, but collisions seem trivial to find, which 
>> might allow hash-flood DoS again (see for example 
>> https://github.com/Cyan4973/xxHash/issues/180 
>> <https://github.com/Cyan4973/xxHash/issues/180>). It's however unclear 
>> whether exploitable multicollisions can also be trivially found.
>>
>> If collisions don't matter and if the ~10x speed-up makes a difference, 
>> then probably a good option, but guess you'll need to keep SipHash (or 
>> some other safe hash) when DoS resistance is needed?
> 
> This information disqualifies xxHash for our use case.

The quoted issue was for an early version of XXH3.

Please see
https://github.com/Cyan4973/xxHash/wiki/Collision-ratio-comparison
as reference for collision analysis of the current xxHash versions.

The numbers are close to the expected case, meaning that
collisions are not more frequent than statistically to be
expected given the hash and the test sample size.

Looking at this older comparison, it would also make sense to
revisit the hybrid approach, e.g. use FNV for strings
up to 16 bytes, XXH128 for longer strings:

https://cglab.ca/~abeinges/blah/hash-rs/

Given that dictionaries often use relatively short string keys,
this should have a significant effect on applications where the
dictionaries don't just use interned strings.

It would also have an effect on Python startup time, since all
those interned strings need to have their hash calculated
during startup.
History
Date User Action Args
2021-10-07 13:58:27lemburgsetrecipients: + lemburg, gvanrossum, rhettinger, vstinner, christian.heimes, methane, Mark.Shannon, serhiy.storchaka, erlendaasland
2021-10-07 13:58:27lemburglinkissue29410 messages
2021-10-07 13:58:27lemburgcreate