Message 325891 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jdemeyer
Recipients	jdemeyer, rhettinger
Date	2018-09-20.15:45:11
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1537458311.74.0.956365154283.issue34751@psf.upfronthosting.co.za>
In-reply-to

Content
> Nor is it hard to come up with collisions for most simple hash functions. The Bernstein hash algorithm is a simple algorithm for which it can be proven mathematically that collisions cannot be generated if the multiplier is unknown. That is an objective improvement over the current tuple hash. Of course, in practice the multiplier is not secret, but it suffices that it is random enough. > Addition also has a relationship to multiplication. Of course, but not in a way that can be used to generate hash collisions. In fact, the simple relation between addition and multiplication makes this an actual provable mathematical statement. > That is more suited to character data and small hash ranges (as opposed to 64-bit). Maybe you are thinking of the multiplier 33 which is classically used for character data? If you replace the multiplier 33 by a larger number like _PyHASH_MULTIPLIER == 1000003 (ideally it would be a truly random number), it works fine for arbitrary sequences and arbitrary bit lengths. > The important thing is that it has passed our testing That's a silly argument. If it passed testing, it is only because the testing is insufficient. But really: what I am proposing is a better hash without obvious collisions which won't be slower than the existing hash. Why would you be against that? Why should we "roll our own" hash instead of using a known good algorithm?

> Nor is it hard to come up with collisions for most simple hash functions.

The Bernstein hash algorithm is a simple algorithm for which it can be proven mathematically that collisions cannot be generated if the multiplier is unknown. That is an objective improvement over the current tuple hash. Of course, in practice the multiplier is not secret, but it suffices that it is random enough.

> Addition also has a relationship to multiplication.

Of course, but not in a way that can be used to generate hash collisions. In fact, the simple relation between addition and multiplication makes this an actual provable mathematical statement.

> That is more suited to character data and small hash ranges (as opposed to 64-bit).

Maybe you are thinking of the multiplier 33 which is classically used for character data? If you replace the multiplier 33 by a larger number like _PyHASH_MULTIPLIER == 1000003 (ideally it would be a truly random number), it works fine for arbitrary sequences and arbitrary bit lengths.

> The important thing is that it has passed our testing

That's a silly argument. If it passed testing, it is only because the testing is insufficient.

But really: what I am proposing is a better hash without obvious collisions which won't be slower than the existing hash. Why would you be against that? Why should we "roll our own" hash instead of using a known good algorithm?

History
Date	User	Action	Args
2018-09-20 15:45:11	jdemeyer	set	recipients: + jdemeyer, rhettinger
2018-09-20 15:45:11	jdemeyer	set	messageid: <1537458311.74.0.956365154283.issue34751@psf.upfronthosting.co.za>
2018-09-20 15:45:11	jdemeyer	link	issue34751 messages
2018-09-20 15:45:11	jdemeyer	create