Message 326032 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jdemeyer
Recipients	eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date	2018-09-21.20:27:56
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1537561676.26.0.956365154283.issue34751@psf.upfronthosting.co.za>
In-reply-to

Content
Concerning the MULTIPLIER: > Why do you claim the original was "too small"? Too small for what purpose? If the multiplier is too small, then the resulting hash values are small too. This causes collisions to appear for smaller numbers: BEFORE: >>> L = [(a,b) for a in range(2) for b in range(221)] >>> len(L), len(set(hash(x) for x in L)) (4194304, 2097152) AFTER: >>> L = [(a,b) for a in range(2) for b in range(221)] >>> len(L), len(set(hash(x) for x in L)) (4194304, 4194304) Again, not a huge problem, but at least there is room for improvement. > Why, when raising it to the third power apparently didn't work, did you pull "2" out of a hat? What was _the cause_ of the "sporadic failure" (whatever that means) With "sporadic failure", I mean a hash collision not due to the algorithm but due to two numbers which happen to be the same. This might not even affect actual usage, but it did cause the testsuite to fail on 32 bit machines (48 collisions if I recall correctly while only 20 are allowed). > and why did adding 2 fix it? Because it's really just random chance. Some constants just happen to produce more collisions on the specific testsuite input. It should also be said that, due to the structure of that input, collisions are not independent. For example, if hash((a,b,c)) == hash((d,e,f)), then also hash((a,b,c,x)) == hash((d,e,f,x)) for example. > Why isn't there a single word in _the code_ about where the mystery numbers came from? Ideally, it should be just a random number. I can fix that by using a "Nothing up my sleeve number".

Concerning the MULTIPLIER:

> Why do you claim the original was "too small"?  Too small for what purpose?

If the multiplier is too small, then the resulting hash values are small too. This causes collisions to appear for smaller numbers:

BEFORE:
>>> L = [(a,b) for a in range(2) for b in range(2**21)]
>>> len(L), len(set(hash(x) for x in L))
(4194304, 2097152)

AFTER:
>>> L = [(a,b) for a in range(2) for b in range(2**21)]
>>> len(L), len(set(hash(x) for x in L))
(4194304, 4194304)

Again, not a huge problem, but at least there is room for improvement.

> Why, when raising it to the third power apparently didn't work, did you pull "2" out of a hat?  What was _the cause_ of the "sporadic failure" (whatever that means)

With "sporadic failure", I mean a hash collision not due to the algorithm but due to two numbers which happen to be the same. This might not even affect actual usage, but it did cause the testsuite to fail on 32 bit machines (48 collisions if I recall correctly while only 20 are allowed).

> and why did adding 2 fix it?

Because it's really just random chance. Some constants just happen to produce more collisions on the specific testsuite input. It should also be said that, due to the structure of that input, collisions are not independent. For example, if hash((a,b,c)) == hash((d,e,f)), then also hash((a,b,c,x)) == hash((d,e,f,x)) for example.

> Why isn't there a single word in _the code_ about where the mystery numbers came from?

Ideally, it should be just a random number. I can fix that by using a "Nothing up my sleeve number".

History
Date	User	Action	Args
2018-09-21 20:27:56	jdemeyer	set	recipients: + jdemeyer, tim.peters, rhettinger, mark.dickinson, eric.smith, sir-sigurd
2018-09-21 20:27:56	jdemeyer	set	messageid: <1537561676.26.0.956365154283.issue34751@psf.upfronthosting.co.za>
2018-09-21 20:27:56	jdemeyer	link	issue34751 messages
2018-09-21 20:27:56	jdemeyer	create