Message 326067 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date	2018-09-22.03:47:32
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1537588052.88.0.956365154283.issue34751@psf.upfronthosting.co.za>
In-reply-to

Content
>> Why do you claim the original was "too small"? Too small for >> what purpose? > If the multiplier is too small, then the resulting hash values are > small too. This causes collisions to appear for smaller numbers: All right! An actual reason ;-) So there are two things so far I clearly agree with, but they're independent of "the problem" this report was opened about: 1. On a 64-bit box life would be better if we picked a bigger multiplier. This should be done via preprocessor #if selection, though, not via the inscrutable rules C uses to cut back an integer literal out of range for the type of the variable it's assigned to. 2. A raw integer hash of -1 should be changed to something other than -2. It was always a Poor Idea to pick an integer of small magnitude for the substitute. Some newer types didn't repeat this mistake. For example, frozensets replace a raw hash of -1 with 590923713. Alas, this is hard to change now, because hashable objects of any type that compare equal to -1 also need to return the same substitute value (e.g., hash(-1.0) == hash(decimal.Decimal("-100e-2")) == hash(-1+0j) == -2). So I'm afraid it's been hard-coded into user-defined numeric classes too :-( There would be no obvious problem in changing the _tuple_ hash to use a different substitute value, although it's not clear that would buy anything either. hash(-1) == -2 was the original sin. [about multipliers] > Because it's really just random chance. > ... > Ideally, it should be just a random number. How do you know this? I'm not aware of any research papers about picking multipliers in this context, but would love to see one. I am aware of much research about multipliers in the somewhat related contexts of linear congruential pseudo-random number generators, and of "multiplicative hashing", and "any random multiplier is fine" couldn't be further from the truth _in_ those contexts. Guido picked a prime to begin with (actually, at first it was 3(!)), and comments in the current code sure say that whoever added this part was keen on primes too: """ /* The addend 82520, was selected from the range(0, 1000000) for generating the greatest number of prime multipliers for tuples up to length eight: """ I don't know that primes are important here, but neither do I know that they're _not_ important here. Widely used production code isn't the place to conduct experiments, so the status quo is favored in the absence of truly compelling reasons. Just repeating the raw assertion "random is fine" isn't compelling. If it were true, we could, e.g., multiply by a large power of 2 via shifting instead, and save a few cycles. For my best guess at what "the problem" here actually is, in the >>> cands = list(range(-10, -1)) + list(range(9)) >>> len(set(hash(t) for t in product(cands, repeat=4))) example, which currently (on my 64-bit box) generates 35,380 distinct hash codes from the 104,976 inputs, ALL the collisions go away merely by changing the character "^" to "+" in the current code. Which was your original suggestion. Which you appear to be opposed to now? I'm unclear about why, if so. That's a change I could probably live with, although I have no strong reason to believe it couldn't cause problems for applications that currently work fine. I don't have more time to think about that now, though.

>> Why do you claim the original was "too small"?  Too small for
>> what purpose?

> If the multiplier is too small, then the resulting hash values are
> small too. This causes collisions to appear for smaller numbers:

All right!  An actual reason ;-)  So there are two things so far I clearly agree with, but they're independent of "the problem" this report was opened about:

1. On a 64-bit box life would be better if we picked a bigger multiplier.  This should be done via preprocessor #if selection, though, not via the inscrutable rules C uses to cut back an integer literal out of range for the type of the variable it's assigned to.

2. A raw integer hash of -1 should be changed to something other than -2.  It was always a Poor Idea to pick an integer of small magnitude for the substitute.  Some newer types didn't repeat this mistake.  For example, frozensets replace a raw hash of -1 with 
590923713.  Alas, this is hard to change now, because hashable objects of any type that compare equal to -1 also need to return the same substitute value (e.g., hash(-1.0) == hash(decimal.Decimal("-100e-2")) == hash(-1+0j) == -2).  So I'm afraid it's been hard-coded into user-defined numeric classes too :-(  There would be no obvious problem in changing the _tuple_ hash to use a different substitute value, although it's not clear that would buy anything either.  hash(-1) == -2 was the original sin.


[about multipliers]
> Because it's really just random chance.
> ...
> Ideally, it should be just a random number.

How do you know this?  I'm not aware of any research papers about picking multipliers in this context, but would love to see one.

I am aware of much research about multipliers in the somewhat related contexts of linear congruential pseudo-random number generators, and of "multiplicative hashing", and "any random multiplier is fine" couldn't be further from the truth _in_ those contexts.

Guido picked a prime to begin with (actually, at first it was 3(!)), and comments in the current code sure say that whoever added this part was keen on primes too:

"""
/* The addend 82520, was selected from the range(0, 1000000) for
   generating the greatest number of prime multipliers for tuples
   up to length eight:
"""

I don't know that primes are important here, but neither do I know that they're _not_ important here.  Widely used production code isn't the place to conduct experiments, so the status quo is favored in the absence of truly compelling reasons.  Just repeating the raw assertion "random is fine" isn't compelling.  If it were true, we could, e.g., multiply by a large power of 2 via shifting instead, and save a few cycles.

For my best guess at what "the problem" here actually is, in the

>>> cands = list(range(-10, -1)) + list(range(9))
>>> len(set(hash(t) for t in product(cands, repeat=4)))

example, which currently (on my 64-bit box) generates 35,380 distinct hash codes from the 104,976 inputs, ALL the collisions go away merely by changing the character "^" to "+" in the current code.

Which was your original suggestion.  Which you appear to be opposed to now?  I'm unclear about why, if so.

That's a change I could probably live with, although I have no strong reason to believe it couldn't cause problems for applications that currently work fine.  I don't have more time to think about that now, though.

History
Date	User	Action	Args
2018-09-22 03:47:32	tim.peters	set	recipients: + tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-09-22 03:47:32	tim.peters	set	messageid: <1537588052.88.0.956365154283.issue34751@psf.upfronthosting.co.za>
2018-09-22 03:47:32	tim.peters	link	issue34751 messages
2018-09-22 03:47:32	tim.peters	create