Message 326302 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date	2018-09-25.00:13:21
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1537834402.61.0.545547206417.issue34751@psf.upfronthosting.co.za>
In-reply-to

Content
> advantage of my approach is that high-order bits become more > important: I don't much care about high-order bits, beyond that we don't systematically _lose_ them. The dict and set lookup routines have their own strategies for incorporating high-order bits into the probe sequence, and those routines are the primary reason hash() exists in Python. So there's scant reason to try to let high bits influence low bits (but good reason _not_ to, because - as explained before - the low-order bits are already in great shape for the important tuple component types). Something else appears to be going on in this specific example, which has a very touchy set of hash codes: > BEFORE: > >>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T)) > 500000 Which is what I get too on (at least) a 64-bit build. > AFTER: > >>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T)) > 1000000 Ditto. However, I get exactly the same from the code I posted before, which has no right shifts at all. More, if I take the code exactly as it exists today, and merely comment out the mult += (Py_hash_t)(82520UL + len + len); line, I also get a million distinct hash codes. Half of the hash codes have bit 260 set, and apart from those all the hash codes have no bits set higher than 26. Any number of accidents could cause the 2**60 bits to wipe each other out, and merely changing the sequence of multipliers was demonstrably enough to stop that. Indeed, merely changing 82520 to 82524 in the current code is enough to get a million unique hash codes in this case.

> advantage of my approach is that high-order bits become more
> important:

I don't much care about high-order bits, beyond that we don't systematically _lose_ them.  The dict and set lookup routines have their own strategies for incorporating high-order bits into the probe sequence, and those routines are the primary reason hash() exists in Python.

So there's scant reason to try to let high bits influence low bits (but good reason _not_ to, because - as explained before - the low-order bits are already in great shape for the important tuple component types).

Something else appears to be going on in this specific example, which has a very touchy set of hash codes:

> BEFORE:
> >>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T))
> 500000

Which is what I get too on (at least) a 64-bit build.

> AFTER:
> >>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T))
> 1000000

Ditto.  However, I get exactly the same from the code I posted before, which has no right shifts at all.

More, if I take the code exactly as it exists today, and merely comment out the

    mult += (Py_hash_t)(82520UL + len + len);

line, I also get a million distinct hash codes.

Half of the hash codes have bit 2**60 set, and apart from those all the hash codes have no bits set higher than 2**6.  Any number of accidents could cause the 2**60 bits to wipe each other out, and merely changing the sequence of multipliers was demonstrably enough to stop that.

Indeed, merely changing 82520 to 82524 in the current code is enough to get a million unique hash codes in this case.

History
Date	User	Action	Args
2018-09-25 00:13:22	tim.peters	set	recipients: + tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-09-25 00:13:22	tim.peters	set	messageid: <1537834402.61.0.545547206417.issue34751@psf.upfronthosting.co.za>
2018-09-25 00:13:22	tim.peters	link	issue34751 messages
2018-09-25 00:13:21	tim.peters	create