Message 326278 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date	2018-09-24.18:05:12
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1537812312.33.0.956365154283.issue34751@psf.upfronthosting.co.za>
In-reply-to

Content
> when you do t ^= t << 7, then you are not changing > the lower 7 bits at all. I want to leave low-order hash bits alone. That's deliberate. The most important tuple component types, for tuples that are hashable, are strings and contiguous ranges of "not huge" ints. The current string hash works hard to "act randomly" in every bit position - there's no point in trying to "improve" anything about the hashes it already produces. In contrast, the hashes of any contiguous range of N not-huge integers (excluding -1) are already, by construction, guaranteed to have no collisions at all in their low-order (roughly) log2(N) bits. They can't be improved in this respect, because they're already optimal in this respect. So, if anything, I'd look at increasing the left shift rather than reducing it. The low bits are already un-improvable in the most important cases. > So applications using hash(x) % 128 will still see > all the problems that we are trying to fix. ? The only "problem" I've seen identified was in mixing positive and negative integers. Here on a 32-build with the FNV-1a 32-bit multiplier: >> from itertools import product >> from collections import Counter >> cands = list(range(-50, 51)) >> cands.remove(-1) >> c = Counter() >> for t in product(cands, repeat=4): .. c[hash(t) & 0x7f] += 1 >>> len(c) 128 So all 128 lower 7-bit patterns showed up. And they're quite evenly distributed: >>> c2 = Counter(c.values()) >>> for k in sorted(c2): ... print(k, c2[k]) ... 781202 1 781207 1 781209 2 781212 1 781213 2 781214 1 781215 4 781221 3 781222 1 781225 3 781226 1 781227 3 781228 2 781229 2 781230 1 781231 1 781232 2 781233 3 781234 1 781235 4 781236 2 781237 1 781238 2 781240 5 781241 6 781242 1 781243 1 781244 1 781245 1 781247 1 781248 2 781251 2 781252 4 781253 3 781254 5 781255 2 781256 2 781257 3 781259 2 781260 1 781261 1 781262 1 781263 2 781264 4 781265 2 781266 1 781267 1 781268 4 781269 1 781270 1 781271 2 781272 1 781274 2 781275 1 781276 1 781278 1 781280 1 781281 2 781282 2 781285 1 781286 2 781288 1 781295 1 781297 2 781301 1 781302 1 781304 1 781307 1 > With the standard FNV multiplier on 64 bits, I did > get collisions while testing. When testing what, specifically? And the standard 32-bit FNV multiplier, or the standard 64-bit FNV multiplier? > Instead, I chose 3**41 as multiplier. But of course, > there are still plenty of bikeshedding opportunities > for the multiplier... Not for me. If the universally used 64-bit FNV multiplier can't be used in the context of Python's tuple hash fiddled to use an almost-pure form of FNV-1a, then that approach is dead to me. Needing to pick multipliers out of thin air instead implies the theory underlying FNV-1a doesn't transfer to this context. About which I have no current opinion. It may or may not. Since we're flying blind, I'm just willing to _assume_ it does until proven otherwise by testing.

> when you do t ^= t << 7, then you are not changing
> the lower 7 bits at all.

I want to leave low-order hash bits alone.  That's deliberate.

The most important tuple component types, for tuples that are hashable, are strings and contiguous ranges of "not huge" ints.  The current string hash works hard to "act randomly" in every bit position - there's no point in trying to "improve" anything about the hashes it already produces.

In contrast, the hashes of any contiguous range of N not-huge integers (excluding -1) are already, by construction, guaranteed to have no collisions at all in their low-order (roughly) log2(N) bits.  They can't be improved in this respect, because they're already optimal in this respect.

So, if anything, I'd look at increasing the left shift rather than reducing it.  The low bits are already un-improvable in the most important cases.


> So applications using hash(x) % 128 will still see
> all the problems that we are trying to fix.

?  The only "problem" I've seen identified was in mixing positive and negative integers.  Here on a 32-build with the FNV-1a 32-bit multiplier:

>> from itertools import product
>> from collections import Counter
>> cands = list(range(-50, 51))
>> cands.remove(-1)
>> c = Counter()
>> for t in product(cands, repeat=4):
..     c[hash(t) & 0x7f] += 1
>>> len(c)
128

So all 128 lower 7-bit patterns showed up.  And they're quite evenly distributed:

>>> c2 = Counter(c.values())
>>> for k in sorted(c2):
...     print(k, c2[k])
...
781202 1
781207 1
781209 2
781212 1
781213 2
781214 1
781215 4
781221 3
781222 1
781225 3
781226 1
781227 3
781228 2
781229 2
781230 1
781231 1
781232 2
781233 3
781234 1
781235 4
781236 2
781237 1
781238 2
781240 5
781241 6
781242 1
781243 1
781244 1
781245 1
781247 1
781248 2
781251 2
781252 4
781253 3
781254 5
781255 2
781256 2
781257 3
781259 2
781260 1
781261 1
781262 1
781263 2
781264 4
781265 2
781266 1
781267 1
781268 4
781269 1
781270 1
781271 2
781272 1
781274 2
781275 1
781276 1
781278 1
781280 1
781281 2
781282 2
781285 1
781286 2
781288 1
781295 1
781297 2
781301 1
781302 1
781304 1
781307 1

> With the standard FNV multiplier on 64 bits, I did
> get collisions while testing.

When testing what, specifically?  And the standard 32-bit FNV multiplier, or the standard 64-bit FNV multiplier?

> Instead, I chose 3**41 as multiplier. But of course,
> there are still plenty of bikeshedding opportunities
> for the multiplier...

Not for me.  If the universally used 64-bit FNV multiplier can't be used in the context of Python's tuple hash fiddled to use an almost-pure form of FNV-1a, then that approach is dead to me.  Needing to pick multipliers out of thin air instead implies the theory underlying FNV-1a doesn't transfer to this context.

About which I have no current opinion.  It may or may not.  Since we're flying blind, I'm just willing to _assume_ it does until proven otherwise by testing.

History
Date	User	Action	Args
2018-09-24 18:05:12	tim.peters	set	recipients: + tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-09-24 18:05:12	tim.peters	set	messageid: <1537812312.33.0.956365154283.issue34751@psf.upfronthosting.co.za>
2018-09-24 18:05:12	tim.peters	link	issue34751 messages
2018-09-24 18:05:12	tim.peters	create