Message 326117 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jdemeyer
Recipients	eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date	2018-09-22.21:38:06
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1537652286.94.0.956365154283.issue34751@psf.upfronthosting.co.za>
In-reply-to

Content
> the made-up hacks Python used to worm around a class of gross flaws its prior DJBX33X approach suffered, taking DJBX33X out of its original context and applying it in an area it wasn't designed for. But we know why the DJBX33A hash didn't work (nested tuples), so we can think of the best way to solve that. Python messes with the multiplier, which makes it quite a different hash. Surely, if you believe that the precise choice of multiplier matters a lot, then you should also agree that arbitrarily changing the multiplier in the loop is a bad idea. My proposal instead is to keep the structure of the DJBX33A hash but change the hash of the individual items to be hashed. That's a much less invasive change to the known algorithm. Finally, something that I haven't mentioned here: an additional advantage of my approach is that high-order bits become more important: BEFORE: >>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T)) 500000 AFTER: >>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T)) 1000000 Again, I'm not claiming that this is a major issue. Just additional evidence that maybe my new hash might actually be slightly better than the existing hash.

> the made-up hacks Python used to worm around a class of gross flaws its prior DJBX33X approach suffered, taking DJBX33X out of its original context and applying it in an area it wasn't designed for.

But we know why the DJBX33A hash didn't work (nested tuples), so we can think of the best way to solve that. Python messes with the multiplier, which makes it quite a different hash. Surely, if you believe that the precise choice of multiplier matters a lot, then you should also agree that arbitrarily changing the multiplier in the loop is a bad idea.

My proposal instead is to keep the structure of the DJBX33A hash but change the hash of the individual items to be hashed. That's a much less invasive change to the known algorithm.

Finally, something that I haven't mentioned here: an additional advantage of my approach is that high-order bits become more important:

BEFORE:
>>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T))
500000

AFTER:
>>> L = [n << 60 for n in range(100)]; T = [(a,b,c) for a in L for b in L for c in L]; len(set(hash(x) for x in T))
1000000

Again, I'm not claiming that this is a major issue. Just additional evidence that maybe my new hash might actually be slightly better than the existing hash.

History
Date	User	Action	Args
2018-09-22 21:38:06	jdemeyer	set	recipients: + jdemeyer, tim.peters, rhettinger, mark.dickinson, eric.smith, sir-sigurd
2018-09-22 21:38:06	jdemeyer	set	messageid: <1537652286.94.0.956365154283.issue34751@psf.upfronthosting.co.za>
2018-09-22 21:38:06	jdemeyer	link	issue34751 messages
2018-09-22 21:38:06	jdemeyer	create