Message 326024 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date	2018-09-21.19:44:23
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1537559064.0.0.956365154283.issue34751@psf.upfronthosting.co.za>
In-reply-to

Content
For me, it's largely because you make raw assertions with extreme confidence that the first thing you think of off the top of your head can't possibly make anything else worse. When it turns out it does make some things worse, you're equally confident that the second thing you think of is (also) perfect. So far we don't even have a real-world test case, let alone a coherent characterization of "the problem": """ all involving negative numbers) due to the same underlying mathematical reason. """ is a raw assertion, not a characterization. The only "mathematical reason" you gave before is that "j odd implies j^(-2) == -j, so that m(j^(-2)) == -m". Which is true, and a good observation, but doesn't generalize as-is beyond -2. Stuff like this from the PR doesn't inspire confidence either: """ MULTIPLIER = (1000003)3 + 2 = 1000009000027000029: the multiplier should be big enough and the original 20-bit number is too small for a 64-bit hash. So I took the third power. Unfortunately, this caused a sporadic failure of the testsuite on 32-bit systems. So I added 2 which solved that problem. """ Why do you claim the original was "too small"? Too small for what purpose? As before, we don't care whether Python hashes "act random". Why, when raising it to the third power apparently didn't work, did you pull "2" out of a hat? What was _the cause_ of the "sporadic failure" (whatever that means), and why did adding 2 fix it? Why isn't there a single word in _the code_ about where the mystery numbers came from?: You're creating at least as many mysteries as you're claiming to solve. We're not going to change such heavily used code on a whim. That said, you could have easily enough _demonstrated_ that there's potentially a real problem with a mix of small integers of both signs: >>> from itertools import product >>> cands = list(range(-10, -1)) + list(range(9)) >>> len(cands) 18 >>> _ * 4 104976 >>> len(set(hash(t) for t in product(cands, repeat=4))) 35380 And that this isn't limited to -2 being in the mix (and noting that -1 wasn't in the mix to begin with): >>> cands.remove(-2) >>> len(cands) ** 4 83521 >>> len(set(hash(t) for t in product(cands, repeat=4))) 33323 If that's "the problem", then - sure - it _may_ be worth addressing. Which we would normally do by looking for a minimal change to code that's been working well for over a decade, not by replacing the whole thing "just because". BTW, continuing the last example: >>> c1 = Counter(hash(t) for t in product(cands, repeat=4)) >>> Counter(c1.values()) Counter({1: 11539, 2: 10964, 4: 5332, 3: 2370, 8: 1576, 6: 1298, 5: 244}) So despite that there were many collisions, the max number of times any single hash code appeared was 8. That's unfortunate, but not catastrophic. Still, if a small change could repair that, fine by me.

For me, it's largely because you make raw assertions with extreme confidence that the first thing you think of off the top of your head can't possibly make anything else worse.  When it turns out it does make some things worse, you're equally confident that the second thing you think of is (also) perfect.

So far we don't even have a real-world test case, let alone a coherent characterization of "the problem":

"""
all involving negative numbers) due to the same underlying mathematical reason.
"""

is a raw assertion, not a characterization.  The only "mathematical reason" you gave before is that "j odd implies j^(-2) == -j, so that m*(j^(-2)) == -m".  Which is true, and a good observation, but doesn't generalize as-is beyond -2.

Stuff like this from the PR doesn't inspire confidence either:

"""
MULTIPLIER = (1000003)**3 + 2 = 1000009000027000029: the multiplier should be big enough and the original 20-bit number is too small for a 64-bit hash. So I took the third power. Unfortunately, this caused a sporadic failure of the testsuite on 32-bit systems. So I added 2 which solved that problem.
"""

Why do you claim the original was "too small"?  Too small for what purpose?  As before, we don't care whether Python hashes "act random".  Why, when raising it to the third power apparently didn't work, did you pull "2" out of a hat?  What was _the cause_ of the "sporadic failure" (whatever that means), and why did adding 2 fix it?  Why isn't there a single word in _the code_ about where the mystery numbers came from?:

You're creating at least as many mysteries as you're claiming to solve.

We're not going to change such heavily used code on a whim.

That said, you could have easily enough _demonstrated_ that there's potentially a real problem with a mix of small integers of both signs:

>>> from itertools import product
>>> cands = list(range(-10, -1)) + list(range(9))
>>> len(cands)
18
>>> _ ** 4
104976
>>> len(set(hash(t) for t in product(cands, repeat=4)))
35380

And that this isn't limited to -2 being in the mix (and noting that -1 wasn't in the mix to begin with):

>>> cands.remove(-2)
>>> len(cands) ** 4
83521
>>> len(set(hash(t) for t in product(cands, repeat=4)))
33323

If that's "the problem", then - sure - it _may_ be worth addressing.  Which we would normally do by looking for a minimal change to code that's been working well for over a decade, not by replacing the whole thing "just because".

BTW, continuing the last example:

>>> c1 = Counter(hash(t) for t in product(cands, repeat=4))
>>> Counter(c1.values())
Counter({1: 11539, 2: 10964, 4: 5332, 3: 2370, 8: 1576, 6: 1298, 5: 244})

So despite that there were many collisions, the max number of times any single hash code appeared was 8.  That's unfortunate, but not catastrophic.

Still, if a small change could repair that, fine by me.

History
Date	User	Action	Args
2018-09-21 19:44:24	tim.peters	set	recipients: + tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-09-21 19:44:24	tim.peters	set	messageid: <1537559064.0.0.956365154283.issue34751@psf.upfronthosting.co.za>
2018-09-21 19:44:23	tim.peters	link	issue34751 messages
2018-09-21 19:44:23	tim.peters	create