Author tim.peters
Recipients eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date 2018-09-24.00:33:44
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1537749226.73.0.956365154283.issue34751@psf.upfronthosting.co.za>
In-reply-to
Content
Has anyone figured out the real source of the degeneration when mixing in negative integers?  I have not.  XOR always permutes the hash range - it's one-to-one.  No possible outputs are lost, and XOR with a negative int isn't "obviously degenerate" in general:

>>> for i in range(-16, 17):
...    print(i, i ^ -5)
-16 11
-15 10
-14 9
-13 8
-12 15
-11 14
-10 13
-9 12
-8 3
-7 2
-6 1
-5 0
-4 7
-3 6
-2 5
-1 4
0 -5
1 -6
2 -7
3 -8
4 -1
5 -2
6 -3
7 -4
8 -13
9 -14
10 -15
11 -16
12 -9
13 -10
14 -11
15 -12
16 -21

Clear enough? "Distinct integers in, distinct integers out", but with the twist that the sign bit flips.  That's _seemingly_ minor to me.  Yet it has massive effects on tuple hash distribution.  Here's a function to show that:

    def doit(cands, repeat=4):
        from collections import Counter
        from itertools import product

        print(len(cands), "cands **", repeat, "=", len(cands)**repeat)
        c1 = Counter(map(hash, product(cands, repeat=repeat)))
        print(len(c1), "distinct hash codes")
        c2 = Counter(c1.values())
        for dups in sorted(c2):
            print(dups, c2[dups])

Then an example we're proud of:

>>> doit(range(100))
100 cands ** 4 = 100000000
100000000 distinct hash codes
1 100000000

Much the same, mixing in negative ints, but _excluding_ the problematic -1 and -2 inputs:

>>> cands = list(range(-50, 51))
>>> cands.remove(-1) # hashes to -2
>>> cands.remove(-2) # for j odd, j ^ -2 == -j
>>> doit(cands)
99 cands ** 4 = 96059601
15736247 distinct hash codes
1 70827
2 1005882
3 228578
4 5000424
5 19728
6 1047762
8 8363046

What accounts for such a massive difference?  It's not merely that we're using negative ints:

>>> doit(range(-101, -1)) # leave -2 in, for a change
100 cands ** 4 = 100000000
100000000 distinct hash codes
1 100000000

So, on their own, negative ints are as spectacularly well-behaved in this range as positive ints, and including -2 creates no problems.

I suspect it's related to that x ^ -(x + 1) == -1, but not in a way obvious enough to be ... well, obvious ;-)
History
Date User Action Args
2018-09-24 00:33:46tim.peterssetrecipients: + tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-09-24 00:33:46tim.peterssetmessageid: <1537749226.73.0.956365154283.issue34751@psf.upfronthosting.co.za>
2018-09-24 00:33:46tim.peterslinkissue34751 messages
2018-09-24 00:33:44tim.peterscreate