Author tim.peters
Recipients eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date 2018-10-08.03:39:43
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1538969985.27.0.545547206417.issue34751@psf.upfronthosting.co.za>
In-reply-to
Content
Here's htest.py output from

git pull https://github.com/jdemeyer/cpython.git bpo34751

and a variation.  The number of collisions in the variation appear in square brackets at the end of each line.

32-bit build:

range(100) by 3; 32-bit hash codes; mean 116.42 got 0 [0]
-10 .. 8 by 4; 32-bit hash codes; mean 1.28 got 0 [0]
-50 .. 50 less -1 by 3; 32-bit hash codes; mean 116.42 got 0 [0]
0..99 << 60 by 3; 32-bit hash codes; mean 116.42 got 0 [0]
[-3, 3] by 20; 32-bit hash codes; mean 128.00 got 140 [108]
[0.5, 0.25] by 20; 32-bit hash codes; mean 128.00 got 99 [154]
old tuple test; 32-bit hash codes; mean 7.43 got 4 [2]
new tuple test; 32-bit hash codes; mean 13.87 got 19 [21]

64-bit build:

range(100) by 3; 64-bit hash codes; mean 0.00 got 0 [0]
range(100) by 3; 32-bit lower hash codes; mean 116.42 got 130 [0]
range(100) by 3; 32-bit upper hash codes; mean 116.42 got 94 [6]
-10 .. 8 by 4; 64-bit hash codes; mean 0.00 got 0 [0]
-10 .. 8 by 4; 32-bit lower hash codes; mean 1.28 got 1 [0]
-10 .. 8 by 4; 32-bit upper hash codes; mean 1.28 got 1 [0]
-50 .. 50 less -1 by 3; 64-bit hash codes; mean 0.00 got 0 [0]
-50 .. 50 less -1 by 3; 32-bit lower hash codes; mean 116.42 got 128 [0]
-50 .. 50 less -1 by 3; 32-bit upper hash codes; mean 116.42 got 116 [8]
0..99 << 60 by 3; 64-bit hash codes; mean 0.00 got 0 [0]
0..99 << 60 by 3; 32-bit lower hash codes; mean 116.42 got 123 [258]
0..99 << 60 by 3; 32-bit upper hash codes; mean 116.42 got 121 [0]
[-3, 3] by 20; 64-bit hash codes; mean 0.00 got 0 [0]
[-3, 3] by 20; 32-bit lower hash codes; mean 128.00 got 129 [117]
[-3, 3] by 20; 32-bit upper hash codes; mean 128.00 got 137 [115]
[0.5, 0.25] by 20; 64-bit hash codes; mean 0.00 got 0 [0]
[0.5, 0.25] by 20; 32-bit lower hash codes; mean 128.00 got 126 [131]
[0.5, 0.25] by 20; 32-bit upper hash codes; mean 128.00 got 137 [130]
old tuple test; 64-bit hash codes; mean 0.00 got 0 [0]
old tuple test; 32-bit lower hash codes; mean 7.43 got 12 [5]
old tuple test; 32-bit upper hash codes; mean 7.43 got 54 [52]
new tuple test; 64-bit hash codes; mean 0.00 got 0 [0]
new tuple test; 32-bit lower hash codes; mean 13.87 got 10 [6]
new tuple test; 32-bit upper hash codes; mean 13.87 got 20 [30]

They're all fine by me.  This is what the variation does:

1. Removes all "final mix" code after the loop ends.
2. Changes initialization to add in the length:

    Py_uhash_t acc = _PyHASH_XXPRIME_5 + (Py_uhash_t)len;

The vast bulk of the "final mix" code applies a chain of tuple-agnostic permutations.  The only exception is adding in the length, which #2 moves to the start instead.

Applying permutations after the loop ends changes nothing about the number of full-width hash collisions, which is Python's _primary_ concern.  Two full-width hash codes are the same after the final mix if and only if they're the same before the final mix starts.

In xxHash they're concerned about getting close to "avalanche" perfection (each input bit affects all final output bits with probability about 0.5 for each).  There aren't enough permutations _inside_ the loop to achieve that for the last input or two, so they pile up more permutations after the loop.

While we do care about "propagate left" and "propagate right" to mix up very regular and/or sparse hash codes, "avalanche perfection" is of no particular value to us.  To the contrary, in _typical_ cases like `product(range(100), repeat=3)` it's better for us _not_ to destroy all the regularity:

range(100) by 3; 32-bit lower hash codes; mean 116.42 got 130 [0]
range(100) by 3; 32-bit upper hash codes; mean 116.42 got 94 [6]

"Avalanche perfection" is what drives the number of collisions up with the final mix code intact.  Without it, we get "waaaaaaay better than random" numbers of collisions.

The other reason it's attractive to drop it:  the final mix code is one long critical path (no instruction-level parallelism), adding 8 serialized machine instructions including two multiplies.  That's a real expense for typically-short tuples used as dict keys or set elements.

Nothing is anywhere near disaster territory either way, although "worse than random" remains quite possible, especially when cutting a 64-bit hash in half (do note that, either way, there are no collisions in any test when retaining the full 64-bit hash code).
History
Date User Action Args
2018-10-08 03:39:45tim.peterssetrecipients: + tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-10-08 03:39:45tim.peterssetmessageid: <1538969985.27.0.545547206417.issue34751@psf.upfronthosting.co.za>
2018-10-08 03:39:45tim.peterslinkissue34751 messages
2018-10-08 03:39:43tim.peterscreate