Message 326505 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date	2018-09-26.20:43:33
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1537994613.33.0.545547206417.issue34751@psf.upfronthosting.co.za>
In-reply-to

Content
High-order bit: please restore the original tuple hash test. You have the worst case of "but I didn't write it" I've ever encountered ;-) Your new test is valuable, but I've seen several cases now where it fails to detect any problems where the original test fails catastrophically. Tests are written to ensure known problems don't recur, so there's _never_ "a reason" to throw one away unless the test itself is in error. The test suite only grows over time, and that's how it should be. Example: when I replace the computational guts of the tuple hash with the single line (64-bit build, and this is the only code remaining in the loop apart from setting y to the next tuple component's hash): x = (x * mult) + y; both tests fail spectacularly. But if I add one more line: x = (x * mult) + y; x += x >> 16; your new test falls to 0 collisions, while the original test still suffers 83447 collisions (an improvement - down from 120050 - but still a spectacular failure). We're flying blind enough here without deliberately plucking out one of our eyes ;-) Indeed, the original tuple test is the only one in my collection that showed even a single collision after the two-line replacement just above. "My collection" means your test, the original tuple hash test, and every example given in all the messages in this bug report. Another "surprise": replace addition with xor in the second line above: x = (x * mult) + y; x ^= x >> 16; and then the original tuple test drops from 83447 to 0(!) collisions, while your new one jumps from 0 to a measly 12. > The collision hash((3,3)) == hash((-3,-3)) is due a specific > structure in the input to the FNV algorithm. This is all so hypothetical it's impossible to know. Perhaps the tuple hash function simply subtracts the hash of the first component from the hash of the second. Then it's impossible to make any deterministic change to the component hashes that can stop either of the tuples from hashing to 0. The weakness is entirely in the hash combining function then. > The operation t ^= t << 7 that you suggested (and which I approve, except > for the shift amount) is meant precisely to destroy that structure. Except I have no reason to believe that destroying the input structure is of unique value. As shown above, it's easy to change the tuple hash in a simple way that passes all known tests without touching `y` (aka `t`) at all. >> It's x you care about directly, not t. > That would not be a good solution, because that destroys the structure > of the hash algorithm. For Bernstein for example, each loop iteration > should do something like > > x = (x * m) + t > > for some value t depending on the input. If you mess with x, then it > really becomes a different hash algorithm. The two-liner above with the xor in the second line is exactly Bernstein 33A, followed by a permutation of 33A's _output_ space. It works fine in tests. Why should I care about merely that it's "different"? 33A on its own fails spectacularly - it damned well _better_ be "different" ;-) There is no "Bernstein theory" to appeal to here - just raw assertions. I'm not attached to any of these approaches. The strongest I can say from trying many things is that "chaining permutations works best, but no detail appears to matter much, except that multiplication is the most valuable of all the permutations I've tried". x*M is a permutation of the word-size space for any odd M, and any "big enough" odd M I tried appears to work about as well as any other. Adding t is another. The shift-xor in the second line above is a third (but the "+=" version instead was not a permutation, and it failed). > That is far more dangerous than simply applying a permutation on t. Why? > Which "desirable properties" of t does the operation t ^= (t << 1) damage? On second thought, none! Across any contiguous range of integers, that preserves that the last k bits (for all k >= 1) remain as evenly distributed as in the inputs. That wasn't obvious to me at first. It's not true if the left shift is replaced by a right shift, though. However, with a shift count of 1 I've generally found that the _original_ tuple hash test fails (not spectacularly, but fails all the same). "Generally" means across a substantial number of varying the permutations used in other places.

High-order bit:  please restore the original tuple hash test.  You have the worst case of "but I didn't write it" I've ever encountered ;-)  Your new test is valuable, but I've seen several cases now where it fails to detect any problems where the original test fails catastrophically.  Tests are written to ensure known problems don't recur, so there's _never_ "a reason" to throw one away unless the test itself is in error.  The test suite only grows over time, and that's how it should be.

Example:  when I replace the computational guts of the tuple hash with the single line (64-bit build, and this is the only code remaining in the loop apart from setting y to the next tuple component's hash):

    x = (x * mult) + y;

both tests fail spectacularly.

But if I add one more line:

    x = (x * mult) + y;
    x += x >> 16;
    
your new test falls to 0 collisions, while the original test still suffers 83447 collisions (an improvement - down from 120050 - but still a spectacular failure).

We're flying blind enough here without deliberately plucking out one of our eyes ;-)  Indeed, the original tuple test is the only one in my collection that showed even a single collision after the two-line replacement just above.  "My collection" means your test, the original tuple hash test, and every example given in all the messages in this bug report.

Another "surprise":  replace addition with xor in the second line above:

    x = (x * mult) + y;
    x ^= x >> 16;

and then the original tuple test drops from 83447 to 0(!) collisions, while your new one jumps from 0 to a measly 12.


> The collision hash((3,3)) == hash((-3,-3)) is due a specific
> structure in the input to the FNV algorithm.

This is all so hypothetical it's impossible to know.  Perhaps the tuple hash function simply subtracts the hash of the first component from the hash of the second.  Then it's impossible to make any deterministic change to the component hashes that can stop either of the tuples from hashing to 0.  The weakness is entirely in the hash combining function then.

> The operation t ^= t << 7 that you suggested (and which I approve, except
> for the shift amount) is meant precisely to destroy that structure.

Except I have no reason to believe that destroying the input structure is of unique value.  As shown above, it's easy to change the tuple hash in a simple way that passes all known tests without touching `y` (aka `t`) at all.

>> It's x you care about directly, not t.

> That would not be a good solution, because that destroys the structure
> of the hash algorithm.   For Bernstein for example, each loop iteration
> should do something like
>
>    x = (x * m) + t
>
> for *some* value t depending on the input. If you mess with x, then it
> really becomes a different hash algorithm.

The two-liner above with the xor in the second line is exactly Bernstein 33A, followed by a permutation of 33A's _output_ space.  It works fine in tests.  Why should I care about merely that it's "different"?  33A on its own fails spectacularly - it damned well _better_ be "different" ;-)

There is no "Bernstein theory" to appeal to here - just raw assertions.  I'm not attached to any of these approaches.  The strongest I can say from trying many things is that "chaining permutations works best, but no detail appears to matter much, except that multiplication is the most valuable of all the permutations I've tried".

x*M is a permutation of the word-size space for any odd M, and any "big enough" odd M I tried appears to work about as well as any other.  Adding t is another.  The shift-xor in the second line above is a third (but the "+=" version instead was not a permutation, and it failed).

> That is far more dangerous than simply applying a permutation on t.

Why?

> Which "desirable properties" of t does the operation t ^= (t << 1) damage?

On second thought, none!  Across any contiguous range of integers, that preserves that the last k bits (for all k >= 1) remain as evenly distributed as in the inputs.  That wasn't obvious to me at first.  It's not true if the left shift is replaced by a right shift, though.

However, with a shift count of 1 I've generally found that the _original_ tuple hash test fails (not spectacularly, but fails all the same).  "Generally" means across a substantial number of varying the permutations used in other places.

History
Date	User	Action	Args
2018-09-26 20:43:33	tim.peters	set	recipients: + tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-09-26 20:43:33	tim.peters	set	messageid: <1537994613.33.0.545547206417.issue34751@psf.upfronthosting.co.za>
2018-09-26 20:43:33	tim.peters	link	issue34751 messages
2018-09-26 20:43:33	tim.peters	create