Message 326375 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	eric.smith, jdemeyer, mark.dickinson, rhettinger, sir-sigurd, tim.peters
Date	2018-09-25.16:54:28
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1537894468.79.0.545547206417.issue34751@psf.upfronthosting.co.za>
In-reply-to

Content
> Suppose that there is a hash collision, say hash((3, 3)) == > hash((-3, -3)) and you change the hashing algorithm to fix > this collision. There are _two_ hash functions at play in that collision: the tuple hash function, and the integer hash function. Regardless of whether the _tuple_ hash function does t ^= t << shift; or t += t >> shift; or anything else involving just `t`, that only directly affects the result of the _int_ hash function. Whose low order bits cannot be improved in general - they're already optimally spread out in the most important cases. > If that change does NOT affect the lower N bits, > then you still have a collision hash((3, 3)) % 2N == > hash((-3, -3)) % 2N. This is relevant because the dict > implementation starts by taking the lower bits first. So > ideally we want to avoid hash collisions in the lower > bits too. Which the int hash on its own already does a superb job of doing in the most important cases. If you feel you just have to mess with low-order bits, do it instead on the _tuple_ hash intermediate results, not on its inputs. Like x += x >> 16; after t is folded in to x. It's x you care about directly, not t. Damaging desirable properties in t to _indirectly_ gain something in x is too clever by half. Mucking with t to avoid the nested-tuple catastrophes is worth it, but I prefer to do that with as little damage to t's desirable properties as reasonably possible. > This may also be a reason to avoid the standard FNV > multipliers: the 64-bit FNV multiplier was chosen with > the full 64-bit hash range in mind. It was never meant > to work well when truncated to some lower N bits. Do you believe any other multiplier would work better toward that end? For any odd multiplier M, the last k bits of iM are determined solely by the last k bits of i, and [(last k bits of iM) for i in range(anything, anything + 2k)] is a permutation of range(2k). The high order bits of i have nothing to do with it, and any odd multiplier permutes the lower k bits over any stretch of 2**k consecutive multiplicands. Extracting low-order bits is intended to be done by "xor folding" in FNV, and I _expect_ the same would be prudent for any other multiplier: https://tools.ietf.org/html/draft-eastlake-fnv-15#page-7 The dict and set lookup functions already do something stronger than xor folding to bring high bits into play, but do so incrementally as the probe sequence unfolds.

> Suppose that there is a hash collision, say hash((3, 3)) ==
> hash((-3, -3)) and you change the hashing algorithm to fix
> this collision.

There are _two_ hash functions at play in that collision:  the tuple hash function, and the integer hash function.  Regardless of whether the _tuple_ hash function does

    t ^= t << shift;
or
    t += t >> shift;

or anything else involving just `t`, that only directly affects the result of the _int_ hash function.  Whose low order bits cannot be improved in general - they're already optimally spread out in the most important cases.

> If that change does NOT affect the lower N bits,
> then you still have a collision hash((3, 3)) % 2**N ==
> hash((-3, -3)) % 2**N. This is relevant because the dict
> implementation starts by taking the lower bits first. So
> ideally we want to avoid hash collisions in the lower
> bits too.

Which the int hash on its own already does a superb job of doing in the most important cases.  If you feel you just have to mess with low-order bits, do it instead on the _tuple_ hash intermediate results, not on its inputs.  Like

    x += x >> 16;

after t is folded in to x.  It's x you care about directly, not t.  Damaging desirable properties in t to _indirectly_ gain something in x is too clever by half.

Mucking with t to avoid the nested-tuple catastrophes is worth it, but I prefer to do that with as little damage to t's desirable properties as reasonably possible.

> This may also be a reason to avoid the standard FNV
> multipliers: the 64-bit FNV multiplier was chosen with
> the full 64-bit hash range in mind. It was never meant
> to work well when truncated to some lower N bits.

Do you believe any other multiplier would work better toward that end?  For any odd multiplier M, the last k bits of i*M are determined solely by the last k bits of i, and

    [(last k bits of i*M) for i in range(anything, anything + 2**k)]

is a permutation of range(2**k).  The high order bits of i have nothing to do with it, and any odd multiplier permutes the lower k bits over any stretch of 2**k consecutive multiplicands.

Extracting low-order bits is intended to be done by "xor folding" in FNV, and I _expect_ the same would be prudent for any other multiplier:

https://tools.ietf.org/html/draft-eastlake-fnv-15#page-7

The dict and set lookup functions already do something stronger than xor folding to bring high bits into play, but do so incrementally as the probe sequence unfolds.

History
Date	User	Action	Args
2018-09-25 16:54:28	tim.peters	set	recipients: + tim.peters, rhettinger, mark.dickinson, eric.smith, jdemeyer, sir-sigurd
2018-09-25 16:54:28	tim.peters	set	messageid: <1537894468.79.0.545547206417.issue34751@psf.upfronthosting.co.za>
2018-09-25 16:54:28	tim.peters	link	issue34751 messages
2018-09-25 16:54:28	tim.peters	create