This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author tim.peters
Recipients Dmitry Rubanovich, methane, rhettinger, serhiy.storchaka, tim.peters, xiang.zhang
Date 2017-06-19.18:19:14
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1497896354.52.0.342266056418.issue30671@psf.upfronthosting.co.za>
In-reply-to
Content
BTW, I should have spelled this out earlier:  recall that x and y collide on the first probe if and only if

    x = y (mod 2**k)   [1]

The second probe never happens unless they do collide on the first probe, so when looking at the second probe we can assume that [1] is true.  Since [1] is true,

    5*(x % 2**k) + 1 = 5*(y % 2**k) + 1

is also true (and there's nothing special here about "5" and "1" - it would hold for any multiplier and any addend).  The second probe just adds (x >> 5) % 2**k to the left side of that and (y >> 5) % 2**k to the right side of that.  Therefore the second probes collide if and only if (in addition to [1])

    x >> 5 = y >> 5 (mod 2**k)  [2]

The primary point is that any analysis that ignores the higher-order bits is missing the primary thing that matters ;-)

[2] does reveal some potential weaknesses in the scheme.  For example, if k < 5, some bits of the hash code have no effect on the probe sequence (e.g., if k==1, the first probe depends only on the hash code's last bit, and if there's collision then a second-probe collision depends only on bit 2**5 -- bits 2**1 through 2**4 are effectively ignored - but the table only has 2 slots in the k==1 case).  Decreasing PERTURB_SHIFT works against that.

OTOH, if k > 5, some of the bits that contributed to [1] being true also feed into whether [2] is true.  Increasing PERTURN_SHIFT works against that.

Setting PERTURB_SHIFT=k (i.e., make it depend on the number of slots) avoids both, but dosen't appear to actually perform better than always using 5.

The rub is that collisions just aren't a real problem under the current scheme - and hashsim.py shows it's doing about as well on that count as the theoretical gold standard ("uniform hashing").

If people have time to spend on dicts, I'd rather see them pursue entirely different methods, that guarantee to slash the _worst_ case number of probes.  hashsim.py also shows that while long collision chains are rare, they do occur (and they also occur under "uniform hashing").  Then, e.g., we could drop the annoying "hash randomization" cruft.  For example, dig into "cuckoo hashing" and "Robin Hood hashing".  Just be warned that, regardless of scheme, the devil's in the details :-(
History
Date User Action Args
2017-06-19 18:19:14tim.peterssetrecipients: + tim.peters, rhettinger, methane, serhiy.storchaka, xiang.zhang, Dmitry Rubanovich
2017-06-19 18:19:14tim.peterssetmessageid: <1497896354.52.0.342266056418.issue30671@psf.upfronthosting.co.za>
2017-06-19 18:19:14tim.peterslinkissue30671 messages
2017-06-19 18:19:14tim.peterscreate