Message 151869 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Jim.Jewett
Recipients	Arach, Arfrever, Huzaifa.Sidhpurwala, Jim.Jewett, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, dmalcolm, eric.araujo, eric.snow, fx5, georg.brandl, grahamd, gregory.p.smith, gvanrossum, gz, jcea, lemburg, mark.dickinson, neologix, pitrou, skrah, terry.reedy, tim.peters, v+python, vstinner, zbysz
Date	2012-01-24.00:42:45
SpamBayes Score	1.2730184e-10
Marked as misclassified	No
Message-id	<CA+OGgf42Di+mL_GaH6HxVEX4xpOTAQTp0mVEVtLDjEv4sS4-Wg@mail.gmail.com>
In-reply-to	<4F1DD38E.7000605@egenix.com>

Content
On Mon, Jan 23, 2012 at 4:39 PM, Marc-Andre Lemburg <report@bugs.python.org> wrote: > Running (part of ()) the test suite with debugging enabled on a 64-bit > machine shows that slot collisions are much more frequent than > hash collisions, which only account for less than 0.01% of all > collisions. Even 1 in 10,000 seems pretty high, though I suppose it is a result of non-random input. (For a smalldict with 8 == 2^3 slots, on a 64-bit machine, true hash collisions "should" only account for 1 in 2^61 slot collisions.) > It also shows that slot collisions in the low 1-10 range are > most frequent, with very few instances of a dict lookup > reaching 20 slot collisions (less than 0.0002% of all > collisions). Thus the argument that collisions > N implies (possibly malicious) data that really needs a different hash -- and that this dict instance in particular should take the hit to use an alternative hash. (Do note that this alternative hash could be stored in the hash member of the PyDictEntry; if anything actually equal* to the key comes along, it will have gone through just as many collisions, and therefore also have been rehashed.) > The great number of cases with 1 or 2 slot collisions surprised > me. It seems that there's potential for improvement of > the perturbation formula left. In retrospect, this makes sense. for (perturb = hash; ; perturb >>= PERTURB_SHIFT) { i = (i << 2) + i + perturb + 1; If two objects collided then they have the same last few last few bits in their hashes -- which means they also have the same last few bits in their initial perturb. And since the first probe is to slot 6i+1, it funnels down to only even consider half the slots until the second probe. Also note that this explains why Randomization could make the Django tests fail, even though 64-bit users haven't complained. The initial hash(&mask) is the same, and the first probe is the same, and (for a small enough dict) so are the next several. In a dict with 2^12 slots, the first 6 tries will be the same ... so I doubt the test cases have sufficiently large amounts of sufficiently unlucky data to notice very often -- unless the hash itself is changed, as in the patch.

On Mon, Jan 23, 2012 at 4:39 PM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:

> Running (part of (*)) the test suite with debugging enabled on a 64-bit
> machine shows that slot collisions are much more frequent than
> hash collisions, which only account for less than 0.01% of all
> collisions.

Even 1 in 10,000 seems pretty high, though I suppose it is a result of
non-random input.  (For a smalldict with 8 == 2^3 slots, on a 64-bit
machine, true hash collisions "should" only account for 1 in 2^61 slot
collisions.)

> It also shows that slot collisions in the low 1-10 range are
> most frequent, with very few instances of a dict lookup
> reaching 20 slot collisions (less than 0.0002% of all
> collisions).

Thus the argument that collisions > N implies (possibly malicious)
data that really needs a different hash -- and that this dict instance
in particular should take the hit to use an alternative hash.  (Do
note that this alternative hash could be stored in the hash member of
the PyDictEntry; if anything actually *equal* to the key comes along,
it will have gone through just as many collisions, and therefore also
have been rehashed.)

> The great number of cases with 1 or 2 slot collisions surprised
> me. It seems that there's potential for improvement of
> the perturbation formula left.

In retrospect, this makes sense.

    for (perturb = hash; ; perturb >>= PERTURB_SHIFT) {
        i = (i << 2) + i + perturb + 1;

If two objects collided then they have the same last few last few bits
in their hashes -- which means they also have the same last few bits
in their initial perturb.  And since the first probe is to slot 6i+1,
it funnels down to only even consider half the slots until the second
probe.

Also note that this explains why Randomization could make the Django
tests fail, even though 64-bit users haven't complained.  The initial
hash(&mask) is the same, and the first probe is the same, and (for a
small enough dict) so are the next several.  In a dict with 2^12
slots, the first 6 tries will be the same ... so I doubt the test
cases have sufficiently large amounts of sufficiently unlucky data to
notice very often -- unless the hash itself is changed, as in the
patch.

History
Date	User	Action	Args
2012-01-24 00:42:46	Jim.Jewett	set	recipients: + Jim.Jewett, lemburg, gvanrossum, tim.peters, barry, georg.brandl, terry.reedy, gregory.p.smith, jcea, mark.dickinson, pitrou, vstinner, christian.heimes, benjamin.peterson, eric.araujo, grahamd, Arfrever, v+python, alex, zbysz, skrah, dmalcolm, gz, neologix, Arach, Mark.Shannon, eric.snow, Zhiping.Deng, Huzaifa.Sidhpurwala, PaulMcMillan, fx5
2012-01-24 00:42:45	Jim.Jewett	link	issue13703 messages
2012-01-24 00:42:45	Jim.Jewett	create