Message 296147 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Dmitry Rubanovich
Recipients	Dmitry Rubanovich, methane, rhettinger, serhiy.storchaka, xiang.zhang
Date	2017-06-16.02:44:20
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1497581061.49.0.312495210898.issue30671@psf.upfronthosting.co.za>
In-reply-to

Content
Yes, I do understand it. But the statement in lines 166, 167: "For any initial j in range(2i), repeating that 2i times generates each int in range(2i) exactly once" does not hold when the perturb is added. 2i times will not be enough to generate all elements of the ring when some of multipliers are zero-divisors. Specifically, if you use j=((5j) + P + 1) mod 2i, you are effectively multiplying by a zero divisors every time P = j mod 2. And I don't mean "mod 2i." I do mean "mod 2." Which means anytime P (which changes a few times and eventually becomes 0), has the same parity as j, you are multiplying by a zero-divisor. Because P is eventually 0, you will go through all the values eventually. But for all values of P for which there is a P' such that P=/=P' and ((5j) + P + 1) = ((5j) + P' + 1) mod 2i, the number of times you'll need to apply the function will be higher than 2i. And that's in the worst case scenario. In the best case scenario, the collision probability can be gathered from looking at the values of my script printed by print_collision_counts(py3_6_1_lookdict_perturb). It can be as low as ~1/20 and as high as ~1/3 depending on the value of i. The main speed up we are seeking is to avoid a collision early on. And we are less concerned about collisions later on. Anecdotally, if your dict has 256 buckets, then the chance of collision is 1 in ~3.5. Which is an improvement over 1 in 2, but still pretty high. Ok, how about this: to avoid the edge cases, unroll the 1st secondary hash key to use j = 2j + P + 1. So try to test for it before the loop. But leave 5j + P + 1 in the loop as is. Although, to be honest if PERTURB_SHIFT is changed to 1 and we have a dict with a key that causes 64 collisions, this may be a good time to resize even if we are still sparse. On the other hand, this might create an attack vector with some magic value which causes resizes of near empty dict's. So maybe not... Certainly not without further analysis. BTW, and it's mostly stylistic, but except for the "hashpos = i & mask;" line in "find_empty_slot()", applying of the mask can be moved to "dk_get_index()". Again, I am looking at the released 3.6.1 code, so if this is already done, then never mind. As another note, changing PERTURB_SHIFT to 1 causes a near-universal reduction in collisions (for all strategies). So if nothing else, that's probably an improvement.

Yes, I do understand it.  

But the statement in lines 166, 167: "For any initial j in range(2**i), repeating that 2**i times generates each int in range(2**i) exactly once" does not hold when the perturb is added.  2**i times will not be enough to generate all elements of the ring when some of multipliers are zero-divisors.  

Specifically, if you use j=((5*j) + P + 1) mod 2**i, you are effectively multiplying by a zero divisors every time P = j mod 2.  And I don't mean "mod 2**i."  I do mean "mod 2."  Which means anytime P (which changes a few times and eventually becomes 0), has the same parity as j, you are multiplying by a zero-divisor.  Because P is eventually 0, you will go through all the values **eventually**.  But for all values of P for which there is a P' such that P=/=P' and ((5*j) + P + 1) = ((5*j) + P' + 1) mod 2**i, the number of times you'll need to apply the function will be higher than 2**i.  And that's in the worst case scenario.

In the best case scenario, the collision probability can be gathered from looking at the values of my script printed by print_collision_counts(py3_6_1_lookdict_perturb).  It can be as low as ~1/20 and as high as ~1/3 depending on the value of i.

The main speed up we are seeking is to avoid a collision early on.  And we are less concerned about collisions later on.  Anecdotally, if your dict has 256 buckets, then the chance of collision is 1 in ~3.5.  Which is an improvement over 1 in 2, but still pretty high.  

Ok, how about this: to avoid the edge cases, unroll the 1st secondary hash key to use j = 2*j + P + 1.  So try to test for it before the loop.  But leave 5*j + P + 1 in the loop as is.  

Although, to be honest if PERTURB_SHIFT is changed to 1 and we have a dict with a key that causes 64 collisions, this may be a good time to resize even if we are still sparse.  On the other hand, this might create an attack vector with some magic value which causes resizes of near empty dict's.  So maybe not...  Certainly not without further analysis.

BTW, and it's mostly stylistic, but except for the "*hashpos = i & mask;" line in "find_empty_slot()", applying of the mask can be moved to "dk_get_index()".  Again, I am looking at the released 3.6.1 code, so if this is already done, then never mind.

As another note, changing PERTURB_SHIFT to 1 causes a near-universal reduction in collisions (for all strategies).  So if nothing else, that's probably an improvement.

History
Date	User	Action	Args
2017-06-16 02:44:21	Dmitry Rubanovich	set	recipients: + Dmitry Rubanovich, rhettinger, methane, serhiy.storchaka, xiang.zhang
2017-06-16 02:44:21	Dmitry Rubanovich	set	messageid: <1497581061.49.0.312495210898.issue30671@psf.upfronthosting.co.za>
2017-06-16 02:44:21	Dmitry Rubanovich	link	issue30671 messages
2017-06-16 02:44:20	Dmitry Rubanovich	create