Message 296138 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Dmitry Rubanovich
Recipients	Dmitry Rubanovich, methane, rhettinger, serhiy.storchaka, xiang.zhang
Date	2017-06-15.23:28:39
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1497569319.98.0.458656958782.issue30671@psf.upfronthosting.co.za>
In-reply-to

Content
I am looking at the 3.6.1 code. I am not trying to simulate the index lookup as the lookup function would do it. There is no point in that. The point of the script is to detect collisions among perturbed secondary indexes rather than between primary hashes and secondary. To that end, it applies the perturbed secondary hash function to all primary indexes and stores the results. This allows to judge the quality of the secondary hash function. The "old" one was bad because it automatically created collisions by multiplying by zero-divisors of the 2^k ring. The one used in 3.6.1 is better because it doesn't always do that, but it still multiplies by zero-divisors of the ring when h mod 2^k is equal to (h >> 5) mod 2^k. This multiplication by zero-divisors of the 2^k ring is what produces collisions. The script simply counts them. The later iterations of the loop are not very relevant. They will almost certainly produce further collisions, but that's unavoidable. By the way, just as a test, I just changed the value of PERTURB_SHIFT to 1 and, in my proposed solution, it produced lower collision counts, after 3 runs of the loop, in virtually all rings (ie, mod all values of 2^k for k from 8 to 20). Intuitively it makes sense because there is less information loss. But I don't have a formal proof as to why that should be.

I am looking at the 3.6.1 code.  I am not trying to simulate the index lookup as the lookup function would do it.  

There is no point in that.  The point of the script is to detect collisions among perturbed secondary indexes rather than between primary hashes and secondary.  To that end, it applies the perturbed secondary hash function to **all** primary indexes and stores the results.  This allows to judge the quality of the secondary hash function.  The "old" one was bad because it automatically created collisions by multiplying by zero-divisors of the 2^k ring.  The one used in 3.6.1 is better because it  doesn't always do that, but it still multiplies by zero-divisors of the ring when h mod 2^k is equal to (h >> 5) mod 2^k.  This multiplication by zero-divisors of the 2^k ring is what produces collisions.  The script simply counts them.  

The later iterations of the loop are not very relevant.  They will almost certainly produce further collisions, but that's unavoidable.  

By the way, just as a test, I just changed the value of PERTURB_SHIFT to 1 and, in my proposed solution, it produced lower collision counts, after 3 runs of the loop, in virtually all rings (ie, mod all values of 2^k for k from 8 to 20).  Intuitively it makes sense because there is less information loss.  But I don't have a formal proof as to why that should be.

History
Date	User	Action	Args
2017-06-15 23:28:40	Dmitry Rubanovich	set	recipients: + Dmitry Rubanovich, rhettinger, methane, serhiy.storchaka, xiang.zhang
2017-06-15 23:28:39	Dmitry Rubanovich	set	messageid: <1497569319.98.0.458656958782.issue30671@psf.upfronthosting.co.za>
2017-06-15 23:28:39	Dmitry Rubanovich	link	issue30671 messages
2017-06-15 23:28:39	Dmitry Rubanovich	create