Message 296302 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Dmitry Rubanovich
Recipients	Dmitry Rubanovich, methane, rhettinger, serhiy.storchaka, tim.peters, xiang.zhang
Date	2017-06-19.03:27:39
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1497842862.14.0.922093660811.issue30671@psf.upfronthosting.co.za>
In-reply-to

Content
A few notes on my motivation and other comments I made(not necessarily in the order of significance): 1. I started looking at this because of the fix in Issue28201. It effectively improves on the difference between print_collision_counts(previous_lookdict_perturb) and print_collision_counts(py3_6_1_lookdict_perturb) because the before-28201 code was, in effect, multiplying by 6 instead of 5 and, therefore, always had a 1/2 chance of secondary hashes colliding. 2. What I demonstrated was that the fix, as it stands, didn't do as much magic as we would hope. Because it produced a scheme in which first secondary hashes would collide at a rate as high as 1/3. 3. I somewhat regret extending the script to demonstrate what happens to 2nd, 3rd secondary hashes because it created the perception that I was trying to demonstrate a simulation. I wasn't. I was trying to demonstrate a calculation. Well, calculations for each dictionary size up to 220. 4. I detect apathy towards the fix, but I get the sense that it's mostly due to the fact that I initially didn't realize that I should have squarely limited the issue to the 1st secondary hash. And after I made the initial error in the calculation (as pointed out by Inada Naoki) which would have produced an unstable calculation of the 2nd, 3rd, 4th, etc. secondary hashes (as pointed out by Tim), there is a lot of push back. But the fix I propose DOES improve on 1st secondary hash** because it eliminates any possibility of collision. That is to say, no 2 1st secondary hashes will be the same for any 2 different primary hashes. And in the latest iteration of this proposal I did limit the scope of the enhancement to only 1st secondary hash. As it happens, this collision actually does not occur for dictionaries of sizes 8,16,32. They only begin to be possible for dictionaries of sizes 64 or greater. The latest script demonstrates it. 5. Sorry, Tim, but I should have reserved my statements to only the calculation I was certain about. The further comments about what happens to later secondary hashes was a speculation on my part and this calculation was not revealing much about it. If you want to consider what happens with later secondary indexes (as you do in hashsim.py), that's a different problem. To reiterate, my intention was to further improve on the fix in 28201. On the size of perturb: 6. Since hashes of strings and other objects are pointers, and can safely be assumed to be pointers addressing the heap, they are aligned on the boundary 2 * sizeof(void*). That's 16 in 64-bit builds. So the last 4 bits are 0. And (depending on how much heap is used) the higher bits of the pointers are also similar in applications which don't use a lot of memory. For example, in an application in which 16MiB of heap is used, only bits 35(=63-4-24) through 59(63-4) are truly random. The rest of the bits can be assumed to not change, but they are not reached until 5th iteration of the loop, so this represents an already mildly-degenerate case in this use case (in which most dictionary keys are strings). I haven't given much thought to which bits will have the most entropy in general, but that's going too far off topic.

A few notes on my motivation and other comments I made(not necessarily in the order of significance):

1. I started looking at this because of the fix in Issue28201. It effectively improves on the difference between

print_collision_counts(previous_lookdict_perturb)

and

print_collision_counts(py3_6_1_lookdict_perturb)

because the before-28201 code was, in effect, multiplying by 6 instead of 5 and, therefore, always had a 1/2 chance of secondary hashes colliding.

2. What I demonstrated was that the fix, as it stands, didn't do as much magic as we would hope. Because it produced a scheme in which **first** **secondary** hashes would collide at a rate as high as 1/3.

3. I somewhat regret extending the script to demonstrate what happens to 2nd, 3rd secondary hashes because it created the perception that I was trying to demonstrate a simulation. I wasn't. I was trying to demonstrate a calculation. Well, calculations for each dictionary size up to 2**20.

4. I detect apathy towards the fix, but I get the sense that it's mostly due to the fact that I initially didn't realize that I should have squarely limited the issue to the 1st secondary hash. And after I made the initial error in the calculation (as pointed out by Inada Naoki) which would have produced an unstable calculation of the 2nd, 3rd, 4th, etc. secondary hashes (as pointed out by Tim), there is a lot of push back.

But the fix I propose DOES improve on ***1st secondary hash*** because it eliminates any possibility of collision. That is to say, no 2 1st secondary hashes will be the same for any 2 different primary hashes.

And in the latest iteration of this proposal I did limit the scope of the enhancement to only 1st secondary hash. As it happens, this collision actually does not occur for dictionaries of sizes 8,16,32. They only begin to be possible for dictionaries of sizes 64 or greater. The latest script demonstrates it.

5. Sorry, Tim, but I should have reserved my statements to only the calculation I was certain about. The further comments about what happens to later secondary hashes was a speculation on my part and this calculation was not revealing much about it. If you want to consider what happens with later secondary indexes (as you do in hashsim.py), that's a different problem. To reiterate, my intention was to further improve on the fix in 28201.

On the size of perturb:

6. Since hashes of strings and other objects are pointers, and can safely be assumed to be pointers addressing the heap, they are aligned on the boundary 2 * sizeof(void*). That's 16 in 64-bit builds. So the last 4 bits are 0. And (depending on how much heap is used) the higher bits of the pointers are also similar in applications which don't use a lot of memory.

For example, in an application in which 16MiB of heap is used, only bits 35(=63-4-24) through 59(63-4) are truly random. The rest of the bits can be assumed to not change, but they are not reached until 5th iteration of the loop, so this represents an already mildly-degenerate case in this use case (in which most dictionary keys are strings).

I haven't given much thought to which bits will have the most entropy in general, but that's going too far off topic.

History
Date	User	Action	Args
2017-06-19 03:27:42	Dmitry Rubanovich	set	recipients: + Dmitry Rubanovich, tim.peters, rhettinger, methane, serhiy.storchaka, xiang.zhang
2017-06-19 03:27:42	Dmitry Rubanovich	set	messageid: <1497842862.14.0.922093660811.issue30671@psf.upfronthosting.co.za>
2017-06-19 03:27:42	Dmitry Rubanovich	link	issue30671 messages
2017-06-19 03:27:40	Dmitry Rubanovich	create