Message 352232 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	crusaderky, hongweipeng, rhettinger, serhiy.storchaka, steven.daprano, tim.peters
Date	2019-09-13.00:47:57
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1568335678.18.0.0410200191576.issue38105@roundup.psfhosted.org>
In-reply-to

Content
Following up, at least under Visual Studio for x86 "it's free" to change the code to add in `perturb` shifted left. The C source: perturb >>= PERTURB_SHIFT; i = (i5 + (perturb << 1) + 1) & mask; compiles to this, where I added comments relating the instructions to the source code: lea rax,[r14+r144] # rax = i5 shr r12,5 # r12(perturb) >>= 5 lea r14,[r122+1] # r14(i) = (perturb << 1) + 1 add r14,rax # r14(i) += i*5 and r14,rbx # r14(i) &= mask Good job! There are actually fewer machine instructions than C operators. As noted, this change would eliminate the possibility of fixed points from the start, so would eliminate the symptoms in this specific bug report. Which I don't much care about ;-) But I _do_ care about that, in general, the early stages of collision resolution for small tables can "frequently" visit slots more than once. This is true even if the hashes aren't identical. And it would remain true even with the change (cycles can still exist - that there are no fixed points just means there are no more cycles of the minimum possible length 1). Improving that is worth some effort, but not much. In general, hash codes aren't identical, and in small tables redundant slot visits just cost the time to read the hash code from cache and re-deduce that it's not equal to what we're looking for (__eq__ isn't typically called). In larger tables redundant slot visits in the early stages are too rare to care about. In small tables, it's worth something to guarantee that the first probe after a collision can't hit exactly the same slot again (in a random model, there's a 12.5% chance of that happening now in the smallest tables - 12.5% is 1 in 8 - and this bug report contrived cases where the chance is 100% across a dozen iterations). Anyone up a for a world of tedious benchmarking? ;-)

Following up, at least under Visual Studio for x86 "it's free" to change the code to add in `perturb` shifted left.  The C source:

    perturb >>= PERTURB_SHIFT;
    i = (i*5 + (perturb << 1) + 1) & mask;

compiles to this, where I added comments relating the instructions to the source code:

    lea         rax,[r14+r14*4]  # rax = i*5
    shr         r12,5            # r12(perturb) >>= 5
    lea         r14,[r12*2+1]    # r14(i) = (perturb << 1) + 1
    add         r14,rax          # r14(i) += i*5
    and         r14,rbx          # r14(i) &= mask

Good job!  There are actually fewer machine instructions than C operators.

As noted, this change would eliminate the possibility of fixed points from the start, so would eliminate the symptoms in this specific bug report.

Which I don't much care about ;-)  But I _do_ care about that, in general, the early stages of collision resolution for small tables can "frequently" visit slots more than once.  This is true even if the hashes aren't identical.  And it would remain true even with the change (cycles can still exist - that there are no fixed points just means there are no more cycles of the minimum possible length 1).

Improving that is worth some effort, but not much.  In general, hash codes aren't identical, and in small tables redundant slot visits just cost the time to read the hash code from cache and re-deduce that it's not equal to what we're looking for (__eq__ isn't typically called).

In larger tables redundant slot visits in the early stages are too rare to care about.

In small tables, it's worth something to guarantee that the first probe after a collision can't hit exactly the same slot again (in a random model, there's a 12.5% chance of that happening now in the smallest tables - 12.5% is 1 in 8 - and this bug report contrived cases where the chance is 100% across a dozen iterations).

Anyone up a for a world of tedious benchmarking? ;-)

History
Date	User	Action	Args
2019-09-13 00:47:58	tim.peters	set	recipients: + tim.peters, rhettinger, steven.daprano, serhiy.storchaka, hongweipeng, crusaderky
2019-09-13 00:47:58	tim.peters	set	messageid: <1568335678.18.0.0410200191576.issue38105@roundup.psfhosted.org>
2019-09-13 00:47:58	tim.peters	link	issue38105 messages
2019-09-13 00:47:57	tim.peters	create