Message 287434 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	methane, rhettinger, serhiy.storchaka, tim.peters, vstinner
Date	2017-02-09.16:22:06
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1486657327.45.0.243712223162.issue29476@psf.upfronthosting.co.za>
In-reply-to

Content
The problem with posting an idea here on the tracker while it is still in the research phase is that it will be prematurely downvoted without have fully thought it through. What I'm working on now is that opposite question. Was it ever worthwhile to add this optimization in the first place? In particular, does it strongly preference a single case over normal cases to no real benefit? Suppose we found a line code in the calendar module that tested for one special case. Is it March 3, 1978, and if so used a precomputed result, but for all other dates would compute normally. The right thing to do would be to remove that code, but then one commenter would say, "it makes that one case many times slower" when in fact it brings that case into line with the other normal cases. And another respondent would say "removing the almost never used special case provides too small of an improvement to the rest of the cases". The group would then arrive at the collective incorrect conclusion that yes it is a great idea to keep a special case for March 3, 1978. The test bench I'm working on now is examining whether a repeated add/discard of the same element in an otherwise mostly full table is like the calendar module scenario described above. In particular, I want to know whether applying the patch makes the new performance for that case about the same as other typical cases of add/discard. Here are some thoughts using timings. How you time, what you time, and where you time it matter a great deal when evaluating whether to keep code that is deep in the set insertion logic. In a complex function that is already having register spills, different compilers may have very different results (i.e. eliminating the freeslot lookup may allow another important variable to use a register rather than spilling and reloading on every iteration). We can also expect different results on 32-bit vs 64-bit builds (the former having many fewer registers to work with) and on ARM vs x86 (with the latter having greater instruction level parallelism that lets you get away with having blocks of useless code running in parallel with the important code). The test dataset matters as well. If set insertion timings are dominated by hashing or equality tests, then all improvements to the probing logic with look insignificant. Likewise, if the dataset has few hash collisions, then the affected code doesn't get run at all, leading to the false conclusion that simplifying the code doesn't have any benefit. For a timing that properly isolates and exercises the affected code, we should expect some measurable (and likely non-trivial) benefit. The GCC-6 x86_64 disassembly gives us reason to expect a favorable result because the new inner loop has a third fewer instructions, two fewer compare/jumps, and half the number of memory accesses (because the freeslot is being reloaded from the stack on every iteration). Also, there hasn't been any exploration of the potential indirect benefits in cases that I expect to be far more common. In those cases where we're going to have to resize anyway, is it better to do work to have partial early reclamation of dummies and defer resizing, or is it better to avoid the work for early reclamation so we can resize sooner and eliminate all of the dummies. I don't know the answer to this question yet, but it is far more important and usual than a contrived case of repeatedly adding and discarding the exact same element but never hitting a resize. On the other hand, perhaps the early dummy reclamation is worth it. I don't know the answer yet and was hoping that you all would help me find out. Downvoting of the basis of two quick and dirty timings on a single machine, single compiler, and single dataset isn't helpful. I've already put a lot of time in looking at these issues and still don't know the right thing to do. So, if other commenters have only put a few minutes into thinking about about it and have already drawn a conclusion, then their conclusions are suspect. IMO, there are still open research questions: Is early freeslot reclamation worth it or does it have a net negative benefit in all but the most exotic and unlikely cases? Without the existing optimization, would the exotic case perform about the same as other cases, or is the exotic case catastrophically worse than the others so that it warrants special handing even to the detriment of the other cases?

The problem with posting an idea here on the tracker while it is still in the research phase is that it will be prematurely downvoted without have fully thought it through.

What I'm working on now is that opposite question. Was it ever worthwhile to add this optimization in the first place? In particular, does it strongly preference a single case over normal cases to no real benefit?

Suppose we found a line code in the calendar module that tested for one special case. Is it March 3, 1978, and if so used a precomputed result, but for all other dates would compute normally. The right thing to do would be to remove that code, but then one commenter would say, "it makes that one case many times slower" when in fact it brings that case into line with the other normal cases. And another respondent would say "removing the almost never used special case provides too small of an improvement to the rest of the cases". The group would then arrive at the collective incorrect conclusion that yes it is a great idea to keep a special case for March 3, 1978.

The test bench I'm working on now is examining whether a repeated add/discard of the same element in an otherwise mostly full table is like the calendar module scenario described above. In particular, I want to know whether applying the patch makes the new performance for that case about the same as other typical cases of add/discard.

Here are some thoughts using timings. How you time, what you time, and where you time it matter a great deal when evaluating whether to keep code that is deep in the set insertion logic. In a complex function that is already having register spills, different compilers may have very different results (i.e. eliminating the freeslot lookup may allow another important variable to use a register rather than spilling and reloading on every iteration). We can also expect different results on 32-bit vs 64-bit builds (the former having many fewer registers to work with) and on ARM vs x86 (with the latter having greater instruction level parallelism that lets you get away with having blocks of useless code running in parallel with the important code).

The test dataset matters as well. If set insertion timings are dominated by hashing or equality tests, then all improvements to the probing logic with look insignificant. Likewise, if the dataset has few hash collisions, then the affected code doesn't get run at all, leading to the false conclusion that simplifying the code doesn't have any benefit.

For a timing that properly isolates and exercises the affected code, we should expect some measurable (and likely non-trivial) benefit. The GCC-6 x86_64 disassembly gives us reason to expect a favorable result because the new inner loop has a third fewer instructions, two fewer compare/jumps, and half the number of memory accesses (because the freeslot is being reloaded from the stack on every iteration).

Also, there hasn't been any exploration of the potential indirect benefits in cases that I expect to be far more common. In those cases where we're going to have to resize anyway, is it better to do work to have partial early reclamation of dummies and defer resizing, or is it better to avoid the work for early reclamation so we can resize sooner and eliminate all of the dummies. I don't know the answer to this question yet, but it is far more important and usual than a contrived case of repeatedly adding and discarding the exact same element but never hitting a resize.

On the other hand, perhaps the early dummy reclamation is worth it. I don't know the answer yet and was hoping that you all would help me find out. Downvoting of the basis of two quick and dirty timings on a single machine, single compiler, and single dataset isn't helpful.

I've already put a lot of time in looking at these issues and still don't know the right thing to do. So, if other commenters have only put a few minutes into thinking about about it and have already drawn a conclusion, then their conclusions are suspect.

IMO, there are still open research questions: Is early freeslot reclamation worth it or does it have a net negative benefit in all but the most exotic and unlikely cases? Without the existing optimization, would the exotic case perform about the same as other cases, or is the exotic case catastrophically worse than the others so that it warrants special handing even to the detriment of the other cases?

History
Date	User	Action	Args
2017-02-09 16:22:07	rhettinger	set	recipients: + rhettinger, tim.peters, vstinner, methane, serhiy.storchaka
2017-02-09 16:22:07	rhettinger	set	messageid: <1486657327.45.0.243712223162.issue29476@psf.upfronthosting.co.za>
2017-02-09 16:22:07	rhettinger	link	issue29476 messages
2017-02-09 16:22:06	rhettinger	create