Author kristjan.jonsson
Recipients beazley, dabeaz, flox, kristjan.jonsson, loewis, pitrou, torsten
Date 2010-04-06.23:53:08
SpamBayes Score 1.60982e-15
Marked as misclassified No
Message-id <1270597990.96.0.191599968356.issue8299@psf.upfronthosting.co.za>
In-reply-to
Content
The counter is "stall cycles".
During the 10 second run on my 2.4Ghz cpu, we had instruction cache miss stalls for 2 billion cycles (2000 samples of 1000000 cycles per sample).  That does account for around 10% of the availible cpu.

I'm observing something like 20% slowdown, though, so there are probably other causes.

Profiling another counter, "instruction fetches", I see this, for a "fast run":
Functions Causing Most Work
Name	Samples	%
Unknown Frame(s)	10.733	99,49

and for a slow run:
Functions Causing Most Work
Name	Samples	%
Unknown Frame(s)	8.056	99,48

This shows a 20% drop in fetched instructions in the interval (five seconds this time).  Ideally, we should see 12000 samples in the fast case (2.4 ghz, 5s) but we see 10000 due to what cache misses there are in this case.  The cache misses in the "slow" case causes effective instruction fetches to drop by 20% on top of that.

I think that this is proof positive that the slowdown is due to instruction cache misses, at least on this dual core intel machine that I am using.

As for "the OS should handle this", I agree.  But it doesn't.  We are doing something unusual:  Convoying two (or more) threads allowing only one to run at a time.  The OS scheduler isn't built for that.  It can only assume that there will be some parallel execution and so it thinks that it is best to put the two sequential threads on different cpus.  But it is wrong, so the cost associated with cache misses outweighs the benefit of running on another core (zero, in our case).

So, the OS won't handle it, no matter how hard we wish that it would.  It is us that know how these gridlocked threads behave, and we do so much better than any OS scheduler can guess.  So, rather than beat our heads against the rock, I'm going to try to come up with a useful heuristic as to when to switch cores, and when not.  It would be useful as a diagnostic tool, if nothing more.

Ok, so we have established two things, I think:
1) the poor response of IO threads in the presence of CPU threads on thread_pthreads.h implementations (on multicore) is because of greedy gil wait semantics in the current gil.  It's easily fixable by using the implementation ROUNDROBIN_GIL implementation I've shown.
2) The poor performance of competing CPU threads on multicore machines is due to the instruction cache behaviour of non-overlapping thread execution on different cores.

We can fix 1) easily, even with a much less invasive patch than the ones I have put in here.  I'm a bit surprised at the apparent disinterest in such an obvious bug / fix.

As for 2), well, see above.  Nothing we can do, really, except identify those cases where we are releasing GIL just to yield (one case, actually, ceval.c) and try to instruct the OS not to switch cores in that case.  I'll see what I can come up with.

Cheers.
History
Date User Action Args
2010-04-06 23:53:11kristjan.jonssonsetrecipients: + kristjan.jonsson, loewis, beazley, pitrou, flox, dabeaz, torsten
2010-04-06 23:53:10kristjan.jonssonsetmessageid: <1270597990.96.0.191599968356.issue8299@psf.upfronthosting.co.za>
2010-04-06 23:53:09kristjan.jonssonlinkissue8299 messages
2010-04-06 23:53:08kristjan.jonssoncreate