Author kristjan.jonsson
Recipients beazley, dabeaz, flox, kristjan.jonsson, loewis, pitrou, torsten
Date 2010-04-06.22:22:14
SpamBayes Score 0.000134092
Marked as misclassified No
Message-id <1270592538.55.0.790564762224.issue8299@psf.upfronthosting.co.za>
In-reply-to
Content
I just did some profiling.  I´m using visual studio team edition which has some fancy built in profiling.  I decided to compare the performance of the iotest.py script with two cpu threads, running for 10 seconds with processor affinity enabled and disabled.  I added this code to the script:
if affinity:
    import ctypes
    i = ctypes.c_int()
    i.value = 1
    ctypes.windll.kernel32.SetProcessAffinityMask(-1, 1)

Regular instruction counter sampling showed no differences.  There were no indications of excessive time being used in the GIL or any strangeness with the locking primitives.  So, I decided to sample on cpu performance counters.  Following up on my conjecture from yesterday, that this was due to inefficiencies in switching between cpus, I settled on sampling the instruction fetch stall cycles from the instruction fetch unit.  I sample every 1000000 stalls.  I get interesting results.

With affinity:
Functions Causing Most Work
Name	Samples	%
_PyObject_Call	403	99,02
_PyEval_EvalFrameEx	402	98,77
_PyEval_EvalCodeEx	402	98,77
_PyEval_CallObjectWithKeywords	400	98,28
call_function	395	97,05

affinity off:
Functions Causing Most Work
Name	Samples	%
_PyEval_EvalFrameEx	1.937	99,28
_PyEval_EvalCodeEx	1.937	99,28
_PyEval_CallObjectWithKeywords	1.936	99,23
_PyObject_Call	1.936	99,23
_threadstartex	1.934	99,13

When we run on both cores, we get four times as many L1 instruction cache hits!  So, what appears to be happening is that each time that a switch occurs the L1 instruction cache for each core must be repopulated with the python evaluation loop, it having been evacuated on that core during the hiatus.

Note that for this effect to kick in we need a large piece of code excercising the cache, such as the evaluation loop.  Earlier today, I wrote a simple (python free) C program to do similar testing, using a GIL, and found no performance degradation due to multi core, but that program only had a very simple "work" function.

So, this confirms my hypothesis:  The downgrading of the performance of python cpu bound threads on multicore machines stems from the shuttling about of the python evaluation loop between the instruction caches of the individual cores.

How best to combat this?  I'll do some experiments on Windows.  Perhaps we can identify cpu-bound threads and group them on a single core.
History
Date User Action Args
2010-04-06 22:22:19kristjan.jonssonsetrecipients: + kristjan.jonsson, loewis, beazley, pitrou, flox, dabeaz, torsten
2010-04-06 22:22:18kristjan.jonssonsetmessageid: <1270592538.55.0.790564762224.issue8299@psf.upfronthosting.co.za>
2010-04-06 22:22:16kristjan.jonssonlinkissue8299 messages
2010-04-06 22:22:14kristjan.jonssoncreate