Message 102500 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	kristjan.jonsson
Recipients	beazley, dabeaz, flox, kristjan.jonsson, loewis, pitrou, torsten
Date	2010-04-06.22:22:14
SpamBayes Score	0.00013409219
Marked as misclassified	No
Message-id	<1270592538.55.0.790564762224.issue8299@psf.upfronthosting.co.za>
In-reply-to

Content
I just did some profiling. I´m using visual studio team edition which has some fancy built in profiling. I decided to compare the performance of the iotest.py script with two cpu threads, running for 10 seconds with processor affinity enabled and disabled. I added this code to the script: if affinity: import ctypes i = ctypes.c_int() i.value = 1 ctypes.windll.kernel32.SetProcessAffinityMask(-1, 1) Regular instruction counter sampling showed no differences. There were no indications of excessive time being used in the GIL or any strangeness with the locking primitives. So, I decided to sample on cpu performance counters. Following up on my conjecture from yesterday, that this was due to inefficiencies in switching between cpus, I settled on sampling the instruction fetch stall cycles from the instruction fetch unit. I sample every 1000000 stalls. I get interesting results. With affinity: Functions Causing Most Work Name Samples % _PyObject_Call 403 99,02 _PyEval_EvalFrameEx 402 98,77 _PyEval_EvalCodeEx 402 98,77 _PyEval_CallObjectWithKeywords 400 98,28 call_function 395 97,05 affinity off: Functions Causing Most Work Name Samples % _PyEval_EvalFrameEx 1.937 99,28 _PyEval_EvalCodeEx 1.937 99,28 _PyEval_CallObjectWithKeywords 1.936 99,23 _PyObject_Call 1.936 99,23 _threadstartex 1.934 99,13 When we run on both cores, we get four times as many L1 instruction cache hits! So, what appears to be happening is that each time that a switch occurs the L1 instruction cache for each core must be repopulated with the python evaluation loop, it having been evacuated on that core during the hiatus. Note that for this effect to kick in we need a large piece of code excercising the cache, such as the evaluation loop. Earlier today, I wrote a simple (python free) C program to do similar testing, using a GIL, and found no performance degradation due to multi core, but that program only had a very simple "work" function. So, this confirms my hypothesis: The downgrading of the performance of python cpu bound threads on multicore machines stems from the shuttling about of the python evaluation loop between the instruction caches of the individual cores. How best to combat this? I'll do some experiments on Windows. Perhaps we can identify cpu-bound threads and group them on a single core.

I just did some profiling. I´m using visual studio team edition which has some fancy built in profiling. I decided to compare the performance of the iotest.py script with two cpu threads, running for 10 seconds with processor affinity enabled and disabled. I added this code to the script:
if affinity:
import ctypes
i = ctypes.c_int()
i.value = 1
ctypes.windll.kernel32.SetProcessAffinityMask(-1, 1)

Regular instruction counter sampling showed no differences. There were no indications of excessive time being used in the GIL or any strangeness with the locking primitives. So, I decided to sample on cpu performance counters. Following up on my conjecture from yesterday, that this was due to inefficiencies in switching between cpus, I settled on sampling the instruction fetch stall cycles from the instruction fetch unit. I sample every 1000000 stalls. I get interesting results.

With affinity:
Functions Causing Most Work
Name Samples %
_PyObject_Call 403 99,02
_PyEval_EvalFrameEx 402 98,77
_PyEval_EvalCodeEx 402 98,77
_PyEval_CallObjectWithKeywords 400 98,28
call_function 395 97,05

affinity off:
Functions Causing Most Work
Name Samples %
_PyEval_EvalFrameEx 1.937 99,28
_PyEval_EvalCodeEx 1.937 99,28
_PyEval_CallObjectWithKeywords 1.936 99,23
_PyObject_Call 1.936 99,23
_threadstartex 1.934 99,13

When we run on both cores, we get four times as many L1 instruction cache hits! So, what appears to be happening is that each time that a switch occurs the L1 instruction cache for each core must be repopulated with the python evaluation loop, it having been evacuated on that core during the hiatus.

Note that for this effect to kick in we need a large piece of code excercising the cache, such as the evaluation loop. Earlier today, I wrote a simple (python free) C program to do similar testing, using a GIL, and found no performance degradation due to multi core, but that program only had a very simple "work" function.

So, this confirms my hypothesis: The downgrading of the performance of python cpu bound threads on multicore machines stems from the shuttling about of the python evaluation loop between the instruction caches of the individual cores.

How best to combat this? I'll do some experiments on Windows. Perhaps we can identify cpu-bound threads and group them on a single core.

History
Date	User	Action	Args
2010-04-06 22:22:19	kristjan.jonsson	set	recipients: + kristjan.jonsson, loewis, beazley, pitrou, flox, dabeaz, torsten
2010-04-06 22:22:18	kristjan.jonsson	set	messageid: <1270592538.55.0.790564762224.issue8299@psf.upfronthosting.co.za>
2010-04-06 22:22:16	kristjan.jonsson	link	issue8299 messages
2010-04-06 22:22:14	kristjan.jonsson	create