Message 102416 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	kristjan.jonsson
Recipients	beazley, dabeaz, kristjan.jonsson, loewis, pitrou, torsten
Date	2010-04-05.21:33:02
SpamBayes Score	9.992007e-16
Marked as misclassified	No
Message-id	<1270503191.08.0.129234495336.issue8299@psf.upfronthosting.co.za>
In-reply-to

Content
Sorry, what I meant with the "original problem" was the phenomenon observed by Antoine (IIRC) that the same CPU thread tends to hog the gil, even when releaseing it in ceval.c. What I have been looking at up to now is chiefly IO performance using David's iotest.py, and improving the poor performance of IO. IO will not suffer as badly on windows because the IO thread will get its fair slice of execution time. Promted by you, I added this bit of code to the iotest.py: spins = 0 laststat = 0 def spin(): global spins, laststat task,args = task_pidigits() while True: r= task(*args) spins += 1 t = time.clock() if t-laststat > 1: print spins/(t-laststat) spins = 0 laststat = t You are right, however that cpu throughput of multiple cpu bound thread suffers. And in fact, on windows, it appears to suffer the least using the LEGACY_GIL implementation. This is, I conjecture, because there are far fewer context switches (because relinqushing the GIL fails). My conjecture is that context switches between threads on two cores are so expensive as to dramatically affect performance. Normal multithreaded programs don't suffer from this because the threads are kept busy. But in our case, we are stopping one thread on one core, and starting another on a separate core, and this causes latency. Now, I've improved my patch somewhat. First off, I fixed some minor errors in the PRIORITY_GIL implementation. But more importantly, I added something called FIFOCOND. It is a condition variable that guarantees the FIFO property. This was prompted by my observation that even Windows' Semaphore doesn't do that, rather the windows scheduler may allow the currently executing thread to jump ahead in the semaphore queue. The FIFOCOND condition variable fixes that using explicit scheduling, and is intended as a diagnostic tool. (Antoine, your comment from 13:04 about "roundrobin" inasfar as that we don't know anything about the condition variable behaviour. I was assuming FIFO behaviour for the sake of argument, and I thought I´ put it in to the comments that we assume a general 'fairness' there. Put in the FIFOCOND and you will have that fairness guaranteed.) At any rate, I believe my patch provides a useful platform for further experimentation. 1) Factoring out the gil as a separate type of lock (which it must be) 2) allowing for different implementation of the GIL 3) shoring up the Condition variable implementation on Windows 4) Providing a FIFOCOND_T type to enforce a particular scheduling order, and demonstrating how we can be explicit about thread scheduling. I have already demonstrated that using the PRIORITY_GIL method fixes the problem with IO threads in the presence of CPU bound threads. Your iotest.py script is perfect for this, using 2 worker threads. On windows, the problem with IO wasn't so grave as I have explained (windows by default works as the ROUNDROBIN_GIL implementation, not the LEGACY_GIL mode used on pthreads). The PRIORITY_GIL solution is particularly effective with multicore on Windows, but it also improves IO throughput if cpu affinity of the server is fixed to one CPU, i.e. on singlecore. I have no fix for CPU bound threads, and I honestly don't think such a fix exists, except by causing switches to happen far less frequently, e.g. by raising the checkinterval, and so mitigating the problem (which is what the new gil in py3k does with its timeout implementation) But the IO fix for pthreads To summarise then: 1) The GIL has two problems on multicore machines a) performance of CPU threads goes down b) performance of IO in the presence of CPU threads is abysmal, but not on Windows 2) We can fix problem b) on pthreads with the ROUNDROBIN_GIL implementation. 3) We can improve IO performance in the presence of CPU threads on pthreads and Windows using the PRIORITY_GIL implementation, even to become faster than on a single core. 4) We cannot do anything about decreased performance of co-operatively switching CPU threads on multicore except switching less frequently. But this is quite feasible now with the PRIORITY_GIL implementation because it can request an immediate gil drop when IO is ready. So raising the checkinterval will not affect IO performance in a negative way. Please have a look at the latest patch with IO thread performance in mind. It is currently configured to enable the PRIORITY_GIL implementation without the FIFOCOND on windows and pthreads.

Sorry, what I meant with the "original problem" was the phenomenon observed by Antoine (IIRC) that the same CPU thread tends to hog the gil, even when releaseing it in ceval.c.
What I have been looking at up to now is chiefly IO performance using David's iotest.py, and improving the poor performance of IO. IO will not suffer as badly on windows because the IO thread will get its fair slice of execution time. Promted by you, I added this bit of code to the iotest.py:
spins = 0
laststat = 0
def spin():
global spins, laststat
task,args = task_pidigits()
while True:
r= task(*args)
spins += 1
t = time.clock()
if t-laststat > 1:
print spins/(t-laststat)
spins = 0
laststat = t

You are right, however that cpu throughput of multiple cpu bound thread suffers. And in fact, on windows, it appears to suffer the least using the LEGACY_GIL implementation. This is, I conjecture, because there are far fewer context switches (because relinqushing the GIL fails). My conjecture is that context switches between threads on two cores are so expensive as to dramatically affect performance. Normal multithreaded programs don't suffer from this because the threads are kept busy. But in our case, we are stopping one thread on one core, and starting another on a separate core, and this causes latency.

Now, I've improved my patch somewhat. First off, I fixed some minor errors in the PRIORITY_GIL implementation. But more importantly, I added something called FIFOCOND. It is a condition variable that guarantees the FIFO property. This was prompted by my observation that even Windows' Semaphore doesn't do that, rather the windows scheduler may allow the currently executing thread to jump ahead in the semaphore queue. The FIFOCOND condition variable fixes that using explicit scheduling, and is intended as a diagnostic tool.
(Antoine, your comment from 13:04 about "roundrobin" inasfar as that we don't know anything about the condition variable behaviour. I was assuming FIFO behaviour for the sake of argument, and I thought I´ put it in to the comments that we assume a general 'fairness' there. Put in the FIFOCOND and you will have that fairness guaranteed.)

At any rate, I believe my patch provides a useful platform for further experimentation.
1) Factoring out the gil as a separate type of lock (which it must be)
2) allowing for different implementation of the GIL
3) shoring up the Condition variable implementation on Windows
4) Providing a FIFOCOND_T type to enforce a particular scheduling order, and demonstrating how we can be explicit about thread scheduling.

I have already demonstrated that using the PRIORITY_GIL method fixes the problem with IO threads in the presence of CPU bound threads. Your iotest.py script is perfect for this, using 2 worker threads. On windows, the problem with IO wasn't so grave as I have explained (windows by default works as the ROUNDROBIN_GIL implementation, not the LEGACY_GIL mode used on pthreads). The PRIORITY_GIL solution is particularly effective with multicore on Windows, but it also improves IO throughput if cpu affinity of the server is fixed to one CPU, i.e. on singlecore.

I have no fix for CPU bound threads, and I honestly don't think such a fix exists, except by causing switches to happen far less frequently, e.g. by raising the checkinterval, and so mitigating the problem (which is what the new gil in py3k does with its timeout implementation) But the IO fix for pthreads

To summarise then:
1) The GIL has two problems on multicore machines
a) performance of CPU threads goes down
b) performance of IO in the presence of CPU threads is abysmal, but not on Windows
2) We can fix problem b) on pthreads with the ROUNDROBIN_GIL implementation.
3) We can improve IO performance in the presence of CPU threads on pthreads and Windows using the PRIORITY_GIL implementation, even to become faster than on a single core.
4) We cannot do anything about decreased performance of co-operatively switching CPU threads on multicore except switching less frequently. But this is quite feasible now with the PRIORITY_GIL implementation because it can request an immediate gil drop when IO is ready. So raising the checkinterval will not affect IO performance in a negative way.

Please have a look at the latest patch with IO thread performance in mind. It is currently configured to enable the PRIORITY_GIL implementation without the FIFOCOND on windows and pthreads.

History
Date	User	Action	Args
2010-04-05 21:33:11	kristjan.jonsson	set	recipients: + kristjan.jonsson, loewis, beazley, pitrou, dabeaz, torsten
2010-04-05 21:33:11	kristjan.jonsson	set	messageid: <1270503191.08.0.129234495336.issue8299@psf.upfronthosting.co.za>
2010-04-05 21:33:09	kristjan.jonsson	link	issue8299 messages
2010-04-05 21:33:08	kristjan.jonsson	create