Issue 8299: Improve GIL in 2.7

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/52546

classification

Title:	Improve GIL in 2.7
Type:	performance	Stage:
Components:	Interpreter Core	Versions:	Python 2.7

process

Status:	closed	Resolution:	out of date
Dependencies:		Superseder:
Assigned To:		Nosy List:	beazley, dabeaz, flox, kristjan.jonsson, loewis, pitrou, r.david.murray, rhettinger, techtonik, terry.reedy, torsten
Priority:	normal	Keywords:	patch

Created on 2010-04-03 10:27 by kristjan.jonsson, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
gil.patch	kristjan.jonsson, 2010-04-03 10:27
gil2.patch	kristjan.jonsson, 2010-04-05 21:33
ccbench.patch	kristjan.jonsson, 2010-04-11 11:09	improved ccbench with -y option and 'balance'
fair.py	dabeaz, 2010-04-16 11:21	Fair CPU use test
fair.py	kristjan.jonsson, 2010-04-16 19:57	A modified version of David's "fair.py"
thread_pthread.h	dabeaz, 2010-04-17 12:21	Modified cond var GIL with fairness.
evalsrv.rar	kristjan.jonsson, 2010-04-21 23:22

Messages (64)
msg102234 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-03 10:27
This patch does several things: 1) Creates a separate lock type PyThread_type_gil and locking functions for that. This allows tweaking of the GIL without affecting regular lock behaviour. 2) Creates a uniform implementation of the GIL on windows/pthreads using macros, with emulated condition variables on windows (Lifted Antoine's code from py3k, adding own improvements to the slightly problematic windows implementation). This makes the GIL behave the same on windows and pthreads platforms, if we so choose, and allows cross-platform development. 3) provide three GIL implementations: a) legacy gil, which is the same as the one used on pthreads b) a roundrobin gil, which fixes the multicore problem on pthreads and exhibits the same behaviour as the legacy GIL on windows did (no jumping the gil queue) c) a priority based gil, with n given priority levels, and optionally, the ability to request immediate GIL drop by the ceval.c loop. See thread_gil.h for details of the three modes. In my experiments using David Beazley's scripts from http://www.dabeaz.com/blog/dablog.html, implementation "b" fixed the performance problems encountered on multicore machines. This is, I believe, the original impetus for Antoine Pitrou's work on the new GIL. Implementation "c" improved data transfer still, by allowing faster wakeup of completed IO. Please note that I was not able to test this patch on a pthreads machine, I can only hope that it compiles :)
msg102235 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-04-03 10:58
I think this is too late for 2.7. 2.7b1 will be released RSN, and we should not implement such a change after the first beta release.
msg102248 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-03 12:32
Without looking at this patch, I think it would wise to proceed with caution on incorporating any kind of GIL patch into 2.X. If there is anything to be taken away from my own working studying the GIL, it's that the problem is far more tricky than it looks and that any kind of fix has the potential to adversely affect something that you might not expect. That said, I would only suggest that any kind of "new gil" incorporated in 2.X be disabled by default and enabled with special configuration options for those who want to try it out.
msg102250 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-03 12:33
Martin: Well, this patch was originally conceived more as a demonstration of the GIL problem and an alternative fix proposal. However, it is possible to configure it so that there is no change from existing functionality, simply by not including thread_gil.h in thread_pthread.h and thread_nt.h. The only change would then be the presence of the new PyThread_type_gil and associating locking functions which delegate directly to the old PyThread_type_lock functions.
msg102252 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-03 12:37
Antoine: Please take a look, the change is really simple, particularly the ROUNDROBIN_GIL variant which fixes the originally observed problem. the GIL is still a lock, implemented using a mutex and a semaphore. It is modified to work exactly as the lock always has done on windows (which is why the original problem isn't present on that platform). The simplicity of the change stems from the fact that the gil is still just a mutex-type object, which is aqcuired and released just as it has always been. The change is in the internal rules of the mutex, making sure that threads queue up properly and (optionally) that they are released in a priority based order.
msg102254 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-03 12:45
I'm not sure where you're getting your information, but the original GIL problem DEFINITELY exists on multicore Windows machines. I've had numerous participants try it in training classes and workshops they've all observed severely degraded performance for CPU-bound threads on Windows (XP, Vista, and Windows 7).
msg102255 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-03 12:55
Just ran the CPU-bound GIL test on my wife's dual core Windows Vista machine. The code runs twice as slow using two threads as it does using no threads (original observed behavior in my GIL talk).
msg102256 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-04-03 13:04
Kristjan, I agree with Martin, it's probably too late to make such changes for 2.7. Additionally, your "round-robin" scheme only seems round-robin when there are two threads competing. Otherwise, you could have three threads A, B and C, and the GIL bouncing between A and B. I would advocate opening a separate issue to improve the Windows condition variable code under 3.x.
msg102290 - (view)	Author: Torsten Landschoff (torsten) *	Date: 2010-04-03 20:01
Silly question, I know, but why isn't the GIL just implemented as a lock of the host operating system? After all, we want mutual exclusion, I don't see why condition variables are required for this. I have to admin that I did not look at the source, so the reason might be documented there.
msg102291 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-03 20:11
It's not a simple mutex because if you did that, you would have performance problems much worse than those described in issue 7946. http://bugs.python.org/issue7946
msg102416 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-05 21:33
Sorry, what I meant with the "original problem" was the phenomenon observed by Antoine (IIRC) that the same CPU thread tends to hog the gil, even when releaseing it in ceval.c. What I have been looking at up to now is chiefly IO performance using David's iotest.py, and improving the poor performance of IO. IO will not suffer as badly on windows because the IO thread will get its fair slice of execution time. Promted by you, I added this bit of code to the iotest.py: spins = 0 laststat = 0 def spin(): global spins, laststat task,args = task_pidigits() while True: r= task(*args) spins += 1 t = time.clock() if t-laststat > 1: print spins/(t-laststat) spins = 0 laststat = t You are right, however that cpu throughput of multiple cpu bound thread suffers. And in fact, on windows, it appears to suffer the least using the LEGACY_GIL implementation. This is, I conjecture, because there are far fewer context switches (because relinqushing the GIL fails). My conjecture is that context switches between threads on two cores are so expensive as to dramatically affect performance. Normal multithreaded programs don't suffer from this because the threads are kept busy. But in our case, we are stopping one thread on one core, and starting another on a separate core, and this causes latency. Now, I've improved my patch somewhat. First off, I fixed some minor errors in the PRIORITY_GIL implementation. But more importantly, I added something called FIFOCOND. It is a condition variable that guarantees the FIFO property. This was prompted by my observation that even Windows' Semaphore doesn't do that, rather the windows scheduler may allow the currently executing thread to jump ahead in the semaphore queue. The FIFOCOND condition variable fixes that using explicit scheduling, and is intended as a diagnostic tool. (Antoine, your comment from 13:04 about "roundrobin" inasfar as that we don't know anything about the condition variable behaviour. I was assuming FIFO behaviour for the sake of argument, and I thought I´ put it in to the comments that we assume a general 'fairness' there. Put in the FIFOCOND and you will have that fairness guaranteed.) At any rate, I believe my patch provides a useful platform for further experimentation. 1) Factoring out the gil as a separate type of lock (which it must be) 2) allowing for different implementation of the GIL 3) shoring up the Condition variable implementation on Windows 4) Providing a FIFOCOND_T type to enforce a particular scheduling order, and demonstrating how we can be explicit about thread scheduling. I have already demonstrated that using the PRIORITY_GIL method fixes the problem with IO threads in the presence of CPU bound threads. Your iotest.py script is perfect for this, using 2 worker threads. On windows, the problem with IO wasn't so grave as I have explained (windows by default works as the ROUNDROBIN_GIL implementation, not the LEGACY_GIL mode used on pthreads). The PRIORITY_GIL solution is particularly effective with multicore on Windows, but it also improves IO throughput if cpu affinity of the server is fixed to one CPU, i.e. on singlecore. I have no fix for CPU bound threads, and I honestly don't think such a fix exists, except by causing switches to happen far less frequently, e.g. by raising the checkinterval, and so mitigating the problem (which is what the new gil in py3k does with its timeout implementation) But the IO fix for pthreads To summarise then: 1) The GIL has two problems on multicore machines a) performance of CPU threads goes down b) performance of IO in the presence of CPU threads is abysmal, but not on Windows 2) We can fix problem b) on pthreads with the ROUNDROBIN_GIL implementation. 3) We can improve IO performance in the presence of CPU threads on pthreads and Windows using the PRIORITY_GIL implementation, even to become faster than on a single core. 4) We cannot do anything about decreased performance of co-operatively switching CPU threads on multicore except switching less frequently. But this is quite feasible now with the PRIORITY_GIL implementation because it can request an immediate gil drop when IO is ready. So raising the checkinterval will not affect IO performance in a negative way. Please have a look at the latest patch with IO thread performance in mind. It is currently configured to enable the PRIORITY_GIL implementation without the FIFOCOND on windows and pthreads.
msg102500 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-06 22:22
I just did some profiling. I´m using visual studio team edition which has some fancy built in profiling. I decided to compare the performance of the iotest.py script with two cpu threads, running for 10 seconds with processor affinity enabled and disabled. I added this code to the script: if affinity: import ctypes i = ctypes.c_int() i.value = 1 ctypes.windll.kernel32.SetProcessAffinityMask(-1, 1) Regular instruction counter sampling showed no differences. There were no indications of excessive time being used in the GIL or any strangeness with the locking primitives. So, I decided to sample on cpu performance counters. Following up on my conjecture from yesterday, that this was due to inefficiencies in switching between cpus, I settled on sampling the instruction fetch stall cycles from the instruction fetch unit. I sample every 1000000 stalls. I get interesting results. With affinity: Functions Causing Most Work Name Samples % _PyObject_Call 403 99,02 _PyEval_EvalFrameEx 402 98,77 _PyEval_EvalCodeEx 402 98,77 _PyEval_CallObjectWithKeywords 400 98,28 call_function 395 97,05 affinity off: Functions Causing Most Work Name Samples % _PyEval_EvalFrameEx 1.937 99,28 _PyEval_EvalCodeEx 1.937 99,28 _PyEval_CallObjectWithKeywords 1.936 99,23 _PyObject_Call 1.936 99,23 _threadstartex 1.934 99,13 When we run on both cores, we get four times as many L1 instruction cache hits! So, what appears to be happening is that each time that a switch occurs the L1 instruction cache for each core must be repopulated with the python evaluation loop, it having been evacuated on that core during the hiatus. Note that for this effect to kick in we need a large piece of code excercising the cache, such as the evaluation loop. Earlier today, I wrote a simple (python free) C program to do similar testing, using a GIL, and found no performance degradation due to multi core, but that program only had a very simple "work" function. So, this confirms my hypothesis: The downgrading of the performance of python cpu bound threads on multicore machines stems from the shuttling about of the python evaluation loop between the instruction caches of the individual cores. How best to combat this? I'll do some experiments on Windows. Perhaps we can identify cpu-bound threads and group them on a single core.
msg102502 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-04-06 22:35
[...] > _PyObject_Call 403 99,02 [...] > affinity off: > Functions Causing Most Work > Name Samples % [...] > _PyObject_Call 1.936 99,23 [...] > _threadstartex 1.934 99,13 > > When we run on both cores, we get four times as many L1 instruction cache hits! You mean we get 4x the number of cache /misses/, right? This analysis is gratuitous if you can't evaluate/measure/calculate the actual cost (in proportion of total elapsed or CPU time) of the instruction cache misses. Perhaps it is actually negligible and the slowdown is caused by something else. > How best to combat this? I'll do some experiments on Windows. > Perhaps we can identify cpu-bound threads and group them on a single > core. IMHO, the OS should handle this. I don't think ad-hoc platform-specific CPU affinity tweaks belong in the Python core.
msg102506 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-06 23:53
The counter is "stall cycles". During the 10 second run on my 2.4Ghz cpu, we had instruction cache miss stalls for 2 billion cycles (2000 samples of 1000000 cycles per sample). That does account for around 10% of the availible cpu. I'm observing something like 20% slowdown, though, so there are probably other causes. Profiling another counter, "instruction fetches", I see this, for a "fast run": Functions Causing Most Work Name Samples % Unknown Frame(s) 10.733 99,49 and for a slow run: Functions Causing Most Work Name Samples % Unknown Frame(s) 8.056 99,48 This shows a 20% drop in fetched instructions in the interval (five seconds this time). Ideally, we should see 12000 samples in the fast case (2.4 ghz, 5s) but we see 10000 due to what cache misses there are in this case. The cache misses in the "slow" case causes effective instruction fetches to drop by 20% on top of that. I think that this is proof positive that the slowdown is due to instruction cache misses, at least on this dual core intel machine that I am using. As for "the OS should handle this", I agree. But it doesn't. We are doing something unusual: Convoying two (or more) threads allowing only one to run at a time. The OS scheduler isn't built for that. It can only assume that there will be some parallel execution and so it thinks that it is best to put the two sequential threads on different cpus. But it is wrong, so the cost associated with cache misses outweighs the benefit of running on another core (zero, in our case). So, the OS won't handle it, no matter how hard we wish that it would. It is us that know how these gridlocked threads behave, and we do so much better than any OS scheduler can guess. So, rather than beat our heads against the rock, I'm going to try to come up with a useful heuristic as to when to switch cores, and when not. It would be useful as a diagnostic tool, if nothing more. Ok, so we have established two things, I think: 1) the poor response of IO threads in the presence of CPU threads on thread_pthreads.h implementations (on multicore) is because of greedy gil wait semantics in the current gil. It's easily fixable by using the implementation ROUNDROBIN_GIL implementation I've shown. 2) The poor performance of competing CPU threads on multicore machines is due to the instruction cache behaviour of non-overlapping thread execution on different cores. We can fix 1) easily, even with a much less invasive patch than the ones I have put in here. I'm a bit surprised at the apparent disinterest in such an obvious bug / fix. As for 2), well, see above. Nothing we can do, really, except identify those cases where we are releasing GIL just to yield (one case, actually, ceval.c) and try to instruct the OS not to switch cores in that case. I'll see what I can come up with. Cheers.
msg102507 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-07 00:08
The analysis of instruction cache behavior is interesting---I could definitely see that coming into play given the heavy penalty that one sees going to multiple cores (it's a side effect in addition everything else that goes wrong such as a huge increase in the number of system calls). I will only point out that messing around with processor affinities is going to be problematic. There are C/C++ extensions to Python that intentionally release the GIL and want to run fully multithreaded across as many cores as might be available. Setting a processor affinities is going to be the exact opposite of what you want for code like that.
msg102508 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-04-07 00:11
> The counter is "stall cycles". > During the 10 second run on my 2.4Ghz cpu, we had instruction cache > miss stalls for 2 billion cycles (2000 samples of 1000000 cycles per > sample). That does account for around 10% of the availible cpu. Ok, thanks. > 2) The poor performance of competing CPU threads on multicore machines > is due to the instruction cache behaviour of non-overlapping thread > execution on different cores. Have you tried your measurement approach with ccbench? > We can fix 1) easily, even with a much less invasive patch than the > ones I have put in here. I'm a bit surprised at the apparent > disinterest in such an obvious bug / fix. As already said, it's too late for 2.7. And the fix in 3.2 is most probably better.
msg102758 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-09 23:34
David, yes messing about with processor affinities is certainly not nice. Especially since the issue is cross-platform. The pthreads api doesn't offer much. There is pthreadd_setschedparam(), and pthreads_setconcurrency(). Unfortunately I don't have a pthreads machine to test that with. On windows, one possibility would be to switch to fibers, in the case of a yielding thread. I don't know if that would change anything, or if the thread-to-fiber and vice versa conversion is lightweight enough to be used dynamically. Antoine: I'm not familiar with ccbench. I´ll look into it. As for my FIFO fix, py3k is trying to do more, namely get rid of the checkinterval. It is most certainly a more complex solution and with it its own set of problems. The only thing that needs fixing is to add "fairness" to the GIL. I know that this is coming a bit late for 2.7 and I'm not pushing it as such for 2.7. But after 2.7 comes 2.8 (and so on ad infinitum) But I'm also pointing out the obvious problem and an obvious simple fix which doesn't involve inventing a whole new system. I would have thought that this should at least spark some enthusiasm. It's unfortunate, maybe, that I only realized so late that the pythread GIL was implemented using a homebrew condition variable mechanism. I always thougth (being a windows guy) that it were simply using the pthread_mutex() and thus the greedy behaviour of the GIL could be ascribed to that. Anyway, I´ll continue giving this patch some love. I wouldn't be surprised if it, and especially the "priority" variant, would be appealing to people doing e.g. webservers with 2.x technology. Another thing that the "priority" patch has done is convince me that I really need to implement this scheduling mode in stackless, since it does appear to help network latency when using FIFO scheduling of threads / tasklets. Cheers!
msg102759 - (view)	Author: anatoly techtonik (techtonik)	Date: 2010-04-09 23:47
If it really improves multicore performance and none of our test fail (even in memory/resource/time survival tests) then I'd give it a try even after a beta. 2.x is still the best practical version out there.
msg102823 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-11 11:09
I looked at ccbench. It's a great tool. I've added two features to it (see the attached patch) -y option to turn off the "do_yield" option in throughput, and so measure thread scheduling without assistance, and the throughput option now also computes "balance", which is the standard deviation of the throughput of each thread normalized by the average. I give you three results for throughput, to demonstrate the ROUNDROBIN_GIL implementation: 1) LEGACY_GIL, no forced switching C:\pydev\python\trunk\PCbuild>python.exe ..\Tools\ccbench\ccbench.py -y -t == CPython 2.7a4+.0 (trunk) == == AMD64 Windows on 'Intel64 Family 6 Model 23 Stepping 6, GenuineIntel' == --- Throughput --- Pi calculation (Python) threads= 1: 672 iterations/s. balance threads= 2: 597 ( 88%) 0.4243 threads= 3: 603 ( 89%) 0.2475 threads= 4: 596 ( 88%) 0.4776 regular expression (C) threads= 1: 571 iterations/s. balance threads= 2: 565 ( 98%) 0.6203 threads= 3: 567 ( 99%) 1.6867 threads= 4: 570 ( 99%) 1.1670 SHA1 hashing (C) threads= 1: 1269 iterations/s. balance threads= 2: 1268 ( 99%) 1.1470 threads= 3: 1270 (100%) 0.6024 threads= 4: 1263 ( 99%) 0.7419 LEGACY_GIL, with forced switching C:\pydev\python\trunk\PCbuild>python.exe ..\Tools\ccbench\ccbench.py -t == CPython 2.7a4+.0 (trunk) == == AMD64 Windows on 'Intel64 Family 6 Model 23 Stepping 6, GenuineIntel' == --- Throughput --- Pi calculation (Python) threads= 1: 663 iterations/s. balance threads= 2: 605 ( 91%) 0.0232 threads= 3: 599 ( 90%) 0.1988 threads= 4: 601 ( 90%) 0.4648 regular expression (C) threads= 1: 568 iterations/s. balance threads= 2: 562 ( 99%) 0.1737 threads= 3: 571 (100%) 0.3950 threads= 4: 566 ( 99%) 0.3158 SHA1 hashing (C) threads= 1: 1275 iterations/s. balance threads= 2: 1267 ( 99%) 0.7238 threads= 3: 1271 ( 99%) 0.2405 threads= 4: 1270 ( 99%) 0.1508 Using the forced "do_yield" helps balance things, but not much. We still have a .7 balance in SHA1 hashing for two threads. Now, for ROUNDROBIN_GIL, and no forced switching: C:\pydev\python\trunk\PCbuild>python.exe ..\Tools\ccbench\ccbench.py -t -y == CPython 2.7a4+.0 (trunk) == == AMD64 Windows on 'Intel64 Family 6 Model 23 Stepping 6, GenuineIntel' == --- Throughput --- Pi calculation (Python) threads= 1: 672 iterations/s. balance threads= 2: 485 ( 72%) 0.0289 threads= 3: 448 ( 66%) 0.0737 threads= 4: 476 ( 70%) 0.0408 regular expression (C) threads= 1: 569 iterations/s. balance threads= 2: 551 ( 96%) 0.0505 threads= 3: 551 ( 96%) 0.1637 threads= 4: 551 ( 96%) 0.2020 SHA1 hashing (C) threads= 1: 1271 iterations/s. balance threads= 2: 1262 ( 99%) 0.0111 threads= 3: 1207 ( 94%) 0.0143 threads= 4: 1202 ( 94%) 0.0317 Notice the much better balance value, and this is without the forced sleep. Also note a lower througput when computing pi with threads. This is because yielding every 100 opcodes now actually works, and the aforementioned instruction cache problem kicks in. Increasing the checkinterval to 1000 solves this: C:\pydev\python\trunk\PCbuild>python.exe ..\Tools\ccbench\ccbench.py -t -y -i100 0 == CPython 2.7a4+.0 (trunk) == == AMD64 Windows on 'Intel64 Family 6 Model 23 Stepping 6, GenuineIntel' == --- Throughput --- Pi calculation (Python) threads= 1: 673 iterations/s. balance threads= 2: 628 ( 93%) 0.0000 threads= 3: 603 ( 89%) 0.0284 threads= 4: 606 ( 90%) 0.0328 regular expression (C) threads= 1: 570 iterations/s. balance threads= 2: 569 ( 99%) 0.2729 threads= 3: 562 ( 98%) 0.6595 threads= 4: 560 ( 98%) 1.2440 SHA1 hashing (C) threads= 1: 1265 iterations/s. balance threads= 2: 1256 ( 99%) 0.0000 threads= 3: 1264 ( 99%) 0.0759 threads= 4: 1255 ( 99%) 0.1309 If no one objects, I'd like to submit this changed ccbench.py to the trunk.
msg102827 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-11 12:21
Fyi, here is the output using the unmodified Windows GIL, i.e. without my patch being active: C:\pydev\python\trunk\PCbuild>python.exe ..\Tools\ccbench\ccbench.py -t -y == CPython 2.7a4+.0 (trunk) == == AMD64 Windows on 'Intel64 Family 6 Model 23 Stepping 6, GenuineIntel' == --- Throughput --- Pi calculation (Python) threads= 1: 623 iterations/s. balance threads= 2: 489 ( 78%) 0.0289 threads= 3: 461 ( 74%) 0.0369 threads= 4: 460 ( 73%) 0.0426 regular expression (C) threads= 1: 515 iterations/s. balance threads= 2: 548 (106%) 0.0771 threads= 3: 532 (103%) 0.0556 threads= 4: 523 (101%) 0.1132 SHA1 hashing (C) threads= 1: 1188 iterations/s. balance threads= 2: 1212 (102%) 0.0232 threads= 3: 1198 (100%) 0.0250 threads= 4: 1215 (102%) 0.0163 You see results virtually identical to the ROUNDROBIN_GIL implementation. This is just do demonstrate that Windows has had the ROUNDROBIN_GIL behaviour all along.
msg102828 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-11 12:22
I must be missing something, but why, exactly would you want multiple CPU-bound threads to yield every 100 ticks? Frankly, that sounds like a horrible idea that is going to hammer your system with excessive context switching overhead and cache performance problems---an effect that you, yourself have actually observed. The results of ccbench also show worse performance for the round-robin GIL because of this. Although the legacy GIL signals every 100 ticks, threads do not context switch that rapidly. In fact, on single CPU systems, they context switch at about the same rate as the system time-slice (5-10 milliseconds on most systems). The new GIL implemented by Antoine also does not rapidly switch CPU-bound threads. Again, I must be missing something, but I don't see how this round-robin GIL and all of this forced thread switching is anything that you would ever want--especially for CPU-bound threads. It seems to go against just about every design goal that people usually have for schedulers (especially the goal of minimizing context switching overhead). Again, maybe I'm just being dense and missing something.
msg102830 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-11 12:33
Sorry, but I don't see how you can say that the round-robin GIL and the legacy GIL have the same behavior based solely on the result of a performance benchmark. Do you have any kind of thread scheduling trace that proves they are scheduling threads in exactly the same manner? Maybe they both have lousy performance, but for different reasons.
msg102831 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-04-11 12:38
> SHA1 hashing (C) > > threads= 1: 1275 iterations/s. balance > threads= 2: 1267 ( 99%) 0.7238 > threads= 3: 1271 ( 99%) 0.2405 > threads= 4: 1270 ( 99%) 0.1508 > > Using the forced "do_yield" helps balance things, but not much. We > still have a .7 balance in SHA1 hashing for two threads. Which is not unreasonable, since SHA1 releases the GIL. The unbalance would be produced by the Windows scheduler, not by Python. Note: "do_yield" is not meant to "balance" things as much as to make measurements meaningful at all. Without switching at all during say 2 seconds, the numbers become totally worthless. > If no one objects, I'd like to submit this changed ccbench.py to the trunk. Please let me take a look.
msg102838 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-11 14:46
David, I don't necessarily think it is reasonable to yield every 100 opcodes, but that is the _intent_ of the current code base. Checkinterval is set to 100. If you don't want that, then set it higher. Your statement is like saying: "Why would you want to have your windows fit tightly, it sounds like a horrible thing for the air quality indoors" (I actually got this when living in Germany). The answer is, of course, that a snugly fitting window can still be opened if you want, but more importantly, you _can_ close it properly. And because the condition variable isn't strictly FIFO, it actually doesn't switch every time (an observation. The scheduler may decide o do its own things inside the condition variable / semaphore). What the ROUNDROBIN_GIL ensures, however, is that the condition variable is _entered_ every checkinterval. What I'm trying to demonsrate to you is the brokenness of the legacy GIL (as observed by Antoine long ago) and how it is not broken on windows. It is broken because the currently running thread is biased to reaquire the GIL immediately in an unpredictable fashion that is not being managed by the (OS) thread scheduler. Because it doesn't enter the condition variable wait when others are competing for it, the scheduler has no means of providing "fairness" to the application. So, to summarise this: I'm not proposing that we context switch every 100 opcodes, but I am proposing that we context switch consistently according to whatever checkinterval is put in place. Antoine, in case you misunderstood: I´m saying that the ROUNDROBIN_GIL and the Windows GIL are the same. If you don't believe me, take a look at the NonRecursiveLock implementation for windows. I'm also starting to think that you didn't actually bother to look at the patch. Please compare PyLock_gil_acquire() for LEGACY_GIL and ROUNDROBIN_GIL and see if you can spot the difference. Really, it's just two lines of code. Maybe it needs restating. The bug is this (python pseudocode) with gil.cond: while not gil.locked: #this line is the bug gil.cond.wait() gil.locked = True vs. with gil.cond: if gil.n_waiting or gil.locked: gil.n_waiting += 1 while True: gil.cond.wait() #always wait at least once if not gil.locked: break gil.n_waiting -= 1 gil.locked = True The cond.wait() is where fairness ensues, where the OS can decide to serve threads roughly on a first come, first serve basis. If you are biased towards not entering it at all (when yielding the GIL), then you have taken away the OS' chance of scheduling. Antoine (2): The need to have do_yield is a symptom of the brokenness of the GIL. You have a checkinterval of 100, which elapses some 1000 times per second, and yet you have to put in place special fudge code to ensure that we do get switches every few seconds? The whole point of the checkinterval is for you _not_ to have to dot the code with sleep() calls. Surely you don't expect the average application developer to do that if he wants his two cpu bound threads to compete fairly for the GIL? This is why I added the -y switch: To emulate normal application code. Also, the 0.7 imbalance observed in the SHA1 disappears on windows, (and using ROUNDROBIN_GIL). It is not due to the windows scheduler, it is due to the broken legacy_gil. This last slew of comments has been about the ROUNDROBIN_GIL only. I haven't dazzled you yet with PRIORITY_GIL, but that solves both problems because it is _fair_, and it allows us to increase the checkinterval to 10000, thus elimintating the rapid switching overhead, and yet gives fast response to IO.
msg102840 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-04-11 14:59
> Antoine (2): The need to have do_yield is a symptom of the brokenness > of the GIL. Of course it is. But the point of the benchmark is to give valid results even with the old broken GIL. I could remove do_yield and still have it give valid results, but that would mean running each step for 30 seconds instead of 2. I don't like having to wait several minutes for benchmark numbers :-)
msg102845 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-11 15:20
I'm sorry, I still don't get the supposed benefits of this round-robin patch over the legacy GIL. Given that using interpreter ticks as a basis for thread scheduling is problematic to begin with (mostly due to the fact that ticks have totally unpredictable execution times), I'd much rather see further GIL work continue to build upon the time-based scheduler that's been implemented in Python 3.2. For instance, I think being able to specify a thread-switching interval in seconds (sys.setswitchinternal) makes much more sense than continuing to fool around with check intervals and all of this tick business. The new GIL implementation is by no means perfect, but people are working on it. I'd much rather know if anything that you've worked out with this patch can be applied to that version of the GIL.
msg103087 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-13 22:48
Maybe the state of this discussion is my fault for not being clear enough. Let's abandon terms such as "broken" and "roundrobin." CS theory has the perfectly useful terms "fair" and "unfair." The fact of the matter is this: the pthread GIL (implemented as LEGACY gil) is an "unfair" syncronization primitve. The GIL on windows, (and the poorly named ROUNDROBIN_GIL) is a "fair" synchronization primitive. Unfair mutexes have their place, and such is the behaviour of the windows condition variable (and the pthreads mutex, I suspect). But they are not useful if you want to provide fair access to a resource that is held all the time. Until after that GIL business at PyCon I wasn't aware of this fundamental difference between the GIL on windows and pthreads platforms in 2.x. It is astonishing to me that no one appears to have noticed the difference, or made much of it. CPU threads are scheduled fairly on windows, and incredibly unfairly on pthreads. with the ROUNDROBIN_GIL I'm not proposing anything radical, I'm just suggesting that we adopt the superior behaviour that has been on windows all along. Yes, people actually do use windows. Antoine, I understand that your point about do_yield, yet the results for 3 seconds without it are telling on their own, and worthy of being studied, which is why I suggested disabling it. Also, I think you will find that he imbalance in the throughput of the threads won't go away even after 30 seconds. Unfortunately, the unfairness is such that it may actually diverge. I've improved my patch some more. I'll upload it soon. In particular, I've addea a PyThread_gil_yield() method to enable whatever underlying gil there is to possible deal with this particular locking case differently, if possible, perhaps suggesting to the OS not to switch cores. I´ve also created a simple program in visual studio to examine a GIL outside the context of python. I'll put it here tomorrow too, for those interested. It allows for simpler experimentation, although because the loop is small, you won't see the effect of the instruction cache problems. David, I actually think that the checkinterval is a perfectly good mechanism, especially if augmented with an interrupt mechanism What does it matter if some opcodes are slower than others? when we are checking every 100 or 1000 (or 10000 as I am proposing) that hardly matters. We just need to have an order of magnitude thing there. But there are other ways to do it. You can use a timer on windows, and on pthreads too, I think. But the whole point of this patch is to take a step back, and to see if there is a way to fix the "gil problem" in a simpler way by first trying to understand it fully, and then apply minimal changes to solve it.
msg103088 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-04-13 23:01
Kristjan, > Maybe the state of this discussion is my fault for not being clear enough. It's quite a bit simpler. The first 2.7 beta has been released and there's IMO no way such patches will be accepted. It doesn't seem to be a pressing enough issue to be considered a real bug. As you said yourself, most people actually aren't really affected, or not enough. > CPU threads are scheduled fairly on windows, and incredibly unfairly > on pthreads. pthreads doesn't schedule anything. The kernel does. I'm sure that on non-tiny periods (>= 5s) they are scheduled quite fairly. It's just that switching occurs less often and less regularly than you'd might hope. > Antoine, I understand that your point about do_yield, yet the results > for 3 seconds without it are telling on their own, and worthy of being > studied, which is why I suggested disabling it. As I said, they will render 2.x results completely wrong (at least under Linux). > Also, I think you will find that he imbalance in the throughput of the > threads won't go away even after 30 seconds. I'm actually not really interested in confirming this, but as I said there's no reason to think that the Linux kernel does a bad job. (the one reputed to do a bad job at scheduling, especially for desktop environments, is the Windows kernel) > I've improved my patch some more. I'll upload it soon. If you are interested in taking it further, I would recommend publishing your patch (and prebuilt binaries, if you care) somewhere else as well, because as I said there's probably no way it gets integrated during what remains of the 2.x timeline. Of course, other developers might disagree with me, in which case your patch /can/ be integrated. But I don't see a lot of interest showing honestly. > We just need to have an order of magnitude thing there. Duration of opcodes can vary by more than an order of magnitude. ccbench includes such testing by the way (different CPU-bound workloads)
msg103093 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-04-13 23:22
> Maybe the state of this discussion is my fault for not being clear > enough. Let's abandon terms such as "broken" and "roundrobin." CS > theory has the perfectly useful terms "fair" and "unfair." The fact > of the matter is this: the pthread GIL (implemented as LEGACY gil) is > an "unfair" syncronization primitve. That's not really true. The Linux condition variable (from glibc linuxthreads), for example, implements "fair" synchronization. Other implementations may do the same.
msg103098 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-13 23:41
What bothers me most about this discussion is that the Windows implementation (legacy GIL) is being held up as an example of what we should be doing on posix. Yet, if I go run the same thread tests that I presented in my GIL talks on a multicore Windows machine, the performance is every bit as bad, if not worse, than what I reported in my talk. Therefore, why would we want that? I just don't get it.
msg103215 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-15 14:09
Here is yet another point: if _POSIX_SEMAPHORES is defined, thread_pthread.h is designed to use the (fair) semaphore. If it is not present, or HAVE_BROKEN_POSIX_SEMAPHORES defined, the semaphore is supposed to be emulated using a condition variable. Now, I don't have access to a mac or linux machine, but does a modern python build perhaps actually have USE_SEMAPHORES defined? if so, then this entire rant about a broken lock on pthreads is nonsense. Please note that the "emulated" semaphore is unfair, as I've pointed out, whereas a posix_sem object strives to be fair. So this "emulation" is not working.. Martin, you are right that some mutexes are indeed fair. There has been a move towards using unfair mutexes, particularly on multicore machines. This is because they reduce the "lock convoying" problem. A fair mutex hands off the lock to a waiting thread. That thread is then made runnable. But on a busy system, it may take a while for that thread to actually start running and use the locked resource. The reesult is that the locked resource is unavailable for a longer time. An unfair mutex will wake up a waiting thread, yet have that thread compete for the mutex with any interloper that might arrive and claim it. See e.g. http://www.bluebytesoftware.com/blog/PermaLink,guid,e40c2675-43a3-410f-8f85-616ef7b031aa.aspx and http://developer.amd.com/documentation/articles/Pages/282007123.aspx
msg103219 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-04-15 14:28
> if _POSIX_SEMAPHORES is defined, thread_pthread.h is designed to use > the (fair) semaphore. If it is not present, or > HAVE_BROKEN_POSIX_SEMAPHORES defined, the semaphore is supposed to be > emulated using a condition variable. > Now, I don't have access to a mac or linux machine, but does a modern > python build perhaps actually have USE_SEMAPHORES defined? Yes, it does. Actually, I find it unlikely that any modern Unix would fall back on the non-semaphore version. All this code is (mostly) very old.
msg103224 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-15 15:39
Oh dear. I was assuming that the mutex+condition variable were the actual implementation mostly in use on pthreads. This is because of David's GIL open talk at pycon, where we were looking at the source and bickering about the placement of "pthread_cond_signal()" being after the "pthread_mutex_unlock()" call. In which case, more than half of this thread is invalid. I could, perhaps, start a new defect: "semaphore emulation using condition variable is broken". However, I just asked a colleague with a os X to compile python 2.7 and _POSIX_SEMAPHORES isn't defined, and so, it is running using the emulation. Why, I wonder? Isn't it defined in unistd.h? Martin, I don't know if you were suggesting that a "fair" mutex would make the emulated semaphore fair too. You probably weren't, but just in case, the fairness of the mutex is immaterial because it is only held for a short time to guard the internal state of the "semaphore". You won't see threads queing up on it, but they will queue on the Contition variable.
msg103225 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-04-15 15:42
> However, I just asked a colleague with a os X to compile python 2.7 > and _POSIX_SEMAPHORES isn't defined, and so, it is running using the > emulation. Why, I wonder? Isn't it defined in unistd.h? Perhaps a bad combination of defines. Has he checked that the semaphore path isn't used at all? (just put a #error in the other path) If so, opening an issue would be good. I would hope we can drop the emulation path one day.
msg103228 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-15 16:34
Yes, we put #error in both places (defining and undefining USE_SEMAPHORES). The colleague in question is Christian Tismer, he is unlikely to have gotten it wrong. I am also curious why David Beazley kept talking about the "binary semaphore" when it is apparent that that is supposed to be a "hack" to use on platforms that don't have posix semaphores. This gets curiouser and curiouser.
msg103230 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-04-15 16:48
> Yes, we put #error in both places (defining and undefining > USE_SEMAPHORES). The colleague in question is Christian Tismer, he is > unlikely to have gotten it wrong. Ok, so can you or Christian open an issue about it? We should try to fix it. > I am also curious why David Beazley kept talking about the "binary > semaphore" when it is apparent that that is supposed to be a "hack" to > use on platforms that don't have posix semaphores. I think David often uses technical terms (such as "semaphore" or "signal") in more generic meanings than what you might expect :)
msg103232 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-15 16:54
You do realize, that if we enable the USE_SEMAPHORE, we get the GIL behaviour as seen on windows and with my ROUNDROBIN_GIL implementation, right? Also, at the GIL open space talk on PyCon, David did show us the "emulation" source code as if it were _the_ gil. Well, maybe he can explain it.
msg103233 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-04-15 17:01
> You do realize, that if we enable the USE_SEMAPHORE, we get the GIL > behaviour as seen on windows and with my ROUNDROBIN_GIL > implementation, right? I haven't studied this argument, but I don't see how that contradicts anything. The main issue witnessed with the 2.x GIL -- and the point of Dave Beazley's original talk -- is CPU inefficiency (due to far too many lock operations).
msg103234 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-04-15 17:04
My understanding is that David noticed the problem originally on MacOS. If the emulation is indeed being used on that platform (and a little googling indicates the MacOS posix semaphore implementation is considered at least slightly broken, and FreeBSD didn't support it until 7.2), then perhaps that is why he was looking at that code.
msg103235 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-04-15 17:05
Also note that his results were much worse on MacOS than anyone was seeing on Linux, which may support this theory :)
msg103236 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-15 17:09
I hope everyone realizes that all of this bike-shedding about emulated semaphores versus "real" semaphores is mostly a non-issue. For one thing, go look at how a "real" semaphore is implemented by reading the source code to pthreads or some other thread library. You'll find that semaphores are implemented using the exact same mechanisms that underly condition variables and in some cases, are actually implemented using a mutex lock and a condition variable exactly as Python is doing. Second, the performance of using "real" semaphores still sucks. So, all of the arguing about "fairness" and whatnot seems to be a total waste of time in my opinion because it doesn't address the underlying problem.
msg103238 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-15 17:23
David, I urge you to reconsider: The "emulated" semaphore is broken because it is unfair. It is clearly a programming error, born out of naivete about how to implement such primitives. Proper semaphores therefore cannot be implemented using the "exact same mechanism" because proper semaphores are fair, this one isn't. You do understand why exactly it is unfair, don't you? Second, with a fair GIL you still get poor performance on multicore with low values of "tickinterval" but at least you get predictable scheduling. The emulated semaphore is bad in two ways: Unpredictable scheduling with thread starvation _and_ poor multicore performance. I don't understand why you prefer having two problems to one. I also think it is worth investigating when exactly the "emulaton" semaphore became the "standard". Did something break in the config script at some point?
msg103239 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-15 17:32
Googling a bit gave me this: http://lists.apple.com/archives/darwin-kernel/2005/Dec/msg00022.html It would appear that mac os X was at least lacking full posix semaphore support in 2005.
msg103258 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-04-15 21:09
> Googling a bit gave me this: > http://lists.apple.com/archives/darwin-kernel/2005/Dec/msg00022.html > It would appear that mac os X was at least lacking full posix semaphore support in 2005. Hmm. OS X really sucks.
msg103310 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-16 10:46
I'm sorry, but even in the presence of fair locking, I still don't like this patch. The main problem is that it confuses fair locking with fair CPU use---something that this patch does not and can not achieve on any platform. The main problem is that everything is still based on the execution of interpreter ticks. However, interpreter ticks have wildly varying execution times dependent upon the code that's running. Thus, executing 1000 ticks might take significantly longer in one thread than another. Under a FIFO scheduler based on "fair" locking, the thread with the longer-running ticks is going to unfairly hog the GIL and the CPU. For example, if thread 1 takes 95 usec to execute 1000 ticks and thread 2 takes 5 usec to execute 1000 ticks, then thread 1 is going to end up hogging about 95% of the CPU cycles, starving thread 2. To me, that doesn't sound especially "fair." It would be much better to have fairness where threads are guaranteed to get an equal time slice of CPU cycles regardless of how many ticks they're executing. In other words, it would be much better if the two threads above each got 50% of the CPU cycles. The only way you're ever going to be able to do that is to base thread scheduling on timing. The new GIL in Python 3 makes an effort to do this even though some issues are still being worked out with it. On a slightly unrelated note, I just tried some experiments on Linux with the GIL implemented as condition variables and with semaphores. I honestly didn't see any noticeable performance difference between the two versions. I also didn't see any kind of purported "fair" scheduling of threads using the semaphore version. Both versions exhibit the same performance problems as described in my GIL talk (albeit not to the same extreme as on OS-X). Based on my own reading of the pthreads source code (yes, I have looked), I can't really draw any conclusion about the fairness of semaphores. Under the covers, it's all based on futex locks (the "f" in futex referring to "fast", not "fair" by the way). I know that the original paper on futexes has some experiments with fair lock scheduling, but I honestly don't know if that is being used in the Linux kernel or by pthreads. My understanding is that by default, futexes do not guarantee fairness. To know for certain with semaphores, much more low-level investigation would be required.
msg103311 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-16 11:21
I've attached a test "fair.py" that gives an example of the fair CPU scheduling issue. In this test, there are two threads, one of which has fast-running ticks, one of which has slow-running ticks. Here is their sequential performance (OS-X, Python 2.6): slow: 5.71 fast: 0.32 Here is their threaded performance (OS-X, Python 2.6.4): slow : 5.99 fast : 6.04 (Notice : Huge jump in execution, unfair CPU) Here is their threaded performance using the Py3K New GIL: slow : 5.96 fast : 0.67 (Notice : Fair CPU use--time only doubled) Using Linux with semaphores gives no benefit here. The fast code is stalled in the same way. For example: here are my Linux results (Ubuntu 8.10, Python-2.6.4, dual-core, using semaphores): Sequential: slow : 6.24 fast : 0.59 Threaded: slow : 6.40 fast : 6.69 (even slower than the slow code!)
msg103355 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-16 19:57
What your fair.py is doing is demonstrating the superior behaviour of a time-based GIL interrupt to a bytecode based one. I have no quibbles with that and I agree that it is superior. But I also think that your example is a very artificial one. On average the duration of the bytecodes evens out between threads and so they should give a fair first approximation. But this is not what I was talking about when considering fairness. Your test only measures each thread end to end, and it demonstrates how a bytecode based gil yilelding system wil let the two threads work in lockstep, even though each loop in one thread is cheaper than the other. Fair enough. But fair scheduling between threads doesn't show up here. To demonstrate fair / unfair, I've modified fair.py to add two more runs, where three threads of the "fast" variety are run for identical number of rounds. This is when you will se a difference between the linux and the mac based GIL implementations. For your info, on my windows dual core office box, with regular windows gil: D:\pydev\python\trunk\PCbuild>python.exe d:\pyscript\fair.py Sequential execution slow: 3.384000 (0 left) fast: 0.177000 (0 left) Threaded execution slow: 3.435000 (0 left) fast: 3.568000 (0 left) Treaded, balanced execution: fast A: 0.973000 (0 left) fast C: 0.992000 (0 left) fast B: 1.013000 (0 left) Treaded, balanced execution, with quickstop: fast A: 0.977000 (0 left) fast C: 0.976000 (252 left) fast B: 0.978000 (17601 left) And now, same box, with the unfair GIL: D:\pydev\python\trunk\PCbuild>python.exe d:\pyscript\fair.py Sequential execution slow: 3.338000 (0 left) fast: 0.177000 (0 left) Threaded execution fast: 0.382000 (0 left) slow: 3.539000 (0 left) Treaded, balanced execution: fast A: 0.362000 (0 left) fast B: 0.464000 (0 left) fast C: 0.549000 (0 left) Treaded, balanced execution, with quickstop: fast B: 0.389000 (0 left) fast A: 0.447000 (240480 left) fast C: 0.360000 (613098 left) The two last cases are the interesting ones. With unfair scheduling, one thread takes almost twice as long to complete its 1000000 inserts than another. And if they are all stopped when the quickest one finishes, one thread has more than 600000 iterations to go. This is what I mean by fair/unfair scheduling. Cheers, Kristján p.s. Yes, I agree that time based GIL yielding is better. I intentionally didn't want to confuse the matter with that in 2.x. I wanted to address the other issues that are wrong.
msg103384 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-17 03:18
I'm not trying to be a pain here, but do you have any explanation as to why, with fair scheduling, the observed execution time of multiple CPU-bound threads is substantially worse than with unfair scheduling? From your own benchmarks, consider this result (Fair scheduling) Treaded, balanced execution: fast A: 0.973000 (0 left) fast C: 0.992000 (0 left) fast B: 1.013000 (0 left) Versus this result with unfair scheduling: Treaded, balanced execution: fast A: 0.362000 (0 left) fast B: 0.464000 (0 left) fast C: 0.549000 (0 left) If I'm reading this right, it takes the three threads with fair locking almost twice as long to complete (1.01s) as the three threads with unfair locking (0.55s) . If so, why would I want fair locking? Wouldn't I want the solution that offers the fastest overall execution time?
msg103385 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-17 03:22
One other comment. Running the modified fair.py file on my Linux system using Python compiled with semaphores shows they they are definitely not fair. Here's the relevant part of your test: Treaded, balanced execution, with quickstop: fast C: 1.580815 (0 left) fast B: 1.636923 (158919 left) fast A: 1.788634 (310323 left)
msg103396 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-17 10:53
>I'm not trying to be a pain here, but do you have any explanation as to >why, with fair scheduling, the observed execution time of multiple CPU->bound threads is substantially worse than with unfair scheduling? Yes. This is because the GIL yield now actually succeeds most of the time, every 100 opcodes (the default). Profiling indicates that on multicore machines this causes significant instruction cache misses as the new thread is scheduled on the other core. Try raising the sys.checkinterval to 1000 and watch the performance difference fall away. >If so, why would I want fair locking? Wouldn't I want the solution >that offers the fastest overall execution time? Because of IO latency. Your IO thread, waiting to wake up when somethin happen also has to compete unfairly with the CPU threads. "a fair" lock allows you to balance the cost of switching (on multicore) vs thread latency using sys.checkinterval (if we are based on opcodes). The unfair lock all but disables this control. >One other comment. Running the modified fair.py file on my Linux >system using Python compiled with semaphores shows they they are >definitely not fair. Here's the relevant part of your test: Interesting. let me stress that I am using windows and making assumptions about how something like sem_wait() behaves. And the posix standard does not require "fair" behaviour of the semaphore functions according to http://www.opengroup.org/onlinepubs/009695399/functions/sem_wait.html. The kernel-based windows synchronization functions achieve an amount of "fairness" through WaitForSingleObject() and friends: http://msdn.microsoft.com/en-us/magazine/cc163642.aspx#S2 Are you absolutely sure that USE_SEMAPHORES is defined? There could be a latent config problem somewhere.
msg103397 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-17 11:16
I'm definitely sure that semaphores were being used in my test---I stuck a print statement inside the code that creates locks just to make sure it was using the semaphore version :-). Unfortunately, at this point I think most of this discussion is academic since no change is likely to be incorporated into Python 2.7. I can definitely see where fairness might help I/O performance if there is only 1 CPU bound thread. I just don't know for other situations. For example, if you have a server where it's all I/O-bound threads, but it suddenly comes under extreme load (e.g., slashdot effect), does a fair GIL help or hurt with that? I just don't know. In the big picture, all of the issues raised here should be on the minds of people fixing the GIL in py3k though. It's just one more aspect of why fixing the GIL is hard.
msg103399 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-17 12:21
As a followup, since I'm not sure anyone actually here actually tried a fair GIL on Linux, I incorporated your suggested fairness patch to the condition-variable version of the GIL (using this pseudocode you wrote as a guide): with gil.cond: if gil.n_waiting or gil.locked: gil.n_waiting += 1 while True: gil.cond.wait() #always wait at least once if not gil.locked: break gil.n_waiting -= 1 gil.locked = True I did some tests on this and it does appear to exhibit fairness. Here are the results of running the 'fair.py' test with a fair GIL on my Linux system: [ Fair GIL Linux ] Sequential execution slow: 6.246764 (0 left) fast: 0.465102 (0 left) Threaded execution slow: 7.534725 (0 left) fast: 7.674448 (0 left) Treaded, balanced execution: fast A: 10.415756 (0 left) fast B: 10.456502 (0 left) fast C: 10.520457 (0 left) Treaded, balanced execution, with quickstop: fast B: 8.423304 (0 left) fast A: 8.409794 (16016 left) fast C: 8.381977 (9162 left) beazley@ubuntu:~/Desktop/Python-2.6.4$ If I switch back to the unfair GIL, this is the result: [ Unfair GIL, original implementation, Linux] Sequential execution slow: 6.164739 (0 left) fast: 0.422626 (0 left) Threaded execution slow: 6.570084 (0 left) fast: 6.690927 (0 left) Treaded, balanced execution: fast A: 1.994143 (0 left) fast C: 2.014925 (0 left) fast B: 2.073212 (0 left) Treaded, balanced execution, with quickstop: fast A: 1.614533 (0 left) fast C: 1.607324 (377323 left) fast B: 1.625987 (111451 left) Probably the main thing to notice is the huge increase in performance over the fair GIL. For instance, the balance execution test runs about 5 times faster. Here are the two tests repeated with checkinterval = 1000. [ Fair GIL, checkinterval = 1000] Sequential execution slow: 6.175320 (0 left) fast: 0.424410 (0 left) Threaded execution slow: 6.505094 (0 left) fast: 6.746649 (0 left) Treaded, balanced execution: fast A: 2.243123 (0 left) fast B: 2.416043 (0 left) fast C: 2.442475 (0 left) Treaded, balanced execution, with quickstop: fast A: 1.565914 (0 left) fast C: 1.514024 (81254 left) fast B: 1.531937 (63740 left) [ Unfair GIL, checkinterval = 1000] Sequential execution slow: 6.258882 (0 left) fast: 0.411590 (0 left) Threaded execution slow: 6.255027 (0 left) fast: 0.409412 (0 left) Treaded, balanced execution: fast A: 1.291007 (0 left) fast C: 1.135373 (0 left) fast B: 1.437205 (0 left) Treaded, balanced execution, with quickstop: fast C: 1.331775 (0 left) fast A: 1.418670 (54841 left) fast B: 1.403853 (208732 left) Here, the unfair GIL is still quite a bit faster on raw performance. I tried kicking the check interval up to 10000 and the unfair GIL still won by a pretty significant margin on raw speed of completing the different tasks. I've attached a copy of the thread_pthread.h file I modified for this test. It's from Python-2.6.4.
msg103500 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-04-18 15:39
> Martin, I don't know if you were suggesting that a "fair" mutex would > make the emulated semaphore fair too. You probably weren't, but just > in case, the fairness of the mutex is immaterial because it is only > held for a short time to guard the internal state of the "semaphore". > You won't see threads queing up on it, but they will queue on the > Contition variable. Exactly so. I still don't see why you then infer that the GIL is unfair. It is not IF THE CONDITION VARIABLE IS FAIR. As I said, some implementations of condition variables are fair, e.g. the Linux one (which in itself isn't really relevant here, because Linux uses the semaphore GIL, anyway). However, it remains unclear why you think that the GIL is not fair in pthreads.
msg103525 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-18 20:40
Martin, I´ve explained it in my other dissue, issue 8411, with a step by step example. It is unfair because a thread can _bypass_ the condition variable. A thread just woken up from the condition variable has to race to get the lock, and it is a race that it will invariably loose if the other thread is doing a release/acquire (yielding the GIL as happens in ceval.py) The ConditionVariable can only endow the lock with its fairness property if all the threads play by the same rules. This was a design decision made by Tim (according to the comment) but a misguided one. It is a good stragegy for resources that are held for a short time to avoid lock convoying, but not appropriate in this case. You also don't have to take my word for it. Just try it out. Notice that 99% of all yields between threads fail, causing starvation of a thread which is the definition of "unfairness." Anyway, this is the last time I explain why the "emulated" semaphore is unfair. I think I´ve done so on at least four or five different occasions and it would be helpful if people would actually bother to read my comments.
msg103533 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-04-18 21:18
> Martin, I´ve explained it in my other dissue, issue 8411, with a step by step example. Hmm. Can't find it there. What message or file should I be looking at?
msg103551 - (view)	Author: David Beazley (dabeaz)	Date: 2010-04-19 01:59
Here are the results of running the fair.py test on a Mac OS-X system using a "fair" GIL implementation (modified condition variable): [ Fair GIL, Dual-Core, OS-X ] Sequential execution slow: 5.490943 (0 left) fast: 0.369257 (0 left) Threaded execution slow: 6.122093 (0 left) fast: 6.179179 (0 left) Treaded, balanced execution: fast C: 3.345452 (0 left) fast B: 3.389235 (0 left) fast A: 3.426407 (0 left) Treaded, balanced execution, with quickstop: fast C: 2.557972 (0 left) fast B: 2.558551 (35087 left) fast A: 2.558914 (13142 left) Here is the same test with the original GIL. [Unfair GIL, original implementation] Sequential execution slow: 5.444754 (0 left) fast: 0.361340 (0 left) Threaded execution slow: 5.542008 (0 left) fast: 5.225690 (0 left) Treaded, balanced execution: fast C: 1.381929 (0 left) fast B: 1.499969 (0 left) fast A: 1.549571 (0 left) Treaded, balanced execution, with quickstop: fast A: 1.284043 (0 left) fast B: 1.295507 (32490 left) fast C: 1.294981 (274777 left) Please observe that the performance of threads under the "fair" GIL are significantly worse than with the "unfair" GIL. Having studied this in more depth, I have to say that I would much rather have fast-running unfair threads than slow-running fair threads. Although I agree that there are other benefits to fairness, they just aren't enough to compensate for the huge performance hit.
msg103782 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-20 22:09
Sorry Martin, I meant issue 8410. I have so many of these going on :)
msg103785 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-20 22:33
It is interesting to see, David, the difference in the behaviour of the semaphore based and condition variable based lock on linux. It is clear that the semaphore and the condition varable have different queuing characteristics. I wouldn't be surprised if the semaphore were "fair" on average, but that it somehow colludes with the scheduler allow the current thread to cut in early based on, perhaps, timeslices or something. The condition variable seems to be much more FIFO. It is odd to me that you don't seem to be able to shake the performance hit even with a checkinterval of 10000. In the end, the performance hit is because of successful switches during the GIL yield in ceval.c. You ought to be able to raise the checkinterval up to the same level as the 'effective' one with the unfair condition variable approach and have the same sort of behaviour (inclusive IO latency). I'll do some modifcations locally to gather stats on successful yields vs unsuccessful ones. As for this being academic, yes it is. As I stated early on, I never intended this to be checked in as is. My aim was to take a step back from the radical GIL rewrite in 3.x. Provide a simple platform for GIL experiments in the know and trusted 2.x series. See if we could achieve the desired features without the added unknowns of the new GIL. And in fact, I also didn't want to expend this amount of ammunition on the "ROUNDROBIN_GIL" implementation. It was included merely as an example of one thing that was wrong with the old one and why it was so unpredictable. What I really wanted to look at was to combine the benefit of a large checkinterval (100000, say) with a priority based mechanism giving us a low IO latency. I've gotten so sidetracked with this whole fair/unfair business that I have neglected to provide useful performance benchmark of the PRIORITY_GIL implementation. The fact that I had to expend so much effort explaining exactly how the gil on mac (LEGACY_GIL) was not fair, also shows how tricky working with synchronization primitives and threads can be, even using such relatively modern primitives as mutexes and condition variables. There are many pitfalls for the unwary, and in my experience, only a lot of exposure will bend your brain into taking all those pesky race conditions into acccount. This is why I think it is important to _understand_ the problem completely before trying to fix it, and why fixing it stepwise, with a full understanding of each step along the way, is necessary for robust behaviour. As for missing the boat on 2.7, well, there is always 2.8 isn't there :)
msg103922 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-21 23:22
David, trying to get some more realistic IO benchmarks I did some more tests. The idea is to have a threaded socket server, serving requests that take different amounts of time to process, and see how io response measures up for two classes of requests being serviced simultaneously. Please see the evalsrv.rar for the client and server scripts. The client uses multiprocessing to distance itself from the GIL issue. The results on my dual core windows box are as follows (LEGACY_GIL is the mac, unfair GIL, ROUNDROBIN_GIL is the same with the fairness fix. "with affinity" means that the server process is restricted to running on one core. label time avg time std.dev serial ((30, 500), (2.7145072907310848, 0.09047581466359553, 0.0041867466462554535)) ((300, 10), (0.46542703053041656, 0.0015481250643121787, 0.0002282114778449236)) 3.36s (3.18s) (total time, sum of individual classes) simultaneous ((30, 500), (2.8820070707310563, 0.09605833283280416, 0.004430198440914231)) ((300, 10), (3.2082978423235358, 0.010690429928943495, 0.014415958519681225)) 3.21s (6.09s) (for each test, you get the indvidual timing for each request class, and then a sum of total time and sum of individual times.) Please don't read too much into small differences, this is a roughly one-off test here and likely contains noise. A few things become apparent: 1) with LEGACY_GIL, affinity appears not to matter. The 300 fast requests take longer to complete than the 30 slow requests if done in parallel, even though their serial execution time is roughly 1/5th. 2) With ROUNDROBIN_GIL, serial performance appears not to be affected, but simultaneous performance is much better: end-to-end time is the same, but the sum of individual classes is lower. That means that the client had to wait less for their IO results. 3) With ROUNDROBIN_GIL, if we put affinity on, we get the same kind of performance as with the LEGACY_GIL. The most important points here are the two last ones, I think. The fact that the sum of the individual request waits goes down is significant, and it is by no small amount that it drops. But equally perplexing is the fact that forcing the server to one cpu, removes the "fairness" again. It would appear that the behaviour of the synchronization object (an windows Semaphore in this case) changes depending on the number of cores, just as you had previously mentioned. This is, however, a windows only effect, I think. I must try to find out what is going on.
msg103923 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-04-21 23:25
Sorry, all the benchmarks were missing from my last comment. Here they are: total time avg time/request stddev LEGACY_GIL serial ((30, 500), (2.6908777512225717, 0.08968708313486773, 0.0015164644061278888)) ((300, 10), (0.4470272758129874, 0.0014862882945551276, 0.00024585959927491124)) 3.30s (3.14s) [end-to-end time, ( sum of total times) simultaneous ((30, 500), (2.876280574765787, 0.09586895587753937, 0.0011848842024515473)) ((300, 10), (3.2645357161315194, 0.010877886016239504, 0.027372366395914234)) 3.27s (6.14s) LEGACY_GIL, affinity on serial ((30, 500), (2.8003825905247735, 0.09333925587376796, 0.013360667457052446)) ((300, 10), (0.45636945477707364, 0.001517769716542185, 0.00020067298808990378)) 3.43s (3.26s) simultaneous ((30, 500), (2.8963266281049687, 0.09653735553913509, 0.0096323348697044)) ((300, 10), (3.212417639672081, 0.010703065845891957, 0.014433573531885001)) 3.21s (6.11s) ROUNDROBIN_GIL serial ((30, 500), (2.703059536896449, 0.09009326837163198, 0.0020036630273102198)) ((300, 10), (0.4539455433581643, 0.0015096094615377107, 0.00041274624690973134)) 3.32s (3.16s) simultaneous ((30, 500), (3.2613636649350686, 0.10870530565570016, 0.02613516235517462)) ((300, 10), (1.4737764855589188, 0.004908413639163637, 0.002818028070766402)) 3.26s (4.74s) ROUNDROBIN_GIL, affinity on serial ((30, 500), (2.7145072907310848, 0.09047581466359553, 0.0041867466462554535)) ((300, 10), (0.46542703053041656, 0.0015481250643121787, 0.0002282114778449236)) 3.36s (3.18s) simultaneous ((30, 500), (2.8820070707310563, 0.09605833283280416, 0.004430198440914231)) ((300, 10), (3.2082978423235358, 0.010690429928943495, 0.014415958519681225)) 3.21s (6.09s) ROUNDROBIN_GIL, USE_FIFO_COND serial ((30, 500), (2.7050386292112547, 0.09016071176643957, 0.0017078174777967025)) ((300, 10), (0.4478294727402505, 0.001488698051474884, 0.00012887034661806298)) 3.31s (3.15s) simultaneous ((30, 500), (3.262343957123042, 0.10873750248518549, 0.02640198429431565)) ((300, 10), (1.5242242379967286, 0.005076519036171731, 0.0033917786033315026)) 3.26s (4.79s)
msg113028 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2010-08-05 19:29
For better or worse, this did not make it into 2.7. Is it the sort of thing that could still go into a bug-fix release, without alpha/beta testing? Or should this be closes as out-of-date?
msg113039 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-08-05 20:22
That question should probably raised on python-dev, not the bug tracker.
msg113087 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2010-08-06 09:47
Although I did finally manage to explain the point of this patch (after a long, long discussion), I think the issue is still too controversial. We did, for example, see some strange behaviour in my last comment (Date: 2010-04-21 23:22) regarding affinity fixing of the process! What I hope comes out of this is that I think I have put my point across that with multithreading, a lock is not a lock. While a mutex may be indeed a mutex, its behaviour towards the threads that want to claim it can be different and can affect program behaviour and performance. This also goes for "emulated" or "constructed" entities, built out of something more primitive such as condition variables. Since 2.x is now frozen, and everyone seems happy (I think) with the more complicated 3.x method (although, being more complex, probably has more surprises in store), we should probably just let this fade away.
msg113111 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2010-08-06 16:50
OK. I would probably be better to expend energy on the 3.x new GIL, should issues arise.

History
Date	User	Action	Args
2022-04-11 14:56:59	admin	set	github: 52546
2010-08-06 16:50:13	terry.reedy	set	status: open -> closed keywords: patch, patch resolution: out of date messages: + msg113111
2010-08-06 09:47:42	kristjan.jonsson	set	keywords: patch, patch messages: + msg113087
2010-08-05 20:22:37	rhettinger	set	keywords: patch, patch nosy: + rhettinger messages: + msg113039
2010-08-05 19:29:00	terry.reedy	set	keywords: patch, patch nosy: + terry.reedy messages: + msg113028
2010-04-21 23:25:25	kristjan.jonsson	set	keywords: patch, patch messages: + msg103923
2010-04-21 23:22:04	kristjan.jonsson	set	keywords: patch, patch files: + evalsrv.rar messages: + msg103922
2010-04-20 22:33:23	kristjan.jonsson	set	keywords: patch, patch messages: + msg103785
2010-04-20 22:09:27	kristjan.jonsson	set	keywords: patch, patch messages: + msg103782
2010-04-19 01:59:04	dabeaz	set	messages: + msg103551
2010-04-18 21:18:38	loewis	set	messages: + msg103533
2010-04-18 20:40:44	kristjan.jonsson	set	keywords: patch, patch messages: + msg103525
2010-04-18 15:39:49	loewis	set	messages: + msg103500
2010-04-17 12:21:47	dabeaz	set	files: + thread_pthread.h messages: + msg103399
2010-04-17 11:16:53	dabeaz	set	messages: + msg103397
2010-04-17 10:53:15	kristjan.jonsson	set	keywords: patch, patch messages: + msg103396
2010-04-17 03:22:19	dabeaz	set	messages: + msg103385
2010-04-17 03:18:09	dabeaz	set	messages: + msg103384
2010-04-16 19:57:26	kristjan.jonsson	set	keywords: patch, patch files: + fair.py messages: + msg103355
2010-04-16 11:21:27	dabeaz	set	files: + fair.py messages: + msg103311
2010-04-16 10:46:42	dabeaz	set	messages: + msg103310
2010-04-15 21:09:59	pitrou	set	messages: + msg103258
2010-04-15 17:32:38	kristjan.jonsson	set	keywords: patch, patch messages: + msg103239
2010-04-15 17:23:14	kristjan.jonsson	set	keywords: patch, patch messages: + msg103238
2010-04-15 17:09:56	dabeaz	set	messages: + msg103236
2010-04-15 17:05:17	r.david.murray	set	keywords: patch, patch messages: + msg103235
2010-04-15 17:04:04	r.david.murray	set	keywords: patch, patch nosy: + r.david.murray messages: + msg103234
2010-04-15 17:01:12	pitrou	set	messages: + msg103233
2010-04-15 16:54:49	kristjan.jonsson	set	keywords: patch, patch messages: + msg103232
2010-04-15 16:48:38	pitrou	set	messages: + msg103230
2010-04-15 16:34:43	kristjan.jonsson	set	keywords: patch, patch messages: + msg103228
2010-04-15 15:42:36	pitrou	set	messages: + msg103225
2010-04-15 15:39:51	kristjan.jonsson	set	keywords: patch, patch messages: + msg103224
2010-04-15 14:28:03	pitrou	set	messages: + msg103219
2010-04-15 14:09:20	kristjan.jonsson	set	keywords: patch, patch messages: + msg103215
2010-04-13 23:41:46	dabeaz	set	messages: + msg103098
2010-04-13 23:22:58	loewis	set	messages: + msg103093
2010-04-13 23:01:22	pitrou	set	messages: + msg103088
2010-04-13 22:48:41	kristjan.jonsson	set	keywords: patch, patch messages: + msg103087
2010-04-11 15:20:20	dabeaz	set	messages: + msg102845
2010-04-11 14:59:05	pitrou	set	messages: + msg102840
2010-04-11 14:46:26	kristjan.jonsson	set	keywords: patch, patch messages: + msg102838
2010-04-11 12:38:52	pitrou	set	messages: + msg102831
2010-04-11 12:33:49	dabeaz	set	messages: + msg102830
2010-04-11 12:22:04	dabeaz	set	messages: + msg102828
2010-04-11 12:21:16	kristjan.jonsson	set	keywords: patch, patch messages: + msg102827
2010-04-11 11:09:04	kristjan.jonsson	set	keywords: patch, patch files: + ccbench.patch messages: + msg102823
2010-04-09 23:47:59	techtonik	set	nosy: + techtonik messages: + msg102759
2010-04-09 23:34:14	kristjan.jonsson	set	keywords: patch, patch messages: + msg102758
2010-04-07 00:11:56	pitrou	set	messages: + msg102508
2010-04-07 00:08:19	dabeaz	set	messages: + msg102507
2010-04-06 23:53:09	kristjan.jonsson	set	keywords: patch, patch messages: + msg102506
2010-04-06 22:35:15	pitrou	set	messages: + msg102502
2010-04-06 22:22:16	kristjan.jonsson	set	keywords: patch, patch messages: + msg102500
2010-04-05 21:55:10	flox	set	keywords: patch, patch nosy: + flox
2010-04-05 21:33:09	kristjan.jonsson	set	keywords: patch, patch files: + gil2.patch messages: + msg102416
2010-04-03 20:11:02	dabeaz	set	messages: + msg102291
2010-04-03 20:01:43	torsten	set	nosy: + torsten messages: + msg102290
2010-04-03 13:04:46	pitrou	set	messages: + msg102256
2010-04-03 12:55:02	dabeaz	set	messages: + msg102255
2010-04-03 12:45:36	dabeaz	set	messages: + msg102254
2010-04-03 12:37:34	kristjan.jonsson	set	keywords: patch, patch messages: + msg102252
2010-04-03 12:33:43	kristjan.jonsson	set	keywords: patch, patch messages: + msg102250
2010-04-03 12:32:22	dabeaz	set	nosy: + dabeaz messages: + msg102248
2010-04-03 10:58:09	loewis	set	keywords: patch, patch nosy: + loewis messages: + msg102235
2010-04-03 10:27:23	kristjan.jonsson	create