Background ----------- In order to multitask with threads, a critical part of the Python interpreter implementation concerns the behavior of I/O operations such as read, write, send, and receive. Specifically, whenever an I/O operation is carried out, the interpreter releases the GIL so that other threads can run while the program waits for the I/O operation to complete. Observed Behavior of I/O Operations ------------------------------------ The release of the GIL for I/O is a critical aspect of making sure the interpreter doesn't make all threads hang while waiting. However, the release of the GIL also assumes a worst-case scenario. In practice, a large number of I/O requests actually complete immediately with no actual blocking. For example, if a program is sending on a socket, send() operations will typically complete immediately if buffer space is available to accept the data. Likewise, read() and recv() operations may return immediately if data is already available in the operating system. For system calls that complete immediately, a thread quickly releases and then almost immediately reacquires the GIL to continue running. However, given that the I/O operation didn't block, the release of the GIL was technically unnecessary in this case. Behavior of the new GIL ----------------------- A feature of the new Python GIL implementation is that the interpreter no longer periodically signals waiting threads (e.g., the check interval). Instead, thread switching is based on a timed wait on a condition variable. Should a timeout occur, a thread will indicate that it wants the GIL and the currently running thread will be forced to give it up. Although this scheme solves the problem of CPU-bound threads thrashing, it introduces a highly pronounced "convoy effect" when CPU-bound threads and I/O bound threads get mixed together. A major part of the problem is caused by the bahvior of I/O as described above. Specifically, when an I/O bound thread executes an I/O call, it always releases the GIL. Since the GIL is released, a CPU bound thread is now free to acquire the GIL and run. However, if the I/O call completes immediately (which is common), the I/O bound thread immediately stalls upon return from the system call. To get the GIL back, it now has to go through the timeout process to force the CPU-bound thread to release the GIL again. It should be noted that the behavior described also occurs in Python 2, but due to the small check interval, an I/O bound thread that wants the GIL back will tend to get it without having to wait very long. Example ------- Here is a very simple example that illustrates the problem. In this example, there is one CPU-bound thread that hammers the CPU and there is an I/O bound thread that handles some network communication (an echo server): # iotest.py import time import threading from socket import * # CPU-bound thread (just hammers the CPU) def spin(): while True: pass # I/O-bound thread (an echo TCP server) def echo_server(): s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR,1) s.bind(("",15000)) s.listen(1) while True: c,a = s.accept() while True: data = c.recv(8192) if not data: break c.sendall(data) c.close() s.close() # Launch the CPU-bound thread t1 = threading.Thread(target=spin) t1.daemon=True t1.start() # Run the I/O server echo_server() Here is a benchmark program that runs as a client for the echo_server() thread defined above. It sends a sequence of messages and reads the response back. It then reports some timings at the end. # echoclient.py from socket import * import time CHUNKSIZE = 16384 NUMMESSAGES = 640 # Total of 10MB # Dummy message msg = b"x"*CHUNKSIZE # Connect and send messages s = socket(AF_INET,SOCK_STREAM) s.connect(("",15000)) start = time.time() for n in range(NUMMESSAGES): s.sendall(msg) bytes_recv = len(msg) # Get the response back while bytes_recv > 0: data = s.recv(bytes_recv) bytes_recv -= len(data) s.close() end = time.time() print("%0.3f seconds (%0.3f bytes/sec)" % (end-start, (CHUNKSIZE*NUMMESSAGES)/(end-start))) Performance Results ------------------- These results are from running the above programs on a dual-core MacBook running OS-X Snow Leopard. I also get similar behavior on a quad-core desktop machine. If you run the iotest.py program using Python 2.6.4 and execute the client, you get this result: bash % python echoclient.py 1.064 seconds (9854148.739 bytes/sec) If you switch the iotest.py to Python 3.2 and rerun, you get this result: bash % python echoclient.py 12.340 seconds (849726.150 bytes/sec) Notice that there is a factor 12 performance difference. Modify the iotest.py program so that there are 2 CPU-bound threads spinning. Just add this extra code: t2 = threading.Thread(target=spin) t2.daemon t2.start() Now, repeat the above tests. For Python 2.6.4, you get this: bash-3.2$ python echoclient.py 0.358 seconds (29319821.410 bytes/sec) (Yes the performance actually improves! That's left as an exercise for the reader to figure out why) Now, switch the iotest.py server to Python 3.2 and retry: base-3 $ python echoclient.py 59.545 seconds (176098.609 bytes/sec) Notice how the addition of one CPU-bound thread made the time go up by more than a factor 4! Now, disable all but one of the CPU cores and try the test again in Python 3.2: bash-3.2$ python echoclient.py 0.120 seconds (87246036.201 bytes/sec) Here, you see that it runs about 500 times faster than with two cores (yikes!) What's causing this behavior? ----------------------------- In the iotest.py program, there is an inner loop that looks like this: while True: data = c.recv(8192) if not data: break c.sendall(data) The I/O operations recv() and sendall() always release the GIL when they execute. However, when this happens, CPU bound threads jump in and start running again. The problem gets worse as the number of CPU-bound threads increases--CPU bound threads might cycle round-robin before the I/O bound thread runs again. The problem is more pronounced on multiple CPU cores because when the GIL is released, one of the cores will typically go handle the system call while the other core wakes up the waiting CPU-bound thread (which then almost immediately steals the GIL). Is it worth fixing? ------------------- I claim yes. There are many applications, such as those carried out with the multiprocessing library, that will operate by trying to overlap computation and I/O in some manner (for example, receiving the next chunk of data to work on while carrying out calculations on the currently received data). In heavily loaded I/O bound applications such as servers with hundreds of simultaneously connected clients, the release of the GIL on short I/O operations may cause a lot of unintended thrashing as threads cycle amongst themselves. This would most likely manifest itself as an increased turnaround time for requests. How to fix? ----------- Modify all I/O operations in the interpreter to not release the GIL if they won't block. Either that or maybe there's some sort of really sneaky easy solution (unknown). The effect can be minimized by setting the switch interval to a really small value using sys.setswitchinterval(). However, doing this greatly increases the amount of thread context-switching--something that's also undesirable.