Message 99438 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	dabeaz
Recipients	dabeaz
Date	2010-02-16.20:48:41
SpamBayes Score	2.7755576e-16
Marked as misclassified	No
Message-id	<1266353325.38.0.278549753357.issue7946@psf.upfronthosting.co.za>
In-reply-to

Content
Background ----------- In order to multitask with threads, a critical part of the Python interpreter implementation concerns the behavior of I/O operations such as read, write, send, and receive. Specifically, whenever an I/O operation is carried out, the interpreter releases the GIL so that other threads can run while the program waits for the I/O operation to complete. Observed Behavior of I/O Operations ------------------------------------ The release of the GIL for I/O is a critical aspect of making sure the interpreter doesn't make all threads hang while waiting. However, the release of the GIL also assumes a worst-case scenario. In practice, a large number of I/O requests actually complete immediately with no actual blocking. For example, if a program is sending on a socket, send() operations will typically complete immediately if buffer space is available to accept the data. Likewise, read() and recv() operations may return immediately if data is already available in the operating system. For system calls that complete immediately, a thread quickly releases and then almost immediately reacquires the GIL to continue running. However, given that the I/O operation didn't block, the release of the GIL was technically unnecessary in this case. Behavior of the new GIL ----------------------- A feature of the new Python GIL implementation is that the interpreter no longer periodically signals waiting threads (e.g., the check interval). Instead, thread switching is based on a timed wait on a condition variable. Should a timeout occur, a thread will indicate that it wants the GIL and the currently running thread will be forced to give it up. Although this scheme solves the problem of CPU-bound threads thrashing, it introduces a highly pronounced "convoy effect" when CPU-bound threads and I/O bound threads get mixed together. A major part of the problem is caused by the bahvior of I/O as described above. Specifically, when an I/O bound thread executes an I/O call, it always releases the GIL. Since the GIL is released, a CPU bound thread is now free to acquire the GIL and run. However, if the I/O call completes immediately (which is common), the I/O bound thread immediately stalls upon return from the system call. To get the GIL back, it now has to go through the timeout process to force the CPU-bound thread to release the GIL again. It should be noted that the behavior described also occurs in Python 2, but due to the small check interval, an I/O bound thread that wants the GIL back will tend to get it without having to wait very long. Example ------- Here is a very simple example that illustrates the problem. In this example, there is one CPU-bound thread that hammers the CPU and there is an I/O bound thread that handles some network communication (an echo server): # iotest.py import time import threading from socket import * # CPU-bound thread (just hammers the CPU) def spin(): while True: pass # I/O-bound thread (an echo TCP server) def echo_server(): s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR,1) s.bind(("",15000)) s.listen(1) while True: c,a = s.accept() while True: data = c.recv(8192) if not data: break c.sendall(data) c.close() s.close() # Launch the CPU-bound thread t1 = threading.Thread(target=spin) t1.daemon=True t1.start() # Run the I/O server echo_server() Here is a benchmark program that runs as a client for the echo_server() thread defined above. It sends a sequence of messages and reads the response back. It then reports some timings at the end. # echoclient.py from socket import * import time CHUNKSIZE = 16384 NUMMESSAGES = 640 # Total of 10MB # Dummy message msg = b"x"CHUNKSIZE # Connect and send messages s = socket(AF_INET,SOCK_STREAM) s.connect(("",15000)) start = time.time() for n in range(NUMMESSAGES): s.sendall(msg) bytes_recv = len(msg) # Get the response back while bytes_recv > 0: data = s.recv(bytes_recv) bytes_recv -= len(data) s.close() end = time.time() print("%0.3f seconds (%0.3f bytes/sec)" % (end-start, (CHUNKSIZENUMMESSAGES)/(end-start))) Performance Results ------------------- These results are from running the above programs on a dual-core MacBook running OS-X Snow Leopard. I also get similar behavior on a quad-core desktop machine. If you run the iotest.py program using Python 2.6.4 and execute the client, you get this result: bash % python echoclient.py 1.064 seconds (9854148.739 bytes/sec) If you switch the iotest.py to Python 3.2 and rerun, you get this result: bash % python echoclient.py 12.340 seconds (849726.150 bytes/sec) Notice that there is a factor 12 performance difference. Modify the iotest.py program so that there are 2 CPU-bound threads spinning. Just add this extra code: t2 = threading.Thread(target=spin) t2.daemon t2.start() Now, repeat the above tests. For Python 2.6.4, you get this: bash-3.2$ python echoclient.py 0.358 seconds (29319821.410 bytes/sec) (Yes the performance actually improves! That's left as an exercise for the reader to figure out why) Now, switch the iotest.py server to Python 3.2 and retry: base-3 $ python echoclient.py 59.545 seconds (176098.609 bytes/sec) Notice how the addition of one CPU-bound thread made the time go up by more than a factor 4! Now, disable all but one of the CPU cores and try the test again in Python 3.2: bash-3.2$ python echoclient.py 0.120 seconds (87246036.201 bytes/sec) Here, you see that it runs about 500 times faster than with two cores (yikes!) What's causing this behavior? ----------------------------- In the iotest.py program, there is an inner loop that looks like this: while True: data = c.recv(8192) if not data: break c.sendall(data) The I/O operations recv() and sendall() always release the GIL when they execute. However, when this happens, CPU bound threads jump in and start running again. The problem gets worse as the number of CPU-bound threads increases--CPU bound threads might cycle round-robin before the I/O bound thread runs again. The problem is more pronounced on multiple CPU cores because when the GIL is released, one of the cores will typically go handle the system call while the other core wakes up the waiting CPU-bound thread (which then almost immediately steals the GIL). Is it worth fixing? ------------------- I claim yes. There are many applications, such as those carried out with the multiprocessing library, that will operate by trying to overlap computation and I/O in some manner (for example, receiving the next chunk of data to work on while carrying out calculations on the currently received data). In heavily loaded I/O bound applications such as servers with hundreds of simultaneously connected clients, the release of the GIL on short I/O operations may cause a lot of unintended thrashing as threads cycle amongst themselves. This would most likely manifest itself as an increased turnaround time for requests. How to fix? ----------- Modify all I/O operations in the interpreter to not release the GIL if they won't block. Either that or maybe there's some sort of really sneaky easy solution (unknown). The effect can be minimized by setting the switch interval to a really small value using sys.setswitchinterval(). However, doing this greatly increases the amount of thread context-switching--something that's also undesirable.

Background
-----------
In order to multitask with threads, a critical part of the Python
interpreter implementation concerns the behavior of I/O operations
such as read, write, send, and receive.  Specifically, whenever an I/O
operation is carried out, the interpreter releases the GIL so that
other threads can run while the program waits for the I/O operation to
complete.

Observed Behavior of I/O Operations
------------------------------------
The release of the GIL for I/O is a critical aspect of making sure the
interpreter doesn't make all threads hang while waiting.  However, the
release of the GIL also assumes a worst-case scenario.  In practice,
a large number of I/O requests actually complete immediately with no
actual blocking.  For example, if a program is sending on a socket,
send() operations will typically complete immediately if buffer space
is available to accept the data.  Likewise, read() and recv()
operations may return immediately if data is already available in the
operating system.

For system calls that complete immediately, a thread quickly releases
and then almost immediately reacquires the GIL to continue running.
However, given that the I/O operation didn't block, the release of the
GIL was technically unnecessary in this case.

Behavior of the new GIL
-----------------------
A feature of the new Python GIL implementation is that the interpreter
no longer periodically signals waiting threads (e.g., the check
interval).  Instead, thread switching is based on a timed wait on a
condition variable. Should a timeout occur, a thread will indicate
that it wants the GIL and the currently running thread will be forced
to give it up.

Although this scheme solves the problem of CPU-bound threads
thrashing, it introduces a highly pronounced "convoy effect" when
CPU-bound threads and I/O bound threads get mixed together.  A major
part of the problem is caused by the bahvior of I/O as described
above.  Specifically, when an I/O bound thread executes an I/O call,
it always releases the GIL.  Since the GIL is released, a CPU bound
thread is now free to acquire the GIL and run.  However, if the I/O
call completes immediately (which is common), the I/O bound thread
immediately stalls upon return from the system call.  To get the GIL
back, it now has to go through the timeout process to force the
CPU-bound thread to release the GIL again.

It should be noted that the behavior described also occurs in Python
2, but due to the small check interval, an I/O bound thread that wants
the GIL back will tend to get it without having to wait very long.

Example
-------
Here is a very simple example that illustrates the problem.  In this
example, there is one CPU-bound thread that hammers the CPU and there
is an I/O bound thread that handles some network communication (an
echo server):

# iotest.py
import time
import threading
from socket import *

# CPU-bound thread (just hammers the CPU)
def spin():
    while True:
        pass

# I/O-bound thread (an echo TCP server)
def echo_server():
    s = socket(AF_INET, SOCK_STREAM)
    s.setsockopt(SOL_SOCKET, SO_REUSEADDR,1)
    s.bind(("",15000))
    s.listen(1)
    while True:
        c,a = s.accept()
        while True:
            data = c.recv(8192)
            if not data:
                break
            c.sendall(data)
        c.close()
    s.close()

# Launch the CPU-bound thread
t1 = threading.Thread(target=spin)
t1.daemon=True
t1.start()

# Run the I/O server
echo_server()

Here is a benchmark program that runs as a client for the echo_server()
thread defined above.  It sends a sequence of messages and reads the
response back.  It then reports some timings at the end.

# echoclient.py
from socket import *
import time

CHUNKSIZE = 16384
NUMMESSAGES = 640     # Total of 10MB

# Dummy message
msg = b"x"*CHUNKSIZE

# Connect and send messages
s = socket(AF_INET,SOCK_STREAM)
s.connect(("",15000))
start = time.time()
for n in range(NUMMESSAGES):
    s.sendall(msg)
    bytes_recv = len(msg)
    # Get the response back
    while bytes_recv > 0:
        data = s.recv(bytes_recv)
        bytes_recv -= len(data)
s.close()
end = time.time()
print("%0.3f seconds (%0.3f bytes/sec)" % (end-start, (CHUNKSIZE*NUMMESSAGES)/(end-start)))

Performance Results
-------------------
These results are from running the above programs on a dual-core
MacBook running OS-X Snow Leopard.  I also get similar behavior on a
quad-core desktop machine.

If you run the iotest.py program using Python 2.6.4 and execute
the client, you get this result:

   bash % python echoclient.py
   1.064 seconds (9854148.739 bytes/sec)

If you switch the iotest.py to Python 3.2 and rerun, you get this
result:

   bash % python echoclient.py
   12.340 seconds (849726.150 bytes/sec)

Notice that there is a factor 12 performance difference.

Modify the iotest.py program so that there are 2 CPU-bound
threads spinning.  Just add this extra code:

    t2 = threading.Thread(target=spin)
    t2.daemon
    t2.start()

Now, repeat the above tests.    For Python 2.6.4, you get this:

    bash-3.2$ python echoclient.py
    0.358 seconds (29319821.410 bytes/sec)

(Yes the performance actually improves!  That's left as an exercise
for the reader to figure out why)

Now, switch the iotest.py server to Python 3.2 and retry:

    base-3 $ python echoclient.py
    59.545 seconds (176098.609 bytes/sec)    

Notice how the addition of one CPU-bound thread made the time go up by
more than a factor 4!

Now, disable all but one of the CPU cores and try the test again in
Python 3.2:

    bash-3.2$ python echoclient.py
    0.120 seconds (87246036.201 bytes/sec)

Here, you see that it runs about 500 times faster than with two cores
(yikes!)

What's causing this behavior?
-----------------------------
In the iotest.py program, there is an inner loop that
looks like this:

        while True:
            data = c.recv(8192)
            if not data:
                break
            c.sendall(data)

The I/O operations recv() and sendall() always release the GIL when
they execute.  However, when this happens, CPU bound threads jump in
and start running again. The problem gets worse as the number of
CPU-bound threads increases--CPU bound threads might cycle round-robin
before the I/O bound thread runs again.  The problem is more
pronounced on multiple CPU cores because when the GIL is released, one
of the cores will typically go handle the system call while the other
core wakes up the waiting CPU-bound thread (which then almost
immediately steals the GIL).

Is it worth fixing?
-------------------
I claim yes.  There are many applications, such as those carried
out with the multiprocessing library, that will operate by trying
to overlap computation and I/O in some manner (for example, receiving
the next chunk of data to work on while carrying out calculations
on the currently received data).

In heavily loaded I/O bound applications such as servers with 
hundreds of simultaneously connected clients, the release of the GIL
on short I/O operations may cause a lot of unintended thrashing
as threads cycle amongst themselves.    This would most likely
manifest itself as an increased turnaround time for requests.

How to fix?
-----------
Modify all I/O operations in the interpreter to not release the
GIL if they won't block.  Either that or maybe there's some sort of
really sneaky easy solution (unknown).

The effect can be minimized by setting the switch interval to a really
small value using sys.setswitchinterval().   However, doing this
greatly increases the amount of thread context-switching--something
that's also undesirable.

History
Date	User	Action	Args
2010-02-16 20:48:47	dabeaz	set	recipients: + dabeaz
2010-02-16 20:48:45	dabeaz	set	messageid: <1266353325.38.0.278549753357.issue7946@psf.upfronthosting.co.za>
2010-02-16 20:48:43	dabeaz	link	issue7946 messages
2010-02-16 20:48:43	dabeaz	create