classification
Title: Convoy effect with I/O bound threads and New GIL
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: pitrou Nosy List: DazWorrall, Dima.Tisnek, Michele, aconrad, alex, andrix, brian.curtin, carljm, coderanger, cool-RR, dabeaz, djc, donaldjeo, durin42, eric.araujo, eric.smith, flox, gregory.p.smith, haypo, hozn, jcea, jhylton, jmehnle, karld, kevinwatters, konryd, larry, loewis, mahmoudimus, movement, ncoghlan, neologix, nirai, phsilva, pitrou, portante, rcohen, rh0dium, scoder, tarek, thouis, victorpoluceno, ysj.ray
Priority: low Keywords: patch

Created on 2010-02-16 20:48 by dabeaz, last changed 2014-09-03 06:45 by scoder.

Files
File name Uploaded Description Edit
udp-ioclient.py pitrou, 2010-02-22 19:17
udp-iotest.py pitrou, 2010-02-22 19:17
gilprio2.patch pitrou, 2010-02-22 21:41
prioritygil.tar.gz dabeaz, 2010-02-22 21:52 Priority GIL implementation from PyCON
linux-7946.patch nirai, 2010-03-16 22:36 Linux (libc) patch for py32 review
gilinter2.patch pitrou, 2010-03-27 13:26
writes.py nirai, 2010-04-18 00:23
schedtest.py dabeaz, 2010-04-26 13:31 Scheduling test
dabeaz_gil.patch dabeaz, 2010-04-27 14:32 Dave's GIL patch review
nir-ccbench-xp32.log nirai, 2010-05-03 22:34
nir-ccbench-linux.log nirai, 2010-05-16 19:11
ccbench-osx.log Michele, 2010-05-19 02:29
bfs.patch nirai, 2010-05-30 20:50 Updated implementation of the BFS scheduler.
Messages (83)
msg99438 - (view) Author: David Beazley (dabeaz) Date: 2010-02-16 20:48
Background
-----------
In order to multitask with threads, a critical part of the Python
interpreter implementation concerns the behavior of I/O operations
such as read, write, send, and receive.  Specifically, whenever an I/O
operation is carried out, the interpreter releases the GIL so that
other threads can run while the program waits for the I/O operation to
complete.

Observed Behavior of I/O Operations
------------------------------------
The release of the GIL for I/O is a critical aspect of making sure the
interpreter doesn't make all threads hang while waiting.  However, the
release of the GIL also assumes a worst-case scenario.  In practice,
a large number of I/O requests actually complete immediately with no
actual blocking.  For example, if a program is sending on a socket,
send() operations will typically complete immediately if buffer space
is available to accept the data.  Likewise, read() and recv()
operations may return immediately if data is already available in the
operating system.

For system calls that complete immediately, a thread quickly releases
and then almost immediately reacquires the GIL to continue running.
However, given that the I/O operation didn't block, the release of the
GIL was technically unnecessary in this case.

Behavior of the new GIL
-----------------------
A feature of the new Python GIL implementation is that the interpreter
no longer periodically signals waiting threads (e.g., the check
interval).  Instead, thread switching is based on a timed wait on a
condition variable. Should a timeout occur, a thread will indicate
that it wants the GIL and the currently running thread will be forced
to give it up.

Although this scheme solves the problem of CPU-bound threads
thrashing, it introduces a highly pronounced "convoy effect" when
CPU-bound threads and I/O bound threads get mixed together.  A major
part of the problem is caused by the bahvior of I/O as described
above.  Specifically, when an I/O bound thread executes an I/O call,
it always releases the GIL.  Since the GIL is released, a CPU bound
thread is now free to acquire the GIL and run.  However, if the I/O
call completes immediately (which is common), the I/O bound thread
immediately stalls upon return from the system call.  To get the GIL
back, it now has to go through the timeout process to force the
CPU-bound thread to release the GIL again.

It should be noted that the behavior described also occurs in Python
2, but due to the small check interval, an I/O bound thread that wants
the GIL back will tend to get it without having to wait very long.

Example
-------
Here is a very simple example that illustrates the problem.  In this
example, there is one CPU-bound thread that hammers the CPU and there
is an I/O bound thread that handles some network communication (an
echo server):

# iotest.py
import time
import threading
from socket import *

# CPU-bound thread (just hammers the CPU)
def spin():
    while True:
        pass

# I/O-bound thread (an echo TCP server)
def echo_server():
    s = socket(AF_INET, SOCK_STREAM)
    s.setsockopt(SOL_SOCKET, SO_REUSEADDR,1)
    s.bind(("",15000))
    s.listen(1)
    while True:
        c,a = s.accept()
        while True:
            data = c.recv(8192)
            if not data:
                break
            c.sendall(data)
        c.close()
    s.close()

# Launch the CPU-bound thread
t1 = threading.Thread(target=spin)
t1.daemon=True
t1.start()

# Run the I/O server
echo_server()

Here is a benchmark program that runs as a client for the echo_server()
thread defined above.  It sends a sequence of messages and reads the
response back.  It then reports some timings at the end.

# echoclient.py
from socket import *
import time

CHUNKSIZE = 16384
NUMMESSAGES = 640     # Total of 10MB

# Dummy message
msg = b"x"*CHUNKSIZE

# Connect and send messages
s = socket(AF_INET,SOCK_STREAM)
s.connect(("",15000))
start = time.time()
for n in range(NUMMESSAGES):
    s.sendall(msg)
    bytes_recv = len(msg)
    # Get the response back
    while bytes_recv > 0:
        data = s.recv(bytes_recv)
        bytes_recv -= len(data)
s.close()
end = time.time()
print("%0.3f seconds (%0.3f bytes/sec)" % (end-start, (CHUNKSIZE*NUMMESSAGES)/(end-start)))

Performance Results
-------------------
These results are from running the above programs on a dual-core
MacBook running OS-X Snow Leopard.  I also get similar behavior on a
quad-core desktop machine.

If you run the iotest.py program using Python 2.6.4 and execute
the client, you get this result:

   bash % python echoclient.py
   1.064 seconds (9854148.739 bytes/sec)

If you switch the iotest.py to Python 3.2 and rerun, you get this
result:

   bash % python echoclient.py
   12.340 seconds (849726.150 bytes/sec)

Notice that there is a factor 12 performance difference.

Modify the iotest.py program so that there are 2 CPU-bound
threads spinning.  Just add this extra code:

    t2 = threading.Thread(target=spin)
    t2.daemon
    t2.start()

Now, repeat the above tests.    For Python 2.6.4, you get this:

    bash-3.2$ python echoclient.py
    0.358 seconds (29319821.410 bytes/sec)

(Yes the performance actually improves!  That's left as an exercise
for the reader to figure out why)

Now, switch the iotest.py server to Python 3.2 and retry:

    base-3 $ python echoclient.py
    59.545 seconds (176098.609 bytes/sec)    

Notice how the addition of one CPU-bound thread made the time go up by
more than a factor 4!

Now, disable all but one of the CPU cores and try the test again in
Python 3.2:

    bash-3.2$ python echoclient.py
    0.120 seconds (87246036.201 bytes/sec)

Here, you see that it runs about 500 times faster than with two cores
(yikes!)

What's causing this behavior?
-----------------------------
In the iotest.py program, there is an inner loop that
looks like this:

        while True:
            data = c.recv(8192)
            if not data:
                break
            c.sendall(data)

The I/O operations recv() and sendall() always release the GIL when
they execute.  However, when this happens, CPU bound threads jump in
and start running again. The problem gets worse as the number of
CPU-bound threads increases--CPU bound threads might cycle round-robin
before the I/O bound thread runs again.  The problem is more
pronounced on multiple CPU cores because when the GIL is released, one
of the cores will typically go handle the system call while the other
core wakes up the waiting CPU-bound thread (which then almost
immediately steals the GIL).

Is it worth fixing?
-------------------
I claim yes.  There are many applications, such as those carried
out with the multiprocessing library, that will operate by trying
to overlap computation and I/O in some manner (for example, receiving
the next chunk of data to work on while carrying out calculations
on the currently received data).

In heavily loaded I/O bound applications such as servers with 
hundreds of simultaneously connected clients, the release of the GIL
on short I/O operations may cause a lot of unintended thrashing
as threads cycle amongst themselves.    This would most likely
manifest itself as an increased turnaround time for requests.

How to fix?
-----------
Modify all I/O operations in the interpreter to not release the
GIL if they won't block.  Either that or maybe there's some sort of
really sneaky easy solution (unknown).

The effect can be minimized by setting the switch interval to a really
small value using sys.setswitchinterval().   However, doing this
greatly increases the amount of thread context-switching--something
that's also undesirable.
msg99439 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2010-02-16 20:57
Would the idea of priority-GIL requests that Antoine had in his original patch solve this issue?
msg99459 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-02-17 03:00
Just a quick test under Linux (on a dual quad core machine):
- with iotest.py and echo_client.py both running Python 2.7: 25.562 seconds (410212.450 bytes/sec)
- with iotest.py and echo_client.py both running Python 3.2: 28.160 seconds (372362.459 bytes/sec)

As already said, the "spinning endlessly" loop is a best case for thread switching latency in 2.x, because the opcodes are very short. If each opcode in the loop has an average duration of 20 ns, and with the default check interval of 100, the GIL gets speculatively released every 2 us (yes, microseconds). That's why I suggested trying more "realistic" workloads, as in ccbench.

Also, as I told you, there might also be interactions with the various timing heuristics the TCP stack of the kernel applies. It would be nice to test with UDP.

That said, the observations are interesting.
msg99461 - (view) Author: David Beazley (dabeaz) Date: 2010-02-17 04:01
The comment on the CPU-bound workload is valid--it is definitely true that Python 2.6 results will degrade as the workload of each tick is increased.   Maybe a better way to interpreter those results is as a baseline of what kind of I/O performance is possible if there is a quick I/O response time.

However, ignoring that and the comparison between Python 2.6 and 3.2, there is still a serious performance issue with I/O in 3.2.  For example, the dramatic decrease in I/O performance as there are more CPU bound threads competing and the fact that there is a huge performance gain when all but one CPU core is disabled. 

I tried the test using UDP packets and get virtually the exact same behavior described.  For instance, echoing 10MB (sent in 8k UDP packets) takes about 0.6s in Python 2.6 and 12.0s in Python-3.2.   The time shoots up to more than 40s if there are two CPU-bound threads. 

The problem being described really doesn't have anything to do with TCP vs. UDP or any part of the network stack.  It has everything to do with how the operating system buffers I/O requests and how I/O operations such as sends and receives complete immediately without blocking depending on system buffer characteristics (e.g., if there is space in the send buffer, a send will return immediately without blocking).   The fact that the GIL is released when it's not necessary in these cases is really the source of the problem.
msg99477 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-02-17 18:41
We could try not to release the GIL when socket methods are called on a non-blocking socket.

Regardless, I've re-run the tests under the Linux machine, with two spinning threads:

* python 2.7: 
25.580 seconds (409914.612 bytes/sec)
* python 3.2:
32.216 seconds (325485.029 bytes/sec)

(and as someone mentioned, the "priority requests" mechanism which was in the original new GIL patch might improve things. It's not an ideal time for me to test, right now :-))
msg99814 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-02-22 19:17
I'm attaching Dave's new UDP-based benchmarks, which eliminate the dependency on the TCP stack's behaviour.
msg99815 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-02-22 19:23
And here is an experimental patch which enables the "priority requests" mechanism which was in the original new GIL patch. "Experimental" because it only enables them on a couple of socket methods (enough to affect the benchmarks).

Here are the UDP benchmark results (with 2 background threads):
* 2.x trunk (old GIL):
11.379 seconds (921515.168 bytes/sec)
* vanilla py3k (new GIL):
27.400 seconds (382689.529 bytes/sec)
* patched py3k (new GIL + priority requests):
1.828 seconds (5737130.676 bytes/sec)

And here's patched py3k with 8 background threads:
3.058 seconds (3429136.193 bytes/sec)

(benchmarks run on a 8-core Linux machine)
msg99838 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-02-22 21:31
See also issue7993 for a patch adding a similar bandwidth benchmark to ccbench.
msg99840 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-02-22 21:41
Here is an improved version of the priority requests patch.
With this patch I get the following result (2 background threads):
0.255 seconds (41109347.194 bytes/sec)
msg99846 - (view) Author: David Beazley (dabeaz) Date: 2010-02-22 21:52
I posted some details about the priority GIL modifications I showed during my PyCON open-space session here:

http://www.dabeaz.com/blog/2010/02/revisiting-thread-priorities-and-new.html

I am attaching the .tar.gz file with modifications if anyone wants to look at them.  Note: I would not consider this to be solid enough to be any kind of official patch.   People should only look at it for the purposes of experimentation and for coming up with something better.
msg99875 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-02-22 23:35
Here is another patch based on a slightly different approach. Instead of being explicitly triggered in I/O methods, priority requests are decided based on the computed "interactiveness" of a thread. Interactiveness itself is a simple saturated counter (incremented when the GIL is dropped without request, decremented when the GIL is dropped after a request).

Benchmark numbers are basically the same as with gilprio2.patch.
msg100094 - (view) Author: Brian Curtin (brian.curtin) * (Python committer) Date: 2010-02-25 17:25
The UDP benchmarks look even better on Windows.
These are results of the attached scripts when running on an Intel Xeon (quad-core) 2.33GHz running Windows Server 2003 x64.

vanilla py3k
32-bit: 26.323 || 64-bit: 25.29

gilprio2 patched py3k
32-bit: 0.932  || 64-bit: 0.682

gilinter patched py3k
32-bit: 0.958  || 64-bit: 0.703
msg101116 - (view) Author: David Beazley (dabeaz) Date: 2010-03-15 14:21
Here's a short benchmark for everyone who thinks that my original benchmark was somehow related to TCP behavior.   This one doesn't even involve sockets:

from threading import Thread
import time

def writenums(f,n):
    start = time.time()
    for x in range(n):
        f.write("%d\n" % x)
    end = time.time()
    print(end-start)

def spin():
    while True:
        pass

# Uncomment to add a thread
#t1 = Thread(target=spin)
#t1.daemon=True
#t1.start()

writenums(open("/tmp/nums","w"),1000000)


If I run this on my Macbook with no threads, it takes about 1.05 seconds.  If the one spinning thread is turned on, the time jumps to about 4.5 seconds.  What you're seeing is that the spinning thread unfairly hogs the CPU.

If I use my own patched version (new GIL with priorities), the threaded version drops back down to about 1.10 seconds.   I have not tried it with Antoine's latest patch, but would expect similar results as he is also using priorities.

Just to be clear, this issue is not specific to sockets or TCP.
msg101118 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-15 15:06
On some platforms the difference is not so important.
I ran it in Debian Lenny AMD64 "Core2 Duo P9600 @2.66GHz".

# Python 3.2a0 (py3k:78982M, Mar 15 2010, 15:40:42) 
# [GCC 4.3.4] on linux2

0.67s without thread
0.84s with spinning thread
msg101120 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-15 15:44
With line buffering, I see the issue.

 * 6 s without thread
 * 115 s with the spinning thread (varying: 60 s, 98 s)
 * 16 s with the spinning thread and the last "gilinter.patch"


# Modified version of the test case, with bufsize=1

from threading import Thread
import time

def writenums(f, n):
    start = time.time()
    for x in range(n):
        f.write("%d\n" % x)
    end = time.time()
    print(end-start)

def spin():
    while True:
        pass

t1 = Thread(target=spin)
t1.daemon=True
# Uncomment to add a thread
#t1.start()

# With line buffering
writenums(open("./nums", "w", 1), 1000000)
msg101125 - (view) Author: David Beazley (dabeaz) Date: 2010-03-15 16:38
Whoa, that's pretty diabolically evil with bufsize=1.  On my machine, doing that just absolutely kills the performance (13 seconds without the spinning thread versus 557 seconds with the thread!).  Or, put another way, the writing performance drops from about 0.5 Mbyte/sec down to 12 Kbytes/sec with the thread.   With my priority GIL, the time is about 29 seconds with the thread (consistent with your experiment using the gilinter patch).
  
One thing to think about with this example is the proper priority of I/O handling generally.  What if, instead of a file, this example code was writing on a pipe to another process?   For that, you would probably want that I/O thread to be able to blast its data to the receiver as fast as it reasonably can so that it can be done with it and get back to other work. 

In terms of practical relevance, this test again represents a simple situation where computation is overlapped with I/O processing.  Perhaps the program has just computed a big result which is now being streamed somewhere else by a background thread.  In the meantime, the program is now working on computing the next result (the spinning thread).  Think queues, actors, or any number of similar things---there are programs that try to operate like this.
msg101127 - (view) Author: David Beazley (dabeaz) Date: 2010-03-15 16:53
Almost forgot--if I turn off one of the CPU cores, the time drops from 557 seconds to 32 seconds.   Gotta love it!
msg101141 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-16 00:59
> One thing to think about with
> this example is the proper priority of I/O handling generally.  What
> if, instead of a file, this example code was writing on a pipe to
> another process?   For that, you would probably want that I/O thread
> to be able to blast its data to the receiver as fast as it reasonably
> can so that it can be done with it and get back to other work. 

We should be careful with statements such as "you want probably want
[...]". There may be situations where you want such a thing; others
where you don't really want (or care about) it.

While IO responsiveness and throughput can be an important measure of
performance, it is not the only one and depending on the situation it
actually may not matter at all.
msg101142 - (view) Author: David Beazley (dabeaz) Date: 2010-03-16 01:13
Oh the situation definitely matters.   Although, in the big picture, most programmers would probably prefer to have fast I/O performance over slow I/O performance :-).
msg101144 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-16 01:22
> Oh the situation definitely matters.   Although, in the big picture,
> most programmers would probably prefer to have fast I/O performance
> over slow I/O performance :-).

Yes, of course. But that's not the point. We could try to improve GIL
behaviour in every thinkable situation, at the expense of code
complexity. But we don't want to make the code too complex and
delicate to maintain, and that's why I've asked for advice on
python-dev.
msg101146 - (view) Author: David Beazley (dabeaz) Date: 2010-03-16 02:06
I absolutely agree 100% that it is not worth trying to fix the GIL for every conceivable situation (although if you could, I wouldn't complain).

To me, there are really only two scenarios worth worrying about:

1. Get rid of all of that multicore lock thrashing that happens if there are multiple CPU-bound threads competing.  Python programmers should know that intentionally using threads like this won't offer a performance boost.  However, it would still be nice to avoid all of that needless overhead when it happens by design or by accident (the new GIL already addresses this).

2. Make sure that you can get reasonable I/O performance when there is *one* CPU-bound thread competing with some background I/O processing.  This covers the use case of overlapped I/O and computation as well as almost every practical problem related to communicating processes, multiprocessing, queuing, etc. 

As for everything else, it's probably not worth worrying about so much.  If someone is only doing I/O (e.g., a web server), their code is going to behave about the same as before (although maybe slightly better under heavy load if there's less GIL contention).  Situations where someone intentionally tries to set up multiple long-running CPU-bound threads seems pretty unlikely---the very presence of the GIL wouldn't make that kind of programming model attractive in the first place, so why would they do it?
msg101148 - (view) Author: David Beazley (dabeaz) Date: 2010-03-16 02:36
You know, I almost wonder whether this whole issue could be fixed by just adding a user-callable function to optionally set a thread priority number.  For example:

    sys.setpriority(n)

Modify the new GIL code so that it checks the priority of the currently running thread against the priority of the thread that wants the GIL.  If the running thread has lower priority, it immediately drops the GIL. 

Other than having this added preemption, do nothing else---just throw it all back to the user to come up with the proper "priorities." 

If there was something like this, it would completely fix the overlapped  compute and I/O problem I mentioned.  I'd just set a higher priority on the background I/O threads and be done with it.  Problem solved.

Ok, it's only a thought.
msg101192 - (view) Author: Nir Aides (nirai) Date: 2010-03-16 21:17
I tried Florent's modification to the write test and did not see the effect on my machine with an updated revision of Python32.

I am running Ubuntu Karmic 64 bit.
7s - no background threads.
20s - one background thread.

According to the following documentation the libc condition is using scheduling policy when waking a thread and not FIFO order:
The following documentation suggests ordering in Linux is not FIFO:
http://www.opengroup.org/onlinepubs/000095399/functions/pthread_cond_timedwait.html#tag_03_518_08_06

I upload a quick and dirty patch (linux-7946.patch) to the new GIL just to reflect this by avoiding the timed waits.  

On my machine it behaves reasonably both with the TCP server and with the write test, but so does unpatched Python 3.2.

I noticed high context switching rate with dave's priority GIL - with both tests it goes above 40K/s context switches.
msg101197 - (view) Author: Nir Aides (nirai) Date: 2010-03-16 22:36
I updated the patch with a small fix and increased the ticks countdown-to-release considerably. This seems to help the OS classify CPU bound threads as such and actually improves IO performance.
msg101612 - (view) Author: Nir Aides (nirai) Date: 2010-03-24 00:41
I upload bfs.patch

To apply the patch use the following commands on updated python 3.2:
$ patch -fp1 < bfs.patch
$ ./configure

The patch replaces the GIL with a scheduler. The scheduler is a simplified implementation of the recent kernel Brain F**k Scheduler by the Linux hacker Con Kolivas:

http://ck.kolivas.org/patches/bfs/sched-BFS.txt
"The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to
completely do away with the complex designs of the past for the cpu process
scheduler and instead implement one that is very simple in basic design.
The main focus of BFS is to achieve excellent desktop interactivity and
responsiveness without heuristics and tuning knobs that are difficult to
understand, impossible to model and predict the effect of, and when tuned to
one workload cause massive detriment to another."

Con Kolivas is the hacker whose work inspired the current CFS scheduler of the Linux Kernel. 

On my core 2 duo laptop it performs as follows compared to the other patches:

1) Florent's writenums() test: ~same
2) UDP test: x6 faster
3) cpued test: works as expected, while the other patches starve the pure python threads.

cpued test spins 3 threads, 2 of them pure python and the 3rd does time.sleep(0) every ~1ms:

import threading
import time

def foo(n):
    while n > 0:
        'y' in 'x' * n
        n -= 1

def bar(sleep, name):
    for i in range(100):
        print (name, i, sleep)
        for j in range(300):
            foo(1500)
            if sleep:
                time.sleep(0)

t0 = threading.Thread(target=bar, args=(False, 't0'))
t1 = threading.Thread(target=bar, args=(False, 't1'))
t2 = threading.Thread(target=bar, args=(True, 't2-interactive'))

list(map(threading.Thread.start, [t0, t1, t2]))
list(map(threading.Thread.join, [t0, t1, t2]))


The patch is still work in progress. In particular:
1) I still need to add support for Windows.
2) It currently requires Posix clock_gettime() and assumes good timer resolution.
3) I only verified it builds on Ubuntu Karmic 64bit.
4) I still need to optimize it and address cleanup.

The scheduler is very simple, straight forward and flexible, and it addresses the tuning problems discussed recently.

I think it can be a good replacement to the GIL, since Python really needs a scheduler, not a lock.
msg101613 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-24 01:13
> I upload bfs.patch

Interesting patch, but:
- Please give understandable benchmark numbers, including an explicit
comparison with baseline 3.2, and patched 3.2 (e.g. gilinter.patch)
- Please also measure single-thread performance, because it looks like
you are adding significant work inside the core eval loop
- Do you need a hi-res clock? gettimeofday() already gives you
microseconds. It looks like a bit of imprecision shouldn't be
detrimental.
- The magic number DEADLINE_FACTOR looks gratuitous (why 1.1^20 ?)
- By the way, I would put COND_SIGNAL inside the LOCK_MUTEX /
UNLOCK_MUTEX pair in bfs_yield().

If this gets accepted there will be cosmetic issues to watch out (and
the patch should be cross-platform).
msg101697 - (view) Author: Nir Aides (nirai) Date: 2010-03-25 13:51
I upload an updated bfs.patch. Apply to updated py32 and ignore the error with:

$ patch -fp1 < bfs.patch
$ ./configure


> Please give understandable benchmark numbers, including an explicit comparison with baseline 3.2, and patched 3.2 (e.g. gilinter.patch)

Below.

> Please also measure single-thread performance, because it looks like you are adding significant work inside the core eval loop

Removed most of it now. last bit will be removed soon.

> Do you need a hi-res clock? gettimeofday() already gives you microseconds. It looks like a bit of imprecision shouldn't be detrimental.

I use clock_gettime() to get the thread running time to calculate slice depletion. Wall clock can not help with that.

> The magic number DEADLINE_FACTOR looks gratuitous (why 1.1^20 ?) 

To my understanding it controls the CPU load (~6) beyond which threads tend to expire. Since expired threads are handled in FIFO order, IO threads do not preempt them (IO threads are chronically expired). So beyond that load IO threads become less responsive.

> By the way, I would put COND_SIGNAL inside the LOCK_MUTEX / UNLOCK_MUTEX pair in bfs_yield().

Done.

Here are benchmark results of the UDP test as timed with ipython, where client.work() is a single run of the client:

System: Core 2 Duo (locked at 2.4 GHz) with Ubuntu Karmic 64 bit.

Vanilla Python 3.2: 

* Note on my system the original problem discussed in this issue report does not manifest since conditions wake up threads according to OS scheduling policy.

In [28]: %timeit -n3 client.work()
1.290 seconds (8127084.435 bytes/sec)
1.488 seconds (7045285.926 bytes/sec)
2.449 seconds (4281485.217 bytes/sec)
1.874 seconds (5594303.222 bytes/sec)
1.853 seconds (5659626.496 bytes/sec)
0.872 seconds (12023425.779 bytes/sec)
4.951 seconds (2117942.079 bytes/sec)
0.728 seconds (14409157.126 bytes/sec)
1.743 seconds (6016999.707 bytes/sec)
3 loops, best of 3: 1.53 s per loop

gilinter.patch:

In [31]: %timeit -n3 client.work()
5.192 seconds (2019676.396 bytes/sec)
1.613 seconds (6500071.475 bytes/sec)
3.057 seconds (3429689.199 bytes/sec)
3.486 seconds (3007596.468 bytes/sec)
4.324 seconds (2424791.868 bytes/sec)
0.964 seconds (10872708.606 bytes/sec)
3.510 seconds (2987722.960 bytes/sec)
1.362 seconds (7698999.458 bytes/sec)
1.013 seconds (10353913.920 bytes/sec)
3 loops, best of 3: 1.96 s per loop

PyCON patch:

In [32]: %timeit -n3 client.work()
2.483 seconds (4223256.889 bytes/sec)
1.330 seconds (7882880.263 bytes/sec)
1.737 seconds (6036251.315 bytes/sec)
1.348 seconds (7778296.679 bytes/sec)
0.983 seconds (10670811.638 bytes/sec)
1.419 seconds (7387226.333 bytes/sec)
1.057 seconds (9919412.977 bytes/sec)
2.483 seconds (4223205.791 bytes/sec)
2.121 seconds (4944231.292 bytes/sec)
3 loops, best of 3: 1.25 s per loop

bfs.patch:

In [33]: %timeit -n3 client.work()
0.289 seconds (36341875.356 bytes/sec)
0.271 seconds (38677439.991 bytes/sec)
0.476 seconds (22033958.947 bytes/sec)
0.329 seconds (31872974.070 bytes/sec)
0.478 seconds (21925125.894 bytes/sec)
0.242 seconds (43386204.271 bytes/sec)
0.213 seconds (49195701.418 bytes/sec)
0.309 seconds (33967467.196 bytes/sec)
0.256 seconds (41008076.688 bytes/sec)
3 loops, best of 3: 259 ms per loop


Output of cpued.py test:

Vanilla Python 3.2, gilinter.patch and PyCON patch all starve the pure Python threads and output the following:

$ ~/build/python/python32/python cpued.py 
t0 0 False
t1 0 False
t2-interactive 0 True
t2-interactive 1 True
t2-interactive 2 True
t2-interactive 3 True
t2-interactive 4 True
t2-interactive 5 True
t2-interactive 6 True
t2-interactive 7 True
.
.
.


Output from bfs.patch run:

$ ~/build/python/bfs/python cpued.py 
t0 0 False
t1 0 False
t2-interactive 0 True
t0 1 False
t1 1 False
t2-interactive 1 True
t0 2 False
t1 2 False
t2-interactive 2 True
t0 3 False
t1 3 False
t2-interactive 3 True
.
.
.

Note: I have not tested on other Posix systems, and expect to have some complications on Windows, since its thread timers are low resolution (10ms+), and there are issues with its high-precision wall clock. ...will soon know better.
msg101700 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-25 14:21
> I use clock_gettime() to get the thread running time to calculate slice depletion.

Ouch. CLOCK_THREAD_CPUTIME_ID is not a required part of the standard. Only CLOCK_REALTIME is guaranteed to exist.

By the way, it's not obvious cpued tests anything meaningful. I understand the bias you are trying to avoid but constructing artificial test cases is not very useful, because we are playing with heuristics and it's always possible to defeat some expectations. That's why benchmarks should try to model/represent real-world situations.


I've tried ccbench with your patch and there's a clear regression in latency numbers:

Background CPU task: Pi calculation (Python)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 1 ms. (std dev: 0 ms.)
CPU threads=2: 0 ms. (std dev: 2 ms.)
CPU threads=3: 0 ms. (std dev: 2 ms.)
CPU threads=4: 2 ms. (std dev: 8 ms.)

Background CPU task: regular expression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 969 ms. (std dev: 577 ms.)
CPU threads=2: 1050 ms. (std dev: 577 ms.)
CPU threads=3: 1217 ms. (std dev: 577 ms.)
CPU threads=4: 1416 ms. (std dev: 577 ms.)

Vanilla py3k and py3k+gilinter both don't have this problem (on the regular expression workload).

IO bandwidth numbers are similar between the BFS patch:

Background CPU task: Pi calculation (Python)

CPU threads=0: 6952.7 packets/s.
CPU threads=1: 4350.3 ( 62 %)
CPU threads=2: 3332.2 ( 47 %)
CPU threads=3: 2632.0 ( 37 %)
CPU threads=4: 2052.0 ( 29 %)

and py3k + gilinter:

Background CPU task: Pi calculation (Python)

CPU threads=0: 7023.4 packets/s.
CPU threads=1: 5019.4 ( 71 %)
CPU threads=2: 4474.6 ( 63 %)
CPU threads=3: 2728.5 ( 38 %)
CPU threads=4: 2244.9 ( 31 %)
msg101707 - (view) Author: Nir Aides (nirai) Date: 2010-03-25 15:32
> Ouch. CLOCK_THREAD_CPUTIME_ID is not a required part of the standard. Only CLOCK_REALTIME is guaranteed to exist.

Right, however the man page at kernel.org says the following on CLOCK_THREAD_CPUTIME_ID: 
"Sufficiently recent versions of glibc and the Linux kernel support the following clocks"
http://www.kernel.org/doc/man-pages/online/pages/man2/clock_getres.2.html

The same statement shows up as early as 2003:
http://www.tin.org/bin/man.cgi?section=3&topic=clock_gettime

However, if this is indeed a problem on some systems (none Linux?), then a fall back could be attempted for them. 

There could also be a problem on systems where the counter exists but has low resolution 10ms+

What platforms do you think this could be a problem on?

> By the way, it's not obvious cpued tests anything meaningful. I understand the bias you are trying to avoid but constructing artificial test cases is not very useful, because we are playing with heuristics and it's always possible to defeat some expectations. That's why benchmarks should try to model/represent real-world situations.

I came up with cpued.py after reading the patches in an attempt to understand how they behave. In this case one thread is pure Python while the other occasionally releases the GIL, both CPU bound. I don't claim this is a real-world situation. However, it is a case in which bfs.patch behaves as expected.

> I've tried ccbench with your patch and there's a clear regression in latency numbers.

Please specify system and test details so I can try to look into it. On my system ccbench behaves as expected:

$ ~/build/python/bfs/python ccbench.py
== CPython 3.2a0.0 (py3k) ==
== x86_64 Linux on '' ==

--- Throughput ---

Pi calculation (Python)

threads=1: 1252 iterations/s.
threads=2: 1199 ( 95 %)
threads=3: 1178 ( 94 %)
threads=4: 1173 ( 93 %)

regular expression (C)

threads=1: 491 iterations/s.
threads=2: 478 ( 97 %)
threads=3: 472 ( 96 %)
threads=4: 477 ( 97 %)

SHA1 hashing (C)

threads=1: 2239 iterations/s.
threads=2: 3719 ( 166 %)
threads=3: 3772 ( 168 %)
threads=4: 3464 ( 154 %)

--- Latency ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 0 ms. (std dev: 1 ms.)
CPU threads=1: 0 ms. (std dev: 1 ms.)
CPU threads=2: 0 ms. (std dev: 1 ms.)
CPU threads=3: 0 ms. (std dev: 1 ms.)
CPU threads=4: 0 ms. (std dev: 1 ms.)

Background CPU task: regular expression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 6 ms. (std dev: 0 ms.)
CPU threads=2: 2 ms. (std dev: 2 ms.)
CPU threads=3: 1 ms. (std dev: 0 ms.)
CPU threads=4: 5 ms. (std dev: 7 ms.)

Background CPU task: SHA1 hashing (C)

CPU threads=0: 0 ms. (std dev: 1 ms.)
CPU threads=1: 0 ms. (std dev: 1 ms.)
CPU threads=2: 0 ms. (std dev: 1 ms.)
CPU threads=3: 1 ms. (std dev: 1 ms.)
CPU threads=4: 1 ms. (std dev: 0 ms.)
msg101711 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-25 15:52
> Please specify system and test details so I can try to look into it.

It's a dual-core Linux x86-64 system. But, looking at the patch again,
the reason is obvious:

#define CHECK_SLICE_DEPLETION(tstate) (bfs_check_depleted || (tstate->tick_counter % 1000 == 0))

`tstate->tick_counter % 1000` is replicating the behaviour of the old
GIL, which based its speculative operation on the number of elapsed
opcodes (and which also gave bad latency numbers on the regex workload).

By the way, I configure --with-computed-gotos.
msg101714 - (view) Author: Nir Aides (nirai) Date: 2010-03-25 16:03
> It's a dual-core Linux x86-64 system. But, looking at the patch again, the reason is obvious:
>
> #define CHECK_SLICE_DEPLETION(tstate) (bfs_check_depleted || (tstate
> >tick_counter % 1000 == 0))
>
> `tstate->tick_counter % 1000` is replicating the behaviour of the old GIL, which based its speculative operation on the number of elapsed opcodes (and which also gave bad latency numbers on the regex workload).

The flag_check_depleted is there to work around this problem. It is raised by waiters which timeout.

What distribution and version of GNU/Linux are you using?

As for the CLOCK_THREAD_CPUTIME_ID clock, support was added to FreeBSD recently in version 7.1, which I guess is not good enough:
http://www.freebsd.org/releases/7.1R/relnotes.html

I did not find yet anything on Solaris. Do you know of an alternative way to measure thread running time on Posix?
msg101715 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-25 16:08
> What distribution and version of GNU/Linux are you using?

$ cat /etc/mandriva-release 
Mandriva Linux release 2010.0 (Official) for x86_64

$ rpm -qv glibc
glibc-2.10.1-6.2mnb2

$ uname -a
Linux localhost 2.6.31.5-desktop-1mnb #1 SMP Fri Oct 23 00:05:22 EDT
2009 x86_64 AMD Athlon(tm) 64 X2 Dual Core Processor 3600+ GNU/Linux

I'll try on an Ubuntu system some other day, if I find the time.

> I did not find yet anything on Solaris. Do you know of an alternative
> way to measure thread running time on Posix?

No, but I'm not an expert.
msg101724 - (view) Author: Nir Aides (nirai) Date: 2010-03-25 17:58
Well, on initial check the scheduler seems to work well with regular gettimeofday() wall clock instead of clock_gettime().

:)

/* Return thread running time in seconds (with nsec precision). */
static inline long double get_thread_timestamp(void) {
    return get_timestamp(); // wall clock via gettimeofday()
    /*struct timespec ts;
    clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);
    return (long double) ts.tv_sec + ts.tv_nsec * 0.000000001;*/
}

Does it make things better on your system?
msg101737 - (view) Author: Nir Aides (nirai) Date: 2010-03-25 23:52
Uploaded an updated bfs.patch

The latency problem was related to the --with-computed-gotos flag. I fixed it and it seems to work fine now. 

I also switched to gettimeofday() so it should work now on all Posix with high resolution timer.
msg101772 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-26 20:09
> I also switched to gettimeofday() so it should work now on all Posix 
> with high resolution timer

But on a busy system, won't measuring wall clock time rather than CPU time give bogus results?
msg101773 - (view) Author: Nir Aides (nirai) Date: 2010-03-26 20:25
> But on a busy system, won't measuring wall clock time rather than CPU time give bogus results?

This was the motivation for using clock_gettime(). I tried the wall clock version under load (including on single core system) and it seems to behave. Now it remains to rationalize it :)
msg101821 - (view) Author: Nir Aides (nirai) Date: 2010-03-27 08:46
gilinter.patch has good IO latency in UDP test on my system when built with --with-computed-gotos:

In [34]: %timeit -n3 client.work()
0.320 seconds (32782026.509 bytes/sec)
0.343 seconds (30561727.443 bytes/sec)
0.496 seconds (21154075.417 bytes/sec)
0.326 seconds (32171215.998 bytes/sec)
0.462 seconds (22701809.421 bytes/sec)
0.378 seconds (27722146.793 bytes/sec)
0.391 seconds (26826713.409 bytes/sec)
0.315 seconds (33335858.720 bytes/sec)
0.281 seconds (37349508.136 bytes/sec)
3 loops, best of 3: 329 ms per loop
msg101825 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-03-27 13:26
Hmm, the gilinter patch shouldn't be sensitive to whether computed gotos are enabled or not. Here is an updated patch, though, the previous one didn't apply cleanly anymore. I've also made the priority condition a bit stricter.
msg101854 - (view) Author: Nir Aides (nirai) Date: 2010-03-28 04:31
I update bfs.patch. It now builds on Windows (and Posix).
msg102043 - (view) Author: Nir Aides (nirai) Date: 2010-03-31 22:58
I upload a new update to bfs.patch which improves scheduling and reduces overhead.
msg102649 - (view) Author: Nir Aides (nirai) Date: 2010-04-08 22:08
Uploaded an update.
msg102890 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2010-04-11 21:18
A couple remarks on BFS-based patch:
- nothing guarantees that you'll get a msec resolution
- gettimeofday returns you wall clock time: if a process that modifies time is running, e.g. ntpd, you'll likely to run into trouble. the value returned is _not_ monotonic, but clock_gettime(CLOCK_MONOTONIC) is
- inline functions are used, but it's not ANSI
- static inline long double get_timestamp(void) {
    struct timeval tv;
    GETTIMEOFDAY(&tv);
    return (long double) tv.tv_sec + tv.tv_usec * 0.000001;
}
the product is computed as double, and then promoted as (long double).
- the code uses a lot of floating point calculation, which is slower than integer

Otherwise:
"- You know, I almost wonder whether this whole issue could be fixed by just adding a user-callable function to optionally set a thread priority number.  For example:

    sys.setpriority(n)

Modify the new GIL code so that it checks the priority of the currently running thread against the priority of the thread that wants the GIL.  If the running thread has lower priority, it immediately drops the GIL."

The problem with this type of fixed-priority is starvation. And it shouldn't be up to the user to set the priorities. And some threads can mix I/O and CPU intensive tasks.

> It's a dual-core Linux x86-64 system. But, looking at the patch again, the reason is obvious:
>
> #define CHECK_SLICE_DEPLETION(tstate) (bfs_check_depleted || (tstate
> >tick_counter % 1000 == 0))
>
> `tstate->tick_counter % 1000` is replicating the behaviour of the old GIL, which based its speculative operation on the number of elapsed opcodes (and which also gave bad latency numbers on the regex workload).

I find this suspicous too. I haven't looked at the patch in detail, but what does the number of elapsed opcodes offers you over the timesplice expiration approach ?
msg103154 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2010-04-14 22:00
Some more remarks:
- COND_TIMED_WAIT macro modifies timeout_result when pthread_cond_timewait expires. But timeout_result is not an int pointer, just an int. So it is never updated, and as a result, bfs_check_depleted is never set after a thread has waited for the current running thread to schedule it in vain (in _bfs_timed_wait).
- the scheduling function bfs_find_task returns the first task that has an expired deadline. since an expired deadline probably means that the scheduler hasn't run for a while, it might be worth it to look for the thread with the oldest deadline and serve it first, instead of stopping at the first one
- calls to COND_WAIT/COND_TIMED_WAIT should be run in loops checking for the predicate, since it might be false even after these call return (spurious wakeups, etc):
"In general, whenever a condition wait returns, the thread has to re-evaluate the predicate associated with the condition wait to determine whether it can safely proceed, should wait again, or should declare a timeout. A return from the wait does not imply that the associated predicate is either true or false. 

It is thus recommended that a condition wait be enclosed in the equivalent of a "while loop" that checks the predicate."
msg103186 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2010-04-15 07:27
Please disregard my remark on COND_TIMED_WAIT not updating timeout_result, it's wrong (it's really a macro, not a function...)
msg103320 - (view) Author: Nir Aides (nirai) Date: 2010-04-16 13:10
I uploaded an update to bfs.patch which improves behavior in particular on non-Linux multi-core (4+) machines. 

Hi Charles-Francois, Thanks for taking the time to review this patch!

> - nothing guarantees that you'll get a msec resolution

Right, the code should behave well with low precision clocks as long as short (sub-tick) tasks are not synchronized with the slice interval. There is a related discussion of this problem in schedulers in the section on sub-tick accounting in: http://ck.kolivas.org/patches/bfs/sched-BFS.txt

On which target systems can we expect not to have high precision clock?

> - gettimeofday returns you wall clock time: if a process that modifies time is running, e.g. ntpd, you'll likely to run into trouble. the value returned is _not_ monotonic, but clock_gettime(CLOCK_MONOTONIC) is
> - inline functions are used, but it's not ANSI
> - static inline long double get_timestamp(void) {
>    struct timeval tv;
>    GETTIMEOFDAY(&tv);
>    return (long double) tv.tv_sec + tv.tv_usec * 0.000001;
> }

I added timestamp capping to the code. timestamp is used for waiting and therefore I think the source should be either CLOCK_REALTIME or gettimeofday().

> > `tstate->tick_counter % 1000` is replicating the behaviour of the old GIL, which based its speculative operation on the number of elapsed opcodes (and which also gave bad latency numbers on the regex workload).
>
> I find this suspicous too. I haven't looked at the patch in detail, but what does the number of elapsed opcodes offers you over the timesplice expiration approach ?

More accurate yielding. It is possible a better mechanism can be thought of and/or maybe it is indeed redundant.

> It is thus recommended that a condition wait be enclosed in the equivalent of a "while loop" that checks the predicate."

Done.
msg103414 - (view) Author: Nir Aides (nirai) Date: 2010-04-17 17:47
> the scheduling function bfs_find_task returns the first task that 
> has an expired deadline. since an expired deadline probably means 
> that the scheduler hasn't run for a while, it might be worth it to 
> look for the thread with the oldest deadline and serve it first, 
> instead of stopping at the first one

This is by design of BFS as I understand it. Next thread to run is either first expired or oldest deadline:

http://ck.kolivas.org/patches/bfs/sched-BFS.txt
"Once a task is descheduled, it is put back on the queue, and an
O(n) lookup of all queued-but-not-running tasks is done to determine which has
the earliest deadline and that task is chosen to receive CPU next. The one
caveat to this is that if a deadline has already passed (jiffies is greater
than the deadline), the tasks are chosen in FIFO (first in first out) order as
the deadlines are old and their absolute value becomes decreasingly relevant
apart from being a flag that they have been asleep and deserve CPU time ahead
of all later deadlines."
msg103460 - (view) Author: Nir Aides (nirai) Date: 2010-04-18 00:23
Yet another update to bfs.patch.

I upload a variation on Florent's write test which prints progress of background CPU bound threads as: thread-name timestamp progress

Here are some numbers from Windows XP 32bit with Intel q9400 (4 cores). Builds produced with VS express (alas - no optimizations, so official builds may behave differently).

BFS - 
33% CPU, 37,000 context switches per second

z:\bfs\PCbuild\python.exe y:\writes.py
t1 2.34400010109 0
t2 2.42199993134 0
t1 4.67199993134 1
t2 4.79699993134 1
t1 7.01600003242 2
t2 7.20300006866 2
t1 9.375 3
t2 9.625 3
t1 11.7039999962 4
t2 12.0309998989 4
t1 14.0469999313 5
t2 14.4219999313 5
t1 16.4070000648 6
t2 16.7809998989 6
t1 18.7820000648 7
t2 19.125 7
t1 21.1570000648 8
t2 21.4839999676 8
t1 23.5 9
t2 23.8589999676 9
t1 25.8599998951 10
t2 26.2339999676 10
t1 28.2349998951 11
28.2189998627

gilinter - starves both bg threads and high rate of context switches.
45% CPU, 203,000 context switches per second

z:\gilinter\PCbuild\python.exe y:\writes.py
t1 13.0939998627 0
t1 26.4219999313 1
t1 39.8129999638 2
t1 53.1559998989 3
57.5470001698

PyCON - starves one bg thread and slow IO thread 
Py32 - starves IO thread as expected.

Note, PyCON, gilinter and py32 starve the bg thread with Dave's original buffered write test as well - http://bugs.python.org/issue7946#msg101116
msg104194 - (view) Author: David Beazley (dabeaz) Date: 2010-04-26 04:13
The attached patch makes two simple refinements to the new GIL implemented in Python 3.2.   Each is briefly described below.

1. Changed mechanism for thread time expiration

In the current implementation, threads perform a timed-wait on a condition variable.  If time expires and no thread switches have occurred, the currently running thread is forced to drop the GIL.

In the patch, timeouts are now performed by a special "GIL monitor" thread.  This thread runs independently of Python and simply handles time expiration.  Basically, it records the number of thread switches, sleeps for a specified interval (5ms), and then looks at the number of thread switches again.  If no switches occurred, it forces the currently running thread to drop the GIL.

With this monitor thread, it is no longer necessary to perform any timed condition variable waits.  This approach has a few subtle benefits.  First, threads no longer sit in a wait/timeout cycle when trying to get the GIL (so, there is less overhead).   Second, you get FIFO scheduling of threads.  When time expires, the thread that has been waiting the longest on the condition variable runs next.  Generally, you want this.

2. A very simple two-level priority mechanism

A new attribute 'cpu_bound' is added to the PyThreadState structure.  If a thread is ever forced to drop the GIL, this attribute is simply set True (1).  If a thread gives up the GIL voluntarily, it is set back to False (0).  This attribute is used to set up simple scheduling (described next).

There are now two separate condition variables (gil_cpu_cond) and (gil_io_cond) that separate waiting threads according to their cpu_bound attribute setting.  CPU-bound threads wait on gil_cpu_cond whereas I/O-bound threads wait on gil_io_cond. 

Using the two condition variables, the following scheduling rules are enforced:

   - If there are any waiting I/O bound threads, they are always signaled first, before any CPU-bound threads.
   - If an I/O bound thread wants the GIL, but a CPU-bound thread is running, the CPU-bound thread is immediately forced to drop the GIL.
   - If a CPU-bound thread wants the GIL, but another CPU-bound thread is running, the running thread is immediately forced to drop the GIL if its time period has already expired.

Results
-------
This patch gives excellent results for both the ccbench test and all of my previous I/O bound tests.  Here is the output:

== CPython 3.2a0.0 (py3k:80470:80497M) ==
== i386 Darwin on 'i386' ==

--- Throughput ---

Pi calculation (Python)

threads=1: 871 iterations/s.
threads=2: 844 ( 96 %)
threads=3: 838 ( 96 %)
threads=4: 826 ( 94 %)

regular expression (C)

threads=1: 367 iterations/s.
threads=2: 345 ( 94 %)
threads=3: 339 ( 92 %)
threads=4: 327 ( 89 %)

bz2 compression (C)

threads=1: 384 iterations/s.
threads=2: 728 ( 189 %)
threads=3: 695 ( 180 %)
threads=4: 707 ( 184 %)

--- Latency ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 0 ms.)
CPU threads=2: 1 ms. (std dev: 2 ms.)
CPU threads=3: 0 ms. (std dev: 1 ms.)
CPU threads=4: 0 ms. (std dev: 1 ms.)

Background CPU task: regular expression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 2 ms. (std dev: 1 ms.)
CPU threads=2: 1 ms. (std dev: 1 ms.)
CPU threads=3: 1 ms. (std dev: 1 ms.)
CPU threads=4: 2 ms. (std dev: 1 ms.)

Background CPU task: bz2 compression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 2 ms.)
CPU threads=2: 2 ms. (std dev: 3 ms.)
CPU threads=3: 0 ms. (std dev: 1 ms.)
CPU threads=4: 0 ms. (std dev: 1 ms.)

--- I/O bandwidth ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 5850.9 packets/s.
CPU threads=1: 5246.8 ( 89 %)
CPU threads=2: 4228.9 ( 72 %)
CPU threads=3: 4222.8 ( 72 %)
CPU threads=4: 2959.5 ( 50 %)

Particular attention should be given to tests involving I/O performance.  In particular, here are the results of the I/O bandwidth test using the unmodified GIL:

--- I/O bandwidth ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 6007.1 packets/s.
CPU threads=1: 189.0 ( 3 %)
CPU threads=2: 19.7 ( 0 %)
CPU threads=3: 19.7 ( 0 %)
CPU threads=4: 5.1 ( 0 %)

Other Benefits
--------------
This patch does not involve any complicated libraries, platform specific functionality, low-level lock twiddling, or mathematically complex priority scheduling algorithms.  Emphasize: The code is simple.

Negative Aspects
----------------
This modification might introduce a starvation effect where CPU-bound threads never get to run if there is an extremely heavy load of I/O-bound threads competing for the GIL.

Comparison to BFS
-----------------
Still need to test. Would be curious.
msg104195 - (view) Author: David Beazley (dabeaz) Date: 2010-04-26 04:17
One comment on that patch I just submitted. Basically, it's an attempt to make an extremely simple tweak to the GIL that fixes most of the problems discussed here in an extremely simple manner.  I don't have any special religious attachment to it though.  Would love to see a BFS comparison.
msg104197 - (view) Author: David Beazley (dabeaz) Date: 2010-04-26 04:56
Here is the result of running the writes.py test with the patch I submitted.   This is on OS-X.

bash-3.2$ ./python.exe writes.py
t1 2.83990693092 0
t2 3.27937912941 0
t1 5.54346394539 1
t2 6.68237304688 1
t1 8.9648039341 2
t2 9.60041999817 2
t1 12.1856160164 3
t2 12.5866689682 3
t1 15.3869640827 4
t2 15.7042851448 4
t1 18.4115200043 5
t2 18.5771169662 5
t2 21.4922711849 6
t1 21.6835460663 6
t2 24.6117911339 7
t1 24.9126679897 7
t1 27.1683580875 8
t2 28.2728791237 8
t1 29.4513950348 9
t1 32.2438161373 10
t2 32.5283250809 9
t1 34.8905010223 11
t2 36.0952250957 10
t1 38.109760046 12
t2 39.3465380669 11
t1 41.5758800507 13
t2 42.587772131 12
t1 45.1536290646 14
t2 45.8339021206 13
t1 48.6495029926 15
t2 49.1581180096 14
t1 51.5414950848 16
t2 52.6768190861 15
t1 54.818582058 17
t2 56.1163961887 16
t1 58.1549630165 18
t2 59.6944830418 17
t1 61.4515309334 19
t2 62.7685520649 18
t1 64.3223180771 20
t2 65.8158640862 19
65.8578810692
msg104198 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2010-04-26 05:23
Nice dabeaz.

One potential concern with "dabeaz_gil.patch 2010-04-25 21:13" is that it appears to always leave the gil_monitor thread running.  This is bad on mobile/embedded platforms where waking up at regular intervals prevents advanced sleep states and wastes power/battery.  (practical example: the OLPC project has run into this issue in other code in the past)

Could this be modified so that gil_monitor stops looping (blocks) so long as there are only IO bound Python threads running or while no python thread owns the GIL?

In that situation a multithreaded python process that has either reverted to one thread or has all threads blocked in IO would be truly idle rather than leaving the gil_monitor polling.
msg104203 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-04-26 09:36
Dave,

> In the current implementation, threads perform a timed-wait on a
> condition variable.  If time expires and no thread switches have
> occurred, the currently running thread is forced to drop the GIL.

A problem, as far as I can see, is that these timeout sleeps run
periodically, regardless of the actual times at which thread switching
takes place. I'm not sure it's really an issue but it's a bit of a
departure from the "ideal" behaviour of the switching interval.

> A new attribute 'cpu_bound' is added to the PyThreadState structure.
> If a thread is ever forced to drop the GIL, this attribute is simply
> set True (1).  If a thread gives up the GIL voluntarily, it is set
> back to False (0).  This attribute is used to set up simple scheduling
> (described next).

Ok, so it's not very different, at least in principle, from what
gilinter.patch does, right?
(and actually, the benchmark results look very similar)
msg104219 - (view) Author: David Beazley (dabeaz) Date: 2010-04-26 11:08
Greg,

I like the idea of the monitor suspending if no thread owns the GIL.  Let me work on that.   Good point on embedded systems.

Antoine, 

Yes, the gil monitor is completely independent and simply ticks along every 5 ms.   A worst case scenario is that an I/O bound thread is scheduled shortly after the 5ms tick and then becomes CPU-bound afterwards.  In that case, the monitor might let it run up to about 10ms before switching it.  Hard to say if it's a real problem though---the normal timeslice on many systems is 10 ms so it doesn't seem out of line.  

As for the priority part, this patch should have similar behavior to the glinter patch except for very subtle differences in thread scheduling due to the use of the GIL monitor.  For instance, since threads never time out on the condition variable anymore, they tend to cycle execution in a purely round-robin fashion.
msg104233 - (view) Author: David Beazley (dabeaz) Date: 2010-04-26 13:21
I've updated the GIL patch to reflect concerns about the monitor thread running forever.  This version has a suspension mechanism where the monitor goes to sleep if nothing is going on for awhile.  It gets resumed if threads try to acquire the GIL, but timeout for some reason.
msg104234 - (view) Author: David Beazley (dabeaz) Date: 2010-04-26 13:31
I've also attached a new file schedtest.py that illustrates a subtle difference between having the GIL monitor thread and not having the monitor.

Without the monitor, every thread is responsible for its own scheduling.  If you have a lot of threads running, you may have a lot of threads all performing a timed wait and then waking up only to find that the GIL is locked and that they have to go back to waiting.  One side effect is that certain threads have a tendency to starve.

For example, if you run the schedtest.py with the original GIL, you get a trace where three CPU-bound threads run like this:

Thread-3 16632
Thread-2 16517
Thread-1 31669
Thread-2 16610
Thread-1 16256
Thread-2 16445
Thread-1 16643
Thread-2 16331
Thread-1 16494
Thread-3 16399
Thread-1 17090
Thread-1 20860
Thread-3 16306
Thread-1 19684
Thread-3 16258
Thread-1 16669
Thread-3 16515
Thread-1 16381
Thread-3 16600
Thread-1 16477
Thread-3 16507
Thread-1 16740
Thread-3 16626
Thread-1 16564
Thread-3 15954
Thread-2 16727
...

You will observe that Threads 1 and 2 alternate, but Thread 3 starves.  Then at some point, Threads 1 and 3 alternate, but Thread 2 starves. 

By having a separate GIL monitor, threads are no longer responsible for making scheduling decisions concerning timeouts.  Instead, the monitor is what times out and yanks threads off the GIL.  If you run the same test with the GIL monitor, you get scheduling like this:

Thread-1 33278
Thread-2 32278
Thread-3 31981
Thread-1 33760
Thread-2 32385
Thread-3 32019
Thread-1 32700
Thread-2 32085
Thread-3 32248
Thread-1 31630
Thread-2 32200
Thread-3 32054
Thread-1 32721
Thread-2 32659
Thread-3 34150

Threads nicely cycle round-robin.  There also appears to be about half as much thread switching (for reasons I don't quite understand).
msg104288 - (view) Author: Nir Aides (nirai) Date: 2010-04-26 23:22
Dave, there seems to be a bug in your patch on Windows XP. It crashes in ccbench.py with the following output:

>python_d.exe y:\ccbench.py
== CPython 3.2a0.0 (py3k) ==
== x86 Windows on 'x86 Family 6 Model 23 Stepping 10, GenuineIntel' ==

--- Throughput ---

Pi calculation (Python)

threads= 1:   840 iterations/s. balance
Fatal Python error: ReleaseMutex(mon_mutex) failed
threads= 2:   704 ( 83%)        0.8167
threads= 3:   840 (100%)        1.6706
threads= 4:   840 (100%)        2.0000

and the following stack trace:

 	ntdll.dll!7c90120e() 	
 	[Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll]	
 	python32_d.dll!Py_FatalError(const char * msg)  Line 2033	C
>	python32_d.dll!gil_monitor(void * arg)  Line 314 + 0x24 bytes	C
 	python32_d.dll!bootstrap(void * call)  Line 122 + 0x7 bytes	C
 	msvcr100d.dll!_callthreadstartex()  Line 314 + 0xf bytes	C
 	msvcr100d.dll!_threadstartex(void * ptd)  Line 297	C
 	kernel32.dll!7c80b729()
msg104293 - (view) Author: David Beazley (dabeaz) Date: 2010-04-27 00:52
New version of patch that will probably fix Windows-XP problems. Was doing something stupid in the monitor (not sure how it worked on Unix).
msg104303 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2010-04-27 09:23
@dabeaz
I'm getting random segfaults with your patch (even with the last one), pretty much everywhere malloc or free is called.
Ater skimming through the code, I think the problem is due to gil_last_holder:
In drop_gil and take_gil, you dereference gil_last_holder->cpu_bound, but it might very well happen that gil_last_holder points to a thread that has been deleted (through tstate_delete_common). Dereferencing is not risky, because there's a high chance that the address is still valid, but in drop_gil, you do this:
    /* Make the thread as CPU-bound or not depending on whether it was forced off */
    gil_last_holder->cpu_bound = gil_drop_request;

Here, if the thread has been deleted in meantine, you end up writting to a random location on the heap, and probably corrupting malloc administration data, which would explain why I get segfaults sometimes later on unrelated malloc() or free() calls.
I looked at it really quickly though, so please forgive me if I missed something obvious ;-)

@nirai
I have some more remarks on your patch:
- /* Diff timestamp capping results to protect against clock differences 
 * between cores. */
_LOCAL(long double) _bfs_diff_ts(long double ts1, long double ts0) {

I'm not sure I understand. You can have problem with multiple cores when reading directly the TSC register, but that doesn't affect gettimeofday. gettimeofday should be reliable and accurate (unless the OS is broken of course), the only issue is that since it's wall clock time, if a process like ntpd is running, then you'll run into problem
- pretty much all your variables are declared as volatile, but volatile was never meant as a thread-synchronization primitive. Since your variables are protected by mutexes, you already have all necessary memory barriers and synchronization, so volatile just prevents optimization
- you use some funtions just to perform a comparison or substraction, maybe it would be better to just remove those functions and perform the substractions/comparisons inline (you declared the functions inline but there's no garantee that the compiler will honor it).
- did you experiment with the time slice ? I tried some higher values and got better results, without penalizing the latency. Maybe it could be interesting to look at it in more detail (and on various platforms).
msg104309 - (view) Author: David Beazley (dabeaz) Date: 2010-04-27 10:48
Added extra pointer check to avoid possible segfault.
msg104312 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2010-04-27 12:20
I don't see segfaults anymore, but there's still an unsafe dereference of gil_last_holder inside take_gil:

    /* Wait on the appropriate GIL depending on thread's classification */
    if (!tstate->cpu_bound) {
      /* We are I/O bound.  If the current thread is CPU-bound, force it off now! */
      if (gil_last_holder->cpu_bound) {
	SET_GIL_DROP_REQUEST();
      }

You're still accessing a location that may have been free()'d previously: while it will work most of the time (that's why I said it's not as risky), if the page gets unmapped between the time the current thread is deleted and the next thread takes over, you'll get a segfault. And that's undefined behaviour anyway ;-)
msg104317 - (view) Author: David Beazley (dabeaz) Date: 2010-04-27 12:44
That second access of gil_last_holder->cpu_bound is safe because that block of code is never entered unless some other thread currently holds the GIL.   If a thread holds the GIL, then gil_last_holder is guaranteed to have a valid value.
msg104319 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2010-04-27 13:24
Didn't have much sleep last night, so please forgive me if I say something stupid, but:

Python/pystate.c:
void
PyThreadState_DeleteCurrent()
{
	PyThreadState *tstate = _PyThreadState_Current;
	if (tstate == NULL)
		Py_FatalError(
			"PyThreadState_DeleteCurrent: no current tstate");
	_PyThreadState_Current = NULL;
	tstate_delete_common(tstate);
	if (autoTLSkey && PyThread_get_key_value(autoTLSkey) == tstate)
		PyThread_delete_key_value(autoTLSkey);
	PyEval_ReleaseLock();
}

the current tstate is deleted and freed before releasing the GIL, so if another thread calls take_gil after the current thread has called tstate_delete_common but before it calls PyEval_ReleaseLock (which calls drop_gil and set gil_locked to 0), then it will enter this section and dereference gil_last_holder.
I just checked with valgrind, and he also reports an illegal dereference at this precise line.
msg104320 - (view) Author: David Beazley (dabeaz) Date: 2010-04-27 13:44
I stand corrected.   However, I'm going to have to think of a completely different approach for carrying out that functionality as I don't know how the take_gil() function is able to determine whether gil_last_holder has been deleted or not.   Will think about it and post an updated patch later. 

Do you have any examples or insight you can provide about how these segfaults have shown up in Python code?   I'm not able to observe any such behavior on OS-X or Linux.  Is this happening while running the ccbench program?  Some other program?
msg104324 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2010-04-27 14:31
> Do you have any examples or insight you can provide about how these segfaults have shown up in Python code?   I'm not able to observe any such behavior on OS-X or Linux.  Is this happening while running the ccbench program?  Some other program?

If you're talking about the first issue (segfaults due to writting to gil_last_holder->cpu_bound), it was occuring quite often during ccbench (pretty much anywhere malloc/free was called). I'm running a regular dual-core Linux box, nothing special.

For the second one, I didn't observe any segfault, I just figured this out reading the code and confirmed it with valgrind, but it's much less likely because the race window is very short and it also requires that the page is unmmaped in between.

If someone really wanted to get segfaults, I guess a good start would be:
- get a fast machine, multi-core is a bonus
- use a kernel with full preemption
- use a lot of threads (-n option with ccbench)
- use purify or valgrind --free-fill option so that you're sure to jump to noland if you dereference a previously-free'd pointer
msg104325 - (view) Author: David Beazley (dabeaz) Date: 2010-04-27 14:32
One more attempt at fixing tricky segfaults.   Glad someone had some eagle eyes on this :-).
msg104330 - (view) Author: Nir Aides (nirai) Date: 2010-04-27 15:00
On Tue, Apr 27, 2010 at 12:23 PM, Charles-Francois Natali wrote:

> @nirai
> I have some more remarks on your patch:
> - /* Diff timestamp capping results to protect against clock differences
>  * between cores. */
> _LOCAL(long double) _bfs_diff_ts(long double ts1, long double ts0) {
>
> I'm not sure I understand. You can have problem with multiple cores when reading directly the 
> TSC register, but that doesn't affect gettimeofday. gettimeofday should be reliable and accurate 
> (unless the OS is broken of course), the only issue is that since it's wall clock time, if a process 
> like ntpd is running, then you'll run into problem

I think gettimeofday() might return different results on different cores as result of kernel/hardware problems or clock drift issues in VM environments:
http://kbase.redhat.com/faq/docs/DOC-7864
https://bugzilla.redhat.com/show_bug.cgi?id=461640

In Windows the high-precision counter might return different results on different cores in some hardware configurations (older multi-core processors). I attempted to alleviate these problems by using capping and by using a "python time" counter constructed from accumulated slices, with the assumption that IO bound threads are unlikely to get migrated often between cores while running. I will add references to the patch docs.

> - did you experiment with the time slice ? I tried some higher values and got better results, 
> without penalizing the latency. Maybe it could be interesting to look at it in more detail (and 
> on various platforms).

Can you post more details on your findings? It is possible that by using a bigger slice, you helped the OS classify CPU bound threads as such and improved "synchronization" between BFS and the OS scheduler.

Notes on optimization of code taken, thanks.
msg104335 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-04-27 16:45
> I stand corrected.   However, I'm going to have to think of a
> completely different approach for carrying out that functionality as I
> don't know how the take_gil() function is able to determine whether
> gil_last_holder has been deleted or not.

Please note take_gil() currently doesn't depend on the validity of the
pointer. gil_last_holder is just used as an opaque value, equivalent to
a thread id.
msg104369 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2010-04-27 21:41
> In Windows the high-precision counter might return different results
> on different cores in some hardware configurations (older multi-core
> processors).

More specifically: some older multi-core processors where the HAL implements QueryPerformanceCounter using the TSC from the CPU, and the HAL doesn't keep the cores in sync and QPC doesn't otherwise account for it.  This is rare; frequently QPC is implemented using another source of time.

But it's true: QPC is not 100% reliable.  QPC can unfortunately jump backwards (when using TSC and you switch cores), jump forwards (when using TSC and you switch cores, or when using the PCI bus timer on P3-era machines with a specific buggy PCI south bridge controller), speed up or slow down (when using TSC and not accounting for changing CPU speed via SpeedStep &c).  The simple solution: give up QPC and use timeGetTime() with timeBeginPeriod(1), which is totally reliable but only has millisecond accuracy at best.

http://www.virtualdub.org/blog/pivot/entry.php?id=106
http://support.microsoft.com/default.aspx?scid=KB;EN-US;Q274323&;
msg104452 - (view) Author: Nir Aides (nirai) Date: 2010-04-28 19:17
On Wed, Apr 28, 2010 at 12:41 AM, Larry Hastings wrote:

> The simple solution: give up QPC and use timeGetTime() with timeBeginPeriod(1), which is totally 
> reliable but only has millisecond accuracy at best.

It is preferable to use a high precision clock and I think the code addresses the multi-core time skew problem (pending testing).
msg104453 - (view) Author: Nir Aides (nirai) Date: 2010-04-28 19:21
Dave, there seems to be some problem with your patch on Windows:

F:\dev>z:\dabeaz-wcg\PCbuild\python.exe y:\ccbench.py -b
== CPython 3.2a0.0 (py3k) ==
== x86 Windows on 'x86 Family 6 Model 23 Stepping 10, GenuineIntel' ==

--- I/O bandwidth ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 8551.2 packets/s.
CPU threads=1: 26.1 ( 0 %)
CPU threads=2: 26.0 ( 0 %)
CPU threads=3: 37.2 ( 0 %)
CPU threads=4: 33.2 ( 0 %)
msg104473 - (view) Author: David Beazley (dabeaz) Date: 2010-04-28 23:03
Wow, that is a *really* intriguing performance result with radically different behavior than Unix.  Do you have any ideas of what might be causing it?
msg104478 - (view) Author: Nir Aides (nirai) Date: 2010-04-29 06:35
On Thu, Apr 29, 2010 at 2:03 AM, David Beazley wrote:

> Wow, that is a *really* intriguing performance result with radically different behavior than Unix.  Do you have any ideas of what might be causing it?

Instrument the code and I'll send you a trace.
msg104613 - (view) Author: Nir Aides (nirai) Date: 2010-04-30 08:24
Dave, 

The behavior of your patch on Windows XP/2003 (and earlier) might be related to the way Windows boosts thread priority when it is signaled. 

Try to increase priority of monitor thread and slice size. Another thing to look at is how to prevent Python CPU bound threads from (starving) messing up scheduling of threads of other processes. Maybe increasing slice significantly can help in this too (50ms++ ?).

XP/NT/CE scheduling and thread boosting affect all patches and the current GIL undesirably (in different ways). Maybe it is possible to make your patch work nicely on these systems:
http://www.sriramkrishnan.com/blog/2006/08/tale-of-two-schedulers-win_115489794858863433.html

Vista and Windows 7 involve CPU cycle counting which results in more sensible scheduling:
http://technet.microsoft.com/en-us/magazine/2007.02.vistakernel.aspx
msg104899 - (view) Author: Nir Aides (nirai) Date: 2010-05-03 22:43
I updated bfs.patch with improvements on Windows XP. 

The update disables priority boosts associated with the scheduler condition on Windows for CPU bound threads.

Here is a link to ccbench results:
http://bugs.python.org/file17194/nir-ccbench-xp32.log

Summary:

Windows XP 32bit q9400 2.6GHz Release build (no PG optimizations).
Test runs in background, ccbench modified to run both bz2 and sha1.

bfs.patch - seems to behave.

gilinter2.patch
single core: high latency, low IO bandwidth.

dabeaz_gil.patch 
single core: low IO bandwidth.
4 cores: throughput threads starvation (balance), some latency, low IO bandwidth.
msg105687 - (view) Author: Nir Aides (nirai) Date: 2010-05-14 07:19
Duck, here comes another update to bfs.patch.

This one with some cleanups which simplify the code and improve behavior (on Windows XP), shutdown code, comments, and "experimental" use of TSC for timestamps, which eliminates timestamp reading overhead.

TSC (http://en.wikipedia.org/wiki/Time_Stamp_Counter) is a fast way to get high precision timing read. On some systems this is what gettimeofday() uses under the hood while on other systems it will use HPET or another source which is slower, typically ~1usec, but can be higher (e.g. my core 2 duo laptop occasionally goes into a few hours of charging 3usec per HPET gettimeofday() call - god knows why)

This overhead is incurred twice for every GIL release/acquire pair and can be eliminated with:
1) Hack the scheduler not to call gettimeofday() when no other threads are waiting to run, or
2) Use TSC on platforms it is available (the Linux BFS scheduler uses TSC).

I took cycle.h pointed by the Wikipedia article on TSC for a spin and it works well on my boxes. It is BSD, (un)maintained? and includes implementation for a gazillion of platforms (I did not yet modify configure.in as it recommends). 

If it breaks on your system please ping with details.

Some benchmarks running (Florent's) writes.py on Core 2 Quad q9400 Ubuntu 64bit:

bfs.patch - 35K context switches per second, threads balanced, runtime is 3 times that of running IO thread alone:

~/dev/python$ ~/build/python/bfs/python writes.py
t1 1.60293507576 1
t2 1.78533816338 1
t1 2.88939499855 2
t2 3.19518113136 2
t1 4.38062310219 3
t2 4.70725703239 3
t1 6.26874804497 4
t2 6.4078810215 4
t1 7.83273100853 5
t2 7.92976212502 5
t1 9.4341750145 6
t2 9.57891893387 6
t1 11.077393055 7
t2 11.164755106 7
t2 12.8495900631 8
t1 12.8979620934 8
t1 14.577999115 9
t2 14.5791089535 9
t1 15.9246580601 10
t2 16.1618289948 10
t1 17.365830183 11
t2 17.7345991135 11
t1 18.9782481194 12
t2 19.2790091038 12
t1 20.4994370937 13
t2 20.5710251331 13
21.0179870129


dabeaz_gil.patch - sometimes runs well but sometimes goes into high level of context switches (250K/s) and produces output such as this:

~/dev/python$ ~/build/python/dabeaz/python writes.py 
t1 0.742760896683 1
t1 7.50052189827 2
t2 8.63794493675 1
t1 10.1924870014 3
17.9419858456

gilinter2.patch - 300K context switches per second, bg threads starved:

~/dev/python$ ~/build/python/gilinter/python writes.py 
t2 6.1153190136 1
t2 11.7834780216 2
14.5995650291
msg105835 - (view) Author: Nir Aides (nirai) Date: 2010-05-15 22:50
Updated bfs.patch to patch cleanly updated py3k branch. Use:
$ patch -p1 < bfs.patch
msg105874 - (view) Author: Nir Aides (nirai) Date: 2010-05-16 19:16
A link to ccbench results comparing old GIL, old GIL with long check interval, new GIL and BFS:
http://bugs.python.org/file17370/nir-ccbench-linux.log

Summary:

Results for ccbench latency and bandwidth test run on Ubuntu Karmic 64bit, q9400 2.6GHz, all Python versions built with computed gotos optimization.

Old GIL:
Hi level of context switching and reduced performance. 
~90ms IO latency with pure Python CPU bound background threads and low IO bandwidth results.

Old GIL with sys.setcheckinterval(2500) as done by Zope:
Context switching level back to normal.
IO latency shoots through the roof. ~950ms (avg) is the maximum recordable value in this test since CPU load duration is 2sec.

New GIL:
The expected 5ms wait related IO latency and low IO bandwidth.

BFS patch:
Behaves.
msg106030 - (view) Author: Michele (Michele) Date: 2010-05-19 02:29
Attached ccbench-osx.log made today on OSX on latest svn checkout. Hope it helps
msg106780 - (view) Author: Nir Aides (nirai) Date: 2010-05-30 20:50
Updated bfs.patch with BSD license and copyright notice.

! Current version patches cleanly and builds with Python revision svn r81201.

Issue 7946 and proposed patches were put on hold indefinitely following this python-dev discussion: http://mail.python.org/pipermail/python-dev/2010-May/100115.html

I would like to thank the Python developer community and in particular David and Antoine for a most interesting ride.

Any party interested in sponsoring further development or porting patch to Python 2.x is welcome to contact me directly at nir@winpdb.org

Nir
msg106782 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2010-05-30 21:44
Thanks for all your work Nir!  I personally think the BFS approach is the best we've seen yet for this problem!

Having read the thread you linked to in full (ignoring the tagents  bikeshedding and mudslinging that went on there), it sounds like the general consensus is that we should take thread scheduling changes slowly and let the existing new implementation bake in the 3.2 release.  That puts this issue as a possibility for 3.3 if users demonstrate real world application problems in 3.2.

(personally I'd say it is already obvious that there are problems an wde should go ahead with your BFS based approach but realistically the we're still better off in 3.2 than we were in 3.1 and 2.x as is)
msg156883 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-03-27 00:05
> gettimeofday returns you wall clock time: if a process
> that modifies time is running, e.g. ntpd, you'll likely
> to run into trouble. the value returned is _not_ monotonic,
> ...

The issue #12822 asks to use monotonic clocks when available.
msg223109 - (view) Author: Dima Tisnek (Dima.Tisnek) Date: 2014-07-15 13:52
What happened to this bug and patch?
msg223110 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-07-15 13:53
Not much :) The patch is complex and the issue hasn't proved to be 
significant in production code.
Do you have a (real-world) workload where this shows up?

Le 15/07/2014 09:52, Dima Tisnek a écrit :
>
> Dima Tisnek added the comment:
>
> What happened to this bug and patch?
History
Date User Action Args
2014-09-03 06:45:13scodersetnosy: + scoder
2014-07-15 13:53:32pitrousetmessages: + msg223110
2014-07-15 13:52:17Dima.Tisneksetnosy: + Dima.Tisnek
messages: + msg223109
2012-03-27 00:05:20hayposetnosy: + haypo
messages: + msg156883
2011-06-09 05:56:15jmehnlesetnosy: + jmehnle
2011-01-03 19:49:32pitrousetpriority: high -> low
nosy: loewis, jhylton, gregory.p.smith, jcea, ncoghlan, pitrou, movement, larry, eric.smith, kevinwatters, tarek, djc, karld, carljm, coderanger, phsilva, durin42, eric.araujo, nirai, alex, andrix, konryd, brian.curtin, hozn, victorpoluceno, flox, DazWorrall, cool-RR, rh0dium, rcohen, dabeaz, mahmoudimus, portante, aconrad, ysj.ray, neologix, thouis, donaldjeo, Michele
versions: - Python 3.2
2010-07-13 10:29:08phsilvasetnosy: + phsilva
2010-07-10 18:41:05hoznsetnosy: + hozn
2010-05-30 21:44:33gregory.p.smithsetmessages: + msg106782
versions: + Python 3.3
2010-05-30 20:50:23niraisetfiles: + bfs.patch

messages: + msg106780
2010-05-30 20:38:25niraisetfiles: - bfs.patch
2010-05-19 02:29:42Michelesetfiles: + ccbench-osx.log
nosy: + Michele
messages: + msg106030

2010-05-18 15:38:21portantesetnosy: + portante
2010-05-17 12:35:57victorpolucenosetnosy: + victorpoluceno
2010-05-16 21:03:32ncoghlansetnosy: + ncoghlan
2010-05-16 19:16:34niraisetmessages: + msg105874
2010-05-16 19:11:22niraisetfiles: + nir-ccbench-linux.log
2010-05-15 22:50:30niraisetfiles: + bfs.patch

messages: + msg105835
2010-05-15 22:48:14niraisetfiles: - bfs.patch
2010-05-14 07:19:39niraisetfiles: + bfs.patch

messages: + msg105687
2010-05-14 07:19:02niraisetfiles: - bfs.patch
2010-05-03 22:44:11niraisetfiles: + bfs.patch

messages: + msg104899
2010-05-03 22:35:02niraisetfiles: - bfs.patch
2010-05-03 22:34:29niraisetfiles: + nir-ccbench-xp32.log
2010-04-30 08:24:03niraisetmessages: + msg104613
2010-04-29 06:35:57niraisetmessages: + msg104478
2010-04-28 23:03:23dabeazsetmessages: + msg104473
2010-04-28 19:21:08niraisetmessages: + msg104453
2010-04-28 19:17:43niraisetmessages: + msg104452
2010-04-27 21:41:46larrysetmessages: + msg104369
2010-04-27 16:45:56pitrousetmessages: + msg104335
2010-04-27 15:00:56niraisetmessages: + msg104330
2010-04-27 14:32:34dabeazsetfiles: + dabeaz_gil.patch

messages: + msg104325
2010-04-27 14:31:15neologixsetmessages: + msg104324
2010-04-27 14:31:01dabeazsetfiles: - dabeaz_gil.patch
2010-04-27 13:44:44dabeazsetmessages: + msg104320
2010-04-27 13:24:10neologixsetmessages: + msg104319
2010-04-27 12:44:45dabeazsetmessages: + msg104317
2010-04-27 12:43:53salgadosetnosy: - salgado
2010-04-27 12:40:22pitrousetnosy: loewis, jhylton, gregory.p.smith, jcea, pitrou, movement, larry, eric.smith, kevinwatters, tarek, djc, karld, carljm, coderanger, durin42, eric.araujo, nirai, alex, andrix, konryd, brian.curtin, flox, DazWorrall, salgado, cool-RR, rh0dium, rcohen, dabeaz, mahmoudimus, aconrad, ysj.ray, neologix, thouis, donaldjeo
components: + Interpreter Core, - None
2010-04-27 12:39:53pitrousetmessages: - msg104314
2010-04-27 12:24:50donaldjeosetnosy: + donaldjeo
messages: + msg104314
components: + None, - Interpreter Core
2010-04-27 12:20:43neologixsetmessages: + msg104312
2010-04-27 10:48:23dabeazsetfiles: + dabeaz_gil.patch

messages: + msg104309
2010-04-27 10:40:32dabeazsetfiles: - dabeaz_gil.patch
2010-04-27 09:23:30neologixsetmessages: + msg104303
2010-04-27 00:52:57dabeazsetfiles: + dabeaz_gil.patch

messages: + msg104293
2010-04-27 00:50:48dabeazsetfiles: - dabeaz_gil.patch
2010-04-26 23:22:53niraisetmessages: + msg104288
2010-04-26 13:31:15dabeazsetfiles: + schedtest.py

messages: + msg104234
2010-04-26 13:21:16dabeazsetfiles: + dabeaz_gil.patch

messages: + msg104233
2010-04-26 13:19:03dabeazsetfiles: - dabeaz_gil.patch
2010-04-26 11:08:46dabeazsetmessages: + msg104219
2010-04-26 09:36:27pitrousetmessages: + msg104203
2010-04-26 09:13:38pitroulinkissue8532 superseder
2010-04-26 05:23:26gregory.p.smithsetnosy: + gregory.p.smith
messages: + msg104198
2010-04-26 04:56:06dabeazsetmessages: + msg104197
2010-04-26 04:17:48dabeazsetmessages: + msg104195
2010-04-26 04:13:29dabeazsetfiles: + dabeaz_gil.patch

messages: + msg104194
2010-04-25 13:25:38ysj.raysetnosy: + ysj.ray
2010-04-20 16:43:08andrixsetnosy: + andrix
2010-04-18 00:23:08niraisetfiles: + writes.py

messages: + msg103460
2010-04-18 00:20:04niraisetfiles: + bfs.patch
2010-04-18 00:19:33niraisetfiles: - bfs.patch
2010-04-17 17:47:49niraisetmessages: + msg103414
2010-04-16 13:10:18niraisetfiles: + bfs.patch

messages: + msg103320
2010-04-16 13:08:40niraisetfiles: - bfs.patch
2010-04-15 07:27:34neologixsetmessages: + msg103186
2010-04-14 22:00:05neologixsetmessages: + msg103154
2010-04-11 21:18:59neologixsetnosy: + neologix
messages: + msg102890
2010-04-11 00:03:10eric.araujosetnosy: + eric.araujo
2010-04-08 22:08:32niraisetfiles: + bfs.patch

messages: + msg102649
2010-04-08 22:08:04niraisetfiles: - bfs.patch
2010-04-05 14:36:05thouissetnosy: + thouis
2010-03-31 22:58:10niraisetfiles: + bfs.patch

messages: + msg102043
2010-03-31 22:57:24niraisetfiles: - bfs.patch
2010-03-28 16:54:10cool-RRsetnosy: + cool-RR
2010-03-28 05:58:40niraisetfiles: + bfs.patch
2010-03-28 05:58:08niraisetfiles: - bfs.patch
2010-03-28 04:32:24niraisetfiles: + bfs.patch

messages: + msg101854
2010-03-28 04:30:24niraisetfiles: - bfs.patch
2010-03-27 13:27:06pitrousetfiles: - gilinter.patch
2010-03-27 13:27:01pitrousetfiles: - gilprio.patch
2010-03-27 13:26:54pitrousetfiles: - issuegil.txt
2010-03-27 13:26:26pitrousetfiles: + gilinter2.patch

messages: + msg101825
2010-03-27 08:46:49niraisetmessages: + msg101821
2010-03-26 22:50:11durin42setnosy: + durin42
2010-03-26 22:48:57larrysetnosy: + larry
2010-03-26 20:25:05niraisetmessages: + msg101773
2010-03-26 20:09:56pitrousetmessages: + msg101772
2010-03-26 18:24:29salgadosetnosy: + salgado
2010-03-26 16:32:23aconradsetnosy: + aconrad
2010-03-26 16:14:15Konrad.Delongsetnosy: + konryd
2010-03-26 08:43:03djcsetnosy: + djc
2010-03-26 05:23:26movementsetnosy: + movement
2010-03-25 23:52:12niraisetfiles: + bfs.patch

messages: + msg101737
2010-03-25 23:47:29niraisetfiles: - bfs.patch
2010-03-25 17:58:04niraisetmessages: + msg101724
2010-03-25 16:08:11pitrousetmessages: + msg101715
2010-03-25 16:03:52niraisetmessages: + msg101714
2010-03-25 15:52:57pitrousetmessages: + msg101711
2010-03-25 15:32:37niraisetmessages: + msg101707
2010-03-25 14:21:50pitrousetmessages: + msg101700
2010-03-25 13:55:11niraisetfiles: - linux-7946.patch
2010-03-25 13:54:52niraisetfiles: - bfs.patch
2010-03-25 13:52:06niraisetfiles: + bfs.patch

messages: + msg101697
2010-03-24 20:43:04jceasetnosy: + jcea
2010-03-24 01:13:45pitrousetmessages: + msg101613
2010-03-24 00:41:34niraisetfiles: + bfs.patch

messages: + msg101612
2010-03-16 22:36:57niraisetfiles: + linux-7946.patch

messages: + msg101197
2010-03-16 21:17:32niraisetfiles: + linux-7946.patch

messages: + msg101192
2010-03-16 02:36:33dabeazsetmessages: + msg101148
2010-03-16 02:06:50dabeazsetmessages: + msg101146
2010-03-16 01:22:08pitrousetmessages: + msg101144
2010-03-16 01:13:37dabeazsetmessages: + msg101142
2010-03-16 00:59:40pitrousetmessages: + msg101141
2010-03-15 20:39:37niraisetnosy: + nirai
2010-03-15 16:53:48dabeazsetmessages: + msg101127
2010-03-15 16:38:48dabeazsetmessages: + msg101125
2010-03-15 15:44:40floxsetmessages: + msg101120
2010-03-15 15:06:38floxsetmessages: + msg101118
2010-03-15 14:21:01dabeazsetmessages: + msg101116
2010-02-26 17:12:00kevinwatterssetnosy: + kevinwatters
2010-02-25 17:25:35brian.curtinsetnosy: + brian.curtin
messages: + msg100094
2010-02-23 23:43:36rcohensetnosy: + rcohen
2010-02-22 23:40:54pitrousetfiles: + gilinter.patch
2010-02-22 23:40:41pitrousetfiles: - gilinter.patch
2010-02-22 23:35:36pitrousetfiles: + gilinter.patch

messages: + msg99875
2010-02-22 22:10:14Jeremy.Hyltonsetnosy: + jhylton
2010-02-22 21:52:19dabeazsetfiles: + prioritygil.tar.gz

messages: + msg99846
2010-02-22 21:41:32pitrousetfiles: + gilprio2.patch

messages: + msg99840
2010-02-22 21:31:52pitrousetmessages: + msg99838
2010-02-22 19:23:26pitrousetfiles: + gilprio.patch
keywords: + patch
messages: + msg99815
2010-02-22 19:17:34pitrousetfiles: + udp-iotest.py
2010-02-22 19:17:10pitrousetfiles: + udp-ioclient.py

messages: + msg99814
2010-02-17 18:41:13pitrousetmessages: + msg99477
2010-02-17 14:23:11rh0diumsetnosy: + rh0dium
2010-02-17 11:02:23DazWorrallsetnosy: + DazWorrall
2010-02-17 06:12:27tareksetnosy: + tarek
2010-02-17 05:18:38karldsetnosy: + karld
2010-02-17 04:01:14dabeazsetmessages: + msg99461
2010-02-17 03:00:38pitrousetmessages: + msg99459
2010-02-17 02:11:18mahmoudimussetnosy: + mahmoudimus
2010-02-16 23:37:06eric.smithsetnosy: + eric.smith
2010-02-16 22:11:37coderangersetnosy: + coderanger
2010-02-16 21:56:18loewissetnosy: + loewis
2010-02-16 21:55:58carljmsetnosy: + carljm
2010-02-16 20:57:06alexsetnosy: + alex
messages: + msg99439
2010-02-16 20:55:38floxsetpriority: normal -> high
nosy: + flox
2010-02-16 20:55:10floxsetpriority: normal
assignee: pitrou

type: behavior -> performance
nosy: + pitrou
2010-02-16 20:49:05dabeazsettype: behavior
2010-02-16 20:48:43dabeazcreate