Author sitbon
Recipients sitbon
Date 2009-05-27.22:09:44
SpamBayes Score 3.33067e-16
Marked as misclassified No
Message-id <1243462195.91.0.81958628587.issue6132@psf.upfronthosting.co.za>
In-reply-to
Content
At the suggestion of others on the Python-Dev list, I'm going to outline
what I've been doing with the GIL and where I see possiblity for
improvement.

Discussion:
http://mail.python.org/pipermail/python-dev/2009-May/089746.html


Detail:

Currently, Python's global interpreter lock is implemented with the same
mechanism that provides locks for the threading module. Moving the GIL's
low-level handling to a separate mechanism would free core developers
from the feature requirements imposed upon thread locks, specifically:

1. Blocking recursion

A thread can acquire a lock when it has already been acquired, and this
action will block.

2. Release-anywhere

Any thread not owning a lock can release it (provided it is currently
owned), thus preventing permanent deadlock in #1.

These semantics are that of a semaphore with an initial value of 1,
where acquisitions decrease its value and releases increment its value.
This behavior cannot change without changing the fundamental behavior of
the threading.Lock object that many have probably come to expect.

I have been looking into ways to minimize GIL overhead in high
concurrency situations. At the moment, I'm only looking at Windows,
partly because it was an easy target and I have experience with its
synchronization objects. While I have come up with a faster alternative
for GIL handling, it breaks the expectations of thread lock objects. I
then got to wondering:

- Does anyone actually use the GIL-locking/unlocking functions in a
manner that requires the features listed above?

- If so, could it be accomplished a different way if the GIL were
expected to behave more like a recursive mutex?

I'm not going to make any assumptions about what extensions do with the
GIL, but I can hope at least they do use the API functions. If so, the
question is if we can safely apply different locking semantics to the GIL:

- Non-blocking recursion

A thread that already owns the lock can re-acquire it without blocking,
increasing a recursion count.

- Owner-only releasing

Only the thread owning the lock can release it, and only as many times
as it has acquired it.

In Windows, this is how critical sections behave. I'm not sure about the
Linux futex, but I know pthread mutexes and others like it can be made
recursive. I'm mainly interested in separating the GIL from thread locks
so we can say "do x and y with it but not z" without changing the
threading module's lock mechanism.

I decided to make a case of this when I was testing my ISAPI extension
under high load. The current mechanism is as fast as it can possibly be
for a single thread, but does not scale well at all. This has a lot to
do with its use of event synchronization objects. Critical sections, on
the other hand, scale well even when contention is unavoidable. My first
experiment involved modifying the thread lock functions in thread_nt.h
to use critical sections. Yes, it broke the behavior of threading.Lock,
but it still worked for the GIL. The performance improvement can vary,
but it's always noticeable in the presence of concurrency. I saw the
performance of my server extension nearly double when serving up Django
pages from 10-12 threads concurrently.

Obviously we can't just slap on a critical section and call it done,
unless nobody cares about changing what threading.Lock does. That's what
led me believe that a separate GIL mechanism is very important to have.

For your hacking pleasure, I've included a patch for thread_nt.h that
will facilitate GIL performance testing. It includes the critical
section implementation and a semaphore implementation (which is about as
fast as the current behavior). I've also included a _very_ basic script
for some ballpark GIL contention measurements. The numbers will be a bit
exaggerated, especially with a _Py_CheckInterval of 1, but you will see
the difference between the current method and the critical section
method right away. Especially in terms of scalability.

If you want a baseline cost for the iteration loops, change ThreadCount
to 1 and make CheckInterval something big (>1 million). There are other
costs included with the reported numbers, but going between the two lock
implementations should make it clear that any big difference you see is
due to locking/unlocking/contention. Please feel free to correct me if I
did it all wrong - I'm sure there are other/better ways to test GIL
performance.

My testing was done with Python 2.6.2 x86 on Vista x64. I can provide
measurements if it's not convenient for anyone to try out the patch.

So where to go from here?

It'd be good to know if changing the restrictions on GIL usage will
break anything. Also, whether the same performance increase can be had
for Linux/Mac/others.

I'll continue to look into giving the GIL its own locking API, and
hopefully my efforts are not for naught.
History
Date User Action Args
2009-05-27 22:09:56sitbonsetrecipients: + sitbon
2009-05-27 22:09:55sitbonsetmessageid: <1243462195.91.0.81958628587.issue6132@psf.upfronthosting.co.za>
2009-05-27 22:09:54sitbonlinkissue6132 messages
2009-05-27 22:09:49sitboncreate