Author neologix
Recipients bobbyi, gregory.p.smith, neologix, nirai, pitrou, sdaoden, vstinner
Date 2011-05-13.11:24:32
SpamBayes Score 0.0
Marked as misclassified No
Message-id <BANLkTinXqo5D6vm51AVoPEwneRVPoxmfKw@mail.gmail.com>
In-reply-to <1305230511.18.0.0895360934602.issue6721@psf.upfronthosting.co.za>
Content
> Hi,
>

Hello Nir,

> Option (2) makes sense but is probably not always applicable.
> Option (1) depends on being able to acquire locks in locking order, but how
> can we determine correct locking order across libraries?
>

There are indeed a couple problems with 1:
1) actually, releasing the mutex/semaphore from the child is not
guaranteed to be safe, see this comment from glibc's malloc:
/* In NPTL, unlocking a mutex in the child process after a
   fork() is currently unsafe, whereas re-initializing it is safe and
   does not leak resources.  Therefore, a special atfork handler is
   installed for the child. */
We could just destroy/reinit them, though.

2) acquiring locks just before fork is probably one of the best way to
deadlock (acquiring a lock we already hold, or acquiring a lock needed
by another thread before it releases its own lock). Apart from adding
dealock avoidance/recovery mechanisms - which would be far from
trivial - I don't see how we could solve this, given that each library
can use its own locks, not counting the user-created ones

3) there's another special lock we must take into account, the GIL:
contrarily to a typical C program, we can't have the thread forking
blindly try to acquire all locks just before fork, because since we
hold the GIL, other threads won't be able to proceed (unless of course
they're in a section where they don't run without the GIL held).

So, we would have to:
- release the GIL
- acquire all locks in the correct order
- re-acquire the GIL
- fork
- reinit all locks after fork

I think this is going to be very complicated.

4) Python locks differ from usual mutexes/semaphores in that they can
be held for quite some time (for example while performing I/O). Thus,
acquiring all the locks could take a long time, and users might get
irritated if fork takes 2 seconds to complete.

5) Finally, there's a fundamental problem with this approach, because
Python locks can be released by a thread other than the one that owns
it.
Imagine this happens:

T1                         T2
                          lock.acquire()
                          (do something without releasing lock)
fork()
lock.release()

This is perfectly valid with the current lock implementation (for
example, it can be used to implement a rendez-vous point so that T2
doesn't start processing before T1 forked worker processes, or
whatever).
But if T1 tries to acquire lock (held by T2) before fork, then it will
deadlock, since it will never be release by T2.

For all those reasons, I don't think that this approach is reasonable,
but I could be wrong :-)

> Initializing locks in child after fork without acquiring them before the
> fork may result in corrupted program state and so is probably not a good
> idea.

Yes, but in practise, I think that this shouldn't be too much of a
problem. Also note that you can very well have the same type of
problem with sections not protected explicitely by locks: for example,
if you have a thread working exclusively on an object (maybe part of a
threadpool), a fork can very well happen while the object is in an
inconsistent state. Acquiring locks before fork won't help that.
But I think this should eventually be addressed, maybe by specific
atfork handlers.

> On a positive note, if I understand correctly, Python signal handler
> functions are actually run in the regular interpreter loop (as pending
> calls) after the signal has been handled and so os.fork() atfork handlers
> will not be restricted to async-signal-safe operations (since a Python fork
> is never done in a signal handler).

That's correct.

In short, I think that we could first try to avoid common deadlocks by
just resetting locks in the child process. This is not panacea, but
this should solve the vast majority of deadlocks, and would open the
door to potential future refinements using atfork-like handlers.

Attached is a first draft for a such patch (with tests).
Synopsis:
- when a PyThread_type_lock is created, it's added to a linked-list,
when it's deleted, it's removed from the linked list
- PyOS_AfterFork() calls PyThread_ReinitLocks() which calls
PyThread_reinit_lock() for each lock in the linked list
- PyThread_reinit_lock() does the right thing (i.e. sem_destroy/init
for USE_SEMAPHORES and pthread_(mutex|cond)_destroy/init for emulated
semaphores).

Notes:
- since it's only applicable to POSIX (since other Unix thread
implementations will be dropped), I've only defined a
PyThread_ReinitLocks inside Python/thread_pthread.h, so it won't build
on other platforms. How should I proceed: like PyThread_ReInitTLS(),
add a stub function to all Python/thread_xxx.h, or guard the call to
PyThread_ReinitLocks() with #ifdef _POSIX_THREADS ?
- I'm not sure of how to handle sem_init/etc failures in the reinit
code: for now I just ignore this possibility, like what's done for the
import lock reset
- insertions/removals from the linked list are not protected from
concurrent access because I assume that locks are created/deleted with
the GIL held: is that a reasonable assumption, or should I add a mutex
to protect those accesses?

This fixes common deadlocks with threading.Lock, and
PyThread_type_lock (used for example by I/O code).
Files
File name Uploaded
reinit_locks.diff neologix, 2011-05-13.11:24:31
History
Date User Action Args
2011-05-13 11:24:36neologixsetrecipients: + neologix, gregory.p.smith, pitrou, vstinner, nirai, bobbyi, sdaoden
2011-05-13 11:24:34neologixlinkissue6721 messages
2011-05-13 11:24:32neologixcreate