Locks in the standard library should be sanitized on fork #50970

gpshead · 2009-08-17T23:06:17Z

BPO	6721
Nosy	@rhettinger, @gpshead, @vsajip, @jcea, @nirs, @pitrou, @vstinner, @applio, @cagney, @Birne94, @ochedru, @kevans91, @jessefarnham, @rojer, @koubaa
PRs	bpo-6721: Sanitize logging locks while forking #4071 [3.7] bpo-6721: Hold logging locks across fork() (GH-4071) #9291 bpo-1635741 port _curses_panel to multi-phase init (PEP 489) #21986 Update all Python Cookbook links after migration to GitHub #22205 bpo-40423: Optimization: use close_range(2) if available #22651
Files	lock_fork_thread_deadlock_demo.py forklocktests.patch reinit_locks.diff: patch adding locks reinitialization upon fork emit_warning_on_fork.patch atfork.patch reinit_locks_2.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2009-08-17.23:06:17.321>
labels = ['3.7', 'type-feature', 'library']
title = 'Locks in the standard library should be sanitized on fork'
updated_at = <Date 2020-10-11.20:42:06.563>
user = 'https://github.com/gpshead'

bugs.python.org fields:

activity = <Date 2020-10-11.20:42:06.563>
actor = 'kevans'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2009-08-17.23:06:17.321>
creator = 'gregory.p.smith'
dependencies = []
files = ['14740', '21874', '22005', '22525', '24303', '25776']
hgrepos = []
issue_num = 6721
keywords = ['patch']
message_count = 133.0
messages = ['91674', '91936', '92766', '94102', '94115', '94133', '94135', '128282', '128307', '128311', '128316', '128369', '135012', '135067', '135069', '135079', '135083', '135095', '135096', '135143', '135157', '135173', '135543', '135857', '135866', '135897', '135899', '135948', '135965', '135984', '136003', '136039', '136045', '136047', '136120', '136147', '139084', '139245', '139470', '139474', '139480', '139485', '139488', '139489', '139509', '139511', '139521', '139522', '139584', '139599', '139608', '139800', '139808', '139850', '139852', '139858', '139869', '139897', '139929', '140215', '140402', '140550', '140658', '140659', '140668', '140689', '140690', '140691', '141286', '143174', '143274', '143279', '151168', '151266', '151267', '151845', '151846', '151853', '161019', '161029', '161389', '161405', '161470', '161953', '162019', '162031', '162034', '162036', '162038', '162039', '162040', '162041', '162053', '162054', '162063', '162113', '162114', '162115', '162117', '162120', '162137', '162160', '270015', '270017', '270018', '270019', '270020', '270021', '270022', '270023', '270028', '289716', '294726', '294834', '304714', '304716', '304722', '304723', '314983', '325326', '327267', '329474', '339369', '339371', '339393', '339418', '339454', '339458', '339473', '365169', '367528', '367702', '368882']
nosy_count = 29.0
nosy_names = ['rhettinger', 'gregory.p.smith', 'vinay.sajip', 'jcea', 'nirs', 'pitrou', 'vstinner', 'nirai', 'forest_atq', 'ionelmc', 'bobbyi', 'neologix', 'Giovanni.Bajo', 'sdaoden', 'tshepang', 'sbt', 'lesha', 'dan.oreilly', 'davin', 'Connor.Wolf', 'Winterflower', 'cagney', 'Birne94', 'ochedru', 'kevans', 'jesse.farnham', 'hugh', 'rojer', 'koubaa']
pr_nums = ['4071', '9291', '21986', '22205', '22651']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue6721'
versions = ['Python 3.7']

gpshead · 2009-08-17T23:06:16Z

The python logging module uses a lock to surround many operations, in
particular. This causes deadlocks in programs that use logging, fork
and threading simultaneously.

spawn one or more threads in your program
have at least one of those threads make logging calls that will be
emitted.
have your main thread or another thread use os.fork() to run some
python code in a child process.
If the fork happened while one of your threads was within the
logging.Handler.handle() critical section (or anywhere else where
handler.lock is acquired), your child process will deadlock as soon as
it tries to log anything. It inherited a held lock.

The deadlock is more likely to happen on a highly loaded system which
tends to widen the deadlock opportunity window due to context switching.

A demo of the problem simplified into one file is attached.

The Python standard library should not be the cause of these deadlocks.
We need a way for all standard library locks to be cleaned up when
forking. By doing one of the following:

A) acquire all locks before forking, release them immediately after.
B) forceably release all standard library locks after forking in the
child process.

Code was added to call some cleanups after forking in
http://bugs.python.org/issue874900 but there are more things that also
need this same sort of cleanup (logging for example).

Rather than having to manually add after fork code hooks into every file
in the standard library that uses locks, a more general solution to
track and manage locks across fork would be a good idea.

gpshead · 2009-08-24T19:48:16Z

I've started a project to patch this and similar messes up for Python
2.4 and later here:

http://code.google.com/p/python-atfork/

I'd like to take ideas or implementations from that when possible for
future use in the python standard library.

gpshead · 2009-09-17T14:25:19Z

bpo-6923 has been opened to provide a C API for an atfork mechanism for
use by extension modules.

pitrou · 2009-10-15T18:15:51Z

Rather than having a kind of global module registry, locks could keep
track of what was the last PID, and reinitialize themselves if it changed.
This is assuming getpid() is fast :-)

gpshead · 2009-10-16T00:24:03Z

Antoine Pitrou <pitrou@free.fr> added the comment:

Rather than having a kind of global module registry, locks could keep
track of what was the last PID, and reinitialize themselves if it changed.
This is assuming getpid() is fast :-)

Locks can't blindly release themselves because they find themselves
running in another process.

If anything if a lock is held and finds itself running in a new
process any attempt to use the lock should raise an exception so that
the bug is noticed.

I'm not sure a PID check is good enough. old linux using linuxthreads
had a different pid for every thread, current linux with NPTL is more
like other oses with the same pid for all threads.

pitrou · 2009-10-16T10:23:10Z

I was suggesting "reinitialize", rather than "release". That is, create
a new lock (mutex, semaphore, etc.) and let the old one die (or occupy
some tiny bit of memory).

gpshead · 2009-10-16T10:28:42Z

no need for that. the problem is that they're held by a thread that
does not exist in the newly forked child process so they will never be
released in the new process.

example: if you fork while another thread is in the middle of logging
something and then try to log something yourself in the child, your
child process will deadlock on the logging module's lock.

locks are not shared between processes so reinitializing them with a new
object wouldn't do anything.

neologix · 2011-02-10T11:12:37Z

I'm not sure that releasing the mutex is enough, it can still lead to a segfault, as is probably the case in this issue :
http://bugs.python.org/issue11148

Quoting pthread_atfork man page :

To understand the purpose of pthread_atfork, recall that fork duplicates the whole memory space, including mutexes in their current locking state, but only the calling thread: other threads are not running in the child process. The mutexes are not usable after the fork and must be initialized with pthread_mutex_init in the child process. This is a limitation of the current implementation and might or might not be present in future versions.

To avoid this, install handlers with pthread_atfork as follows: have the prepare handler lock the mutexes (in locking order), and the parent handler unlock the mutexes. The child handler should reset the mutexes using pthread_mutex_init, as well as any other synchronization objects such as condition variables.

Locking the global mutexes before the fork ensures that all other threads are locked out of the critical regions of code protected by those mutexes. Thus when fork takes a snapshot of the parent's address space, that snapshot will copy valid, stable data. Resetting the synchronization objects in the child process will ensure they are properly cleansed of any artifacts from the threading subsystem of the parent process. For example, a mutex may inherit a wait queue of threads waiting for the lock; this wait queue makes no sense in the child process. Initializing the mutex takes care of this.

pthread_atfork might be worth looking into

gpshead · 2011-02-10T16:56:33Z

fwiw http://bugs.python.org/issue6643 recently fixed on issue where a mutex was being closed instead of reinitialized after a fork. more are likely needed.

Are you suggesting to use pthread_atfork to call pthread_mutex_init on all mutexes created by Python in the child after a fork? I'll have to ponder that some more. Given the mutexes are all useless post fork it does not strike me as a bad idea.

pitrou · 2011-02-10T17:02:53Z

Are you suggesting to use pthread_atfork to call pthread_mutex_init on > all mutexes created by Python in the child after a fork? I'll have to > ponder that some more. Given the mutexes are all useless post fork it > does not strike me as a bad idea.

I don't really understand. It's quite similar to the idea you shot down in msg94135. Or am I missing something?

gpshead · 2011-02-10T17:20:18Z

Yeah, I'm trying to figure out what I was thinking then or if I was just plain wrong. :)

I was clearly wrong about a release being done in the child being the right thing to do (bpo-6643 proved that, the state held by a lock is not usable to another process on all platforms such that release even works).

Part of it looks like I wanted a way to detect it was happening as any lock that is held during a fork indicates a _potential_ bug (the lock wasn't registered anywhere to be released before the fork) but not everything needs to care about that.

neologix · 2011-02-11T09:13:04Z

I was clearly wrong about a release being done in the child being the right thing to do (bpo-6643 proved that, the state held by a lock is not usable to another process on all platforms such that release even works).

Yeah, apparently OS-X is one of them, the reporter in bpo-11148 is
experiencing segfaults under OS-X.

Are you suggesting to use pthread_atfork to call pthread_mutex_init on all mutexes created by Python in the child after a fork? I'll have to ponder that some more. Given the mutexes are all useless post fork it does not strike me as a bad idea.

Yes, that's what I was thinking. Instead of scattering the
lock-reclaiming code all over the place, try to use a more standard
API specifically designed with that in mind.
Note the base issue is that we're authorizing things which are
forbidden : in a multi-threaded process, only sync-safe calls are
authorized between fork and exec.

2011/2/10 Gregory P. Smith <report@bugs.python.org>:

Gregory P. Smith <greg@krypto.org> added the comment:

Yeah, I'm trying to figure out what I was thinking then or if I was just plain wrong. :)

I was clearly wrong about a release being done in the child being the right thing to do (bpo-6643 proved that, the state held by a lock is not usable to another process on all platforms such that release even works).

Part of it looks like I wanted a way to detect it was happening as any lock that is held during a fork indicates a _potential_ bug (the lock wasn't registered anywhere to be released before the fork) but not everything needs to care about that.

----------
versions: +Python 3.3

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue6721\>

pitrou · 2011-05-03T00:21:00Z

I encountered this issue while debugging some multiprocessing code; fork() would be called from one thread while sys.stdout was in use in another thread (simply because of a couple of debugging statements). As a result the IO lock would be already "taken" in the child process and any operation on sys.stdout would deadlock.

This is definitely something that can happen more easily than I thought.

pitrou · 2011-05-03T20:38:12Z

Here is a patch with tests for the issue (some of which fail of course).
Do we agree that these tests are right?

gpshead · 2011-05-03T20:47:32Z

Those tests make sense to me.

neologix · 2011-05-03T21:39:56Z

# A lock taken from the current thread should stay taken in the
# child process.

Note that I'm not sure of how to implement this.
After a fork, even releasing the lock can be unsafe, it must be re-initialized, see following comment in glibc's malloc implementation:
/* In NPTL, unlocking a mutex in the child process after a
fork() is currently unsafe, whereas re-initializing it is safe and
does not leak resources. Therefore, a special atfork handler is
installed for the child. */

Note that this means that even the current code allocating new locks after fork (in Lib/threading.py, _after_fork and _reset_internal_locks) is unsafe, because the old locks will be deallocated, and the lock deallocation tries to acquire and release the lock before destroying it (in issue bpo-11148 the OP experienced a segfault on OS-X when locking a mutex, but I'm not sure of the exact context).

Also, this would imply keeping track of the thread currently owning the lock, and doesn't match the typical pthread_atfork idiom (acquire locks just before fork, release just after in parent and child, or just reinit them in the child process)

Finally, IMHO, forking while holding a lock and expecting it to be usable after fork doesn't make much sense, since a lock is acquired by a thread, and this threads doesn't exist in the child process. It's explicitely described as "undefined" by POSIX, see http://pubs.opengroup.org/onlinepubs/007908799/xsh/sem_init.html :
"""
The use of the semaphore by threads other than those created in the same process is undefined.
"""

So I'm not sure whether it's feasable/wise to provide such a guarantee.

pitrou · 2011-05-03T22:05:36Z

Also, this would imply keeping track of the thread currently owning
the lock,

Yes, we would need to keep track of the thread id and process id inside
the lock. We also need a global variable of the main thread id after
fork, and a per-lock "taken" flag.

Synopsis:

    def _reinit_if_needed(self):
        # Call this before each acquire() or release()
        if self.pid != getpid():
            sem_init(self.sem, 0, 1)
            if self.taken:
                if self.tid == main_thread_id_after_fork:
                    # Lock was taken in forked thread, re-take it
                    sem_wait(self.sem)
                else:
                    # It's now released
                    self.taken = False
            self.pid = getpid()
            self.tid = current_thread_id()

and doesn't match the typical pthread_atfork idiom (acquire locks just
before fork, release just after in parent and child, or just reinit
them in the child process)

Well, I fail to understand how that idiom can help us. We're not a
self-contained application, we're a whole programming language.
Calling fork() only when no lock is held is unworkable (for example, we
use locks around buffered I/O objects).

neologix · 2011-05-04T05:56:18Z

Yes, we would need to keep track of the thread id and process id inside
the lock. We also need a global variable of the main thread id after
fork, and a per-lock "taken" flag.

Synopsis:

def _reinit_if_needed(self):
# Call this before each acquire() or release()
if self.pid != getpid():
sem_init(self.sem, 0, 1)
if self.taken:
if self.tid == main_thread_id_after_fork:
# Lock was taken in forked thread, re-take it
sem_wait(self.sem)
else:
# It's now released
self.taken = False
self.pid = getpid()
self.tid = current_thread_id()

A couple remarks:

with linuxthreads, different threads within the same process have
the same PID - it may be true for other implementations - so this
would lead to spurious reinitializations
what's current_thread_id ? If it's thread_get_ident (pthread_self),
since TID is not guaranteed to be inherited across fork, this won't
work
calling getpid at every acquire/release is expensive, even though
it's a trivial syscall (it'll have to measured though)
imagine the following happens:

P1

lock.acquire()
fork()    ->       P2
                       start_new_thread T2
                       T1           T2
                                        lock.acquire()

The acquisition of lock by T2 will cause lock's reinitialization: what
happens to the lock wait queue ? who owns the lock ?
That why I don't think we can delay the reinitialization of locks, but
I could be wrong.

Well, I fail to understand how that idiom can help us. We're not a
self-contained application, we're a whole programming language.
Calling fork() only when no lock is held is unworkable (for example, we
use locks around buffered I/O objects).

Yes, but in that case, you don't have to reacquire the locks after fork.
In the deadlock you experienced above, the thread that forked wasn't
the one in the I/O code, so the corresponding lock can be
re-initialized anyway, since the thread in the I/O code at that time
won't exist after fork.
And it's true with every lock in the library code: they're only held
in short critical sections (typically acquired when entering a
function and released when leaving), and since it's not the threads
inside those libraries that fork, all those locks can simply be
reinitialized on fork, without having the reacquire them.

neologix · 2011-05-04T06:00:59Z

Oops, for liunxthreads, you should of course read "different PIDs", not "same PID".

pitrou · 2011-05-04T17:53:47Z

what's current_thread_id ? If it's thread_get_ident (pthread_self),
since TID is not guaranteed to be inherited across fork, this won't
work

Ouch, then the approach I'm proposing is probably doomed.

And it's true with every lock in the library code: they're only held
in short critical sections (typically acquired when entering a
function and released when leaving), and since it's not the threads
inside those libraries that fork, all those locks can simply be
reinitialized on fork, without having the reacquire them.

Well, this means indeed that *some* locks can be handled, but not all of
them and not automatically, right?
Also, how would you propose they be dealt with in practice? How do they
get registered, and how does the reinitialization happen?

(do note that library code can call arbitrary third-party code, by the
way: for example through encodings in the text I/O layer)

neologix · 2011-05-04T20:58:56Z

> - what's current_thread_id ? If it's thread_get_ident (pthread_self),
> since TID is not guaranteed to be inherited across fork, this won't
> work

Ouch, then the approach I'm proposing is probably doomed.

Well, it works on Linux with NPTL, but I'm not sure at all it holds
for other implementations (pthread_t it's only meaningful within the
same process).
But I'm not sure it's really the "killer" point: PID with linuxthreads
and lock being acquired by a second thread before the main thread
releases it in the child process also look like serious problems.

Well, this means indeed that *some* locks can be handled, but not all of
them and not automatically, right?
Also, how would you propose they be dealt with in practice? How do they
get registered, and how does the reinitialization happen?

When a lock object is allocated in Modules/threadmodule.c
(PyThread_allocate_lock/newlockobject), add the underlying lock
(self->lock_lock) to a linked list (since it's called with the GIL
held, we don't need to protect the linked list from concurrent
access). Each thread implementation (thread_pthread.h, thread_nt.h)
would provide a new PyThread_reinit_lock function that would do the
right thing (pthread_mutex_destroy/init, sem_destroy/init, etc).
Modules/threadmodule.c would provide a new PyThread_ReInitLocks that
would walk through the linked list and call PyThread_reinit_lock for
each lock.
PyOS_AfterFork would call this PyThread_ReInitLocks right after fork.
This would have the advantage of being consistent with what's already
done to reinit the TLS key and the import lock. So, we guarantee to be
in a consistent and usable state when PyOS_AfterFork returns. Also,
it's somewhat simpler because we're sure that at that point only one
thread is running (once again, no need to protect the linked-list
walk).
I don't think that the performance impact would be noticable (I know
it's O(N) where N is the number of locks), and contrarily to the
automatic approach, this wouldn't penalize every acquire/release.
Of course, this would solve the problem of threading's module locks,
so PyEval_ReInitThreads could be removed, along with threading.py's
_after_fork and _reset_internal_locks.
In short, this would reset every lock held so that they're usable in
the child process, even locks allocated e.g. from
Modules/_io/bufferedio.c.
But this wouldn't allow a lock's state to be inherited across fork for
the main thread (but like I said, I don't think that this makes much
sense anyway, and to my knowledge no implementation makes such a
guarantee - and definitely not POSIX).

neologix · 2011-05-05T05:41:46Z

Please disregard my comment on PyEval_ReInitThreads and _after_fork:
it will of course still be necessary, because it does much more than
just reinitializing locks (e.g. stop threads).
Also, note that both approaches don't handle synchronization
primitives other than bare Lock and RLock. For example, Condition and
Event used in the threading module wouldn't be reset automatically:
that's maybe something that could be handled by Gregory's atfork
mechanism.

pitrou · 2011-05-08T21:25:05Z

Thanks for the explanations. This sounds like an interesting path.

Each thread implementation (thread_pthread.h, thread_nt.h)
would provide a new PyThread_reinit_lock function that would do the
right thing (pthread_mutex_destroy/init, sem_destroy/init, etc).
Modules/threadmodule.c would provide a new PyThread_ReInitLocks that
would walk through the linked list and call PyThread_reinit_lock for
each lock.

Actually, I think the issue is POSIX-specific: Windows has no fork(),
and we don't care about other platforms anymore (they are, are being, or
will be soon deprecated).
It means only the POSIX implementation needs to register its locks in a
linked list.

But this wouldn't allow a lock's state to be inherited across fork for
the main thread (but like I said, I don't think that this makes much
sense anyway, and to my knowledge no implementation makes such a
guarantee - and definitely not POSIX).

Well, the big difference between Python locks and POSIX mutexes is that
Python locks can be released from another thread. They're a kind of
trivial semaphore really, and this makes them usable for other purpose
than mutual exclusion (you can e.g. use a lock as a simple event by
blocking on a second acquire() until another thread calls release()).

But even though we might not be "fixing" arbitrary Python code
automatically, fixing the interpreter's internal locks (especially the
IO locks) would be great already.

(we could also imagine that the creator of the lock decides whether it
should get reinitialized after fork)

nirai · 2011-05-12T20:01:50Z

Hi,

There seem to be two alternatives for atfork handlers:

acquire locks during prepare phase and unlock them in parent and child after fork.
reset library to some consistent state in child after fork.

http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_atfork.html

Option (2) makes sense but is probably not always applicable.
Option (1) depends on being able to acquire locks in locking order, but how can we determine correct locking order across libraries?

Initializing locks in child after fork without acquiring them before the fork may result in corrupted program state and so is probably not a good idea.

On a positive note, if I understand correctly, Python signal handler functions are actually run in the regular interpreter loop (as pending calls) after the signal has been handled and so os.fork() atfork handlers will not be restricted to async-signal-safe operations (since a Python fork is never done in a signal handler).

http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html
http://pubs.opengroup.org/onlinepubs/009695399/functions/fork.html
"It is therefore undefined for the fork handlers to execute functions that are not async-signal-safe when fork() is called from a signal handler."

Opinion by Butenhof who was involved in the standardization effort of POSIX threads:
http://groups.google.com/group/comp.programming.threads/msg/3a43122820983fde

...so how can we establish correct (cross library) locking order during prepare stage?

Nir

sdaoden · 2011-05-12T21:10:08Z

@nir Aides: *thanks* for this link:
http://groups.google.com/group/comp.programming.threads/msg/3a43122820983fde
You made my day!

pitrou · 2011-05-13T11:15:03Z

...so how can we establish correct (cross library) locking order
during prepare stage?

That sounds like a lost battle, if it requires the libraries'
cooperation. I think resetting locks is the best we can do. It might not
work ok in all cases, but if it can handle simple cases (such as I/O and
logging locks), it is already very good.

neologix · 2011-05-13T11:24:33Z

Hi,

Hello Nir,

Option (2) makes sense but is probably not always applicable.
Option (1) depends on being able to acquire locks in locking order, but how
can we determine correct locking order across libraries?

There are indeed a couple problems with 1:

actually, releasing the mutex/semaphore from the child is not
guaranteed to be safe, see this comment from glibc's malloc:
/* In NPTL, unlocking a mutex in the child process after a
fork() is currently unsafe, whereas re-initializing it is safe and
does not leak resources. Therefore, a special atfork handler is
installed for the child. */
We could just destroy/reinit them, though.
acquiring locks just before fork is probably one of the best way to
deadlock (acquiring a lock we already hold, or acquiring a lock needed
by another thread before it releases its own lock). Apart from adding
dealock avoidance/recovery mechanisms - which would be far from
trivial - I don't see how we could solve this, given that each library
can use its own locks, not counting the user-created ones
there's another special lock we must take into account, the GIL:
contrarily to a typical C program, we can't have the thread forking
blindly try to acquire all locks just before fork, because since we
hold the GIL, other threads won't be able to proceed (unless of course
they're in a section where they don't run without the GIL held).

So, we would have to:

release the GIL
acquire all locks in the correct order
re-acquire the GIL
fork
reinit all locks after fork

I think this is going to be very complicated.

Python locks differ from usual mutexes/semaphores in that they can
be held for quite some time (for example while performing I/O). Thus,
acquiring all the locks could take a long time, and users might get
irritated if fork takes 2 seconds to complete.
Finally, there's a fundamental problem with this approach, because
Python locks can be released by a thread other than the one that owns
it.
Imagine this happens:

T1 T2
lock.acquire()
(do something without releasing lock)
fork()
lock.release()

This is perfectly valid with the current lock implementation (for
example, it can be used to implement a rendez-vous point so that T2
doesn't start processing before T1 forked worker processes, or
whatever).
But if T1 tries to acquire lock (held by T2) before fork, then it will
deadlock, since it will never be release by T2.

For all those reasons, I don't think that this approach is reasonable,
but I could be wrong :-)

Initializing locks in child after fork without acquiring them before the
fork may result in corrupted program state and so is probably not a good
idea.

Yes, but in practise, I think that this shouldn't be too much of a
problem. Also note that you can very well have the same type of
problem with sections not protected explicitely by locks: for example,
if you have a thread working exclusively on an object (maybe part of a
threadpool), a fork can very well happen while the object is in an
inconsistent state. Acquiring locks before fork won't help that.
But I think this should eventually be addressed, maybe by specific
atfork handlers.

On a positive note, if I understand correctly, Python signal handler
functions are actually run in the regular interpreter loop (as pending
calls) after the signal has been handled and so os.fork() atfork handlers
will not be restricted to async-signal-safe operations (since a Python fork
is never done in a signal handler).

That's correct.

In short, I think that we could first try to avoid common deadlocks by
just resetting locks in the child process. This is not panacea, but
this should solve the vast majority of deadlocks, and would open the
door to potential future refinements using atfork-like handlers.

Attached is a first draft for a such patch (with tests).
Synopsis:

when a PyThread_type_lock is created, it's added to a linked-list,
when it's deleted, it's removed from the linked list
PyOS_AfterFork() calls PyThread_ReinitLocks() which calls
PyThread_reinit_lock() for each lock in the linked list
PyThread_reinit_lock() does the right thing (i.e. sem_destroy/init
for USE_SEMAPHORES and pthread_(mutex|cond)_destroy/init for emulated
semaphores).

Notes:

since it's only applicable to POSIX (since other Unix thread
implementations will be dropped), I've only defined a
PyThread_ReinitLocks inside Python/thread_pthread.h, so it won't build
on other platforms. How should I proceed: like PyThread_ReInitTLS(),
add a stub function to all Python/thread_xxx.h, or guard the call to
PyThread_ReinitLocks() with #ifdef _POSIX_THREADS ?
I'm not sure of how to handle sem_init/etc failures in the reinit
code: for now I just ignore this possibility, like what's done for the
import lock reset
insertions/removals from the linked list are not protected from
concurrent access because I assume that locks are created/deleted with
the GIL held: is that a reasonable assumption, or should I add a mutex
to protect those accesses?

This fixes common deadlocks with threading.Lock, and
PyThread_type_lock (used for example by I/O code).

Birne94 · 2017-05-31T11:40:53Z

While having to deal with this bug for a while I have written a small library using pthread_atfork: https://github.com/Birne94/python-atfork-lock-release

It allows registering atfork-hooks (similar to the ones available by now) and frees the stdout/stderr as well as manually provided io locks. I guess it uses some hacky ways to get the job done, but resolved the issue for me and has been working without problems for some weeks now.

pitrou · 2017-10-21T17:00:42Z

I think we should somehow move forward on this, at least for logging locks which can be quite an annoyance.

There are two possible approaches:

either a generic mechanism as posted by sbt in reinit_locks_2.diff
or a logging-specific fix using os.register_at_fork()

What do you think?

pitrou · 2017-10-21T17:12:00Z

Oh, I forgot that IO buffered objects also have a lock. So we would have to special-case those as well, unless we take the generic approach...

A problem with the generic approach is that it would leave higher-level synchronization objects such as RLock, Event etc. in an inconsistent state. Not to mention the case where the lock is taken by the thread calling fork()...

gpshead · 2017-10-21T20:27:53Z

logging is pretty easy to deal with so I created a PR.

bufferedio.c is a little more work as we either need to use the posixmodule.c os.register_at_fork API or expose it as an internal C API to be able to call it to add acquires and releases around the buffer's self->lock member when non-NULL. either way, that needs to be written safely so that it doesn't crash if fork happens after a buffered io struct is freed. (unregister at fork handlers when freeing it? messy)

pitrou · 2017-10-21T20:29:54Z

Actually, we already have a doubly-linked list of buffered IO objects
(for another purpose), so we can reuse that and register a single set of
global callbacks.

ochedru · 2018-04-05T13:53:20Z

FWIW, I encountered the same kind of issue when using the mkstemp() function: under the hood, it calls gettempdir() and this one is protected by a lock too.

Current thread 0x00007ff10231f700 (most recent call first):
File "/usr/lib/python3.5/tempfile.py", line 432 in gettempdir
File "/usr/lib/python3.5/tempfile.py", line 269 in _sanitize_params
File "/usr/lib/python3.5/tempfile.py", line 474 in mkstemp

gpshead · 2018-09-14T05:08:36Z

New changeset 1900384 by Gregory P. Smith in branch 'master':
bpo-6721: Hold logging locks across fork() (GH-4071)
1900384

gpshead · 2018-10-07T07:10:12Z

New changeset 3b69993 by Gregory P. Smith (Miss Islington (bot)) in branch '3.7':
bpo-6721: Hold logging locks across fork() (GH-4071) (bpo-9291)
3b69993

vstinner · 2018-11-08T14:18:54Z

New changeset 3b69993 by Gregory P. Smith (Miss Islington (bot)) in branch '3.7':
bpo-6721: Hold logging locks across fork() (GH-4071) (bpo-9291)

It seems like this change caused a regression in the Anaconda installer of Fedora:
https://bugzilla.redhat.com/show_bug.cgi?id=1644936

But we are not sure at this point. I have to investigate to understand exactly what is happening.

cagney · 2019-04-02T21:27:32Z

I suspect 3b69993 is causing a hang in libreswan's kvmrunner.py on Fedora.

Looking at the Fedora RPMs:

python3-3.7.0-9.fc29.x86_64 didn't contain the fix and worked
python3-3.7.1-4.fc29.x86_64 reverted the fix (for anaconda) and worked
python3-3.7.2-4.fc29.x86_64 included the fix; eventually hangs

I believe the hang looks like:

Traceback (most recent call last):
  File "/home/build/libreswan-web/master/testing/utils/fab/runner.py", line 389, in _process_test
    test_domains = _boot_test_domains(logger, test, domain_prefix, boot_executor)
  File "/home/build/libreswan-web/master/testing/utils/fab/runner.py", line 203, in _boot_test_domains
    TestDomain.boot_and_login)
  File "/home/build/libreswan-web/master/testing/utils/fab/runner.py", line 150, in submit_job_for_domain
    logger.debug("scheduled %s on %s", job, domain)
  File "/usr/lib64/python3.7/logging/__init__.py", line 1724, in debug
    
  File "/usr/lib64/python3.7/logging/__init__.py", line 1768, in log
    def __repr__(self):
  File "/usr/lib64/python3.7/logging/__init__.py", line 1449, in log
    """
  File "/usr/lib64/python3.7/logging/__init__.py", line 1519, in _log
    break
  File "/usr/lib64/python3.7/logging/__init__.py", line 1529, in handle
    logger hierarchy. If no handler was found, output a one-off error
  File "/usr/lib64/python3.7/logging/__init__.py", line 1591, in callHandlers
    
  File "/usr/lib64/python3.7/logging/__init__.py", line 905, in handle
    try:
  File "/home/build/libreswan-web/master/testing/utils/fab/logutil.py", line 163, in emit
    stream_handler.emit(record)
  File "/usr/lib64/python3.7/logging/__init__.py", line 1038, in emit
    Handler.__init__(self)
  File "/usr/lib64/python3.7/logging/__init__.py", line 1015, in flush
    name += ' '
  File "/usr/lib64/python3.7/logging/__init__.py", line 854, in acquire
    self.emit(record)
KeyboardInterrupt

gpshead · 2019-04-02T22:21:34Z

We need a small test case that can reproduce your problem. I believe 3b69993 to be correct.

acquiring locks before fork in the thread doing the forking and releasing them afterwards is always the safe thing to do.

Example possibility: Does your code use any C code that forks on its own without properly calling the C Python PyOS_BeforeFork(), PyOS_AfterFork_Parent(), and PyOS_AfterFork_Child() APIs?

cagney · 2019-04-03T14:40:51Z

Does your code use any C code that forks on its own without properly calling the C Python PyOS_BeforeFork(), PyOS_AfterFork_Parent(), and PyOS_AfterFork_Child() APIs?

No.

Is there a web page explaining how to pull a python backtrace from all the threads running within a daemon?

gpshead · 2019-04-03T22:15:37Z

I'd start with faulthandler.register with all_threads=True and see if that gives you what you need.

https://docs.python.org/3/library/faulthandler.html

cagney · 2019-04-04T17:11:09Z

acquiring locks before fork in the thread doing the forking and releasing them afterwards is always the safe thing to do.

It's also an easy way to cause a deadlock:

register_at_fork() et.al. will cause per-logger locks to be acquired before the global lock (this isn't immediately obvious from the code)

If a thread were to grab its logging lock before the global lock then it would deadlock. I'm guessing this isn't allowed - however I didn't see any comments to this effect?

Can I suggest documenting this, and also merging the two callbacks so that the ordering of these two acquires is made explicit.

the per-logger locks are acquired in a random order

If a thread were to acquire two per-logger locks in a different order then things would deadlock.

cagney · 2019-04-04T17:35:46Z

Below is a backtrace from the deadlock.

It happens because the logging code is trying to acquire two per-logger locks; and in an order different to the random order used by the fork() handler.

The code in question has a custom class DebugHandler(logging.Handler). The default logging.Handler.handle() method grabs its lock and calls .emit() vis:

        if rv:
            self.acquire()
            try:
                self.emit(record)
            finally:
                self.release()

the custom .emit() then sends the record to a sub-logger stream vis:

    def emit(self, record):
        for stream_handler in self.stream_handlers:
            stream_handler.emit(record)
        if _DEBUG_STREAM:
            _DEBUG_STREAM.emit(record)

and one of these emit() functions calls flush() which tries to acquire a further lock.

Thread 0x00007f976b7fe700 (most recent call first):
File "/usr/lib64/python3.7/logging/init.py", line 854 in acquire
File "/usr/lib64/python3.7/logging/init.py", line 1015 in flush

    def flush(self):
        """
        Flushes the stream.
        """
        self.acquire() <

    try:
        if self.stream and hasattr(self.stream, "flush"):
            self.stream.flush()
    finally:
        self.release()

File "/usr/lib64/python3.7/logging/init.py", line 1038 in emit

        self.flush() <\----

File "/home/build/libreswan-web/master/testing/utils/fab/logutil.py", line 163 in emit

    def emit(self, record):
        for stream_handler in self.stream_handlers:
            stream_handler.emit(record) <---
        if _DEBUG_STREAM:
            _DEBUG_STREAM.emit(record)

File "/usr/lib64/python3.7/logging/init.py", line 905 in handle

    def handle(self, record):
        """
        Conditionally emit the specified logging record.

    Emission depends on filters which may have been added to the handler.
    Wrap the actual emission of the record with acquisition/release of
    the I/O thread lock. Returns whether the filter passed the record for
    emission.
    """
    rv = self.filter(record)
    if rv:
        self.acquire()
        try:
            self.emit(record) <\---
        finally:
            self.release()
    return rv

File "/usr/lib64/python3.7/logging/init.py", line 1591 in callHandlers

                    hdlr.handle(record)

File "/usr/lib64/python3.7/logging/init.py", line 1529 in handle

            self.callHandlers(record)

File "/usr/lib64/python3.7/logging/init.py", line 1519 in _log

        self.handle(record)

File "/usr/lib64/python3.7/logging/init.py", line 1449 in log

        self._log(level, msg, args, **kwargs)

File "/usr/lib64/python3.7/logging/init.py", line 1768 in log

            self.logger.log(level, msg, *args, **kwargs)

File "/usr/lib64/python3.7/logging/init.py", line 1724 in debug

        self.log(DEBUG, msg, *args, **kwargs)

File "/home/build/libreswan-web/master/testing/utils/fab/shell.py", line 110 in write

        self.logger.debug(self.message, ascii(text))

gpshead · 2019-04-05T08:17:15Z

Thanks for the debugging details! I've filed https://bugs.python.org/issue36533 to specifically track this potential regression in the 3.7 stable branch. lets carry on there where the discussion thread isn't too long for bug tracker sanity.

vstinner · 2020-03-27T16:53:40Z

I created bpo-40089: Add _at_fork_reinit() method to locks.

pitrou · 2020-04-28T13:20:23Z

Related issue:
https://bugs.python.org/issue40399
"""
IO streams locking can be broken after fork() with threads
"""

rojer · 2020-04-29T20:49:08Z

https://bugs.python.org/issue40442 is a fresh instance of this, entirely self-inflicted.

vstinner · 2020-05-14T23:43:26Z

See also bpo-25920: PyOS_AfterFork should reset socketmodule's lock.

vstinner · 2022-06-06T11:06:56Z

While it's true that "Locks in the standard library should be sanitized on fork", IMO having such "meta-issue" to track the issue in the 300+ stdlib modules is a bad idea, since it's hard to track how many modules got fixed and how many modules should still be fixed. Multiple modules have been fixed. I suggest to open more specific issues for remaining ones. I close the issue. Thanks for anyone who was involved in fixing issues! Good luck for people volunteers to fix remaining issues :-) Also, avoid fork without exec, it's no longer supported on macOS, it was never supported on Windows, and it causes tons of very complex bugs on Linux :-)

gpshead self-assigned this Aug 17, 2009

gpshead added the stdlib Python modules in the Lib dir label Sep 17, 2009

bitdancer added the type-bug An unexpected behavior, bug, or error label Dec 14, 2010

gpshead added the 3.7 (EOL) end of life label May 30, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

vstinner closed this as completed Jun 6, 2022

bmerry mentioned this issue Jun 7, 2022

multiprocessing with maxtasksperchild can hang if unpickling causes import #93580

Closed

ShaneHarvey mentioned this issue Aug 29, 2022

OpenSSL 3.0.5 deadlock in SSL_do_handshake after fork() openssl/openssl#19066

Closed

davidak mentioned this issue Feb 18, 2023

Nix build blocked by deadlocked python tests NixOS/nixpkgs#217247

Open

vsajip mentioned this issue Jun 30, 2023

gh-106238: Handle KeyboardInterrupt during logging._acquireLock() #106239

Merged

saramsey mentioned this issue Oct 19, 2023

Some KG2 query subprocesses never terminate RTXteam/RTX#2114

Closed

vstinner mentioned this issue Apr 4, 2024

Make some PyMutex functions public #117511

Open

Locks in the standard library should be sanitized on fork #50970

Locks in the standard library should be sanitized on fork #50970

Comments

gpshead commented Aug 17, 2009

gpshead commented Aug 17, 2009

gpshead commented Aug 24, 2009

gpshead commented Sep 17, 2009

pitrou commented Oct 15, 2009

gpshead commented Oct 16, 2009

pitrou commented Oct 16, 2009

gpshead commented Oct 16, 2009

neologix mannequin commented Feb 10, 2011

gpshead commented Feb 10, 2011

pitrou commented Feb 10, 2011

gpshead commented Feb 10, 2011

neologix mannequin commented Feb 11, 2011

pitrou commented May 3, 2011

pitrou commented May 3, 2011

gpshead commented May 3, 2011

neologix mannequin commented May 3, 2011

pitrou commented May 3, 2011

neologix mannequin commented May 4, 2011

neologix mannequin commented May 4, 2011

pitrou commented May 4, 2011

neologix mannequin commented May 4, 2011

neologix mannequin commented May 5, 2011

pitrou commented May 8, 2011

nirai mannequin commented May 12, 2011

sdaoden mannequin commented May 12, 2011

pitrou commented May 13, 2011

neologix mannequin commented May 13, 2011

Birne94 mannequin commented May 31, 2017

pitrou commented Oct 21, 2017

pitrou commented Oct 21, 2017

gpshead commented Oct 21, 2017

pitrou commented Oct 21, 2017

ochedru mannequin commented Apr 5, 2018

gpshead commented Sep 14, 2018

gpshead commented Oct 7, 2018

vstinner commented Nov 8, 2018

cagney mannequin commented Apr 2, 2019

gpshead commented Apr 2, 2019

cagney mannequin commented Apr 3, 2019

gpshead commented Apr 3, 2019

cagney mannequin commented Apr 4, 2019

cagney mannequin commented Apr 4, 2019

gpshead commented Apr 5, 2019

vstinner commented Mar 27, 2020

pitrou commented Apr 28, 2020

rojer mannequin commented Apr 29, 2020

vstinner commented May 14, 2020

vstinner commented Jun 6, 2022