Issue6721
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2009-08-17 23:06 by gregory.p.smith, last changed 2022-04-11 14:56 by admin.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
lock_fork_thread_deadlock_demo.py | gregory.p.smith, 2009-08-17 23:06 | |||
forklocktests.patch | pitrou, 2011-05-03 20:38 | review | ||
reinit_locks.diff | neologix, 2011-05-15 19:39 | patch adding locks reinitialization upon fork | review | |
emit_warning_on_fork.patch | avian, 2011-06-30 15:04 | review | ||
atfork.patch | sbt, 2012-01-23 21:23 | review | ||
reinit_locks_2.diff | sbt, 2012-05-31 20:48 | review |
Pull Requests | |||
---|---|---|---|
URL | Status | Linked | Edit |
PR 4071 | merged | gregory.p.smith, 2017-10-21 20:24 | |
PR 9291 | merged | miss-islington, 2018-09-14 05:08 | |
PR 21986 | koubaa, 2020-09-07 15:38 | ||
PR 22205 | adelfino, 2020-09-15 20:22 | ||
PR 22651 | kevans, 2020-10-11 20:42 |
Messages (133) | |||
---|---|---|---|
msg91674 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2009-08-17 23:06 | |
The python logging module uses a lock to surround many operations, in particular. This causes deadlocks in programs that use logging, fork and threading simultaneously. 1) spawn one or more threads in your program 2) have at least one of those threads make logging calls that will be emitted. 3) have your main thread or another thread use os.fork() to run some python code in a child process. 4) If the fork happened while one of your threads was within the logging.Handler.handle() critical section (or anywhere else where handler.lock is acquired), your child process will deadlock as soon as it tries to log anything. It inherited a held lock. The deadlock is more likely to happen on a highly loaded system which tends to widen the deadlock opportunity window due to context switching. A demo of the problem simplified into one file is attached. The Python standard library should not be the cause of these deadlocks. We need a way for all standard library locks to be cleaned up when forking. By doing one of the following: A) acquire all locks before forking, release them immediately after. B) forceably release all standard library locks after forking in the child process. Code was added to call some cleanups after forking in http://bugs.python.org/issue874900 but there are more things that also need this same sort of cleanup (logging for example). Rather than having to manually add after fork code hooks into every file in the standard library that uses locks, a more general solution to track and manage locks across fork would be a good idea. |
|||
msg91936 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2009-08-24 19:48 | |
I've started a project to patch this and similar messes up for Python 2.4 and later here: http://code.google.com/p/python-atfork/ I'd like to take ideas or implementations from that when possible for future use in the python standard library. |
|||
msg92766 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2009-09-17 14:25 | |
issue 6923 has been opened to provide a C API for an atfork mechanism for use by extension modules. |
|||
msg94102 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2009-10-15 18:15 | |
Rather than having a kind of global module registry, locks could keep track of what was the last PID, and reinitialize themselves if it changed. This is assuming getpid() is fast :-) |
|||
msg94115 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2009-10-16 00:24 | |
> Antoine Pitrou <pitrou@free.fr> added the comment: > > Rather than having a kind of global module registry, locks could keep > track of what was the last PID, and reinitialize themselves if it changed. > This is assuming getpid() is fast :-) Locks can't blindly release themselves because they find themselves running in another process. If anything if a lock is held and finds itself running in a new process any attempt to use the lock should raise an exception so that the bug is noticed. I'm not sure a PID check is good enough. old linux using linuxthreads had a different pid for every thread, current linux with NPTL is more like other oses with the same pid for all threads. |
|||
msg94133 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2009-10-16 10:23 | |
I was suggesting "reinitialize", rather than "release". That is, create a new lock (mutex, semaphore, etc.) and let the old one die (or occupy some tiny bit of memory). |
|||
msg94135 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2009-10-16 10:28 | |
no need for that. the problem is that they're held by a thread that does not exist in the newly forked child process so they will never be released in the new process. example: if you fork while another thread is in the middle of logging something and then try to log something yourself in the child, your child process will deadlock on the logging module's lock. locks are not shared between processes so reinitializing them with a new object wouldn't do anything. |
|||
msg128282 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-02-10 11:12 | |
I'm not sure that releasing the mutex is enough, it can still lead to a segfault, as is probably the case in this issue : http://bugs.python.org/issue11148 Quoting pthread_atfork man page : To understand the purpose of pthread_atfork, recall that fork duplicates the whole memory space, including mutexes in their current locking state, but only the calling thread: other threads are not running in the child process. The mutexes are not usable after the fork and must be initialized with pthread_mutex_init in the child process. This is a limitation of the current implementation and might or might not be present in future versions. To avoid this, install handlers with pthread_atfork as follows: have the prepare handler lock the mutexes (in locking order), and the parent handler unlock the mutexes. The child handler should reset the mutexes using pthread_mutex_init, as well as any other synchronization objects such as condition variables. Locking the global mutexes before the fork ensures that all other threads are locked out of the critical regions of code protected by those mutexes. Thus when fork takes a snapshot of the parent's address space, that snapshot will copy valid, stable data. Resetting the synchronization objects in the child process will ensure they are properly cleansed of any artifacts from the threading subsystem of the parent process. For example, a mutex may inherit a wait queue of threads waiting for the lock; this wait queue makes no sense in the child process. Initializing the mutex takes care of this. pthread_atfork might be worth looking into |
|||
msg128307 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2011-02-10 16:56 | |
fwiw http://bugs.python.org/issue6643 recently fixed on issue where a mutex was being closed instead of reinitialized after a fork. more are likely needed. Are you suggesting to use pthread_atfork to call pthread_mutex_init on all mutexes created by Python in the child after a fork? I'll have to ponder that some more. Given the mutexes are all useless post fork it does not strike me as a bad idea. |
|||
msg128311 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2011-02-10 17:02 | |
> Are you suggesting to use pthread_atfork to call pthread_mutex_init on > all mutexes created by Python in the child after a fork? I'll have to > ponder that some more. Given the mutexes are all useless post fork it > does not strike me as a bad idea. I don't really understand. It's quite similar to the idea you shot down in msg94135. Or am I missing something? |
|||
msg128316 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2011-02-10 17:20 | |
Yeah, I'm trying to figure out what I was thinking then or if I was just plain wrong. :) I was clearly wrong about a release being done in the child being the right thing to do (issue6643 proved that, the state held by a lock is not usable to another process on all platforms such that release even works). Part of it looks like I wanted a way to detect it was happening as any lock that is held during a fork indicates a _potential_ bug (the lock wasn't registered anywhere to be released before the fork) but not everything needs to care about that. |
|||
msg128369 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-02-11 09:13 | |
> I was clearly wrong about a release being done in the child being the right thing to do (issue6643 proved that, the state held by a lock is not usable to another process on all platforms such that release even works). Yeah, apparently OS-X is one of them, the reporter in #11148 is experiencing segfaults under OS-X. > Are you suggesting to use pthread_atfork to call pthread_mutex_init on all mutexes created by Python in the child after a fork? I'll have to ponder that some more. Given the mutexes are all useless post fork it does not strike me as a bad idea. Yes, that's what I was thinking. Instead of scattering the lock-reclaiming code all over the place, try to use a more standard API specifically designed with that in mind. Note the base issue is that we're authorizing things which are forbidden : in a multi-threaded process, only sync-safe calls are authorized between fork and exec. 2011/2/10 Gregory P. Smith <report@bugs.python.org>: > > Gregory P. Smith <greg@krypto.org> added the comment: > > Yeah, I'm trying to figure out what I was thinking then or if I was just plain wrong. :) > > I was clearly wrong about a release being done in the child being the right thing to do (issue6643 proved that, the state held by a lock is not usable to another process on all platforms such that release even works). > > Part of it looks like I wanted a way to detect it was happening as any lock that is held during a fork indicates a _potential_ bug (the lock wasn't registered anywhere to be released before the fork) but not everything needs to care about that. > > ---------- > versions: +Python 3.3 > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue6721> > _______________________________________ > |
|||
msg135012 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2011-05-03 00:20 | |
I encountered this issue while debugging some multiprocessing code; fork() would be called from one thread while sys.stdout was in use in another thread (simply because of a couple of debugging statements). As a result the IO lock would be already "taken" in the child process and any operation on sys.stdout would deadlock. This is definitely something that can happen more easily than I thought. |
|||
msg135067 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2011-05-03 20:38 | |
Here is a patch with tests for the issue (some of which fail of course). Do we agree that these tests are right? |
|||
msg135069 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2011-05-03 20:47 | |
Those tests make sense to me. |
|||
msg135079 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-05-03 21:39 | |
# A lock taken from the current thread should stay taken in the # child process. Note that I'm not sure of how to implement this. After a fork, even releasing the lock can be unsafe, it must be re-initialized, see following comment in glibc's malloc implementation: /* In NPTL, unlocking a mutex in the child process after a fork() is currently unsafe, whereas re-initializing it is safe and does not leak resources. Therefore, a special atfork handler is installed for the child. */ Note that this means that even the current code allocating new locks after fork (in Lib/threading.py, _after_fork and _reset_internal_locks) is unsafe, because the old locks will be deallocated, and the lock deallocation tries to acquire and release the lock before destroying it (in issue #11148 the OP experienced a segfault on OS-X when locking a mutex, but I'm not sure of the exact context). Also, this would imply keeping track of the thread currently owning the lock, and doesn't match the typical pthread_atfork idiom (acquire locks just before fork, release just after in parent and child, or just reinit them in the child process) Finally, IMHO, forking while holding a lock and expecting it to be usable after fork doesn't make much sense, since a lock is acquired by a thread, and this threads doesn't exist in the child process. It's explicitely described as "undefined" by POSIX, see http://pubs.opengroup.org/onlinepubs/007908799/xsh/sem_init.html : """ The use of the semaphore by threads other than those created in the same process is undefined. """ So I'm not sure whether it's feasable/wise to provide such a guarantee. |
|||
msg135083 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2011-05-03 22:05 | |
> Also, this would imply keeping track of the thread currently owning > the lock, Yes, we would need to keep track of the thread id and process id inside the lock. We also need a global variable of the main thread id after fork, and a per-lock "taken" flag. Synopsis: def _reinit_if_needed(self): # Call this before each acquire() or release() if self.pid != getpid(): sem_init(self.sem, 0, 1) if self.taken: if self.tid == main_thread_id_after_fork: # Lock was taken in forked thread, re-take it sem_wait(self.sem) else: # It's now released self.taken = False self.pid = getpid() self.tid = current_thread_id() > and doesn't match the typical pthread_atfork idiom (acquire locks just > before fork, release just after in parent and child, or just reinit > them in the child process) Well, I fail to understand how that idiom can help us. We're not a self-contained application, we're a whole programming language. Calling fork() only when no lock is held is unworkable (for example, we use locks around buffered I/O objects). |
|||
msg135095 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-05-04 05:56 | |
> Yes, we would need to keep track of the thread id and process id inside > the lock. We also need a global variable of the main thread id after > fork, and a per-lock "taken" flag. > > Synopsis: > > def _reinit_if_needed(self): > # Call this before each acquire() or release() > if self.pid != getpid(): > sem_init(self.sem, 0, 1) > if self.taken: > if self.tid == main_thread_id_after_fork: > # Lock was taken in forked thread, re-take it > sem_wait(self.sem) > else: > # It's now released > self.taken = False > self.pid = getpid() > self.tid = current_thread_id() > A couple remarks: - with linuxthreads, different threads within the same process have the same PID - it may be true for other implementations - so this would lead to spurious reinitializations - what's current_thread_id ? If it's thread_get_ident (pthread_self), since TID is not guaranteed to be inherited across fork, this won't work - calling getpid at every acquire/release is expensive, even though it's a trivial syscall (it'll have to measured though) - imagine the following happens: P1 lock.acquire() fork() -> P2 start_new_thread T2 T1 T2 lock.acquire() The acquisition of lock by T2 will cause lock's reinitialization: what happens to the lock wait queue ? who owns the lock ? That why I don't think we can delay the reinitialization of locks, but I could be wrong. > Well, I fail to understand how that idiom can help us. We're not a > self-contained application, we're a whole programming language. > Calling fork() only when no lock is held is unworkable (for example, we > use locks around buffered I/O objects). Yes, but in that case, you don't have to reacquire the locks after fork. In the deadlock you experienced above, the thread that forked wasn't the one in the I/O code, so the corresponding lock can be re-initialized anyway, since the thread in the I/O code at that time won't exist after fork. And it's true with every lock in the library code: they're only held in short critical sections (typically acquired when entering a function and released when leaving), and since it's not the threads inside those libraries that fork, all those locks can simply be reinitialized on fork, without having the reacquire them. |
|||
msg135096 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-05-04 06:00 | |
Oops, for liunxthreads, you should of course read "different PIDs", not "same PID". |
|||
msg135143 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2011-05-04 17:53 | |
> - what's current_thread_id ? If it's thread_get_ident (pthread_self), > since TID is not guaranteed to be inherited across fork, this won't > work Ouch, then the approach I'm proposing is probably doomed. > And it's true with every lock in the library code: they're only held > in short critical sections (typically acquired when entering a > function and released when leaving), and since it's not the threads > inside those libraries that fork, all those locks can simply be > reinitialized on fork, without having the reacquire them. Well, this means indeed that *some* locks can be handled, but not all of them and not automatically, right? Also, how would you propose they be dealt with in practice? How do they get registered, and how does the reinitialization happen? (do note that library code can call arbitrary third-party code, by the way: for example through encodings in the text I/O layer) |
|||
msg135157 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-05-04 20:58 | |
>> - what's current_thread_id ? If it's thread_get_ident (pthread_self), >> since TID is not guaranteed to be inherited across fork, this won't >> work > > Ouch, then the approach I'm proposing is probably doomed. > Well, it works on Linux with NPTL, but I'm not sure at all it holds for other implementations (pthread_t it's only meaningful within the same process). But I'm not sure it's really the "killer" point: PID with linuxthreads and lock being acquired by a second thread before the main thread releases it in the child process also look like serious problems. > Well, this means indeed that *some* locks can be handled, but not all of > them and not automatically, right? > Also, how would you propose they be dealt with in practice? How do they > get registered, and how does the reinitialization happen? When a lock object is allocated in Modules/threadmodule.c (PyThread_allocate_lock/newlockobject), add the underlying lock (self->lock_lock) to a linked list (since it's called with the GIL held, we don't need to protect the linked list from concurrent access). Each thread implementation (thread_pthread.h, thread_nt.h) would provide a new PyThread_reinit_lock function that would do the right thing (pthread_mutex_destroy/init, sem_destroy/init, etc). Modules/threadmodule.c would provide a new PyThread_ReInitLocks that would walk through the linked list and call PyThread_reinit_lock for each lock. PyOS_AfterFork would call this PyThread_ReInitLocks right after fork. This would have the advantage of being consistent with what's already done to reinit the TLS key and the import lock. So, we guarantee to be in a consistent and usable state when PyOS_AfterFork returns. Also, it's somewhat simpler because we're sure that at that point only one thread is running (once again, no need to protect the linked-list walk). I don't think that the performance impact would be noticable (I know it's O(N) where N is the number of locks), and contrarily to the automatic approach, this wouldn't penalize every acquire/release. Of course, this would solve the problem of threading's module locks, so PyEval_ReInitThreads could be removed, along with threading.py's _after_fork and _reset_internal_locks. In short, this would reset every lock held so that they're usable in the child process, even locks allocated e.g. from Modules/_io/bufferedio.c. But this wouldn't allow a lock's state to be inherited across fork for the main thread (but like I said, I don't think that this makes much sense anyway, and to my knowledge no implementation makes such a guarantee - and definitely not POSIX). |
|||
msg135173 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-05-05 05:41 | |
Please disregard my comment on PyEval_ReInitThreads and _after_fork: it will of course still be necessary, because it does much more than just reinitializing locks (e.g. stop threads). Also, note that both approaches don't handle synchronization primitives other than bare Lock and RLock. For example, Condition and Event used in the threading module wouldn't be reset automatically: that's maybe something that could be handled by Gregory's atfork mechanism. |
|||
msg135543 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2011-05-08 21:25 | |
Thanks for the explanations. This sounds like an interesting path. > Each thread implementation (thread_pthread.h, thread_nt.h) > would provide a new PyThread_reinit_lock function that would do the > right thing (pthread_mutex_destroy/init, sem_destroy/init, etc). > Modules/threadmodule.c would provide a new PyThread_ReInitLocks that > would walk through the linked list and call PyThread_reinit_lock for > each lock. Actually, I think the issue is POSIX-specific: Windows has no fork(), and we don't care about other platforms anymore (they are, are being, or will be soon deprecated). It means only the POSIX implementation needs to register its locks in a linked list. > But this wouldn't allow a lock's state to be inherited across fork for > the main thread (but like I said, I don't think that this makes much > sense anyway, and to my knowledge no implementation makes such a > guarantee - and definitely not POSIX). Well, the big difference between Python locks and POSIX mutexes is that Python locks can be released from another thread. They're a kind of trivial semaphore really, and this makes them usable for other purpose than mutual exclusion (you can e.g. use a lock as a simple event by blocking on a second acquire() until another thread calls release()). But even though we might not be "fixing" arbitrary Python code automatically, fixing the interpreter's internal locks (especially the IO locks) would be great already. (we could also imagine that the creator of the lock decides whether it should get reinitialized after fork) |
|||
msg135857 - (view) | Author: Nir Aides (nirai) ![]() |
Date: 2011-05-12 20:01 | |
Hi, There seem to be two alternatives for atfork handlers: 1) acquire locks during prepare phase and unlock them in parent and child after fork. 2) reset library to some consistent state in child after fork. http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_atfork.html Option (2) makes sense but is probably not always applicable. Option (1) depends on being able to acquire locks in locking order, but how can we determine correct locking order across libraries? Initializing locks in child after fork without acquiring them before the fork may result in corrupted program state and so is probably not a good idea. On a positive note, if I understand correctly, Python signal handler functions are actually run in the regular interpreter loop (as pending calls) after the signal has been handled and so os.fork() atfork handlers will not be restricted to async-signal-safe operations (since a Python fork is never done in a signal handler). http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html http://pubs.opengroup.org/onlinepubs/009695399/functions/fork.html "It is therefore undefined for the fork handlers to execute functions that are not async-signal-safe when fork() is called from a signal handler." Opinion by Butenhof who was involved in the standardization effort of POSIX threads: http://groups.google.com/group/comp.programming.threads/msg/3a43122820983fde ...so how can we establish correct (cross library) locking order during prepare stage? Nir |
|||
msg135866 - (view) | Author: Steffen Daode Nurpmeso (sdaoden) | Date: 2011-05-12 21:10 | |
@Nir Aides: *thanks* for this link: http://groups.google.com/group/comp.programming.threads/msg/3a43122820983fde You made my day! |
|||
msg135897 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2011-05-13 11:15 | |
> ...so how can we establish correct (cross library) locking order > during prepare stage? That sounds like a lost battle, if it requires the libraries' cooperation. I think resetting locks is the best we can do. It might not work ok in all cases, but if it can handle simple cases (such as I/O and logging locks), it is already very good. |
|||
msg135899 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-05-13 11:24 | |
> Hi, > Hello Nir, > Option (2) makes sense but is probably not always applicable. > Option (1) depends on being able to acquire locks in locking order, but how > can we determine correct locking order across libraries? > There are indeed a couple problems with 1: 1) actually, releasing the mutex/semaphore from the child is not guaranteed to be safe, see this comment from glibc's malloc: /* In NPTL, unlocking a mutex in the child process after a fork() is currently unsafe, whereas re-initializing it is safe and does not leak resources. Therefore, a special atfork handler is installed for the child. */ We could just destroy/reinit them, though. 2) acquiring locks just before fork is probably one of the best way to deadlock (acquiring a lock we already hold, or acquiring a lock needed by another thread before it releases its own lock). Apart from adding dealock avoidance/recovery mechanisms - which would be far from trivial - I don't see how we could solve this, given that each library can use its own locks, not counting the user-created ones 3) there's another special lock we must take into account, the GIL: contrarily to a typical C program, we can't have the thread forking blindly try to acquire all locks just before fork, because since we hold the GIL, other threads won't be able to proceed (unless of course they're in a section where they don't run without the GIL held). So, we would have to: - release the GIL - acquire all locks in the correct order - re-acquire the GIL - fork - reinit all locks after fork I think this is going to be very complicated. 4) Python locks differ from usual mutexes/semaphores in that they can be held for quite some time (for example while performing I/O). Thus, acquiring all the locks could take a long time, and users might get irritated if fork takes 2 seconds to complete. 5) Finally, there's a fundamental problem with this approach, because Python locks can be released by a thread other than the one that owns it. Imagine this happens: T1 T2 lock.acquire() (do something without releasing lock) fork() lock.release() This is perfectly valid with the current lock implementation (for example, it can be used to implement a rendez-vous point so that T2 doesn't start processing before T1 forked worker processes, or whatever). But if T1 tries to acquire lock (held by T2) before fork, then it will deadlock, since it will never be release by T2. For all those reasons, I don't think that this approach is reasonable, but I could be wrong :-) > Initializing locks in child after fork without acquiring them before the > fork may result in corrupted program state and so is probably not a good > idea. Yes, but in practise, I think that this shouldn't be too much of a problem. Also note that you can very well have the same type of problem with sections not protected explicitely by locks: for example, if you have a thread working exclusively on an object (maybe part of a threadpool), a fork can very well happen while the object is in an inconsistent state. Acquiring locks before fork won't help that. But I think this should eventually be addressed, maybe by specific atfork handlers. > On a positive note, if I understand correctly, Python signal handler > functions are actually run in the regular interpreter loop (as pending > calls) after the signal has been handled and so os.fork() atfork handlers > will not be restricted to async-signal-safe operations (since a Python fork > is never done in a signal handler). That's correct. In short, I think that we could first try to avoid common deadlocks by just resetting locks in the child process. This is not panacea, but this should solve the vast majority of deadlocks, and would open the door to potential future refinements using atfork-like handlers. Attached is a first draft for a such patch (with tests). Synopsis: - when a PyThread_type_lock is created, it's added to a linked-list, when it's deleted, it's removed from the linked list - PyOS_AfterFork() calls PyThread_ReinitLocks() which calls PyThread_reinit_lock() for each lock in the linked list - PyThread_reinit_lock() does the right thing (i.e. sem_destroy/init for USE_SEMAPHORES and pthread_(mutex|cond)_destroy/init for emulated semaphores). Notes: - since it's only applicable to POSIX (since other Unix thread implementations will be dropped), I've only defined a PyThread_ReinitLocks inside Python/thread_pthread.h, so it won't build on other platforms. How should I proceed: like PyThread_ReInitTLS(), add a stub function to all Python/thread_xxx.h, or guard the call to PyThread_ReinitLocks() with #ifdef _POSIX_THREADS ? - I'm not sure of how to handle sem_init/etc failures in the reinit code: for now I just ignore this possibility, like what's done for the import lock reset - insertions/removals from the linked list are not protected from concurrent access because I assume that locks are created/deleted with the GIL held: is that a reasonable assumption, or should I add a mutex to protect those accesses? This fixes common deadlocks with threading.Lock, and PyThread_type_lock (used for example by I/O code). |
|||
msg135948 - (view) | Author: Steffen Daode Nurpmeso (sdaoden) | Date: 2011-05-13 23:12 | |
@ Charles-François Natali <report@bugs.python.org> wrote (2011-05-13 13:24+0200): > I happily posted a reinit patch I must say in advance that we have implemented our own thread support 2003-2005 and i'm thus lucky not to need to use anything else ever since. So. And of course i have no overview about Python. But i looked and saw no errors in the default path and the tests run without errors. Then i started to try your semaphore path which is a bit problematic because Mac OS X doesn't offer anon sems ;). ( By the way, in PyThread_acquire_lock_timed() these lines if (microseconds > 0) MICROSECONDS_TO_TIMESPEC(microseconds, ts); result in these compiler warnings. python/thread_pthread.h: In function ‘PyThread_acquire_lock_timed’: Python/thread_pthread.h:424: warning: ‘ts.tv_sec’ may be used uninitialized in this function Python/thread_pthread.h:424: warning: ‘ts.tv_nsec’ may be used uninitialized in this function ) #ifdef USE_SEMAPHORES #define broken_sem_init broken_sem_init static int broken_sem_init(sem_t **sem, int shared, unsigned int value) { int ret; auto char buffer[32]; static long counter = 3000; sprintf(buffer, "%016ld", ++counter); *sem = sem_open(buffer, O_CREAT, (mode_t)0600, (unsigned int)value); ret = (*sem == SEM_FAILED) ? -1 : 0; //printf("BROKEN_SEM_INIT WILL RETURN %d (value=%u)\n", ret,value); return ret; } static int sem_timedwait(sem_t *sem, struct timespec *ts) { int success = -1, iters = 1000; struct timespec now, wait; printf("STARTING LOOP\n"); for (;;) { if (sem_trywait(sem) == 0) { printf("TRYWAIT OK\n"); success = 0; break; } wait.tv_sec = 0, wait.tv_nsec = 200 * 1000; //printf("DOWN "); fflush(stdout); nanosleep(&wait, NULL); MICROSECONDS_TO_TIMESPEC(0, now); //printf("WOKE UP NOW=%ld:%ld END=%ld:%ld\n", now.tv_sec,now.tv_nsec, ts->tv_sec,ts->tv_nsec); if (now.tv_sec > ts->tv_sec || (now.tv_sec == ts->tv_sec && now.tv_nsec >= ts->tv_nsec)) break; if (--iters < 0) { printf("BREAKING OFF LOOP, 1000 iterations\n"); errno = ETIMEDOUT; break; } } return success; } #define sem_destroy sem_close typedef struct _pthread_lock { sem_t *sem; struct _pthread_lock*next; sem_t sem_buf; } pthread_lock; #endif plus all the changes the struct change implies, say. Yes it's silly, but i wanted to test. And this is the result: == CPython 3.3a0 (default:804abc2c60de+, May 14 2011, 01:09:53) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] == Darwin-10.7.0-i386-64bit little-endian == /Users/steffen/src/cpython/build/test_python_19230 Testing with flags: sys.flags(debug=0, inspect=0, interactive=0, optimize=0, dont_write_bytecode=0, no_user_site=0, no_site=0, ignore_environment=1, verbose=0, bytes_warning=0, quiet=0) Using random seed 1362049 [1/1] test_threading STARTING LOOP test_acquire_contended (test.test_threading.LockTests) ... ok test_acquire_destroy (test.test_threading.LockTests) ... ok test_acquire_release (test.test_threading.LockTests) ... ok test_constructor (test.test_threading.LockTests) ... ok test_different_thread (test.test_threading.LockTests) ... ok test_reacquire (test.test_threading.LockTests) ... ok test_state_after_timeout (test.test_threading.LockTests) ... ok test_thread_leak (test.test_threading.LockTests) ... ok test_timeout (test.test_threading.LockTests) ... STARTING LOOP TRYWAIT OK FAIL test_try_acquire (test.test_threading.LockTests) ... ok test_try_acquire_contended (test.test_threading.LockTests) ... ok test_with (test.test_threading.LockTests) ... ok test__is_owned (test.test_threading.PyRLockTests) ... ok test_acquire_contended (test.test_threading.PyRLockTests) ... ok test_acquire_destroy (test.test_threading.PyRLockTests) ... ok test_acquire_release (test.test_threading.PyRLockTests) ... ok test_constructor (test.test_threading.PyRLockTests) ... ok test_different_thread (test.test_threading.PyRLockTests) ... ok test_reacquire (test.test_threading.PyRLockTests) ... ok test_release_unacquired (test.test_threading.PyRLockTests) ... ok test_thread_leak (test.test_threading.PyRLockTests) ... ok test_timeout (test.test_threading.PyRLockTests) ... STARTING LOOP TRYWAIT OK FAIL test_try_acquire (test.test_threading.PyRLockTests) ... ok test_try_acquire_contended (test.test_threading.PyRLockTests) ... ok test_with (test.test_threading.PyRLockTests) ... ok test__is_owned (test.test_threading.CRLockTests) ... ok test_acquire_contended (test.test_threading.CRLockTests) ... ok test_acquire_destroy (test.test_threading.CRLockTests) ... ok test_acquire_release (test.test_threading.CRLockTests) ... ok test_constructor (test.test_threading.CRLockTests) ... ok test_different_thread (test.test_threading.CRLockTests) ... ok test_reacquire (test.test_threading.CRLockTests) ... ok test_release_unacquired (test.test_threading.CRLockTests) ... ok test_thread_leak (test.test_threading.CRLockTests) ... BREAKING OFF LOOP, 1000 iterations Timeout (1:00:00)! Thread 0x00007fff70677ca0: File "/Users/steffen/src/cpython/Lib/test/lock_tests.py", line 17 in _wait File "/Users/steffen/src/cpython/Lib/test/lock_tests.py", line 52 in wait_for_finished File "/Users/steffen/src/cpython/Lib/test/lock_tests.py", line 152 in test_thread_leak File "/Users/steffen/src/cpython/Lib/unittest/case.py", line 407 in _executeTestPart File "/Users/steffen/src/cpython/Lib/unittest/case.py", line 462 in run File "/Users/steffen/src/cpython/Lib/unittest/case.py", line 514 in __call__ File "/Users/steffen/src/cpython/Lib/unittest/suite.py", line 105 in run File "/Users/steffen/src/cpython/Lib/unittest/suite.py", line 67 in __call__ File "/Users/steffen/src/cpython/Lib/unittest/suite.py", line 105 in run File "/Users/steffen/src/cpython/Lib/unittest/suite.py", line 67 in __call__ File "/Users/steffen/src/cpython/Lib/unittest/runner.py", line 168 in run File "/Users/steffen/src/cpython/Lib/test/support.py", line 1187 in _run_suite File "/Users/steffen/src/cpython/Lib/test/support.py", line 1213 in run_unittest File "/Users/steffen/src/cpython/Lib/test/test_threading.py", line 748 in test_main File "/Users/steffen/src/cpython/Lib/test/regrtest.py", line 1044 in runtest_inner File "/Users/steffen/src/cpython/Lib/test/regrtest.py", line 838 in runtest File "/Users/steffen/src/cpython/Lib/test/regrtest.py", line 662 in main File "/Users/steffen/src/cpython/Lib/test/__main__.py", line 13 in <module> File "/Users/steffen/src/cpython/Lib/runpy.py", line 73 in _run_code File "/Users/steffen/src/cpython/Lib/runpy.py", line 160 in _run_module_as_main Hope that helps a bit. Bâillement. |
|||
msg135965 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-05-14 08:26 | |
Hello Steffen, First, thanks for testing this on OS-X: I only have access to Linux systems (I tested both the semaphore and the emulated semaphore paths). If I understand correctly, the patch works fine with the default build option on OS-X. Then, you're saying that OS-X doesn't have POSIX unnamed semaphores: this means that the default build uses the mutex+condition variable version. Am I correct? But I don't understand the last part of your message. Do you mean that you implemented your own version of the semaphore path using named semaphores, and that it fails with my patch (well, your adapted version of it) ? If yes, then you'll understand that I can't comment on this, since it's not my code :-) But after a quick look at the code you posted, I think that your acquire code is broken: sem_timedwait(timeout) if not the same as trying sem_trywait multiple times until timeout expires: in case of contention, this will fail. |
|||
msg135984 - (view) | Author: Nir Aides (nirai) ![]() |
Date: 2011-05-14 18:54 | |
I think that generally it is better to deadlock than corrupt data. > 2) acquiring locks just before fork is probably one of the best way to > deadlock (acquiring a lock we already hold, or acquiring a lock needed > by another thread before it releases its own lock). Apart from adding > dealock avoidance/recovery mechanisms - which would be far from > trivial - I don't see how we could solve this, given that each library > can use its own locks, not counting the user-created ones a) We know the correct locking order in Python's std libraries so the problem there is kind of solved. b) We can put the burden of other locks on application developers and since currently no one registers atfork handlers, there is no problem there yet. > 4) Python locks differ from usual mutexes/semaphores in that they can > be held for quite some time (for example while performing I/O). Thus, > acquiring all the locks could take a long time, and users might get > irritated if fork takes 2 seconds to complete. We only need a prepare handler to acquire locks that protect data from corruption. A lock synchronizing IO which is held for long periods may possibly be initialized in child without being acquired in a prepare handler; for example, a lock serializing logging messages. In other cases or in general an atfork handler may reset or reinitialize a library without acquiring locks in a prepare handler. > 5) Finally, there's a fundamental problem with this approach, because > Python locks can be released by a thread other than the one that owns > it. > Imagine this happens: > > T1 T2 > lock.acquire() > (do something without releasing lock) > fork() > lock.release() > > This is perfectly valid with the current lock implementation (for > example, it can be used to implement a rendez-vous point so that T2 > doesn't start processing before T1 forked worker processes, or > whatever). > But if T1 tries to acquire lock (held by T2) before fork, then it will > deadlock, since it will never be release by T2. I think we do not need to acquire rendezvous locks in a prepare handler. > > Initializing locks in child after fork without acquiring them before the > > fork may result in corrupted program state and so is probably not a good > > idea. > > Yes, but in practise, I think that this shouldn't be too much of a > problem. Also note that you can very well have the same type of > problem with sections not protected explicitely by locks: for example, > if you have a thread working exclusively on an object (maybe part of a > threadpool), a fork can very well happen while the object is in an > inconsistent state. Acquiring locks before fork won't help that. I think a worker thread that works exclusively on an object does not create the problem: a) If the fork thread eventually needs to read the object then you need synchronization. b) If the worker thread eventually writes data into file or DB then that operation will be completed at the parent process. To summarize I think we should take the atfork path. An atfork handler does not need to acquire all locks, but only those required by library logic, which the handler is aware of, and as a bonus it can be used to do all sort of stuff such as cleaning up, reinitializing a library, etc... |
|||
msg136003 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-05-14 23:14 | |
> a) We know the correct locking order in Python's std libraries so the problem there is kind of solved. I think that you're greatly under-estimating the complexity of lock ordering. If we were just implementing a malloc implementation protected with a single mutex, then yes, it would be simple. But here, you have multiple libraries with each their own locks, locks at the I/O layer, in the socket module (some name resolution libraries are not thread-safe), and in many other places. And all those interact. For example, buffered I/O objects each have their own lock (Antoine, correct me if I'm wrong). It's a common cause of deadlock. Now imagine I have a thread that logs information to a bz2 stream, so that it's compressed on-the-fly. Sounds reasonable, no? Well, the lock hierarchy is: buffered stream lock bz2-level lock logging object I/O lock Do you still think that getting the locking order right is easy? Another example, with I/O locks (and if you're concerned with data corruption, those are definitely the one you would want to handle with atfork): I have a thread blocking on a write (maybe the output pipe is full, maybe it's a NFS file system and the server takes a long time to respond, etc. Or maybe it's just waiting for someone to type something on stdin.). Another thread forks. The atfork-handler will try to acquire the buffered I/O object's lock: it won't succeed until the other threads finally manages to write/read. It could take seconds, or forever. And there are many other things that could go wrong, because contrarily to a standalone and self-contained library, Python is made of several components, at different layers, that can call each other in an arbitrary order. Also, some locks can be held for arbitrarily long. That's why I still think that this can be fully handled by atfork handlers. But don't get me wrong: like you, I think that we should definitely have an atfork mechanism. I just think it won't be able to solve all the issues, and that I can also bring its own set of troubles. Concerning the risk of corruption (invariant broken), you're right. But resetting the locks is the approach currently in use for the threading module, and it seems to work reasonably well there. Finally, I'd just like to insist on a point: In a multi-threaded program, between fork and exec, the code must be async-safe. This means that in theory, you can't call pthread_mutex_release/pthread_mutex_destroy, fwrite, malloc, etc. Period. This means that in theory, we shouldn't be running Python code at all! So if we really wanted to be safe, the only solution would be to forbid fork() in a multi-threaded program. Since it's not really a reasonable option, and that the underlying platform (POSIX) doesn't allow to be safe either, I guess that the only choice left is to provide a bet-try implementation, knowing perfectly that there will always be some corner cases that can't be handled. |
|||
msg136039 - (view) | Author: Steffen Daode Nurpmeso (sdaoden) | Date: 2011-05-15 17:03 | |
@ Charles-François Natali wrote (2011-05-15 01:14+0200): > So if we really wanted to be safe, the only solution would be to > forbid fork() in a multi-threaded program. > Since it's not really a reasonable option But now - why this? The only really acceptable thing if you have control about what you are doing is the following: class SMP::Process /*! * \brief Daemonize process. *[.] * \note * The implementation of this function is not trivial. * To avoid portability no-goes and other such problems, * you may \e not call this function after you have initialized * Thread::enableSMP(), * nor may there (have) be(en) Child objects, * nor may you have used an EventLoop! * I.e., the process has to be a single threaded, "synchronous" one. * [.] */ pub static si32 daemonize(ui32 _daemon_flags=df_default); namespace SMP::POSIX /*! * \brief \fn fork(2). *[.] * Be aware that this passes by all \SMP and Child related code, * i.e., this simply \e is the system-call. * Signal::resetAllSignalStates() and Child::killAll() are thus if * particular interest; thread handling is still entirely up to you. */ pub static sir fork(void); Which kind of programs cannot be written with this restriction? |
|||
msg136045 - (view) | Author: Nir Aides (nirai) ![]() |
Date: 2011-05-15 18:51 | |
Is it possible the following issue is related to this one? http://bugs.python.org/issue10037 - "multiprocessing.pool processes started by worker handler stops working" |
|||
msg136047 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-05-15 19:39 | |
> Is it possible the following issue is related to this one? It's hard to tell, the original report is rather vague. But the comment about the usage of the maxtasksperchild argument reminds me of issue #10332 "Multiprocessing maxtasksperchild results in hang": basically, there's a race window in the Pool shutdown code where worker threads having completed their work can exit without being replaced. So the connection with the current issue does not strike me, but you never know (that's the problem with those nasty race conditions ;-). Concerning this issue, here's an updated patch. I removed calls to pthread_mutex_destroy/pthread_condition_destroy/sem_destroy from the reinit functions: the reason is that I experienced a deadlock in test_concurrent_futures with the emulated semaphore code on Linux/NPTL inside pthread_condition_destroy: the new version strictly mimics what's done in glibc's malloc, and just calls pthrad_mutex_init and friends. It's safe, and shouldn't leak resources (and even if it does, it's way better than a deadlock). I also placed the note on the interaction between locks and fork() at the top of Python/thread_pthread.h. |
|||
msg136120 - (view) | Author: Nir Aides (nirai) ![]() |
Date: 2011-05-16 18:57 | |
Steffen, can you explain in layman's terms? On Sun, May 15, 2011 at 8:03 PM, Steffen Daode Nurpmeso <report@bugs.python.org> wrote: > > @ Charles-François Natali wrote (2011-05-15 01:14+0200): >> So if we really wanted to be safe, the only solution would be to >> forbid fork() in a multi-threaded program. >> Since it's not really a reasonable option > > But now - why this? The only really acceptable thing if you have > control about what you are doing is the following: > > class SMP::Process > /*! > * \brief Daemonize process. > *[.] > * \note > * The implementation of this function is not trivial. > * To avoid portability no-goes and other such problems, > * you may \e not call this function after you have initialized > * Thread::enableSMP(), > * nor may there (have) be(en) Child objects, > * nor may you have used an EventLoop! > * I.e., the process has to be a single threaded, "synchronous" one. > * [.] > */ > pub static si32 daemonize(ui32 _daemon_flags=df_default); > > namespace SMP::POSIX > /*! > * \brief \fn fork(2). > *[.] > * Be aware that this passes by all \SMP and Child related code, > * i.e., this simply \e is the system-call. > * Signal::resetAllSignalStates() and Child::killAll() are thus if > * particular interest; thread handling is still entirely up to you. > */ > pub static sir fork(void); > > Which kind of programs cannot be written with this restriction? |
|||
msg136147 - (view) | Author: Steffen Daode Nurpmeso (sdaoden) | Date: 2011-05-17 10:35 | |
@ Nir Aides wrote (2011-05-16 20:57+0200): > Steffen, can you explain in layman's terms? I am the layman here. Charles-François has written a patch for Python which contradicted his own proposal from msg135079, but he seems to have tested a lot so that he then was even able to prove that his own proposal was correct. His new patch does implement that with a nice introductional note. He has also noticed that the only really safe solution is to simply disallow multi-threading in programs which fork(). And this either-or is exactly the conclusion we have taken and implemented in our C++ library - which is not an embeddable programming language that needs to integrate nicely in whatever environment it is thrown into, but "even replaces main()". And i don't know any application which cannot be implemented regardless of fork()-or-threads instead of fork()-and-threads. (You *can* have fork()+exec()-and-threads at any time!) So what i tried to say is that it is extremely error-prone and resource intensive to try to implement anything that tries to achieve both. I.e. on Solaris they do have a forkall() and it seems they have atfork handlers for everything (and even document that in the system manual). atfork handlers for everything!! And for what? To implement a standart which is obviously brain-dead because it is *impossible* to handle it - as your link has shown this is even confessed by members of the committee. And writing memory in the child causes page-faults. That's all i wanted to say. (Writing this mail required more than 20 minutes, the mentioned one was out in less than one. And it is much more meaningful AFAIK.) |
|||
msg139084 - (view) | Author: Giovanni Bajo (Giovanni.Bajo) | Date: 2011-06-25 15:15 | |
If there's agreement that the general problem is unsolvable (so fork and threads just don't get along with each other), what we could attempt is trying to limit the side effects in the standard library, so that fewest users as possible are affected by this problem. For instance, having deadlocks just because of print statements sounds like a bad QoI that we could attempt to improve. Is there a reason while BufferedIO needs to hold its internal data-structure lock (used to make it thread-safe) while it's doing I/O and releasing the GIL? I would think that it's feasible to patch it so that its internal lock is only used to synchronize accesses to the internal data structures, but it is never held while I/O is performed (and thus the GIL is released -- at which point, if another threads forks, the problem appears). |
|||
msg139245 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2011-06-27 08:30 | |
> If there's agreement that the general problem is unsolvable (so fork and > threads just don't get along with each other), what we could attempt is > trying to limit the side effects in the standard library, so that fewest > users as possible are affected by this problem. Actually, I think Charles-François' suggested approach is a good one. > For instance, having deadlocks just because of print statements sounds > like a bad QoI that we could attempt to improve. Is there a reason while > BufferedIO needs to hold its internal data-structure lock (used to make > it thread-safe) while it's doing I/O and releasing the GIL? I would think > that it's feasible to patch it so that its internal lock is only used to > synchronize accesses to the internal data structures, but it is never > held while I/O is performed (and thus the GIL is released -- at which > point, if another threads forks, the problem appears). Not really. Whether you update the internal structures depends on the result of the I/O (so that e.g. two threads don't flush the same buffer simultaneously). Also, finer-grained locking is always a risky endeavour and we already have a couple of bugs to fix in the current buffered I/O implementation :-/ |
|||
msg139470 - (view) | Author: Tomaž Šolc (avian) | Date: 2011-06-30 10:05 | |
The way I see it is that Charles-François' patch trades a possibility of a deadlock for a possibility of a child process running with inconsistent states due to forcibly reinitialized locks. Personally, I would rather have an occasional deadlock than an occasional random crash. I don't like increasing complexity with fine-grained locking either. While the general case is unsolvable what Giovanni proposed at least solves the specific case where only the basic IO code is involved after a fork. In hindsight the only real life use-case I can find that it would solve is doing an exec() right after a fork(). There are quite a few bugs in the tracker that seem to have this same root cause, so it appears the impossibility of cleanly handling threads and forks is not something people are generally aware of. Since I think we agree you can't just disable fork() when multiple threads are running, how about at least issuing a warning in that case? That would be a two-line change in threading.py. |
|||
msg139474 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-06-30 12:55 | |
> The way I see it is that Charles-François' patch trades a possibility of a deadlock for a possibility of a child process running with inconsistent states due to forcibly reinitialized locks. > Yeah, that's why I let this stale: that's really an unsolvable problem in the general case. Don't mix fork() and threads, that's it. > I don't like increasing complexity with fine-grained locking either. While the general case is unsolvable what Giovanni proposed at least solves the specific case where only the basic IO code is involved after a fork. In hindsight the only real life use-case I can find that it would solve is doing an exec() right after a fork(). > Antoine seems to think that you can't release the I/O locks around I/O syscalls (when the GIL is released). But I'm sure that if you come up with a working patch it'll get considered for inclusion ;-) > Since I think we agree you can't just disable fork() when multiple threads are running, how about at least issuing a warning in that case? That would be a two-line change in threading.py. You mean a runtime warning? That would be ugly and clumsy. A warning is probably a good idea, but it should be added somewhere in os.fork() and threading documentation. |
|||
msg139480 - (view) | Author: Nir Aides (nirai) ![]() |
Date: 2011-06-30 14:16 | |
Well, I ping my view that we should: 1) Add general atfork() mechanism. 2) Dive into the std lib and add handlers one by one, that depending on case, either do the lock/init thing or just init the state of the library to some valid state in the child. Once this mechanism is in place and committed with a few obvious handlers such as the one for the logging library, other handlers can be added over time. Following this path we will slowly resolve the problem, handler by handler, without introducing the invalid state problem. |
|||
msg139485 - (view) | Author: Tomaž Šolc (avian) | Date: 2011-06-30 15:04 | |
> You mean a runtime warning? That would be ugly and clumsy. > A warning is probably a good idea, but it should be added somewhere in os.fork() and threading documentation. I was thinking about a run time warning that is emitted if you call os.fork() while multiple threads are active. It is ugly, but at least it tells you you are doing something that will in most cases not work correctly. I certainly agree that a warning should also be added to os.fork() documentation. I'm attaching an example patch that adds it into _after_fork() in threading.py, but there are a number of other places where it might go instead. I believe that the comp.programming.threads post from David Butenhof linked above explains why adding atfork() handlers isn't going to solve this. |
|||
msg139488 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2011-06-30 15:08 | |
> I was thinking about a run time warning that is emitted if you call > os.fork() while multiple threads are active. It is ugly, but at least > it tells you you are doing something that will in most cases not work > correctly. The problem is that multiprocessing itself, by construction, uses fork() with multiple threads. Perhaps there's a way to use only non-blocking communication instead (rendering the helper threads useless), but that's not a trivial change. |
|||
msg139489 - (view) | Author: Nir Aides (nirai) ![]() |
Date: 2011-06-30 15:10 | |
> I believe that the comp.programming.threads post from > David Butenhof linked above explains why adding atfork() > handlers isn't going to solve this. In Python atfork() handlers will never run from signal handlers, and if I understood correctly, Charles-François described a way to "re-initialize" a Python lock safely under that assumption. |
|||
msg139509 - (view) | Author: Steffen Daode Nurpmeso (sdaoden) | Date: 2011-06-30 19:05 | |
My suggestion to this would be that it should be outdated in the same way that Georg Brandl has suggested for changing the default encoding on python-dev [1], and add clear documentation on that, also in respect to the transition phase .. > The problem is that multiprocessing itself, by construction, > uses fork() with multiple threads. .. and maybe add some switches which allow usage of fork() for the time being. Today a '$ grep -Fir fork' does not show up threading.rst at all, which seems to be little in regard to the great problem. I would add a big fat note that multiprocessing.Process should be used instead today, because how about those of us who are not sophisticated enough to be appointed to standard committees? But anyway we should be lucky: fork(2) is UNIX specific, and thus it can be expected that all thread-safe libraries etc. are aware of the fact that they may be cloned by it. Except mine, of course. ,~) [1] http://mail.python.org/pipermail/python-dev/2011-June/112126.html -- Ciao, Steffen sdaoden(*)(gmail.com) () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments |
|||
msg139511 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2011-06-30 19:14 | |
> I would add a big fat note that multiprocessing.Process should be > used instead today, because how about those of us who are not > sophisticated enough to be appointed to standard committees? How do you think multiprocessing.Process launches a new process? |
|||
msg139521 - (view) | Author: Steffen Daode Nurpmeso (sdaoden) | Date: 2011-06-30 20:25 | |
> How do you think multiprocessing.Process launches a new process? But it's a single piece of code under control and even multi-OS/multi-architecture test-coverage, not a general purpose "Joe, you may just go that way and Python will handle it correctly"? What i mean is: ten years ago (or so), Java did not offer true selection on sockets (unless i'm mistaken) - servers needed a 1:1 mapping of threads:sockets to handle connections?! But then, a "this thread has finished the I/O, let's use it for something different" seems to be pretty obvious. This is ok if it's your professor who is forcefully misleading you into the wrong direction, but otherwise you will have problems, maybe sooner, maybe later (, maybe never). And currently there is not a single piece of documentation which points you onto the problems. (And there *are* really people without Google.) The problem is that it looks so simple and easy - but it's not. In my eyes it's an unsolvable problem. And for the sake of resource usage, simplicity and execution speed i appreciate all solutions which don't try to do the impossible. I want to add that all this does not really help just as long just *any* facility which is used by Python *itself* is not under control of atfork. Solaris e.g. uses atfork for it's memory allocator, because that is surely needed if anything else but async-safe facilities are used in the newly forked process. Can Python give that guarantee for all POSIX systems it supports? Good night. -- Ciao, Steffen sdaoden(*)(gmail.com) () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments |
|||
msg139522 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2011-06-30 20:53 | |
> > How do you think multiprocessing.Process launches a new process? > > But it's a single piece of code under control and even > multi-OS/multi-architecture test-coverage, not a general purpose > "Joe, you may just go that way and Python will handle it > correctly"? Sorry, how does that make the problem any different? |
|||
msg139584 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-07-01 15:23 | |
> Well, I ping my view that we should: > Could you please detail the following points: - what would be the API of this atfork() mechanism (with an example of how it would be used in the library)? - how do you find the correct order to acquire locks in the parent process? - what do you do with locks that can be held for arbitrarily long (e.g. I/O locks)? |
|||
msg139599 - (view) | Author: Nir Aides (nirai) ![]() |
Date: 2011-07-01 19:50 | |
> - what would be the API of this atfork() mechanism (with an example of how it would be used in the library)? The atfork API is defined in POSIX and Gregory P. Smith proposed a Python one above that we can look into. http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_atfork.html We may need an API to reset a lock. > - how do you find the correct order to acquire locks in the parent process? One option is to use the import graph to determine call order of atfork handlers. If a current std library does not fit into this scheme we can possibly fix it when writing its handlers. > - what do you do with locks that can be held for arbitrarily long (e.g. I/O locks)? It is likely that such a lock does not need acquiring at the parent, and re-initializing the library in the child handler will do. A "critical section" lock that protects in-memory data should not be held for long. |
|||
msg139608 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-07-01 21:09 | |
>> - how do you find the correct order to acquire locks in the parent process? > > One option is to use the import graph to determine call order of atfork handlers. > If a current std library does not fit into this scheme we can possibly fix it when writing its handlers. > Sorry, I fail to see how the "import graph" is related to the correct lock acquisition order. Some locks are created dynamically, for example. That's why I asked for a specific API: when do you register a handler? When are they called? When are they reset? >> - what do you do with locks that can be held for arbitrarily long (e.g. I/O locks)? > > It is likely that such a lock does not need acquiring at the parent, and re-initializing the library in the child handler will do. The whole point of atfork is to avoid breaking invariants and introduce invalid state in the child process. If there is one thing we want to avoid, it's precisely reading/writting corrupted data from/to files, so eluding the I/O problem seems foolish to me. > A "critical section" lock that protects in-memory data should not be held for long. Not necessarily. See for example I/O locks and logging module, which hold locks until I/O completion. |
|||
msg139800 - (view) | Author: Nir Aides (nirai) ![]() |
Date: 2011-07-04 19:41 | |
> Sorry, I fail to see how the "import graph" is related to the correct > lock acquisition order. Some locks are created dynamically, for > example. Import dependency is a reasonable heuristic to look into for inter-module locking order. The rational is explained in the following pthread_atfork man page: http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_atfork.html "A higher-level package may acquire locks on its own data structures before invoking lower-level packages. Under this scenario, the order specified for fork handler calls allows a simple rule of initialization for avoiding package deadlock: a package initializes all packages on which it depends before it calls the pthread_atfork() function for itself." (The rational section is an interpretation which is not part of the standard) A caveat is that since Python is an object oriented language it is more common than with C that code from a higher level module will be invoked by code from a lower level module, for example by calling an object method that was over-ridden by the higher level module - this actually happens in the logging module (emit method). > That's why I asked for a specific API: when do you register a handler? > When are they called? When are they reset? Read the pthread_atfork man page. > The whole point of atfork is to avoid breaking invariants and > introduce invalid state in the child process. If there is one thing we > want to avoid, it's precisely reading/writting corrupted data from/to > files, so eluding the I/O problem seems foolish to me. Please don't use insulting adjectives. If you think I am wrong, convincing me logically will do. you can "avoid breaking invariants" using two different strategies: 1) Acquire locks before the fork and release/reset them after it. 2) Initialize the module to some known state after the fork. For some (most?) modules it may be quite reasonable to initialize the module to a known state after the fork without acquiring its locks before the fork; this too is explained in the pthread_atfork man page: "Alternatively, some libraries might be able to supply just a child routine that reinitializes the mutexes in the library and all associated states to some known value (for example, what it was when the image was originally executed)." > > A "critical section" lock that protects in-memory data should not be held for long. > > Not necessarily. See for example I/O locks and logging module, which > hold locks until I/O completion. Oops, I have always used the term "critical section" to describe a lock that protects data state as tightly as possible, ideally not even across function calls but now I see the Wikipedia defines one to protect any resource including IO. The logging module locks the entire emit() function which I think is wrong. It should let the derived handler take care of locking when it needs to, if it needs to at all. The logging module is an example for a module we should reinitialize after the fork without locking its locks before the fork. |
|||
msg139808 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-07-04 21:22 | |
[...] > A caveat is that since Python is an object oriented language it is more common than with C that code from a higher level module will be invoked by code from a lower level module, for example by calling an object method that was over-ridden by the higher level module - this actually happens in the logging module (emit method). Exactly. That's why registering atfork() handler in independent modules can - and will - lead to deadlocks, if we get the order wrong. Also, most locks are allocated dynamically (at the same time as the object they protect), so the import order is not really relevant here. Furthermore, there's not a strict ordering between the modules: how is bz2 compared to loglib, for example? > >> That's why I asked for a specific API: when do you register a handler? >> When are they called? When are they reset? > > Read the pthread_atfork man page. > No, it won't do it, since when an object - and its protecting lock - is deallocated, the related atfork handler must be removed, for example. You might handle this with wearefs, but that's definitely something not accounted for by the pthread_atfork standard API. >> The whole point of atfork is to avoid breaking invariants and >> introduce invalid state in the child process. If there is one thing we >> want to avoid, it's precisely reading/writting corrupted data from/to >> files, so eluding the I/O problem seems foolish to me. > > Please don't use insulting adjectives. > If you think I am wrong, convincing me logically will do. > I'm sorry if that sounded insulting to you, it was really unintentional (English is not my mother tongue). > you can "avoid breaking invariants" using two different strategies: > 1) Acquire locks before the fork and release/reset them after it. > 2) Initialize the module to some known state after the fork. > > For some (most?) modules it may be quite reasonable to initialize the module to a known state after the fork without acquiring its locks before the fork; this too is explained in the pthread_atfork man page: > "Alternatively, some libraries might be able to supply just a child routine that reinitializes the mutexes in the library and all associated states to some known value (for example, what it was when the image was originally executed)." > The most problematic place is the I/O layer, since those are the locks held longer (see for example issue #7123). And I'm not sure we can simply reinit the I/O object after fork() without corrupting or losing data. But this approach (reinitializing after fork()) works well most of the time, and is actually already used in multiple places (threading and multiprocessing modules, and probably others). > Oops, I have always used the term "critical section" to describe a lock that protects data state as tightly as possible, ideally not even across function calls but now I see the Wikipedia defines one to protect any resource including IO. > Yes, that's one peculiarity of Python locks. Another one is that a lock can be released by a process other than the one who acquired it. > The logging module locks the entire emit() function which I think is wrong. > It should let the derived handler take care of locking when it needs to, if it needs to at all. > > The logging module is an example for a module we should reinitialize after the fork without locking its locks before the fork. It's possible. Like I said earlier in this thread, I'm not at all opposed to the atfork mechanism. I'm actually in favor of it, the first reason being that we could factorize the various ad-hoc atfork handlers scattered through the standard library. My point is just that it's not as simple as it sounds because of long-held locks, and that we've got to be really careful because of inter-module dependencies. Would you like to work on a patch to add an atfork mechanism? |
|||
msg139850 - (view) | Author: Tomaž Šolc (avian) | Date: 2011-07-05 11:20 | |
Except for multiprocessing, does anyone know of any other module in the standard library that uses fork() and threads at the same time? After some grepping through the source I couldn't find any other cases. I'm still in favor of just deprecating using fork() on a multithreaded process (with appropriate warnings and documentation) and I'm prepared to work on a patch that would remove the need for helper threads in the multiprocessing module. I gather that having atfork would be useful beyond attempting to solve the locking problem, so this doesn't mean I'm opposed to it. However debugging rare problems in multithreaded/multiprocess applications is such a frustrating task that I really don't like a solution that only works in the common case. > In Python atfork() handlers will never run from signal handlers, and > if I understood correctly, Charles-François described a way to > "re-initialize" a Python lock safely under that assumption. Just to clarify: it's not that POSIX atfork() handlers run from signal handlers. It's that after a fork in a multithreaded process POSIX only guarantees calls to "safe" functions, which is the same set of functions as those that are safe to call from signal handlers. This fact does not change for Python's os.fork(). |
|||
msg139852 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2011-07-05 11:29 | |
> Except for multiprocessing, does anyone know of any other module in > the standard library that uses fork() and threads at the same time? > After some grepping through the source I couldn't find any other > cases. It's quite common to launch a subprocess from a thread, so as to communicate with the subprocess without blocking the main thread. I'm not sure the stdlib itself does it, but the test suite does (when run in parallel mode). > I'm prepared to work on a patch that would remove the need for helper > threads in the multiprocessing module. Your contribution is welcome. > Just to clarify: it's not that POSIX atfork() handlers run from signal > handlers. It's that after a fork in a multithreaded process POSIX only > guarantees calls to "safe" functions, which is the same set of > functions as those that are safe to call from signal handlers. For the record, I would consider POSIX totally broken from this point of view. It seems most modern systems allow much more than that, fortunately. |
|||
msg139858 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-07-05 11:47 | |
> Except for multiprocessing, does anyone know of any other module in the standard library that uses fork() and threads at the > same time? After some grepping through the source I couldn't find any other cases. The same problem arises in case of user-created threads, this problem is not specific to the multiprocessing. > Just to clarify: it's not that POSIX atfork() handlers run from signal handlers. It's that after a fork in a multithreaded process POSIX only guarantees calls to "safe" functions, which is the same set of functions as those that are safe to call from signal handlers. This fact does not change for Python's os.fork(). > I think Nir knows perfectly that, he was just referring to a limitation of pthread_atfork: - fork() is async-safe, and thus can be called from a signal handler - but if pthread_atfork handlers are installed, then fork() can become non async-safe, if the handlers are not async-safe (and it's the case when you're dealing with POSIX mutexes for example) But since Python's user-defined signal handlers are actually called synchronously (and don't run on behalf of the signal handler), there's no risk of fork() being called from a signal handler. > I'm still in favor of just deprecating using fork() on a multithreaded process (with appropriate warnings and documentation) We can't do that, it would break existing code. Furthermore, some libraries use threads behind the scene. > I'm prepared to work on a patch that would remove the need for helper threads in the multiprocessing module. What do you mean by helper threads? |
|||
msg139869 - (view) | Author: Tomaž Šolc (avian) | Date: 2011-07-05 13:04 | |
> We can't do that, it would break existing code. I would argue that such code is already broken. > What do you mean by helper threads? multiprocessing uses threads behind the scenes to handle queue traffic and such for individual forked processes. It's something I also wasn't aware of until Antoine pointed it out. It also has its own implementation of atfork hooks in an attempt to handle the locking issue. |
|||
msg139897 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-07-05 16:33 | |
>> We can't do that, it would break existing code. > > I would argue that such code is already broken. > - that's not necessarily true, if your code is carefully designed - we can't forbid fork() in a multi-threaded application while it's allowed by POSIX - backward compatibility is *really* important >> What do you mean by helper threads? > > multiprocessing uses threads behind the scenes to handle queue traffic and such for individual forked processes. It's something I also wasn't aware of until Antoine pointed it out. It also has its own implementation of atfork hooks in an attempt to handle the locking issue. > I'm curious as to how you'll manage to implement multiprocessing.queues without threads. Please open a dedicated issue for this. |
|||
msg139929 - (view) | Author: Nir Aides (nirai) ![]() |
Date: 2011-07-06 08:55 | |
> Would you like to work on a patch to add an atfork mechanism? I will start with an attempt to generate a summary "report" on this rabbit hole of a problem, the views and suggestions raised by people here and what we may expect from atfork approach, its limitations, etc... I will also take a deeper look into the code. Hopefully my brain will not deadlock or fork while I am at it. more words, I know... |
|||
msg140215 - (view) | Author: Nir Aides (nirai) ![]() |
Date: 2011-07-12 20:57 | |
Well, my brain did not deadlock, but after spinning on the problem for a while longer, it now thinks Tomaž Šolc and Steffen are right. We should try to fix the multiprocessing module so it does not deadlock single-thread programs and deprecate fork in multi-threaded programs. Here is the longer version, which is a summary of what people said here in various forms, observations from diving into the code and Googling around: 1) The rabbit hole a) In a multi-threaded program, fork() may be called while another thread is in a critical section. That thread will not exist in the child and the critical section will remain locked. Any attempt to enter that critical section will deadlock the child. b) POSIX included the pthread_atfork() API in an attempt to deal with the problem: http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_atfork.html c) But it turns out atfork handlers are limited to calling async-signal-safe functions since fork may be called from a signal handler: http://download.oracle.com/docs/cd/E19963-01/html/821-1601/gen-61908.html#gen-95948 This means atfork handlers may not actually acquire or release locks. See opinion by David Butenhof who was involved in the standardization effort of POSIX threads: http://groups.google.com/group/comp.programming.threads/msg/3a43122820983fde d) One consequence is that we can not assume third party libraries are safe to fork in multi-threaded program. It is likely their developers consider this scenario broken. e) It seems the general consensus across the WWW concerning this problem is that it has no solution and that a fork should be followed by exec as soon as possible. Some references: http://pubs.opengroup.org/onlinepubs/009695399/functions/fork.html http://austingroupbugs.net/view.php?id=62 http://sourceware.org/bugzilla/show_bug.cgi?id=4737 http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them 2) Python’s killer rabbit The standard library multiprocessing module does two things that force us into the rabbit hole; it creates worker threads and forks without exec. Therefore, any program that uses the multiprocessing module is a multi-threading forking program. One immediate problem is that a multiprocessing.Pool may fork from its worker thread in Pool._handle_workers(). This puts the forked child at risk of deadlock with any code that was run by the parent’s main thread (the entire program logic). More problems may be found with a code review. Other modules to look at are concurrent.futures.process (creates a worker thread and uses multiprocessing) and socketserver (ForkingMixIn forks without exec). 3) God bless the GIL a) Python signal handlers run synchronously in the interpreter loop of the main thread, so os.fork() will never be called from a POSIX signal handler. This means Python atfork prepare and parent handlers may run any code. The code run at the child is still restricted though and may deadlock into any acquired locks left behind by dead threads in the standard library or lower level third party libraries. b) Turns out the GIL also helps by synchronizing threads. Any lock held for the duration of a function call while the GIL is held will be released by the time os.fork() is called. But if a thread in the program calls a function that yields the GIL we are in la la land again and need to watch out step. 4) Landing gently at the bottom a) I think we should try to review and sanitize the worker threads of the multiprocessing module and other implicit worker threads in the standard library. Once we do (and since os.fork() is never run from a POSIX signal handler) the multiprocessing library should be safe to use in programs that do not start other threads. b) Then we should declare the user scenario of mixing the threading and multiprocessing modules as broken by design. c) Finally, optionally provide atfork API The atfork API can be used to refactor existing fork handlers in the standard library, provide handlers for modules such as the logging module that will reduce the risk of deadlock in existing programs, and can be used by developers who insist on mixing threading and forking in their programs. 5) Sanitizing worker threads in the multiprocessing module TODO :) (will try to post some ideas soon) |
|||
msg140402 - (view) | Author: Nir Aides (nirai) ![]() |
Date: 2011-07-15 11:17 | |
Here is a morning reasoning exercise - please help find the flaws or refine it: 5) Sanitizing worker threads in the multiprocessing module Sanitizing a worker thread in the context of this problem is to make sure it can not create a state that may deadlock another thread that calls fork(); or in other words fork-safe. Keep in mind that in Python os.fork() is never called from a POSIX signal handler. So what are examples of a fork-safe thread? a) A thread that spins endlessly doing nothing in a C for(;;) loop is safe. Another thread may call fork() without restrictions. b) A Python thread that only calls function that do not yield the GIL and that does not acquire locks that are held beyond a Python tick is safe. An example for such a lock is a critical-section lock acquired by a lower level third party library for the duration of a function call. Such a lock will be released by the time os.fork() is called because of the GIL. c) A Python thread that in addition to (2) also acquires a lock that is handled at fork is safe. d) A Python thread that in addition to (2) and (3) calls function that yield the GIL but while the GIL is released only calls async-signal-safe code. This is a bit tricky. We know that it is safe for thread A to fork and call async-signal-safe functions regardless of what thread B has been doing, but I do not know that thread A can fork and call non async-signal-safe functions if thread B was only calling async-signal-safe functions. Nevertheless it makes sense: For example lets assume it isn't true, and that hypothetical thread A forked while thread B was doing the async-signal-safe function safe_foo(), and then thread A called non async-signal-safe function unsafe_bar() and deadlocked. unsafe_bar() could have deadlocked trying to acquire a lock that was acquired by safe_foo(). But if this is so, then it could also happen the other way around. Are there other practical possibilities? Either way, we could double check and white list the async-signal-safe functions we are interested in, in a particular implementation. e) Socket related functions such as bind() accept() send() and recv(), that Python calls without holding the GIL, are all async-signal-safe. This means that in principle we can have a fork-safe worker thread for the purpose of communicating with a forked process using a socket. |
|||
msg140550 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2011-07-18 00:27 | |
No Python thread is ever fork safe because the Python interpreter itself can never be made fork safe. Nor should anyone try to make the interpreter itself safe. It is too complex and effectively impossible to guarantee. There is no general solution to this, fork and threading is simply broken in POSIX and no amount of duct tape outside of the OS kernel can fix it. My only desire is that we attempt to do the right thing when possible with the locks we know about within the standard library. |
|||
msg140658 - (view) | Author: Steffen Daode Nurpmeso (sdaoden) | Date: 2011-07-19 12:32 | |
If Nir's analysis is right, and Antoines comment pushes me into this direction, (i personally have not looked at that code), then multiprocessing is completely brain-damaged and has been implemented by a moron. And yes, I know this is a bug tracker, and even that of Python. Nir should merge his last two messages into a single mail to python-dev, and those guys should give Nir or Thomas or a group of persons who have time and mental power a hg(1) repo clone and committer access to that and multiprocessing should be rewritten, maybe even from scratch, but i dunno. For the extremely unlikely case that all that doesn't happen maybe the patch of neologix should make it? --Steffen Ciao, sdaoden(*)(gmail.com) () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments |
|||
msg140659 - (view) | Author: Steffen Daode Nurpmeso (sdaoden) | Date: 2011-07-19 12:54 | |
Um, and just to add: i'm not watching out for anything, and it won't and it can't be me: ?0%0[steffen@sherwood sys]$ grep -F smp CHANGELOG.svn -B3 | grep -E '^r[[:digit:]]+' | tail -n 1 r162 | steffen | 2006-01-18 18:29:58 +0100 (Wed, 18 Jan 2006) | 35 lines --Steffen Ciao, sdaoden(*)(gmail.com) () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments |
|||
msg140668 - (view) | Author: Nir Aides (nirai) ![]() |
Date: 2011-07-19 14:11 | |
> then multiprocessing is completely brain-damaged and has been > implemented by a moron. Please do not use this kind of language. Being disrespectful to other people hurts the discussion. |
|||
msg140689 - (view) | Author: Steffen Daode Nurpmeso (sdaoden) | Date: 2011-07-19 19:04 | |
P.S.: I have to apologize, it's Tomaž, not Thomas. (And unless i'm mistaken this is pronounced "TomAsch" rather than the english "Tommes", so i was just plain wrong.) --Steffen Ciao, sdaoden(*)(gmail.com) () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments |
|||
msg140690 - (view) | Author: Steffen Daode Nurpmeso (sdaoden) | Date: 2011-07-19 19:14 | |
> > then multiprocessing is completely brain-damaged and has been > > implemented by a moron. > > Please do not use this kind of language. > Being disrespectful to other people hurts the discussion. So i apologize once again. 'Still i think this should go to python-dev in the mentioned case. (BTW: there are religions without "god", so whom shall e.g. i praise for the GIL?) --Steffen Ciao, sdaoden(*)(gmail.com) () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments |
|||
msg140691 - (view) | Author: Nir Aides (nirai) ![]() |
Date: 2011-07-19 19:45 | |
> (BTW: there are religions without "god", so whom shall e.g. i praise for the GIL?) Guido? ;) |
|||
msg141286 - (view) | Author: Nir Aides (nirai) ![]() |
Date: 2011-07-28 08:26 | |
Hi Gregory, > Gregory P. Smith <greg@krypto.org> added the comment: > No Python thread is ever fork safe because the Python interpreter itself can never be made fork safe. > Nor should anyone try to make the interpreter itself safe. It is too complex and effectively impossible to guarantee. a) I think the term "Guarantee" is not meaningful here since the interpreter is probably too complex to guarantee it does not contain other serious problems. b) If no Python thread is ever fork safe, can you illustrate how a trivial Python thread spinning endlessly might deadlock a child forked by another Python thread? I was not able to find reports of deadlocks clearly related to multiprocessing worker threads so they could be practically safe already, to the point other Python-Dev developers would be inclined to bury this as a theoretical problem :) Anyway, there exists at least the problem of forking from the pool worker thread and possibly other issues, so the code should be reviewed. Another latent problem is multiprocessing logging which is disabled by default? > There is no general solution to this, fork and threading is simply broken in POSIX and no amount of duct tape outside of the OS kernel can fix it. This is why we should "sanitize" the multithreading module and deprecate mixing of threading and multiprocessing. I bet most developers using Python are not even aware of this problem. We should make sure they are through documentation. Here is another way to look at the current situation: 1) Don't use threading for concurrency because of the GIL. 2) Don't mix threading with multiprocessing because threading and forking don't mix well. 3) Don't use multiprocessing because it can deadlock. We should make sure developers are aware of (2) and can use (3) safely***. > My only desire is that we attempt to do the right thing when possible with the locks we know about within the standard library. Right, with an atfork() mechanism. |
|||
msg143174 - (view) | Author: Richard Oudkerk (sbt) * ![]() |
Date: 2011-08-29 19:04 | |
multiprocessing.util already has register_after_fork() which it uses for cleaning up certain things when a new process (launched by multiprocessing) is starting. This is very similar to the proposed atfork mechanism. Multiprocessing assumes that it is always safe to delete lock objects. If reinit_locks.diff is committed then I guess this won't be a problem. I will try to go through multiprocessing's use of threads: Queue ----- Queue's have a feeder thread which pushes objects in to the underlying pipe as soon as possible. The state which can be modified by this thread is a threading.Condition object and a collections.deque buffer. Both of these are replaced by fresh copies by the after-fork mechanism. However, because objects in the buffer may have __del__ methods or weakref callbacks associated, arbitrary code may be run by the background thread if the reference count falls to zero. Simply pickling the argument of put() before adding it to the buffer fixes that problem -- see the patch for Issue 10886. With this patch I think Queue's use of threads is fork-safe. Pool ---- If a fork occurs while a pool is running then a forked process will get a copy of the pool object in an inconsistent state -- but that does not matter since trying to use a pool from a forked process will *never* work. Also, some of a pool's methods support callbacks which can execute arbitrary code in a background thread. This can create inconsistent state in a forked process As with Queue.put, pool methods should pickle immediately for similar reasons. I would suggest documenting clearly that a pool should only ever be used or deleted by the process which created it. We can use register_after_fork to make all of a pool's methods raise an error after a fork. We should also document that callbacks should only be used if no more processes will be forked. allow_connection_pickling ------------------------- Currently multiprocessing.allow_connection_pickling() does not work because types are registered with ForkingPickler instead of copyreg -- see Issue 4892. However, the code in multiprocessing.reduction uses a background thread to support the transfer of sockets/connections between processes. If this code is ever resurrected I think the use of register_after_fork makes this safe. Managers -------- A manager uses a threaded server process. This is not a problem unless you create a user defined manager which forks new processes. The documentation should just say Don't Do That. I think multiprocessing's threading issues are all fixable. |
|||
msg143274 - (view) | Author: Nir Aides (nirai) ![]() |
Date: 2011-08-31 18:25 | |
For the record, turns out there was a bit of misunderstanding. I used the term deprecate above to mean "warn users (through documentation) that they should not use (a feature)" and not in its Python-dev sense of "remove (a feature) after a period of warning". I do not think the possibility to mix threading and multiprocessing together should be somehow forcibly disabled. Anyway, since my view does not seem to resonate with core developers I I'll give it a rest for now. |
|||
msg143279 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2011-08-31 21:02 | |
> Anyway, since my view does not seem to resonate with core developers I I'll > give it a rest for now. Well, the problem is that many views have been expressed in this thread, which doesn't help getting a clear picture of what's needed to make progress on this issue. AFAIC, I think the following seems reasonable: 1) add an atfork module which provides a generic and pthread_atfork-like mechanism to setup handlers that must be called after fork (right now several modules use their own ad-hoc mechanism) 2) for multiprocessing, call exec() after fork() (issue #8713) 3) for buffered file objects locks, use the approach similar to the patch I posted (reinit locks in the child process right after fork()) Does that sound reasonable? |
|||
msg151168 - (view) | Author: (lesha) | Date: 2012-01-13 11:02 | |
Just wanted to say that I spent something like 8 hours debugging a subprocess + threading + logging deadlock on a real production system. I suspected one of my locks at first, but I couldn't find any. The post-fork code was very simple, and I didn't suspect that logging would be subject to the same issue. The good news that I see a very clean solution for fixing this. We can't free all locks across fork -- that is unsafe and mad, because the child might end up corrupting some shared (network) resource, for example/ However, extending RLock to provide ForkClearedRLock (this would be used by logging, i.e.) is quite straighforward. The extended class would simply need to record the process ID, in which the lock was created, and the process ID, which is trying to acquire it. Done! |
|||
msg151266 - (view) | Author: Charles-François Natali (neologix) * ![]() |
Date: 2012-01-14 19:32 | |
> However, extending RLock to provide ForkClearedRLock (this would be used by logging, i.e.) is quite straighforward. > > The extended class would simply need to record the process ID, in which the lock was created, and the process ID, which is trying to acquire it. Done! There are at least two problems with this approach. - with LinuxThreads, threads have different PIDs, so this would break. LinuxThreads have now been replaced with NPTL, so this may not be a showstopper, though - however, the other problem is more serious: it has the exact same problem as the lock reinitialization upon fork(): locks are used to protect critical sections, to make sure that threads see a consistent state: if you simply proceed and reset/acquire the lock when the process is not the last one that owned it, the invariants protected by the lock will be broken. The proper solution is to setup handlers to be called upon fork, that not only reset locks, but also internal state of objects they protect. However, this is a dull and boring task, and would require patching dozens of different places. It's been on my todo list for some time... Another "solution" would be to remove the only place in the standard library where a bare fork() is used, in multiprocessing (issue #8713). Then, it's the user's problem if he calls fork() :-) |
|||
msg151267 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2012-01-14 19:40 | |
A new lock type will NOT solve this. It is ALWAYS okay to clear all thread/threading module locks after a fork. They are and always have been process-local by definition so they are also by definition 100% invalid to any child process. Anyone who has written code using them to "lock" an out-of-process resource has written code that is already broken today. Thread locks can't guard network resources. |
|||
msg151845 - (view) | Author: Richard Oudkerk (sbt) * ![]() |
Date: 2012-01-23 21:23 | |
Attached is a patch (without documentation) which creates an atfork module for Unix. Apart from the atfork() function modelled on pthread_atfork() there is also a get_fork_lock() function. This returns a recursive lock which is held whenever a child process is created using os.fork(), subprocess.Popen() or multiprocessing.Process(). It can be used like with atfork.get_fork_lock(): r, w = os.pipe() pid = os.fork() if pid == 0: try: os.close(r) # do something with w finally: os._exit(0) else: os.close(w) # do something with r This prevents processes forked by other threads from accidentally inheriting the writable end (which would potentially cause EOF to be delayed when reading from the pipe). It can also be used to eliminate the potential race where you create an fd and then set the CLOEXEC flag on it. The patch modifies Popen() and Process.start() to acquire the lock when they create their pipes. (A race condition previously made Process.sentinel and Process.join() potentially unreliable in a multithreaded program.) Note that using the deprecated os.popen?() and os.spawn?() functions can still cause accidental inheritance of fds. (I have also done a hopefully complete patch to multiprocessing to optionally allow fork+exec on Unix -- see Issue 8713.) |
|||
msg151846 - (view) | Author: Richard Oudkerk (sbt) * ![]() |
Date: 2012-01-23 21:29 | |
Is there any particular reason not to merge Charles-François's reinit_locks.diff? Reinitialising all locks to unlocked after a fork seems the only sane option. |
|||
msg151853 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2012-01-23 21:58 | |
> Is there any particular reason not to merge Charles-François's reinit_locks.diff? > > Reinitialising all locks to unlocked after a fork seems the only sane option. I agree with this. I haven't looked at the patch very closely. I think perhaps each lock could have an optional callback for specific code to be run after forking, but that may come in another patch. (this would allow to make e.g. the C RLock fork-safe) |
|||
msg161019 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2012-05-17 21:49 | |
Should we go forward on this? |
|||
msg161029 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2012-05-18 01:26 | |
going forward with reinit_locks.diff makes sense. I've added comments to it in the code review link. It is "Patch Set 3" |
|||
msg161389 - (view) | Author: (lesha) | Date: 2012-05-23 02:00 | |
This is a reply to: http://bugs.python.org/issue6721#msg151266 Charles-François raises two problems: 1) LinuxThreads have different PIDs per thread. He then points out that LinuxThreads have long been deprecated. 2) you cannot release locks on fork() because that might let the forked process access protected resources. My replies: (1) Good catch. I suspect that this could be mitigated even if we cared about LinuxThreads. I haven't looked, but there's got to be a way to determine if we are a thread or a fork child. (2) I think I didn't explain my idea very well. I don't mean that we should release *all* locks on fork. That will end in disaster, as Charles-François amply explained. All I meant is that we could introduce a special lock class "ForkClearedRLock" that self-releases on fork(). We could even use Charles-François's reinit magic for this. Using ForkClearedRLock in "logging" would prevent deadlocks. The only potential harm that would come from this is that your logfile might get pretty ugly, i.e. the fork parent and child might be printing simultaneously, resulting in logs like: Error: parentparentparError: childchildchildchildchild entparentparent It's not great, but it's definitely better than deadlocking. I don't think logging can do anything more sensible across fork() anyway. Did this explanation make more sense? |
|||
msg161405 - (view) | Author: Richard Oudkerk (sbt) * ![]() |
Date: 2012-05-23 12:49 | |
> (1) Good catch. I suspect that this could be mitigated even if we cared > about LinuxThreads. I haven't looked, but there's got to be a way to > determine if we are a thread or a fork child. Using a generation count would probably work just as well as the PID: main process has generation 0, children have generation 1, grandchildren have generation 2, ... > (2) I think I didn't explain my idea very well. I don't mean that we > should release *all* locks on fork. That will end in disaster, as > Charles-François amply explained. So what are you suggesting? That a lock of the default type should raise an error if you try to acquire it when it has been acquired in a previous process? |
|||
msg161470 - (view) | Author: (lesha) | Date: 2012-05-23 23:39 | |
> So what are you suggesting? That a lock of the default type should > raise an error if you try to acquire it when it has been acquired in a > previous process? I was suggesting a way to make 'logging' fork-safe. No more, no less. Does what my previous comment make sense in light of this? > Using a generation count Sure, that's a good idea. |
|||
msg161953 - (view) | Author: Richard Oudkerk (sbt) * ![]() |
Date: 2012-05-30 14:04 | |
> > Is there any particular reason not to merge Charles-François's reinit_locks.diff? > > > > Reinitialising all locks to unlocked after a fork seems the only sane option. > > I agree with this. > I haven't looked at the patch very closely. I think perhaps each lock > could have an optional callback for specific code to be run after > forking, but that may come in another patch. > (this would allow to make e.g. the C RLock fork-safe) An alternative way of handling RLock.acquire() would be to always start by trying a non-blocking acquire while holding the GIL: if this succeeds and self->rlock_count != 0 then we can assume that the lock was cleared by PyThread_ReinitLocks(). |
|||
msg162019 - (view) | Author: Richard Oudkerk (sbt) * ![]() |
Date: 2012-05-31 20:48 | |
Attached is an updated version of Charles-François's reinit_locks.diff. Changes: * Handles RLock by assuming that if self->count != 0 when we acquire the lock, then the lock must have been reinitialized by PyThread_ReInitLocks(). * Applies existing fork tests for Lock to RLock. * Fixes capitalization issues with PyThread_ReInitLocks()/PyThread_ReinitLocks(). * Defines PyThread_ReInitLocks() to be empty on non-pthread platforms. Note that RLock._is_owned() is unreliable after a fork until RLock.acquire() has been called. Also, no synchronization has been added for the list of locks. Are PyThread_allocate_lock() and PyThread_free_lock() supposed to be safe to call while not holding the GIL? |
|||
msg162031 - (view) | Author: (lesha) | Date: 2012-05-31 23:17 | |
I am really alarmed by the reinit_locks patches. I scanned the comment history, and looked at the patch. I may have missed something, but it looks to me like the basic behavior is this: "After fork(), all locks are replaced by brand-new lock objects that are NOT locked." *Grim Prediction*: This is going to cause some disastrous, unexpected, and hilarious data loss or corruption for somebody. Here is why: class MySQLConn: def __init__(self): self.lock = Lock() def doWork(self): self.lock.acquire() # do a sequence of DB operations that must not be interrupted, # and cannot be made transactional. self.lock.release() Run this in a thread: def thread1(conn): while True: conn.doWork() time.sleep(0.053) Run this in another thread: def thread2(conn): while True: conn.doWork() time.sleep(0.071) Run this in a third thread: def thread2(): while True: subprocess.call(["ls", "-l"]) time.sleep(0.3) With reinit_locks(), this will eventually break horribly. a) fork() is called with the DB lock held by thread1. b) Some time passes before the child gets to exec(). c) In that time, the child's thread2 gets to doWork(). d) Simultaneously, the parent's doWork is still running and holding a lock. e) Thanks to reinit_locks, the child's thread2 does not have a lock, and it will merrily proceed to corrupt the parent's work. So I think this approach is basically doomed. I think my approach of marking _some_ locks as safe to reinit upon fork is workable (i.e. to solve the bad interaction with logging or import). However, there's also an orthogonal approach that might work well: 1) Right before the first thread gets created in any Python program, fork off a fork() server. From then on, subprocess will only use the fork server to call commands. Thoughts? |
|||
msg162034 - (view) | Author: Richard Oudkerk (sbt) * ![]() |
Date: 2012-06-01 00:11 | |
> a) fork() is called with the DB lock held by thread1. > b) Some time passes before the child gets to exec(). > c) In that time, the child's thread2 gets to doWork(). > d) Simultaneously, the parent's doWork is still running and holding a lock. > e) Thanks to reinit_locks, the child's thread2 does not have a lock, and > it will merrily proceed to corrupt the parent's work. You seem to be saying that all three threads survive the fork. I think forkall() on Solaris acts like that, but the normal fork() function does not. Only the thread which performs fork() will survive in the child process. So doWork() never runs in the child process, and the lock is never used in the child process. |
|||
msg162036 - (view) | Author: (lesha) | Date: 2012-06-01 00:25 | |
> I think forkall() on Solaris acts like that, but the normal fork() > function does not. Only the thread which performs fork() will survive > in the child process. Sorry, brain fail. A slightly more contrived failure case is this: subprocess.Popen( ..., preexec_fn=lambda: conn.doWork() ) Everything else is the same. Another failure case is: class MySQLConn: ... doWork as before ... def __del__(self): self.doWork() Followed by: def thread3(conn): while True: subprocess.call(['nonexistent_program']) time.sleep(0.1) The destructor will fire in the child and corrupt the parent's data. An analogous example is: conn = MySQLConn() start_thread1(conn) start_thread2(conn): while True: if os.fork() == 0: # child raise Exception('doom') # triggers destructor Basically, it is really really dangerous to release locks that protect any resources that are not copied by fork (i.e. network resources, files, DB connections, etc, etc). |
|||
msg162038 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2012-06-01 01:16 | |
Anyone using a preexec function in subprocess has already declared that they like deadlocks so that isn't an issue. :) |
|||
msg162039 - (view) | Author: (lesha) | Date: 2012-06-01 01:20 | |
Deadlocks are dandy, but corruption is cruel. |
|||
msg162040 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2012-06-01 01:26 | |
threading locks cannot be used to protect things outside of a single process. Any code using them to do that is broken. In your examples you are suggesting a class that wants to do one or more mysql actions within a destructor and worried that the __del__ method would be called in the fork()'ed child process. With the subprocess module, this will never happen. the child exec's or does a hard exit. http://hg.python.org/cpython/file/bd2c2def77a7/Modules/_posixsubprocess.c#l634 When someone is using os.fork() directly, they are responsible for all destructors in their application behaving sanely within the child process. Destructors are an evil place to put code that does actual work and are best avoided. When required, they must be written defensively because they really cannot depend on much of the Python execution environment around them being in a functional state as they have no control over _when_ they will be called during shutdown. Nothing new here. |
|||
msg162041 - (view) | Author: (lesha) | Date: 2012-06-01 01:51 | |
Re "threading locks cannot be used to protect things outside of a single process": The Python standard library already violates this, in that the "logging" module uses such a lock to protect the file/socket/whatever, to which it is writing. If I had a magic wand that could fix all the places in the world where people do this, I'd accept your argument. In practice, threading locks are abused in this way all the time. Most people don't even think about the interaction of fork and threads until they hit a bug of this nature. Right now, we are discussing a patch that will take broken code, and instead of having it deadlock, make it actually destroy data. I think this is a bad idea. That is all I am arguing. I am glad that my processes deadlocked instead of corrupting files. A deadlock is easier to diagnose. You are right: subprocess does do a hard exit on exceptions. However, the preexec_fn and os.fork() cases definitely happen in the wild. I've done both of these before. I'm arguing for a simple thing: let's not increase the price of error. A deadlock sucks, but corrupted data sucks much worse -- it's both harder to debug, and harder to fix. |
|||
msg162053 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2012-06-01 06:43 | |
We could make any later attempt to acquire or release a lock that was reinitialized while it was held raise an exception. Such exception raising behavior should be conditional at lock construction time; some code (such as logging) never wants to deal with one because it is unacceptable for it to ever fail or deadlock. It should also be possible for the exception to be usefully handled if caught; locks should gain an API to clear their internal "reinitialized while held" flag. Q: Should .release() raise the exception? Or just warn? I'm thinking no exception on release(). Releasing a lock that was held during re-initialization could just reset the "reinitialized while held" flag. The acquire() that would deadlock or crash today would be where raising an exception makes the most sense. Deadlocks are unacceptable. The whole point of this bug is that we can do better. An exception would provide a stack trace of exactly which thing where caused the offending operation so that the code can be fixed to not misuse locks in the first place or that specific lock can be changed to silent reinitialization on fork. (or better yet, the fork can be removed entirely) Both behaviors are better than today. This change would surface bugs in people's code much better than difficult to debug deadlocks. It should be a pretty straightforward change to reinit_locks_2 (Patch Set 6) to behave that way. Looking back, raising an exception is pretty much what I suggested in http://bugs.python.org/issue6721#msg94115 2.5 years ago. |
|||
msg162054 - (view) | Author: Richard Oudkerk (sbt) * ![]() |
Date: 2012-06-01 06:58 | |
> conn = MySQLConn() > start_thread1(conn) > start_thread2(conn): > while True: > if os.fork() == 0: # child > raise Exception('doom') # triggers destructor There is no guarantee here that the lock will be held at the time of the fork. So even if we ensure that a lock acquired before the fork stayed lock, we won't necessarily get a deadlock. More importantly, you should never fork without ensuring that you exit with os._exit() or os.exec*(). So your example should be something like conn = MySQLConn() start_thread1(conn) start_thread2(conn): while True: if os.fork() == 0: # child try: raise Exception('doom') # does NOT trigger destructor except: sys.excepthook(*sys.exc_info()) os._exit(1) else: os._exit(0) With this hard exit the destructor never runs. |
|||
msg162063 - (view) | Author: Vinay Sajip (vinay.sajip) * ![]() |
Date: 2012-06-01 09:18 | |
>> Re "threading locks cannot be used to protect things outside of a >> single process": > The Python standard library already violates this, in that the > "logging" module uses such a lock to protect the file/socket/whatever, > to which it is writing. logging is not doing anything to protect things *outside* of a single process - the logging docs make that clear, and give specific recommendations for how to use logging in a multi-process scenario. Logging is just using locks to manage contention between multiple threads in a single process. In that sense, it is no different to any other Python code that uses locks. |
|||
msg162113 - (view) | Author: (lesha) | Date: 2012-06-02 00:33 | |
I feel like I'm failing to get my thesis across. I'll write it out fully: == Thesis start == Basic fact: It is an error to use threading locks in _any_ way after a fork. I think we mostly agree on this. The programs we discussing are **inherently buggy**. We disagree on the right action when such a bug happens. I see 3 possibilities: 1) deadlock (the current behavior, if the lock was held in the parent at the time of fork) 2) continue to execute: a) as if nothing happened (the current behavior, if the lock was not held in the parent) b) throw an Exception (equivalent to a, see below) 3) crash hard. I think both 1 and 3 are tolerable, while 2 is **completely unsafe** because the resulting behavior of the program is unexpected and unpredictable (data corruption, deletion, random actions, etc). == Thesis end == I will now address Gregory's, Richard's, and Vinay's comments in view of this thesis: 1) Gregory suggests throwing an exception when the locks are used in a child. He also discusses some cases, in which he believes one could safely continue execution. My responses: a) Throwing an exception is tatamount to continuing execution. Imagine that the parent has a tempfile RAII object that erases the file after the object disappears, or in some exception handler. The destructor / handler will now get called in the child... and the parent's tempfile is gone. Good luck tracking that one down. b) In general, is not safe to continue execution on release(). If you release() and reinitialize, the lock could still later be reused by both parent and child, and there would still be contention leading to data corruption. c) Re: deadlocks are unacceptable... A deadlock is better than data corruption. Whether you prefer a deadlock or a crash depends on whether your system is set up to dump core. You can always debug a deadlock with gdb. A crash without a core dump is impossible to diagnose. However, a crash is harder to ignore, and it lets the process recover. So, in my book, though I'm not 100% certain: hard crash > deadlock > corruption d) However, we can certainly do better than today: i) Right now, we sometimes deadlock, and sometimes continue execution. It would be better to deadlock always (or crash always), no matter how the child uses the lock. ii) We can log before deadlocking (this is hard in general, because it's unclear where to log to), but it would immensely speed up debugging. iii) We can hard-crash with an extra-verbose stack dump (i.e. dump the lock details in addition to the stack) 2) Richard explains how my buggy snippets are buggy, and how to fix them. I respond: Richard, thanks for explaining how to avoid these bugs! Nonetheless, people make bugs all the time, especially in areas like this. I made these bugs. I now know better, mostly, but I wouldn't bet on it. We should choose the safest way to handle these bugs: deadlocking always, or crashing always. Reinitializing the locks is going to cost Python users a lot more in the long run. Deadlocking _sometimes_, as we do now, is equally bad. Also, even your code is potentially unsafe: when you execute the excepthook in the child, you could be running custom exception logic, or even a custom excepthook. Those could well-intentionedly, but stupidly, destroy some of the parent's valuable data. 3) Vinay essentially says "using logging after fork is user error". I respond: Yes, it is. In any other logging library, this error would only result in mangled log lines, but no lasting harm. In Python, you sometimes get a deadlock, and other times, mangled lines. > logging is not doing anything to protect things *outside* of a single process A file is very much outside a single process. If you are logging to a file, the only correct way is to use a file lock. Thus, I stand by my assertion that "logging" is buggy. Windows programs generally have no problems with this. fork() on UNIX gives you both the rope and the gallows to hang yourself. Specifically for logging, I think reasonable options include: a) [The Right Way (TM)] Using a file lock + CLOEXEC when available; this lets multiple processes cooperate safely. b) It's okay to deadlock & log with an explanation of why the deadlock is happening. c) It's okay to crash with a similar explanation. d) It's pretty okay even to reinitialize logs, although mangled log lines do prevent automated parsing. I really hope that my compact thesis can help us get closer to a consensus, instead of arguing about the details of specific bugs. |
|||
msg162114 - (view) | Author: (lesha) | Date: 2012-06-02 00:42 | |
A slightly more ambitious solution than crashing / deadlocking always is to have Python automatically spawn a "fork server" whenever you start using threads. Then, you would be able to have "subprocess" work cleanly, and not worry about any of this stuff. I don't know if we want to take the perf hit on "import threading" though... |
|||
msg162115 - (view) | Author: (lesha) | Date: 2012-06-02 00:54 | |
Actually, we might be able to automatically spawn a safe fork server _only_ when people start mixing threading and subprocess. I'm not totally sure if this would allow us to salvage multiprocessing as well... The tricky bit is that we'd need to proxy into the fork server all the calls having to do with file descriptors / sockets that we would want to pass into the child processes. That suggests to me that it'll be really hard to do this in a backwards-compatible way. Given that subprocess is a pretty broken library, this might be a good time to replace it anyway. |
|||
msg162117 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2012-06-02 02:15 | |
subprocess has nothing to do with this bug. subprocess is safe as of Python 3.2 (and the subprocess32 backport for 2.x). Its preexec_fn argument is already documented as an unsafe legacy. If you want to replace subprocess, go ahead, write something new and post it on pypi. That is out of the scope of this issue. Look at the original message I opened this bug with. I *only* want to make the standard library use of locks not be a source of deadlocks as it is unacceptable for a standard library itself to force your code to adopt a threads only or a fork only programming style. How we do that is irrelevant; I merely started the discussion with one suggestion. Third party libraries are always free to hang their users however they see fit. If you want to "log" something before deadlocking, writing directly to the stderr file descriptor is the best that can be done. That is what exceptions that escape __del__ destructors do. logging, http.cookiejar, _strptime - all use locks that could be dealt with in a sane manner to avoid deadlocks after forking. Queue, concurrent.futures & threading.Condition - may not make sense to fix as these are pretty threading specific as is and should just carry the "don't fork" caveats in their documentation. (My *real* preference would be to remove os.fork() from the standard library. Not going to happen.) |
|||
msg162120 - (view) | Author: (lesha) | Date: 2012-06-02 02:57 | |
1) I'm totally in favor of making the standard library safe. For that purpose, I think we should do a combination of: a) Use file locks in logging, whenever possible. b) Introduce LockUnsafelyReinitializedAtFork, using a generation counter, or whatever else, which can be used by the few places in the standard library that can safely deal with lock reinitialization. 2) http://docs.python.org/library/subprocess.html#module-subprocess does not actually document that preexec_fn is unsafe and in need of deprecation. New users will continue to shoot themselves in the foot. 3) I think that in addition to making the standard library safe, all other locks need to be made sane (crash or deadlock), so that we at least always avoid the option "2) continue to execute the child despite it relying on an unsafe lock". |
|||
msg162137 - (view) | Author: Vinay Sajip (vinay.sajip) * ![]() |
Date: 2012-06-02 14:49 | |
> Use file locks in logging, whenever possible. Logging doesn't just log to files, and moreover, also has locks to serialise access to internal data structures (nothing to do with files). Hence, using file locks in logging is not going to magically solve problems caused in threading+forking scenarios. Apart from logging a commonly used part of the stdlib library which uses locks, I don't think this issue is to do with logging, specifically; logging uses locks in an unexceptional, conventional way, much as any other code might. Whatever solution is come up with for this thorny issue, it needs to be generic, in my view; otherwise we might just be papering over the cracks. |
|||
msg162160 - (view) | Author: Richard Oudkerk (sbt) * ![]() |
Date: 2012-06-02 18:06 | |
Lesha, the problems about "magical" __del__ methods you are worried about actually have nothing to do with threading and locks. Even in a single threaded program using fork, exactly the same issues of potential corruption would be present because the object might be finalized at the same in multiple processes. The idea that protecting the object with a thread lock will help you is seriously misguided UNLESS you also make sure you acquire them all before the fork -- and then you need to worry about the order in which you acquire all these locks. There are much easier and more direct ways to deal with the issue than wrapping all objects with locks and trying to acquire them all before forking. You could of course use multiprocessing.Lock() if you want a lock shared between processes. But even then, depending on what the __del__ method does, it is likely that you will not want the object to be finalized in both processes. However, the suggestion that locked-before-fork-locks should by default raise an error is reasonable enough. |
|||
msg270015 - (view) | Author: Connor Wolf (Connor.Wolf) | Date: 2016-07-09 01:57 | |
Is anything happening with these fixes? This is still an issue (I'm running into it now)? |
|||
msg270017 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2016-07-09 03:19 | |
For me the momentum on fixing these things has stalled because I no longer work on any code that runs into this. There is a fundamental problem: You cannot safely use threading and os.fork() in the same application per POSIX rules. So even if the standard library and interpreter to tried to force its locks into some sort of consistent state post os.fork(), the much more fundamental POSIX problem remains. IMNSHO, working on "fixes" for this issue while ignoring the larger application design flaw elephant in the room doesn't make a lot of sense. |
|||
msg270018 - (view) | Author: Connor Wolf (Connor.Wolf) | Date: 2016-07-09 03:46 | |
> IMNSHO, working on "fixes" for this issue while ignoring the larger application design flaw elephant in the room doesn't make a lot of sense. I understand the desire for a canonically "correct" fix, but it seems the issue with fixing it "correctly" has lead to the /actual/ implementation being broken for at least 6 years now. As it is, my options are: A. Rewrite the many, many libraries I use that internally spawn threads. B. Not use multiprocessing. (A) is prohibitive from a time perspective (I don't even know how many libraries I'd have to rewrite!), and (B) means I'd get 1/24-th of my VMs performance, so it's somewhat prohibitive. At the moment, I've thrown together a horrible, horrible fix where I reach into the logging library (which is where I'm seeing deadlocks), and manually iterate over all attached log managers, resetting the locks in each immediately when each process spawns. This is, I think it can be agreed, a horrible, horrible hack, but in my particular case it works (the worst case result is garbled console output for a line or two). --- If a canonical fix is not possible, at least add a facility to the threading fork() call that lets the user decide what to do. In my case, my program is wedging in the logging system, and I am entirely OK with having transiently garbled logs, if it means I don't wind up deadlocking and having to force kill the interpreter (which is, I think, far /more/ destructive an action). If I could basically do `multiprocessing.Process(*args, *kwargs, _clear_locks=True)`, that would be entirely sufficient, and not change existing behaviour at all. |
|||
msg270019 - (view) | Author: Connor Wolf (Connor.Wolf) | Date: 2016-07-09 03:47 | |
Arrrgh, s/threading/multiprocessing/g in my last message. |
|||
msg270020 - (view) | Author: (lesha) | Date: 2016-07-09 04:03 | |
I work on a multi-million-line C++ codebase that uses fork() from multithreaded programs all over the place. We use `glog` ubiquitously. This bug here that spans 6 years and has affected dozens of people (conservatively) simply does not exist for us. That is because glog takes the pragmatic approach of sanitizing its mutex on fork: https://github.com/google/glog/blob/4d391fe692ae6b9e0105f473945c415a3ce5a401/src/base/mutex.h#L249 In my opinion, "thou shalt never fork() in a threaded program" is impractical purism. It is good to be aware of the dangers that lie therein, but it is completely possible to safely spawn **subprocesses** from multithreaded programs on modern OSes like Linux. Python's subprocess **ought** to be safe to use in threaded programs. Any issues with this (aside from `pre_exec_fn`, obviously) are bugs in `subprocess`. Here is a C++ implementation of the concept that can be safely used in threaded programs: https://github.com/facebook/folly/blob/master/folly/Subprocess.cpp Note that unlike Python's subprocess `pre_exec_fn`, the C++ analog is very loud in warning you about the scary world in which your closure will execute: https://github.com/facebook/folly/blob/master/folly/Subprocess.h#L252 The point to my message is simple: there is a pragmatic way to save hundreds of debugging hours for users of Python. People are going to find it necessary to do such things from time to time, so instead of telling them to redesign their programs, it is better to give them a safer tool. Taking the glog approach in `logging` has no cost to the standard library, but it does have real world benefits. Please don't block shipping such a fix. |
|||
msg270021 - (view) | Author: (lesha) | Date: 2016-07-09 04:19 | |
On a moment's reflection, a lot of my prior comment is wrong. Sorry about that. - glog does not, that I know of, sanitize locks on fork. You just shouldn't log after fork but before exec. - Using `pthread_atfork` to clean up the `logging` lock might be enough to make it safe from the "just forked" context, but without adding fairly exhaustive tests around this logic, it would be fragile with respect to further improvements to `logging`. So even just making this one library safe is a considerable amount of work. So I retract most of my previous opinion. Sorry. |
|||
msg270022 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2016-07-09 04:21 | |
My intent is not to block anything. I'm Just explaining why I'm not motivated to spend much time on this issue myself. Others are welcome to. subprocess is not related to this issue, it has been fixed for use with threads (in 3.2 and higher) with an extremely widely used drop in replacement back-port for 2.7 https://pypi.python.org/pypi/subprocess32. But even 2.7's poor subprocess implementation never triggered this specific issue in the first place (unless someone logged from a pre_exec_fn which would be a laughable thing to do anyways). multiprocessing: It has spawn (as of 3.4) and forkserver methods both of which can help avoid this issue. Caveats: spawn understandably has negative performance implications and forkserver requires the forkserver to be forked before threads potentially holding locks have been started. As for the gross hacky monkey patching workaround: That was the approach I took in https://github.com/google/python-atfork/blob/master/atfork/stdlib_fixer.py#L51 Definitely a hack, but one that does work on existing interpreters. Conner & lesha: Which Python version(s) are you using? |
|||
msg270023 - (view) | Author: Connor Wolf (Connor.Wolf) | Date: 2016-07-09 04:44 | |
> Python 3.5.1+ (default, Mar 30 2016, 22:46:26) Whatever the stock 3.5 on ubuntu 16.04 x64 is. I've actually been running into a whole horde of really bizarre issues related to what I /think/ is locking in stdout. Basically, I have a context where I have thousands and thousands of (relatively short lived) `multiprocessing.Process()` processes, and over time they all get wedged (basically, I have ~4-32 processes alive at any time, but they all get recycled every few minutes). After doing some horrible (https://github.com/fake-name/ReadableWebProxy/blob/master/logSetup.py#L21-L78) hackery in the logging module, I'm not seeing processes get wedged there, but I do still encounter issues with what I can only assume is a lock in the print statement. I'm hooking into a wedged process using [pystuck](https://github.com/alonho/pystuck) durr@rwpscrape:/media/Storage/Scripts/ReadableWebProxy⟫ pystuck --port 6675 Welcome to the pystuck interactive shell. Use the 'modules' dictionary to access remote modules (like 'os', or '__main__') Use the `%show threads` magic to display all thread stack traces. In [1]: show threads <_MainThread(MainThread, started 140574012434176)> File "runScrape.py", line 74, in <module> go() File "runScrape.py", line 57, in go runner.run() File "/media/Storage/Scripts/ReadableWebProxy/WebMirror/Runner.py", line 453, in run living = sum([manager.check_run_jobs() for manager in managers]) File "/media/Storage/Scripts/ReadableWebProxy/WebMirror/Runner.py", line 453, in <listcomp> living = sum([manager.check_run_jobs() for manager in managers]) File "/media/Storage/Scripts/ReadableWebProxy/WebMirror/Runner.py", line 364, in check_run_jobs proc.start() File "/usr/lib/python3.5/multiprocessing/process.py", line 105, in start self._popen = self._Popen(self) File "/usr/lib/python3.5/multiprocessing/context.py", line 212, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/usr/lib/python3.5/multiprocessing/context.py", line 267, in _Popen return Popen(process_obj) File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__ self._launch(process_obj) File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 74, in _launch code = process_obj._bootstrap() File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/media/Storage/Scripts/ReadableWebProxy/WebMirror/Runner.py", line 145, in run run.go() File "/media/Storage/Scripts/ReadableWebProxy/WebMirror/Runner.py", line 101, in go self.log.info("RunInstance starting!") File "/usr/lib/python3.5/logging/__init__.py", line 1279, in info self._log(INFO, msg, args, **kwargs) File "/usr/lib/python3.5/logging/__init__.py", line 1415, in _log self.handle(record) File "/usr/lib/python3.5/logging/__init__.py", line 1425, in handle self.callHandlers(record) File "/usr/lib/python3.5/logging/__init__.py", line 1487, in callHandlers hdlr.handle(record) File "/usr/lib/python3.5/logging/__init__.py", line 855, in handle self.emit(record) File "/media/Storage/Scripts/ReadableWebProxy/logSetup.py", line 134, in emit print(outstr) <Thread(Thread-4, started daemon 140573656733440)> File "/usr/lib/python3.5/threading.py", line 882, in _bootstrap self._bootstrap_inner() File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner self.run() File "/usr/lib/python3.5/threading.py", line 862, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.5/dist-packages/rpyc/utils/server.py", line 241, in start self.accept() File "/usr/local/lib/python3.5/dist-packages/rpyc/utils/server.py", line 128, in accept sock, addrinfo = self.listener.accept() File "/usr/lib/python3.5/socket.py", line 195, in accept fd, addr = self._accept() <Thread(Thread-5, started daemon 140573665126144)> File "/usr/local/lib/python3.5/dist-packages/pystuck/thread_probe.py", line 15, in thread_frame_generator yield (thread_, frame) So, somehow the print() statement is blocking, which I have /no/ idea how to go about debugging. I assume there's a lock /in/ the print statement function call, and I'm probably going to look into wrapping both the print() call and the multiprocessing.Process() call execution in a single, shared multiprocessing lock, but that seems like a very patchwork solution to something that should just work. |
|||
msg270028 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2016-07-09 08:02 | |
I suggest to close the issue as WONT FIX. Python code base is huge and Python depends on a lot of external code. We cannot warranty anything. It might be possible to track all kinds of locks with an infinite time. But I'm not sure that it's worth it. It is possible to use fork() with threads. The problem is more to execute non trivial code after the fork. In short, the POSIX advices to only call exec() syscall after fork and nothing else. The list of functions which are safe after fork() is very short. You can still use the multiprocessing module using the fork server for example. |
|||
msg289716 - (view) | Author: Daniel Birnstiel (Birne94) * | Date: 2017-03-16 14:10 | |
Currently using Python 3.6.0 (default, Mar 4 2017, 12:32:34) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > So, somehow the print() statement is blocking, which I have /no/ idea how to go about debugging. I assume there's a lock /in/ the print statement function call, and I'm probably going to look into wrapping both the print() call and the multiprocessing.Process() call execution in a single, shared multiprocessing lock, but that > seems like a very patchwork solution to something that should just work. I am currently having a similar issue where I get a deadlock to a stdout.flush call (which I assume is called when printing). Flush internally appears to acquire a lock which is getting stuck in the subprocess. This is the backtrace I was able to obtain through lldb, showing the waiting for a lock after the flush-call: * thread #1: tid = 0x77c27d, 0x00007fffe33c9c86 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP frame #0: 0x00007fffe33c9c86 libsystem_kernel.dylib`__psynch_cvwait + 10 frame #1: 0x00007fffe34b396a libsystem_pthread.dylib`_pthread_cond_wait + 712 frame #2: 0x00000001021ecad8 Python`PyThread_acquire_lock_timed + 256 frame #3: 0x000000010221cc2f Python`_enter_buffered_busy + 169 frame #4: 0x000000010221ed36 Python`_io_BufferedWriter_write + 203 frame #5: 0x000000010215448b Python`_PyCFunction_FastCallDict + 529 frame #6: 0x000000010211b3f0 Python`_PyObject_FastCallDict + 237 frame #7: 0x000000010211be9e Python`PyObject_CallMethodObjArgs + 240 frame #8: 0x000000010222171a Python`_textiowrapper_writeflush + 150 * frame #9: 0x00000001022224a8 Python`_io_TextIOWrapper_flush + 239 Is there any update on this issue or any solution to avoid deadlocking without wrapping every fork/print/logging call with a multiprocessing (or billiard in my case) lock? |
|||
msg294726 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2017-05-30 00:14 | |
http://bugs.python.org/issue16500 added the os.register_at_fork() API which may be usable for this. |
|||
msg294834 - (view) | Author: Daniel Birnstiel (Birne94) * | Date: 2017-05-31 11:40 | |
While having to deal with this bug for a while I have written a small library using `pthread_atfork`: https://github.com/Birne94/python-atfork-lock-release It allows registering atfork-hooks (similar to the ones available by now) and frees the stdout/stderr as well as manually provided io locks. I guess it uses some hacky ways to get the job done, but resolved the issue for me and has been working without problems for some weeks now. |
|||
msg304714 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2017-10-21 17:00 | |
I think we should somehow move forward on this, at least for logging locks which can be quite an annoyance. There are two possible approaches: - either a generic mechanism as posted by sbt in reinit_locks_2.diff - or a logging-specific fix using os.register_at_fork() What do you think? |
|||
msg304716 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2017-10-21 17:12 | |
Oh, I forgot that IO buffered objects also have a lock. So we would have to special-case those as well, unless we take the generic approach... A problem with the generic approach is that it would leave higher-level synchronization objects such as RLock, Event etc. in an inconsistent state. Not to mention the case where the lock is taken by the thread calling fork()... |
|||
msg304722 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2017-10-21 20:27 | |
logging is pretty easy to deal with so I created a PR. bufferedio.c is a little more work as we either need to use the posixmodule.c os.register_at_fork API or expose it as an internal C API to be able to call it to add acquires and releases around the buffer's self->lock member when non-NULL. either way, that needs to be written safely so that it doesn't crash if fork happens after a buffered io struct is freed. (unregister at fork handlers when freeing it? messy) |
|||
msg304723 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2017-10-21 20:29 | |
Actually, we already have a doubly-linked list of buffered IO objects (for another purpose), so we can reuse that and register a single set of global callbacks. |
|||
msg314983 - (view) | Author: Olivier Chédru (ochedru) | Date: 2018-04-05 13:53 | |
FWIW, I encountered the same kind of issue when using the mkstemp() function: under the hood, it calls gettempdir() and this one is protected by a lock too. Current thread 0x00007ff10231f700 (most recent call first): File "/usr/lib/python3.5/tempfile.py", line 432 in gettempdir File "/usr/lib/python3.5/tempfile.py", line 269 in _sanitize_params File "/usr/lib/python3.5/tempfile.py", line 474 in mkstemp |
|||
msg325326 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2018-09-14 05:08 | |
New changeset 19003841e965bbf56fd06824d6093620c1b66f9e by Gregory P. Smith in branch 'master': bpo-6721: Hold logging locks across fork() (GH-4071) https://github.com/python/cpython/commit/19003841e965bbf56fd06824d6093620c1b66f9e |
|||
msg327267 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2018-10-07 07:10 | |
New changeset 3b699932e5ac3e76031bbb6d700fbea07492641d by Gregory P. Smith (Miss Islington (bot)) in branch '3.7': bpo-6721: Hold logging locks across fork() (GH-4071) (#9291) https://github.com/python/cpython/commit/3b699932e5ac3e76031bbb6d700fbea07492641d |
|||
msg329474 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2018-11-08 14:18 | |
> New changeset 3b699932e5ac3e76031bbb6d700fbea07492641d by Gregory P. Smith (Miss Islington (bot)) in branch '3.7': > bpo-6721: Hold logging locks across fork() (GH-4071) (#9291) It seems like this change caused a regression in the Anaconda installer of Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=1644936 But we are not sure at this point. I have to investigate to understand exactly what is happening. |
|||
msg339369 - (view) | Author: cagney (cagney) | Date: 2019-04-02 21:27 | |
I suspect 3b699932e5ac3e7 is causing a hang in libreswan's kvmrunner.py on Fedora. Looking at the Fedora RPMs: python3-3.7.0-9.fc29.x86_64 didn't contain the fix and worked python3-3.7.1-4.fc29.x86_64 reverted the fix (for anaconda) and worked python3-3.7.2-4.fc29.x86_64 included the fix; eventually hangs I believe the hang looks like: Traceback (most recent call last): File "/home/build/libreswan-web/master/testing/utils/fab/runner.py", line 389, in _process_test test_domains = _boot_test_domains(logger, test, domain_prefix, boot_executor) File "/home/build/libreswan-web/master/testing/utils/fab/runner.py", line 203, in _boot_test_domains TestDomain.boot_and_login) File "/home/build/libreswan-web/master/testing/utils/fab/runner.py", line 150, in submit_job_for_domain logger.debug("scheduled %s on %s", job, domain) File "/usr/lib64/python3.7/logging/__init__.py", line 1724, in debug File "/usr/lib64/python3.7/logging/__init__.py", line 1768, in log def __repr__(self): File "/usr/lib64/python3.7/logging/__init__.py", line 1449, in log """ File "/usr/lib64/python3.7/logging/__init__.py", line 1519, in _log break File "/usr/lib64/python3.7/logging/__init__.py", line 1529, in handle logger hierarchy. If no handler was found, output a one-off error File "/usr/lib64/python3.7/logging/__init__.py", line 1591, in callHandlers File "/usr/lib64/python3.7/logging/__init__.py", line 905, in handle try: File "/home/build/libreswan-web/master/testing/utils/fab/logutil.py", line 163, in emit stream_handler.emit(record) File "/usr/lib64/python3.7/logging/__init__.py", line 1038, in emit Handler.__init__(self) File "/usr/lib64/python3.7/logging/__init__.py", line 1015, in flush name += ' ' File "/usr/lib64/python3.7/logging/__init__.py", line 854, in acquire self.emit(record) KeyboardInterrupt |
|||
msg339371 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2019-04-02 22:21 | |
We need a small test case that can reproduce your problem. I believe https://github.com/python/cpython/commit/3b699932e5ac3e76031bbb6d700fbea07492641d to be correct. acquiring locks before fork in the thread doing the forking and releasing them afterwards is always the safe thing to do. Example possibility: Does your code use any C code that forks on its own without properly calling the C Python PyOS_BeforeFork(), PyOS_AfterFork_Parent(), and PyOS_AfterFork_Child() APIs? |
|||
msg339393 - (view) | Author: cagney (cagney) | Date: 2019-04-03 14:40 | |
> Does your code use any C code that forks on its own without properly calling the C Python PyOS_BeforeFork(), PyOS_AfterFork_Parent(), and PyOS_AfterFork_Child() APIs? No. Is there a web page explaining how to pull a python backtrace from all the threads running within a daemon? |
|||
msg339418 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2019-04-03 22:15 | |
I'd start with faulthandler.register with all_threads=True and see if that gives you what you need. https://docs.python.org/3/library/faulthandler.html |
|||
msg339454 - (view) | Author: cagney (cagney) | Date: 2019-04-04 17:11 | |
> acquiring locks before fork in the thread doing the forking and releasing them afterwards is always the safe thing to do. It's also an easy way to cause a deadlock: - register_at_fork() et.al. will cause per-logger locks to be acquired before the global lock (this isn't immediately obvious from the code) If a thread were to grab its logging lock before the global lock then it would deadlock. I'm guessing this isn't allowed - however I didn't see any comments to this effect? Can I suggest documenting this, and also merging the two callbacks so that the ordering of these two acquires is made explicit. - the per-logger locks are acquired in a random order If a thread were to acquire two per-logger locks in a different order then things would deadlock. |
|||
msg339458 - (view) | Author: cagney (cagney) | Date: 2019-04-04 17:35 | |
Below is a backtrace from the deadlock. It happens because the logging code is trying to acquire two per-logger locks; and in an order different to the random order used by the fork() handler. The code in question has a custom class DebugHandler(logging.Handler). The default logging.Handler.handle() method grabs its lock and calls .emit() vis: if rv: self.acquire() try: self.emit(record) finally: self.release() the custom .emit() then sends the record to a sub-logger stream vis: def emit(self, record): for stream_handler in self.stream_handlers: stream_handler.emit(record) if _DEBUG_STREAM: _DEBUG_STREAM.emit(record) and one of these emit() functions calls flush() which tries to acquire a further lock. Thread 0x00007f976b7fe700 (most recent call first): File "/usr/lib64/python3.7/logging/__init__.py", line 854 in acquire File "/usr/lib64/python3.7/logging/__init__.py", line 1015 in flush def flush(self): """ Flushes the stream. """ self.acquire() <---- try: if self.stream and hasattr(self.stream, "flush"): self.stream.flush() finally: self.release() File "/usr/lib64/python3.7/logging/__init__.py", line 1038 in emit self.flush() <---- File "/home/build/libreswan-web/master/testing/utils/fab/logutil.py", line 163 in emit def emit(self, record): for stream_handler in self.stream_handlers: stream_handler.emit(record) <--- if _DEBUG_STREAM: _DEBUG_STREAM.emit(record) File "/usr/lib64/python3.7/logging/__init__.py", line 905 in handle def handle(self, record): """ Conditionally emit the specified logging record. Emission depends on filters which may have been added to the handler. Wrap the actual emission of the record with acquisition/release of the I/O thread lock. Returns whether the filter passed the record for emission. """ rv = self.filter(record) if rv: self.acquire() try: self.emit(record) <--- finally: self.release() return rv File "/usr/lib64/python3.7/logging/__init__.py", line 1591 in callHandlers hdlr.handle(record) File "/usr/lib64/python3.7/logging/__init__.py", line 1529 in handle self.callHandlers(record) File "/usr/lib64/python3.7/logging/__init__.py", line 1519 in _log self.handle(record) File "/usr/lib64/python3.7/logging/__init__.py", line 1449 in log self._log(level, msg, args, **kwargs) File "/usr/lib64/python3.7/logging/__init__.py", line 1768 in log self.logger.log(level, msg, *args, **kwargs) File "/usr/lib64/python3.7/logging/__init__.py", line 1724 in debug self.log(DEBUG, msg, *args, **kwargs) File "/home/build/libreswan-web/master/testing/utils/fab/shell.py", line 110 in write self.logger.debug(self.message, ascii(text)) |
|||
msg339473 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2019-04-05 08:17 | |
Thanks for the debugging details! I've filed https://bugs.python.org/issue36533 to specifically track this potential regression in the 3.7 stable branch. lets carry on there where the discussion thread isn't too long for bug tracker sanity. |
|||
msg365169 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2020-03-27 16:53 | |
I created bpo-40089: Add _at_fork_reinit() method to locks. |
|||
msg367528 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2020-04-28 13:20 | |
Related issue: https://bugs.python.org/issue40399 """ IO streams locking can be broken after fork() with threads """ |
|||
msg367702 - (view) | Author: Deomid Ryabkov (rojer) | Date: 2020-04-29 20:49 | |
https://bugs.python.org/issue40442 is a fresh instance of this, entirely self-inflicted. |
|||
msg368882 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2020-05-14 23:43 | |
See also bpo-25920: PyOS_AfterFork should reset socketmodule's lock. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:52 | admin | set | github: 50970 |
2021-06-26 08:10:31 | pitrou | link | issue41539 superseder |
2020-10-11 20:42:06 | kevans | set | nosy:
+ kevans pull_requests: + pull_request21627 |
2020-10-03 13:38:34 | adelfino | set | nosy:
- adelfino |
2020-10-03 07:08:56 | scoder | set | nosy:
- scoder |
2020-10-03 07:08:30 | scoder | set | pull_requests: - pull_request21519 |
2020-10-03 06:56:08 | scoder | set | nosy:
+ scoder pull_requests: + pull_request21519 |
2020-09-15 20:22:21 | adelfino | set | nosy:
+ adelfino pull_requests: + pull_request21315 |
2020-09-07 15:38:07 | koubaa | set | nosy:
+ koubaa pull_requests: + pull_request21218 |
2020-09-07 07:36:12 | kulikjak | set | nosy:
- kulikjak |
2020-09-07 07:36:00 | kulikjak | set | pull_requests: - pull_request21197 |
2020-09-05 19:32:02 | kulikjak | set | nosy:
+ kulikjak pull_requests: + pull_request21197 |
2020-08-29 17:05:58 | gregory.p.smith | set | pull_requests: - pull_request21110 |
2020-08-29 16:33:59 | rhettinger | set | nosy:
+ rhettinger pull_requests: + pull_request21110 |
2020-06-11 21:05:42 | jesse.farnham | set | nosy:
+ jesse.farnham |
2020-05-14 23:43:26 | vstinner | set | messages: + msg368882 |
2020-04-29 20:49:08 | rojer | set | nosy:
+ rojer messages: + msg367702 |
2020-04-28 13:20:23 | pitrou | set | messages: + msg367528 |
2020-03-27 16:53:39 | vstinner | set | messages: + msg365169 |
2019-04-05 17:02:47 | hugh | set | nosy:
+ hugh |
2019-04-05 08:17:15 | gregory.p.smith | set | messages: + msg339473 |
2019-04-04 18:01:34 | emptysquare | set | nosy:
- emptysquare |
2019-04-04 17:35:46 | cagney | set | messages: + msg339458 |
2019-04-04 17:11:09 | cagney | set | messages: + msg339454 |
2019-04-03 22:15:37 | gregory.p.smith | set | messages: + msg339418 |
2019-04-03 14:40:50 | cagney | set | messages: + msg339393 |
2019-04-02 22:21:33 | gregory.p.smith | set | messages: + msg339371 |
2019-04-02 21:27:32 | cagney | set | nosy:
+ cagney messages: + msg339369 |
2018-11-08 14:18:53 | vstinner | set | messages: + msg329474 |
2018-10-07 07:10:11 | gregory.p.smith | set | messages: + msg327267 |
2018-09-14 05:08:43 | miss-islington | set | pull_requests: + pull_request8722 |
2018-09-14 05:08:35 | gregory.p.smith | set | messages: + msg325326 |
2018-04-05 13:53:19 | ochedru | set | nosy:
+ ochedru messages: + msg314983 |
2017-10-21 20:29:53 | pitrou | set | messages: + msg304723 |
2017-10-21 20:27:53 | gregory.p.smith | set | messages: + msg304722 |
2017-10-21 20:24:03 | gregory.p.smith | set | pull_requests: + pull_request4042 |
2017-10-21 17:12:00 | pitrou | set | messages: + msg304716 |
2017-10-21 17:00:42 | pitrou | set | messages: + msg304714 |
2017-05-31 11:40:53 | Birne94 | set | messages: + msg294834 |
2017-05-30 00:14:19 | gregory.p.smith | set | versions: + Python 3.7, - Python 3.5 |
2017-05-30 00:14:09 | gregory.p.smith | set | messages: + msg294726 |
2017-03-16 14:10:29 | Birne94 | set | nosy:
+ Birne94 messages: + msg289716 |
2017-02-17 23:29:07 | Winterflower | set | nosy:
+ Winterflower |
2016-09-08 01:43:23 | davin | set | nosy:
+ davin |
2016-07-09 08:02:15 | vstinner | set | messages: + msg270028 |
2016-07-09 04:44:44 | Connor.Wolf | set | messages: + msg270023 |
2016-07-09 04:21:51 | gregory.p.smith | set | messages: + msg270022 |
2016-07-09 04:19:06 | lesha | set | messages: + msg270021 |
2016-07-09 04:03:23 | lesha | set | messages: + msg270020 |
2016-07-09 03:47:53 | Connor.Wolf | set | messages: + msg270019 |
2016-07-09 03:46:16 | Connor.Wolf | set | messages: + msg270018 |
2016-07-09 03:19:21 | gregory.p.smith | set | messages: + msg270017 |
2016-07-09 01:57:37 | Connor.Wolf | set | nosy:
+ Connor.Wolf messages: + msg270015 |
2016-03-24 14:56:29 | emptysquare | set | nosy:
+ emptysquare |
2014-11-03 11:21:12 | nirs | set | nosy:
+ nirs |
2014-11-02 16:24:44 | ionelmc | set | nosy:
+ ionelmc |
2014-11-02 15:11:51 | serhiy.storchaka | link | issue22697 superseder |
2014-08-24 03:56:21 | dan.oreilly | set | nosy:
+ dan.oreilly |
2014-06-30 15:03:46 | tshepang | set | nosy:
+ tshepang |
2014-06-30 15:02:39 | tshepang | set | versions:
+ Python 3.5, - Python 3.3 title: Locks in python standard library should be sanitized on fork -> Locks in the standard library should be sanitized on fork |
2014-04-22 20:05:29 | forest_atq | set | nosy:
+ forest_atq |
2012-06-02 18:06:20 | sbt | set | messages: + msg162160 |
2012-06-02 14:49:28 | vinay.sajip | set | messages: + msg162137 |
2012-06-02 13:38:42 | avian | set | nosy:
- avian |
2012-06-02 02:57:02 | lesha | set | messages: + msg162120 |
2012-06-02 02:15:42 | gregory.p.smith | set | messages: + msg162117 |
2012-06-02 00:54:59 | lesha | set | messages: + msg162115 |
2012-06-02 00:42:00 | lesha | set | messages: + msg162114 |
2012-06-02 00:33:12 | lesha | set | messages: + msg162113 |
2012-06-01 09:18:59 | vinay.sajip | set | messages: + msg162063 |
2012-06-01 06:58:59 | sbt | set | messages: + msg162054 |
2012-06-01 06:43:09 | gregory.p.smith | set | messages: + msg162053 |
2012-06-01 01:51:38 | lesha | set | messages: + msg162041 |
2012-06-01 01:26:55 | gregory.p.smith | set | messages: + msg162040 |
2012-06-01 01:20:22 | lesha | set | messages: + msg162039 |
2012-06-01 01:16:41 | gregory.p.smith | set | messages: + msg162038 |
2012-06-01 00:25:01 | lesha | set | messages: + msg162036 |
2012-06-01 00:11:00 | sbt | set | messages: + msg162034 |
2012-05-31 23:17:24 | lesha | set | messages: + msg162031 |
2012-05-31 20:48:41 | sbt | set | files:
+ reinit_locks_2.diff messages: + msg162019 |
2012-05-30 14:04:45 | sbt | set | messages: + msg161953 |
2012-05-23 23:39:43 | lesha | set | messages: + msg161470 |
2012-05-23 12:49:07 | sbt | set | messages: + msg161405 |
2012-05-23 02:00:55 | lesha | set | messages: + msg161389 |
2012-05-18 01:26:50 | gregory.p.smith | set | messages: + msg161029 |
2012-05-17 21:49:12 | pitrou | set | versions:
- Python 2.7, Python 3.2 messages: + msg161019 assignee: gregory.p.smith -> type: behavior -> enhancement stage: test needed -> patch review |
2012-02-13 11:55:21 | vinay.sajip | set | nosy:
+ vinay.sajip |
2012-01-23 21:58:31 | pitrou | set | messages: + msg151853 |
2012-01-23 21:29:10 | sbt | set | messages: + msg151846 |
2012-01-23 21:23:23 | sbt | set | files:
+ atfork.patch messages: + msg151845 |
2012-01-14 19:40:51 | gregory.p.smith | set | messages: + msg151267 |
2012-01-14 19:32:37 | neologix | set | messages: + msg151266 |
2012-01-14 17:47:47 | jcea | set | nosy:
+ jcea |
2012-01-13 11:02:10 | lesha | set | nosy:
+ lesha messages: + msg151168 |
2012-01-13 10:42:12 | neologix | link | issue13778 superseder |
2011-08-31 21:02:04 | neologix | set | messages: + msg143279 |
2011-08-31 18:25:42 | nirai | set | messages: + msg143274 |
2011-08-29 19:04:01 | sbt | set | nosy:
+ sbt messages: + msg143174 |
2011-07-28 08:26:10 | nirai | set | messages: + msg141286 |
2011-07-19 19:45:04 | nirai | set | messages: + msg140691 |
2011-07-19 19:14:28 | sdaoden | set | messages: + msg140690 |
2011-07-19 19:04:48 | sdaoden | set | messages: + msg140689 |
2011-07-19 14:11:56 | nirai | set | messages: + msg140668 |
2011-07-19 12:54:49 | sdaoden | set | messages: + msg140659 |
2011-07-19 12:32:45 | sdaoden | set | messages: + msg140658 |
2011-07-18 00:27:23 | gregory.p.smith | set | priority: high -> normal messages: + msg140550 |
2011-07-15 11:17:44 | nirai | set | messages: + msg140402 |
2011-07-12 20:57:36 | nirai | set | messages: + msg140215 |
2011-07-06 08:55:49 | nirai | set | messages: + msg139929 |
2011-07-05 16:33:33 | neologix | set | messages: + msg139897 |
2011-07-05 13:04:43 | avian | set | messages: + msg139869 |
2011-07-05 11:47:12 | neologix | set | messages: + msg139858 |
2011-07-05 11:29:34 | pitrou | set | messages: + msg139852 |
2011-07-05 11:20:41 | avian | set | messages: + msg139850 |
2011-07-04 21:22:46 | neologix | set | messages: + msg139808 |
2011-07-04 19:41:37 | nirai | set | messages: + msg139800 |
2011-07-03 09:41:58 | neologix | link | issue7123 superseder |
2011-07-01 21:09:05 | neologix | set | messages: + msg139608 |
2011-07-01 19:50:54 | nirai | set | messages: + msg139599 |
2011-07-01 15:23:47 | neologix | set | messages: + msg139584 |
2011-06-30 20:53:13 | pitrou | set | messages: + msg139522 |
2011-06-30 20:25:55 | sdaoden | set | messages: + msg139521 |
2011-06-30 19:14:14 | pitrou | set | messages: + msg139511 |
2011-06-30 19:05:28 | sdaoden | set | messages: + msg139509 |
2011-06-30 15:10:11 | nirai | set | messages: + msg139489 |
2011-06-30 15:08:49 | pitrou | set | messages: + msg139488 |
2011-06-30 15:04:08 | avian | set | files:
+ emit_warning_on_fork.patch messages: + msg139485 |
2011-06-30 14:16:33 | nirai | set | messages: + msg139480 |
2011-06-30 12:55:56 | neologix | set | messages: + msg139474 |
2011-06-30 10:05:36 | avian | set | messages: + msg139470 |
2011-06-27 08:30:10 | pitrou | set | messages: + msg139245 |
2011-06-26 18:49:46 | terry.reedy | set | versions: - Python 3.1 |
2011-06-25 15:18:31 | avian | set | nosy:
+ avian |
2011-06-25 15:15:46 | Giovanni.Bajo | set | nosy:
+ Giovanni.Bajo messages: + msg139084 |
2011-05-17 10:35:25 | sdaoden | set | messages: + msg136147 |
2011-05-16 18:57:29 | nirai | set | messages: + msg136120 |
2011-05-15 19:39:55 | neologix | set | files:
+ reinit_locks.diff messages: + msg136047 |
2011-05-15 19:34:38 | neologix | set | files: - reinit_locks.diff |
2011-05-15 18:51:26 | nirai | set | messages: + msg136045 |
2011-05-15 17:03:50 | sdaoden | set | messages: + msg136039 |
2011-05-14 23:14:32 | neologix | set | messages: + msg136003 |
2011-05-14 18:54:57 | nirai | set | messages: + msg135984 |
2011-05-14 08:26:14 | neologix | set | messages: + msg135965 |
2011-05-13 23:12:19 | sdaoden | set | messages: + msg135948 |
2011-05-13 11:24:34 | neologix | set | files:
+ reinit_locks.diff messages: + msg135899 |
2011-05-13 11:15:03 | pitrou | set | messages: + msg135897 |
2011-05-12 21:10:07 | sdaoden | set | nosy:
+ sdaoden messages: + msg135866 |
2011-05-12 20:01:50 | nirai | set | messages: + msg135857 |
2011-05-10 02:32:27 | nirai | set | nosy:
+ nirai |
2011-05-08 21:25:05 | pitrou | set | messages: + msg135543 |
2011-05-08 21:12:08 | pitrou | set | files: - unnamed |
2011-05-05 05:41:46 | neologix | set | messages: + msg135173 |
2011-05-04 20:58:55 | neologix | set | messages: + msg135157 |
2011-05-04 17:53:47 | pitrou | set | messages: + msg135143 |
2011-05-04 07:18:08 | vstinner | set | nosy:
+ vstinner |
2011-05-04 06:00:59 | neologix | set | messages: + msg135096 |
2011-05-04 05:56:18 | neologix | set | messages: + msg135095 |
2011-05-03 22:05:36 | pitrou | set | messages: + msg135083 |
2011-05-03 21:39:56 | neologix | set | messages: + msg135079 |
2011-05-03 20:47:32 | gregory.p.smith | set | files:
+ unnamed messages: + msg135069 |
2011-05-03 20:38:12 | pitrou | set | files:
+ forklocktests.patch keywords: + patch messages: + msg135067 |
2011-05-03 00:20:59 | pitrou | set | messages: + msg135012 |
2011-02-11 09:13:04 | neologix | set | nosy:
gregory.p.smith, pitrou, bobbyi, neologix messages: + msg128369 |
2011-02-10 17:20:17 | gregory.p.smith | set | nosy:
gregory.p.smith, pitrou, bobbyi, neologix messages: + msg128316 versions: + Python 3.3 |
2011-02-10 17:02:53 | pitrou | set | nosy:
gregory.p.smith, pitrou, bobbyi, neologix messages: + msg128311 |
2011-02-10 16:56:33 | gregory.p.smith | set | nosy:
gregory.p.smith, pitrou, bobbyi, neologix messages: + msg128307 |
2011-02-10 11:12:37 | neologix | set | nosy:
+ neologix messages: + msg128282 |
2010-12-15 18:25:41 | bobbyi | set | nosy:
+ bobbyi |
2010-12-14 02:43:00 | r.david.murray | set | stage: test needed type: behavior versions: - Python 2.6 |
2009-10-16 10:28:42 | gregory.p.smith | set | messages: + msg94135 |
2009-10-16 10:23:10 | pitrou | set | messages: + msg94133 |
2009-10-16 00:24:04 | gregory.p.smith | set | messages: + msg94115 |
2009-10-15 18:15:51 | pitrou | set | nosy:
+ pitrou messages: + msg94102 |
2009-09-17 14:25:18 | gregory.p.smith | set | messages:
+ msg92766 components: + Library (Lib) |
2009-08-24 19:48:16 | gregory.p.smith | set | messages: + msg91936 |
2009-08-17 23:06:17 | gregory.p.smith | create |