queue.Queue() is not reentrant, so signals and GC can cause deadlocks #59181

JohanAR · 2012-06-01T08:43:04Z

BPO	14976
Nosy	@gvanrossum, @tim-one, @rhettinger, @gpshead, @ncoghlan, @pitrou, @serhiy-storchaka, @1st1, @applio
PRs	bpo-14976: Reentrant simple queue #3346
Files	queue_deadlock.py: signal hander reading/writing a queue queue_sendint.c: send 2 signals to specified process in rapid succession queuebug2.py: gc-based-queue-problem.py q_reentrancy.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2018-01-15.23:27:50.989>
created_at = <Date 2012-06-01.08:43:03.911>
labels = ['3.7', 'type-bug', 'library']
title = 'queue.Queue() is not reentrant, so signals and GC can cause deadlocks'
updated_at = <Date 2018-01-17.01:04:26.008>
user = 'https://bugs.python.org/JohanAR'

bugs.python.org fields:

activity = <Date 2018-01-17.01:04:26.008>
actor = 'gregory.p.smith'
assignee = 'none'
closed = True
closed_date = <Date 2018-01-15.23:27:50.989>
closer = 'pitrou'
components = ['Library (Lib)']
creation = <Date 2012-06-01.08:43:03.911>
creator = 'JohanAR'
dependencies = []
files = ['25786', '25787', '46806', '47092']
hgrepos = []
issue_num = 14976
keywords = ['patch']
message_count = 51.0
messages = ['162062', '162065', '162069', '162076', '211436', '275181', '275377', '275404', '275421', '275485', '291764', '300421', '300422', '300455', '300478', '300489', '300491', '300499', '300504', '300505', '300509', '300510', '300511', '300532', '300567', '300574', '300575', '300577', '300581', '300723', '300825', '301170', '301173', '301175', '301176', '301207', '301299', '301321', '301322', '301378', '301379', '301380', '301383', '301386', '301409', '310014', '310025', '310026', '310109', '310111', '310125']
nosy_count = 15.0
nosy_names = ['gvanrossum', 'tim.peters', 'rhettinger', 'gregory.p.smith', 'ncoghlan', 'pitrou', 'zzzeek', 'python-dev', 'sbt', 'serhiy.storchaka', 'JohanAR', 'yselivanov', 'itamarst', 'davin', 'Catalin Patulea']
pr_nums = ['3346']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue14976'
versions = ['Python 3.7']

JohanAR · 2012-06-01T08:43:04Z

Python 2.7.3 (default, Apr 20 2012, 22:39:59)
[GCC 4.6.3] on linux2

If a signal handler calls Queue.PriorityQueue.put and a second signal is received while one is already being processed, neither of the calls to put will terminate.

Highly dependent on timing so it might be difficult to reproduce.

JohanAR · 2012-06-01T09:44:19Z

Start queue_deadlock.py in one terminal and note the PID it prints.

Compile queue_sendint.c in another terminal and execute it with previous PID as only argument.

If the bug is triggered, nothing will print in the python terminal window when you press Enter. To terminate the application you can press Ctrl-\

sbt · 2012-06-01T12:02:09Z

I don't think there is anything special about PriorityQueue.

There is a similar concerning the use of the Python implementation of RLock in signal handlers -- see http://bugs.python.org/issue13697.

Maybe the signal handler should temporarily mask or ignore SIGINT while it runs.

JohanAR · 2012-06-01T14:43:02Z

I did read some more on the subject and it seems like using locks with interrupts is in general a very difficult problem and not limited to Python.

Don't know if this is considered common knowledge among programmers or if it would be useful with at least a warning on the signal manual page.

itamarst · 2014-02-17T18:50:43Z

This is not specifically a signal issue; it can happen with garbage collection as well if you have a Queue.put that runs in __del__ or a weakref callback function.

This can happen in real code. In my case, a thread that reads log messages from a Queue and writes them to disk. The thread putting log messages into the Queue can deadlock if GC happens to cause a log message to be written right after Queue.put() acquired the lock.
(see itamarst/crochet#25).

zzzeek · 2016-09-08T21:59:52Z

SQLAlchemy suffered from this issue long ago as we use a Queue for connections, which can be collected via weakref callback and sent back to put(), which we observed can occur via gc. For many years (like since 2007 or so) we've packaged a complete copy of Queue.Queue with an RLock inside of it to work around it.

I'm just noticing this issue because I'm writing a new connection pool implementation that tries to deal with this issue a little differently (not using Queue.Queue though).

rhettinger · 2016-09-09T18:38:54Z

Guido opposes having us going down the path of using an RLock and altering the queue module to cope with reentrancy.

In the case of mixing signals with threads, we all agree with Johan Aires Rastén that "using locks with interrupts is in general a very difficult problem and not limited to Python", nor is it even limited to the queue module.

I believe that the usual technique for mixing signals and threads is to have the signal handler do the minimal work necessary to record the event without interacting with the rest of the program. Later, another thread can act on the recorded signal but can do so with normal use of locks so that code invariants are not violated. This design makes reasoning about the program more tractable.

python-dev · 2016-09-09T19:31:29Z

New changeset 137806ca59ce by Gregory P. Smith in branch '3.5':
Add a note about queue not being safe for use from signal handlers.
https://hg.python.org/cpython/rev/137806ca59ce

New changeset 4e15e7618715 by Gregory P. Smith in branch 'default':
Add a note about queue not being safe for use from signal handlers.
https://hg.python.org/cpython/rev/4e15e7618715

zzzeek · 2016-09-09T20:16:39Z

yep, that's what im doing in my approach. though longer term thing, I noticed it's very hard to find documentation on exactly when gc might run. E.g. would it ever run if I did something innocuous, like "self.thread_id = None" (probably not). Just wasn't easy to get that answer.

python-dev · 2016-09-09T22:59:25Z

New changeset 8c00cbbd3ff9 by Raymond Hettinger in branch '3.5':
bpo-14976: Note that the queue module is not designed to protect against reentrancy
https://hg.python.org/cpython/rev/8c00cbbd3ff9

itamarst · 2017-04-16T20:44:34Z

This bug was closed on the basis that signals + threads don't interact well. Which is a good point.

Unfortunately this bug can happen in cases that have nothing to do with signals. If you look at the title and some of the comments it also happens as a result of garbage collection: in __del__ methods and weakref callbacks.

Specifically, garbage collection can happen on any bytecode boundary and cause reentrancy problems with Queue.

The attached file is an attempt to demonstrate this: it runs and runs until GC happens and then deadlocks for me (on Python 3.6). I.e. it prints GC! and after that no further output is printed and no CPU usage reported.

Please re-open this bug: if you don't want to fix signal case that's fine, but the GC case is still an issue.

ncoghlan · 2017-08-17T13:52:09Z

Itamar wrote up a post describing the GC variant of this problem in more detail: https://codewithoutrules.com/2017/08/16/concurrency-python/

In particular, he highlighted a particularly nasty action-at-a-distance variant of the deadlock where:

Someone registers a logging.handlers.QueueHandler instance with the logging system
One or more types in the application or libraries it uses call logging functions in a __del__ method or a weakref callback
A GC cycle triggers while a log message is already being processed and hence the thread already holds the queue's put() lock
Things deadlock because the put() operation isn't re-entrant

As far as I can see, there's no application level way of resolving that short of "Only register logging.handlers.QueueHandler with a logger you completely control and hence can ensure is never used in a __del__ method or weakref callback", which doesn't feel like a reasonable restriction to place on the safe use of a standard library logging handler.

1st1 · 2017-08-17T13:54:49Z

An idea from the blog post: if we rewrite queue in C it will use the GIL as a lock which will fix this particular bug. I can make a patch.

pitrou · 2017-08-17T20:01:27Z

Using any kind of potentially-blocking synchronization primitive from __del__ or weakref callback is indeed a bug waiting for happen. I agree non-trivial cases can be hard to debug, especially when people don't expect that kind of cause.

It would be ok to submit a patch solving this issue using C code IMHO. Note the maxsize argument complicates things even though most uses of Queue don't use maxsize.

Of course, the general issue isn't only about Queue: other synchronization primitives are widely used. See tornadoweb/tornado#1876 for an example.

pitrou · 2017-08-18T09:06:20Z

Here is a pure Python PoC patch that allows unbounded Queue and LifoQueue to have reentrant put().

serhiy-storchaka · 2017-08-18T11:08:52Z

Are all changes are necessary for this issue or some of them are just an optimization? The patch changes the semantic of public attribute unfinished_tasks. Seems that old unfinished_tasks is new unfinished_tasks + _qsize().

pitrou · 2017-08-18T11:35:34Z

unfinished_tasks is not a public attribute AFAIK (it's not documented).

The change is necessary: you cannot increment unfinished_tasks in reentrant put(), since incrementing in pure Python is not atomic. So the incrementation is moved to get(), which probably cannot be made reentrant at all.

If keeping the visible semantics of the unfinished_tasks attribute is important, we could make it a property that computes the desired value.

zzzeek · 2017-08-18T14:25:19Z

Here is a pure Python PoC patch that allows unbounded Queue and LifoQueue to have reentrant put().

per http://bugs.python.org/msg275377 guido does not want an RLock here.

pitrou · 2017-08-18T15:24:53Z

I'll believe it when Guido chimes in and actually says so himself. Otherwise, I am skeptical Guido cares a lot about whether the internal implementation of Queue uses a Lock, a RLock or whatever else.

pitrou · 2017-08-18T15:29:41Z

Guido, apparently you are opposed to the Queue implementation using a RLock instead of a Lock, according to Raymond in https://bugs.python.org/issue14976#msg275377. I find this a bit surprising, so could you confirm it yourself?

pitrou · 2017-08-18T15:34:11Z

Also, to elaborate a bit, I don't think we should aim to make Queue fully reentrant, as that would be extremely involved. I think we can settle on the simpler goal of making put() reentrant for unbounded LIFO and FIFO queues, which is what most people care about (and which is incidentally what the posted patch claims to do).

1st1 · 2017-08-18T15:37:59Z

Here is a pure Python PoC patch that allows unbounded Queue and LifoQueue to have reentrant put().

Is it guaranteed that the GC will happen in the same thread that is holding the lock? IOW will RLock help with all GC/del deadlocking scenarios?

ncoghlan · 2017-08-19T06:46:08Z

+1 for treating Queue.put() specifically as the case to be handled, as that's the mechanism that can be used to *avoid* running complex operations directly in __del__ methods and weakref callbacks.

For testing purposes, the current deadlock can be reliably reproduced with sys.settrace:

>>> import sys
>>> import queue
>>> the_queue=queue.Queue()
>>> counter = 0
>>> def bad_trace(*args):
...     global counter
...     counter += 1
...     print(counter)
...     the_queue.put(counter)
...     return bad_trace
... 
>>> sys.settrace(bad_trace)
>>> the_queue.put(None)
1
2
3
4
5
6
7
[and here we have a deadlock]

pitrou · 2017-08-19T08:53:41Z

Le 18/08/2017 à 23:26, Guido van Rossum a écrit :

IIUC the end result would be a Queue whose put() works from signal handlers, GC callbacks and __del__, as long as it's unbounded, right?

Yes.

And when it *is* bounded, it will give a decent message if the queue is full and the lock is already taken, right?

Currently it gives a decent message on any bounded queue, even if not
full. That may be improved, I'd have to investigate how.

pitrou · 2017-08-19T08:54:37Z

Oh and:

Le 18/08/2017 à 23:26, Guido van Rossum a écrit :

I can't say I understand all of Antoine's patch, but it's probably okay to do it this way; however I would rather see if we can add _is_owned() to Lock, assuming it can be implemented using any of the threading/locking libraries we still support (I presume that's basically posix and Windows).

Regular Locks don't have the notion of an owning thread so, while we
could add it, that would be a bizarre API.

We can also detect reentrancy in get() and raise an error.

ncoghlan · 2017-08-19T10:09:18Z

Would it be feasible to change the behaviour of non-reentrant locks such that:

They *do* keep track of the owning thread
Trying to acquire them again when the current thread already has them locked raises RuntimeError instead of deadlocking the way it does now?

Then they could sensibly expose the same "_is_locked()" API as RLock, while still disallowing reentrancy by default.

pitrou · 2017-08-19T12:05:56Z

Le 19/08/2017 à 12:09, Nick Coghlan a écrit :

Would it be feasible to change the behaviour of non-reentrant locks such that:

They *do* keep track of the owning thread

Yes.

Trying to acquire them again when the current thread already has them locked raises RuntimeError instead of deadlocking the way it does now?

No. It's not a deadlock, since you can release a Lock from another thread.

pitrou · 2017-08-22T22:05:09Z

After experimenting a bit more with this approach, I now realize that the case where a get() is waiting and gets interrupted by a put() call is not handled properly: there is no obvious way for the get() call to realize (when the interruption finishes) that the queue is now non-empty and can be popped from.

So perhaps we need C code after all.

rhettinger · 2017-08-25T05:07:07Z

[Antoine Pitrou]

So perhaps we need C code after all.

This matches my experience with functools.lru_cache() where I used an RLock() to handle reentrancy. That by itself was insufficient. I also had to make otherwise unnecessary variable assignments to hang onto object references to avoid a decref triggering arbitrary Python code from reentering before the links were all in a consistent state. Further, I had to create a key wrapper to make sure a potentially reentrant __hash__() call wouldn't be made before the state was fully updated. Even then, a potentially reentrant __eq__() call couldn't be avoided, so I had to re-order the operations to make sure this was the last call after the other state updates. This defended against all normal code, but all these measures still could not defend against signals or a GC invocation of __del__, either of which can happen at any time.

On the plus side, we now have a C version of functools.lru_cache() that is protected somewhat by the GIL. On the minus side, it was hard to get right. Even with the pure python code as a model, the person who wrote the C code didn't fully think through all sources of reentrancy and wrote buggy code that shipped in 3.5 and 3.6 (resulting in normal code code triggering hard-to-reproduce reentrancy bugs). The lesson here is that while the C code can be written correctly, it isn't easy to do and it is hard to notice when it is incorrect.

One other thought: Given that __del__() can be invoked at almost any time and can potentially call any other piece of Python code, we should consider turning every lock into an rlock. Also, there should be some guidance on __del__() advising considerable restraint on what gets called. The world is likely full of pure Python code that can't defend itself against arbitrary re-entrancy.

pitrou · 2017-09-02T18:25:42Z

Just a random thought: if there was a SimpleQueue class with very basic functionality (only FIFO, only get(), put() and empty(), no size limit, no task management), it would be easier to make it reentrant using C.

(FTR, multiprocessing also has a light-weight SimpleQueue)

gvanrossum · 2017-09-02T20:01:06Z

Agreed; the Queue class has a bunch of rarely used functionality rolled
in...

Why was task management ever added?

tim-one · 2017-09-02T21:43:23Z

[Guido]

Why was task management ever added?

Raymond published a "joinable" queue class as a recipe here:

http://code.activestate.com/recipes/475160-taskqueue/

and later folded it into the standard Python queue. So the usual answer applies: "it was easy to add and sometimes useful, so it seemed like a good idea at the time" ;-)

rhettinger · 2017-09-02T22:38:55Z

[Guido]

Why was task management ever added?

See http://bugs.python.org/issue1455676

Problem being solved: How can a requestor of a service get notified when that service is complete given that the work is being done by a daemon thread that never returns.

gvanrossum · 2017-09-04T04:18:39Z

Oh well. While it is undoubtedly useful I wish we had had more experience and factored the API differently. Ditto for the maxsize=N feature.

So, while it's not too late, perhaps we should indeed follow Antoine's advice and implement a different queue that has fewer features but is guaranteed to be usable by signal handlers and GC callbacks (including __del__). The nice part here is that a queue is mostly a wrapper around a deque anyways, and deque itself is reentrant. (At least one would hope so -- Antoine's patch to Queue assumes this too, and I can't think of a reason why deque would need to release the GIL.)

rhettinger · 2017-09-05T06:19:53Z

I can't think of a reason why deque would need to release the GIL.

Yes, in deque.append(item) and deque.popleft() are atomic. Likewise list.append() and list.pop() are atomic. The heappush() and heappop() operations aren't as fortunate since they call __lt__() on arbitrary Python objects.

perhaps we should indeed follow Antoine's advice and implement
a different queue that has fewer features but is guaranteed
to be usable by signal handlers and GC callbacks
(including __del__).

Is the idea to use a regular non-reentrant lock but write the whole thing in C to avoid running any pure python instructions (any of which could allow a signal handler to run)?

If so, it seems like the only feature that needs to be dropped is subclassing. The maxsize feature and unfinished tasks tracking could still be supported. Also, I suppose the guarantees would have be marked as CPython implementation details, leaving the other implementations to fend for themselves.

Or were you thinking about a pure python simplified queue class? If so, an RLock() will likely be required (the act of assigning deque.popleft's return value to a pure python variable can allow a pending signal to be handled before the lock could be released).

One other thought: Would it make sense get() and put() to add gc.disable() and gc.enable() whenever GC is already enabled? That would eliminate a source of reentrancy.

pitrou · 2017-09-05T15:50:58Z

Would it make sense get() and put() to add gc.disable() and gc.enable() whenever GC is already enabled?

That doesn't sound very nice if some thread is waiting on a get() for a very long time (which is reasonable if you have a thread pool that's only used sporadically, for example). Also, using gc.disable() and gc.enable() in multi-threaded programs is a bit delicate.

pitrou · 2017-09-05T15:52:17Z

I don't mind someone to reimplement a full-fledged Queue in C. As for me, I am currently implementing a SimpleQueue in C that's reentrant and has only the most basic functionality.

gpshead · 2017-09-05T21:21:54Z

FYI - here is an appropriately licensed (WTFPL) very simple lock-free queue implemented in C: https://github.com/darkautism/lfqueue

pitrou · 2017-09-05T21:25:06Z

Le 05/09/2017 à 23:21, Gregory P. Smith a écrit :

FYI - here is an appropriately licensed (WTFPL) very simple lock-free queue implemented in C

Looks like it uses busy-waiting inside its get() equivalent.
Also we would need a portable version of the atomic instructions used
there (e.g. __sync_bool_compare_and_swap) :-)

gpshead · 2017-09-05T21:43:37Z

yeah, there are others such as https://www.liblfds.org/ that seem better from that standpoint (gcc, and microsoft compiler support - I'm sure clang is fine as well - anything gcc can do they can do). Ensuring they're supported across build target platforms (all the hardware technically has the ability) is the fun part if we're going to go that route.

pitrou · 2017-09-05T22:10:19Z

One tangential note about a potential full-fledged Queue in C: the pure Python implementation is fair towards consumers, as it uses a threading.Condition which is itself fair. Achieving the same thing in C may significantly complicate the implementation. Raymond would have to decide whether it's an important property to keep.

The SimpleQueue class, being a separate API, is not tied by this constraint.

rhettinger · 2017-09-05T22:19:28Z

To handle the logging.handlers.QueueHandler case, the SimpleQueue needs a put_nowait() method (see line 1959 of Lib/logging/handlers.py.

[Mike Bayer]

I noticed it's very hard to find documentation on exactly when
gc might run.

The answer I got from the other coredevs is that GC can trigger whenever a new GCable object is allocated (pretty much any container or any pure python object).

rhettinger · 2017-09-06T00:04:43Z

Just for the record, Guido explained his aversion to RLocks. Roughly: 1) RLocks are slower and more complex 2) It is difficult to correctly write code that can survive reentrancy, so it is easy fool yourself into believing you've written correct code 3) Getting a deadlock with a regular lock is a better outcome than having the invariants subtly violated, 4) Deadlocks have a more clear-cut failure mode and are easier to debug.

Guido also explained that he favored some sort of minimal queue class even if it might not handle all possible gc/weakref/signal/del induced issues. Essentially, he believes there is some value in harm reduction by minimizing the likelihood of a failure. The will help sysadmins who already have a practice of doing monitoring and occasional restarts as a way of mitigating transient issues.

pitrou · 2018-01-15T20:05:40Z

Hi all,

The PR has been ready for quite some time now. Raymond posted some review comments back in September, which I addressed by making the requested changes.

If someone wants to add their comments, now is the time. Otherwise I plan to merge the PR sometimes soon, so that it gets in before 3.7 beta.

pitrou · 2018-01-15T23:27:19Z

New changeset 94e1696 by Antoine Pitrou in branch 'master':
bpo-14976: Reentrant simple queue (bpo-3346)
94e1696

pitrou · 2018-01-15T23:27:51Z

Ok, there has been enough reviews in the last four months :-) This is now merged.

gpshead · 2018-01-16T20:00:05Z

Catalin has been examining code... switching concurrent.futures.thread to use SimpleQueue instead of Queue is probably a good idea as the queues in there get used from weakref callbacks.

pitrou · 2018-01-16T20:00:47Z

Could you open a new issue for it?

gpshead · 2018-01-17T01:04:26Z

https://bugs.python.org/issue32576 filed for that

JohanAR mannequin added type-crash A hard crash of the interpreter, possibly with a core dump stdlib Python modules in the Lib dir labels Jun 1, 2012

itamarst mannequin changed the title ~~Queue.PriorityQueue() is not interrupt safe~~ queue.Queue() is not reentrant, so signals and GC can cause deadlocks Feb 17, 2014

rhettinger closed this as completed Sep 9, 2016

1st1 reopened this Aug 17, 2017

pitrou added 3.7 (EOL) end of life type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Aug 17, 2017

pitrou closed this as completed Jan 15, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

YouJiacheng mentioned this issue Jul 13, 2022

Fix race condition for weakref destructor by catching rare exceptions. google/jax#10110

Merged

queue.Queue() is not reentrant, so signals and GC can cause deadlocks #59181

queue.Queue() is not reentrant, so signals and GC can cause deadlocks #59181

Comments

JohanAR mannequin commented Jun 1, 2012

JohanAR mannequin commented Jun 1, 2012

JohanAR mannequin commented Jun 1, 2012

sbt mannequin commented Jun 1, 2012

JohanAR mannequin commented Jun 1, 2012

itamarst mannequin commented Feb 17, 2014

zzzeek mannequin commented Sep 8, 2016

rhettinger commented Sep 9, 2016

python-dev mannequin commented Sep 9, 2016

zzzeek mannequin commented Sep 9, 2016

python-dev mannequin commented Sep 9, 2016

itamarst mannequin commented Apr 16, 2017

ncoghlan commented Aug 17, 2017

1st1 commented Aug 17, 2017

pitrou commented Aug 17, 2017

pitrou commented Aug 18, 2017

serhiy-storchaka commented Aug 18, 2017

pitrou commented Aug 18, 2017

zzzeek mannequin commented Aug 18, 2017

pitrou commented Aug 18, 2017

pitrou commented Aug 18, 2017

pitrou commented Aug 18, 2017

1st1 commented Aug 18, 2017

ncoghlan commented Aug 19, 2017

pitrou commented Aug 19, 2017

pitrou commented Aug 19, 2017

ncoghlan commented Aug 19, 2017

pitrou commented Aug 19, 2017

pitrou commented Aug 22, 2017

rhettinger commented Aug 25, 2017

pitrou commented Sep 2, 2017

gvanrossum commented Sep 2, 2017

tim-one commented Sep 2, 2017

rhettinger commented Sep 2, 2017

gvanrossum commented Sep 4, 2017

rhettinger commented Sep 5, 2017

pitrou commented Sep 5, 2017

pitrou commented Sep 5, 2017

gpshead commented Sep 5, 2017

pitrou commented Sep 5, 2017

gpshead commented Sep 5, 2017

pitrou commented Sep 5, 2017

rhettinger commented Sep 5, 2017

rhettinger commented Sep 6, 2017

pitrou commented Jan 15, 2018

pitrou commented Jan 15, 2018

pitrou commented Jan 15, 2018

gpshead commented Jan 16, 2018

pitrou commented Jan 16, 2018

gpshead commented Jan 17, 2018