Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queue.Queue() is not reentrant, so signals and GC can cause deadlocks #59181

Closed
JohanAR mannequin opened this issue Jun 1, 2012 · 51 comments
Closed

queue.Queue() is not reentrant, so signals and GC can cause deadlocks #59181

JohanAR mannequin opened this issue Jun 1, 2012 · 51 comments
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@JohanAR
Copy link
Mannequin

JohanAR mannequin commented Jun 1, 2012

BPO 14976
Nosy @gvanrossum, @tim-one, @rhettinger, @gpshead, @ncoghlan, @pitrou, @serhiy-storchaka, @1st1, @applio
PRs
  • bpo-14976: Reentrant simple queue #3346
  • Files
  • queue_deadlock.py: signal hander reading/writing a queue
  • queue_sendint.c: send 2 signals to specified process in rapid succession
  • queuebug2.py: gc-based-queue-problem.py
  • q_reentrancy.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-01-15.23:27:50.989>
    created_at = <Date 2012-06-01.08:43:03.911>
    labels = ['3.7', 'type-bug', 'library']
    title = 'queue.Queue() is not reentrant, so signals and GC can cause deadlocks'
    updated_at = <Date 2018-01-17.01:04:26.008>
    user = 'https://bugs.python.org/JohanAR'

    bugs.python.org fields:

    activity = <Date 2018-01-17.01:04:26.008>
    actor = 'gregory.p.smith'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-01-15.23:27:50.989>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2012-06-01.08:43:03.911>
    creator = 'JohanAR'
    dependencies = []
    files = ['25786', '25787', '46806', '47092']
    hgrepos = []
    issue_num = 14976
    keywords = ['patch']
    message_count = 51.0
    messages = ['162062', '162065', '162069', '162076', '211436', '275181', '275377', '275404', '275421', '275485', '291764', '300421', '300422', '300455', '300478', '300489', '300491', '300499', '300504', '300505', '300509', '300510', '300511', '300532', '300567', '300574', '300575', '300577', '300581', '300723', '300825', '301170', '301173', '301175', '301176', '301207', '301299', '301321', '301322', '301378', '301379', '301380', '301383', '301386', '301409', '310014', '310025', '310026', '310109', '310111', '310125']
    nosy_count = 15.0
    nosy_names = ['gvanrossum', 'tim.peters', 'rhettinger', 'gregory.p.smith', 'ncoghlan', 'pitrou', 'zzzeek', 'python-dev', 'sbt', 'serhiy.storchaka', 'JohanAR', 'yselivanov', 'itamarst', 'davin', 'Catalin Patulea']
    pr_nums = ['3346']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue14976'
    versions = ['Python 3.7']

    @JohanAR
    Copy link
    Mannequin Author

    JohanAR mannequin commented Jun 1, 2012

    Python 2.7.3 (default, Apr 20 2012, 22:39:59)
    [GCC 4.6.3] on linux2

    If a signal handler calls Queue.PriorityQueue.put and a second signal is received while one is already being processed, neither of the calls to put will terminate.

    Highly dependent on timing so it might be difficult to reproduce.

    @JohanAR JohanAR mannequin added type-crash A hard crash of the interpreter, possibly with a core dump stdlib Python modules in the Lib dir labels Jun 1, 2012
    @JohanAR
    Copy link
    Mannequin Author

    JohanAR mannequin commented Jun 1, 2012

    Start queue_deadlock.py in one terminal and note the PID it prints.

    Compile queue_sendint.c in another terminal and execute it with previous PID as only argument.

    If the bug is triggered, nothing will print in the python terminal window when you press Enter. To terminate the application you can press Ctrl-\

    @sbt
    Copy link
    Mannequin

    sbt mannequin commented Jun 1, 2012

    I don't think there is anything special about PriorityQueue.

    There is a similar concerning the use of the Python implementation of RLock in signal handlers -- see http://bugs.python.org/issue13697.

    Maybe the signal handler should temporarily mask or ignore SIGINT while it runs.

    @JohanAR
    Copy link
    Mannequin Author

    JohanAR mannequin commented Jun 1, 2012

    I did read some more on the subject and it seems like using locks with interrupts is in general a very difficult problem and not limited to Python.

    Don't know if this is considered common knowledge among programmers or if it would be useful with at least a warning on the signal manual page.

    @itamarst
    Copy link
    Mannequin

    itamarst mannequin commented Feb 17, 2014

    This is not specifically a signal issue; it can happen with garbage collection as well if you have a Queue.put that runs in __del__ or a weakref callback function.

    This can happen in real code. In my case, a thread that reads log messages from a Queue and writes them to disk. The thread putting log messages into the Queue can deadlock if GC happens to cause a log message to be written right after Queue.put() acquired the lock.
    (see itamarst/crochet#25).

    @itamarst itamarst mannequin changed the title Queue.PriorityQueue() is not interrupt safe queue.Queue() is not reentrant, so signals and GC can cause deadlocks Feb 17, 2014
    @zzzeek
    Copy link
    Mannequin

    zzzeek mannequin commented Sep 8, 2016

    SQLAlchemy suffered from this issue long ago as we use a Queue for connections, which can be collected via weakref callback and sent back to put(), which we observed can occur via gc. For many years (like since 2007 or so) we've packaged a complete copy of Queue.Queue with an RLock inside of it to work around it.

    I'm just noticing this issue because I'm writing a new connection pool implementation that tries to deal with this issue a little differently (not using Queue.Queue though).

    @rhettinger
    Copy link
    Contributor

    Guido opposes having us going down the path of using an RLock and altering the queue module to cope with reentrancy.

    In the case of mixing signals with threads, we all agree with Johan Aires Rastén that "using locks with interrupts is in general a very difficult problem and not limited to Python", nor is it even limited to the queue module.

    I believe that the usual technique for mixing signals and threads is to have the signal handler do the minimal work necessary to record the event without interacting with the rest of the program. Later, another thread can act on the recorded signal but can do so with normal use of locks so that code invariants are not violated. This design makes reasoning about the program more tractable.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 9, 2016

    New changeset 137806ca59ce by Gregory P. Smith in branch '3.5':
    Add a note about queue not being safe for use from signal handlers.
    https://hg.python.org/cpython/rev/137806ca59ce

    New changeset 4e15e7618715 by Gregory P. Smith in branch 'default':
    Add a note about queue not being safe for use from signal handlers.
    https://hg.python.org/cpython/rev/4e15e7618715

    @zzzeek
    Copy link
    Mannequin

    zzzeek mannequin commented Sep 9, 2016

    yep, that's what im doing in my approach. though longer term thing, I noticed it's very hard to find documentation on exactly when gc might run. E.g. would it ever run if I did something innocuous, like "self.thread_id = None" (probably not). Just wasn't easy to get that answer.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 9, 2016

    New changeset 8c00cbbd3ff9 by Raymond Hettinger in branch '3.5':
    bpo-14976: Note that the queue module is not designed to protect against reentrancy
    https://hg.python.org/cpython/rev/8c00cbbd3ff9

    @itamarst
    Copy link
    Mannequin

    itamarst mannequin commented Apr 16, 2017

    This bug was closed on the basis that signals + threads don't interact well. Which is a good point.

    Unfortunately this bug can happen in cases that have nothing to do with signals. If you look at the title and some of the comments it also happens as a result of garbage collection: in __del__ methods and weakref callbacks.

    Specifically, garbage collection can happen on any bytecode boundary and cause reentrancy problems with Queue.

    The attached file is an attempt to demonstrate this: it runs and runs until GC happens and then deadlocks for me (on Python 3.6). I.e. it prints GC! and after that no further output is printed and no CPU usage reported.

    Please re-open this bug: if you don't want to fix signal case that's fine, but the GC case is still an issue.

    @ncoghlan
    Copy link
    Contributor

    Itamar wrote up a post describing the GC variant of this problem in more detail: https://codewithoutrules.com/2017/08/16/concurrency-python/

    In particular, he highlighted a particularly nasty action-at-a-distance variant of the deadlock where:

    1. Someone registers a logging.handlers.QueueHandler instance with the logging system
    2. One or more types in the application or libraries it uses call logging functions in a __del__ method or a weakref callback
    3. A GC cycle triggers while a log message is already being processed and hence the thread already holds the queue's put() lock
    4. Things deadlock because the put() operation isn't re-entrant

    As far as I can see, there's no application level way of resolving that short of "Only register logging.handlers.QueueHandler with a logger you completely control and hence can ensure is never used in a __del__ method or weakref callback", which doesn't feel like a reasonable restriction to place on the safe use of a standard library logging handler.

    @1st1
    Copy link
    Member

    1st1 commented Aug 17, 2017

    An idea from the blog post: if we rewrite queue in C it will use the GIL as a lock which will fix this particular bug. I can make a patch.

    @1st1 1st1 reopened this Aug 17, 2017
    @pitrou
    Copy link
    Member

    pitrou commented Aug 17, 2017

    Using any kind of potentially-blocking synchronization primitive from __del__ or weakref callback is indeed a bug waiting for happen. I agree non-trivial cases can be hard to debug, especially when people don't expect that kind of cause.

    It would be ok to submit a patch solving this issue using C code IMHO. Note the maxsize argument complicates things even though most uses of Queue don't use maxsize.

    Of course, the general issue isn't only about Queue: other synchronization primitives are widely used. See tornadoweb/tornado#1876 for an example.

    @pitrou pitrou added 3.7 (EOL) end of life type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Aug 17, 2017
    @pitrou
    Copy link
    Member

    pitrou commented Aug 18, 2017

    Here is a pure Python PoC patch that allows unbounded Queue and LifoQueue to have reentrant put().

    @serhiy-storchaka
    Copy link
    Member

    Are all changes are necessary for this issue or some of them are just an optimization? The patch changes the semantic of public attribute unfinished_tasks. Seems that old unfinished_tasks is new unfinished_tasks + _qsize().

    @pitrou
    Copy link
    Member

    pitrou commented Aug 18, 2017

    unfinished_tasks is not a public attribute AFAIK (it's not documented).

    The change is necessary: you cannot increment unfinished_tasks in reentrant put(), since incrementing in pure Python is not atomic. So the incrementation is moved to get(), which probably cannot be made reentrant at all.

    If keeping the visible semantics of the unfinished_tasks attribute is important, we could make it a property that computes the desired value.

    @zzzeek
    Copy link
    Mannequin

    zzzeek mannequin commented Aug 18, 2017

    Here is a pure Python PoC patch that allows unbounded Queue and LifoQueue to have reentrant put().

    per http://bugs.python.org/msg275377 guido does not want an RLock here.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 18, 2017

    I'll believe it when Guido chimes in and actually says so himself. Otherwise, I am skeptical Guido cares a lot about whether the internal implementation of Queue uses a Lock, a RLock or whatever else.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 18, 2017

    Guido, apparently you are opposed to the Queue implementation using a RLock instead of a Lock, according to Raymond in https://bugs.python.org/issue14976#msg275377. I find this a bit surprising, so could you confirm it yourself?

    @pitrou
    Copy link
    Member

    pitrou commented Aug 18, 2017

    Also, to elaborate a bit, I don't think we should aim to make Queue fully reentrant, as that would be extremely involved. I think we can settle on the simpler goal of making put() reentrant for unbounded LIFO and FIFO queues, which is what most people care about (and which is incidentally what the posted patch claims to do).

    @1st1
    Copy link
    Member

    1st1 commented Aug 18, 2017

    Here is a pure Python PoC patch that allows unbounded Queue and LifoQueue to have reentrant put().

    Is it guaranteed that the GC will happen in the same thread that is holding the lock? IOW will RLock help with all GC/del deadlocking scenarios?

    @ncoghlan
    Copy link
    Contributor

    +1 for treating Queue.put() specifically as the case to be handled, as that's the mechanism that can be used to *avoid* running complex operations directly in __del__ methods and weakref callbacks.

    For testing purposes, the current deadlock can be reliably reproduced with sys.settrace:

    >>> import sys
    >>> import queue
    >>> the_queue=queue.Queue()
    >>> counter = 0
    >>> def bad_trace(*args):
    ...     global counter
    ...     counter += 1
    ...     print(counter)
    ...     the_queue.put(counter)
    ...     return bad_trace
    ... 
    >>> sys.settrace(bad_trace)
    >>> the_queue.put(None)
    1
    2
    3
    4
    5
    6
    7
    [and here we have a deadlock]
    

    @pitrou
    Copy link
    Member

    pitrou commented Aug 19, 2017

    Le 18/08/2017 à 23:26, Guido van Rossum a écrit :

    IIUC the end result would be a Queue whose put() works from signal handlers, GC callbacks and __del__, as long as it's unbounded, right?

    Yes.

    And when it *is* bounded, it will give a decent message if the queue is full and the lock is already taken, right?

    Currently it gives a decent message on any bounded queue, even if not
    full. That may be improved, I'd have to investigate how.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 19, 2017

    Oh and:

    Le 18/08/2017 à 23:26, Guido van Rossum a écrit :

    I can't say I understand all of Antoine's patch, but it's probably okay to do it this way; however I would rather see if we can add _is_owned() to Lock, assuming it can be implemented using any of the threading/locking libraries we still support (I presume that's basically posix and Windows).

    Regular Locks don't have the notion of an owning thread so, while we
    could add it, that would be a bizarre API.

    We can also detect reentrancy in get() and raise an error.

    @ncoghlan
    Copy link
    Contributor

    Would it be feasible to change the behaviour of non-reentrant locks such that:

    1. They *do* keep track of the owning thread
    2. Trying to acquire them again when the current thread already has them locked raises RuntimeError instead of deadlocking the way it does now?

    Then they could sensibly expose the same "_is_locked()" API as RLock, while still disallowing reentrancy by default.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 19, 2017

    Le 19/08/2017 à 12:09, Nick Coghlan a écrit :

    Would it be feasible to change the behaviour of non-reentrant locks such that:

    1. They *do* keep track of the owning thread

    Yes.

    1. Trying to acquire them again when the current thread already has them locked raises RuntimeError instead of deadlocking the way it does now?

    No. It's not a deadlock, since you can release a Lock from another thread.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 22, 2017

    After experimenting a bit more with this approach, I now realize that the case where a get() is waiting and gets interrupted by a put() call is not handled properly: there is no obvious way for the get() call to realize (when the interruption finishes) that the queue is now non-empty and can be popped from.

    So perhaps we need C code after all.

    @rhettinger
    Copy link
    Contributor

    [Antoine Pitrou]

    So perhaps we need C code after all.

    This matches my experience with functools.lru_cache() where I used an RLock() to handle reentrancy. That by itself was insufficient. I also had to make otherwise unnecessary variable assignments to hang onto object references to avoid a decref triggering arbitrary Python code from reentering before the links were all in a consistent state. Further, I had to create a key wrapper to make sure a potentially reentrant __hash__() call wouldn't be made before the state was fully updated. Even then, a potentially reentrant __eq__() call couldn't be avoided, so I had to re-order the operations to make sure this was the last call after the other state updates. This defended against all normal code, but all these measures still could not defend against signals or a GC invocation of __del__, either of which can happen at any time.

    On the plus side, we now have a C version of functools.lru_cache() that is protected somewhat by the GIL. On the minus side, it was hard to get right. Even with the pure python code as a model, the person who wrote the C code didn't fully think through all sources of reentrancy and wrote buggy code that shipped in 3.5 and 3.6 (resulting in normal code code triggering hard-to-reproduce reentrancy bugs). The lesson here is that while the C code can be written correctly, it isn't easy to do and it is hard to notice when it is incorrect.

    One other thought: Given that __del__() can be invoked at almost any time and can potentially call any other piece of Python code, we should consider turning every lock into an rlock. Also, there should be some guidance on __del__() advising considerable restraint on what gets called. The world is likely full of pure Python code that can't defend itself against arbitrary re-entrancy.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 2, 2017

    Just a random thought: if there was a SimpleQueue class with very basic functionality (only FIFO, only get(), put() and empty(), no size limit, no task management), it would be easier to make it reentrant using C.

    (FTR, multiprocessing also has a light-weight SimpleQueue)

    @gvanrossum
    Copy link
    Member

    Agreed; the Queue class has a bunch of rarely used functionality rolled
    in...

    Why was task management ever added?

    @tim-one
    Copy link
    Member

    tim-one commented Sep 2, 2017

    [Guido]

    Why was task management ever added?

    Raymond published a "joinable" queue class as a recipe here:

    http://code.activestate.com/recipes/475160-taskqueue/

    and later folded it into the standard Python queue. So the usual answer applies: "it was easy to add and sometimes useful, so it seemed like a good idea at the time" ;-)

    @rhettinger
    Copy link
    Contributor

    [Guido]

    Why was task management ever added?

    See http://bugs.python.org/issue1455676

    Problem being solved: How can a requestor of a service get notified when that service is complete given that the work is being done by a daemon thread that never returns.

    @gvanrossum
    Copy link
    Member

    Oh well. While it is undoubtedly useful I wish we had had more experience and factored the API differently. Ditto for the maxsize=N feature.

    So, while it's not too late, perhaps we should indeed follow Antoine's advice and implement a different queue that has fewer features but is guaranteed to be usable by signal handlers and GC callbacks (including __del__). The nice part here is that a queue is mostly a wrapper around a deque anyways, and deque itself is reentrant. (At least one would hope so -- Antoine's patch to Queue assumes this too, and I can't think of a reason why deque would need to release the GIL.)

    @rhettinger
    Copy link
    Contributor

    I can't think of a reason why deque would need to release the GIL.

    Yes, in deque.append(item) and deque.popleft() are atomic. Likewise list.append() and list.pop() are atomic. The heappush() and heappop() operations aren't as fortunate since they call __lt__() on arbitrary Python objects.

    perhaps we should indeed follow Antoine's advice and implement
    a different queue that has fewer features but is guaranteed
    to be usable by signal handlers and GC callbacks
    (including __del__).

    Is the idea to use a regular non-reentrant lock but write the whole thing in C to avoid running any pure python instructions (any of which could allow a signal handler to run)?

    If so, it seems like the only feature that needs to be dropped is subclassing. The maxsize feature and unfinished tasks tracking could still be supported. Also, I suppose the guarantees would have be marked as CPython implementation details, leaving the other implementations to fend for themselves.

    Or were you thinking about a pure python simplified queue class? If so, an RLock() will likely be required (the act of assigning deque.popleft's return value to a pure python variable can allow a pending signal to be handled before the lock could be released).

    One other thought: Would it make sense get() and put() to add gc.disable() and gc.enable() whenever GC is already enabled? That would eliminate a source of reentrancy.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 5, 2017

    Would it make sense get() and put() to add gc.disable() and gc.enable() whenever GC is already enabled?

    That doesn't sound very nice if some thread is waiting on a get() for a very long time (which is reasonable if you have a thread pool that's only used sporadically, for example). Also, using gc.disable() and gc.enable() in multi-threaded programs is a bit delicate.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 5, 2017

    I don't mind someone to reimplement a full-fledged Queue in C. As for me, I am currently implementing a SimpleQueue in C that's reentrant and has only the most basic functionality.

    @gpshead
    Copy link
    Member

    gpshead commented Sep 5, 2017

    FYI - here is an appropriately licensed (WTFPL) very simple lock-free queue implemented in C: https://github.com/darkautism/lfqueue

    @pitrou
    Copy link
    Member

    pitrou commented Sep 5, 2017

    Le 05/09/2017 à 23:21, Gregory P. Smith a écrit :

    FYI - here is an appropriately licensed (WTFPL) very simple lock-free queue implemented in C

    Looks like it uses busy-waiting inside its get() equivalent.
    Also we would need a portable version of the atomic instructions used
    there (e.g. __sync_bool_compare_and_swap) :-)

    @gpshead
    Copy link
    Member

    gpshead commented Sep 5, 2017

    yeah, there are others such as https://www.liblfds.org/ that seem better from that standpoint (gcc, and microsoft compiler support - I'm sure clang is fine as well - anything gcc can do they can do). Ensuring they're supported across build target platforms (all the hardware technically has the ability) is the fun part if we're going to go that route.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 5, 2017

    One tangential note about a potential full-fledged Queue in C: the pure Python implementation is fair towards consumers, as it uses a threading.Condition which is itself fair. Achieving the same thing in C may significantly complicate the implementation. Raymond would have to decide whether it's an important property to keep.

    The SimpleQueue class, being a separate API, is not tied by this constraint.

    @rhettinger
    Copy link
    Contributor

    To handle the logging.handlers.QueueHandler case, the SimpleQueue needs a put_nowait() method (see line 1959 of Lib/logging/handlers.py.

    [Mike Bayer]

    I noticed it's very hard to find documentation on exactly when
    gc might run.

    The answer I got from the other coredevs is that GC can trigger whenever a new GCable object is allocated (pretty much any container or any pure python object).

    @rhettinger
    Copy link
    Contributor

    Just for the record, Guido explained his aversion to RLocks. Roughly: 1) RLocks are slower and more complex 2) It is difficult to correctly write code that can survive reentrancy, so it is easy fool yourself into believing you've written correct code 3) Getting a deadlock with a regular lock is a better outcome than having the invariants subtly violated, 4) Deadlocks have a more clear-cut failure mode and are easier to debug.

    Guido also explained that he favored some sort of minimal queue class even if it might not handle all possible gc/weakref/signal/del induced issues. Essentially, he believes there is some value in harm reduction by minimizing the likelihood of a failure. The will help sysadmins who already have a practice of doing monitoring and occasional restarts as a way of mitigating transient issues.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 15, 2018

    Hi all,

    The PR has been ready for quite some time now. Raymond posted some review comments back in September, which I addressed by making the requested changes.

    If someone wants to add their comments, now is the time. Otherwise I plan to merge the PR sometimes soon, so that it gets in before 3.7 beta.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 15, 2018

    New changeset 94e1696 by Antoine Pitrou in branch 'master':
    bpo-14976: Reentrant simple queue (bpo-3346)
    94e1696

    @pitrou
    Copy link
    Member

    pitrou commented Jan 15, 2018

    Ok, there has been enough reviews in the last four months :-) This is now merged.

    @pitrou pitrou closed this as completed Jan 15, 2018
    @gpshead
    Copy link
    Member

    gpshead commented Jan 16, 2018

    Catalin has been examining code... switching concurrent.futures.thread to use SimpleQueue instead of Queue is probably a good idea as the queues in there get used from weakref callbacks.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 16, 2018

    Could you open a new issue for it?

    @gpshead
    Copy link
    Member

    gpshead commented Jan 17, 2018

    https://bugs.python.org/issue32576 filed for that

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    8 participants