Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asyncio: Cancelling wait() after notification leaves Condition in an inconsistent state #67159

Closed
dcoles mannequin opened this issue Dec 1, 2014 · 20 comments
Closed
Assignees
Labels
release-blocker topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@dcoles
Copy link
Mannequin

dcoles mannequin commented Dec 1, 2014

BPO 22970
Nosy @gvanrossum, @jcea, @vstinner, @larryhastings, @ned-deily, @dcoles, @1st1
Files
  • asyncio-fix-wait-cancellation-race.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/1st1'
    closed_at = <Date 2016-06-11.16:05:13.591>
    created_at = <Date 2014-12-01.03:29:41.832>
    labels = ['type-bug', 'release-blocker', 'expert-asyncio']
    title = 'asyncio: Cancelling wait() after notification leaves Condition in an inconsistent state'
    updated_at = <Date 2016-06-29.01:03:33.992>
    user = 'https://github.com/dcoles'

    bugs.python.org fields:

    activity = <Date 2016-06-29.01:03:33.992>
    actor = 'jcea'
    assignee = 'yselivanov'
    closed = True
    closed_date = <Date 2016-06-11.16:05:13.591>
    closer = 'yselivanov'
    components = ['asyncio']
    creation = <Date 2014-12-01.03:29:41.832>
    creator = 'dcoles'
    dependencies = []
    files = ['37330']
    hgrepos = []
    issue_num = 22970
    keywords = ['patch']
    message_count = 20.0
    messages = ['231912', '233694', '233696', '234044', '234046', '240278', '249723', '249746', '250044', '250046', '265424', '265429', '265476', '265492', '265493', '265531', '265534', '265703', '268220', '268221']
    nosy_count = 8.0
    nosy_names = ['gvanrossum', 'jcea', 'vstinner', 'larry', 'ned.deily', 'python-dev', 'dcoles', 'yselivanov']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue22970'
    versions = ['Python 3.6']

    @dcoles
    Copy link
    Mannequin Author

    dcoles mannequin commented Dec 1, 2014

    If a task that is waiting on an asyncio.Condition is cancelled after notification but before having successfully reacquired the associated lock, the acquire() will be cancelled causing wait() to return without the lock held (violating wait()'s contract and leaving the program in an inconsistent state).

    This can be reproduced in cases where there's some contention on the Condition's lock. For example:

        import asyncio
    
        loop = asyncio.get_event_loop()
        cond = asyncio.Condition()
    
        @asyncio.coroutine
        def cond_wait_timeout(condition, timeout):
          wait_task = asyncio.async(condition.wait())
          loop.call_later(timeout, wait_task.cancel)
          try:
              yield from wait_task
              return True
          except asyncio.CancelledError:
              print("Timeout (locked: {0})".format(condition.locked()))
              return False
    
        @asyncio.coroutine
        def waiter():
          yield from cond.acquire()
          try:
              print("Wait")
              if (yield from cond_wait_timeout(cond, 1)):
                  # Cause some lock contention
                  print("Do work")
                  yield from asyncio.sleep(1)
          finally:
              cond.release()
    
        @asyncio.coroutine
        def notifier():
          # Yield to the waiters
          yield from asyncio.sleep(0.1)
      yield from cond.acquire()
      try:
          print("Notify")
          cond.notify_all()
      finally:
          cond.release()
    
    loop.run_until_complete(asyncio.wait([waiter(), waiter(), notifier()]))
    

    The most straightforward fix appears to be just to have wait() retry to acquire the lock, effectively ignoring cancellation at this point (since the condition has already finished waiting and just trying to reacquire the lock before returning).

    @dcoles dcoles mannequin added topic-asyncio type-bug An unexpected behavior, bug, or error labels Dec 1, 2014
    @vstinner
    Copy link
    Member

    vstinner commented Jan 8, 2015

    I don't like the idea of ignoring exceptions (CancelledError). An option may be to store the latest exception and reraise it when the condition is acquired.

    I'm not sure that it's safe or correct to "retry" to acquire the condition.

    I don't know what I should think about this issue. I don't see any obvious and simple fix, but I agree that inconsistencies in lock primitives is a severe issue.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 8, 2015

    threading.Condition.wait() implementation is very similar to asyncio.Condition.wait(), and the threading code only call acquire() once, it doesn't loop or ignore exceptions.

    Does it mean that threading.Condition.wait() has the same issue?

    @dcoles
    Copy link
    Mannequin Author

    dcoles mannequin commented Jan 15, 2015

    Hi Victor,

    (Sorry for the delay, I think GMail ate the notification)

    The main issue is that Condition.wait MUST reacquire the associated lock released on entry to wait() before returning or else the caller's assumption that it holds a lock (such as `with lock:`) will be incorrect.

    With the threading module, it's typically not possible to interrupt or cancel lock acquisition (no Thread.interrupt for example). Even on Python 3.2 where lock acquisition can be interrupted by POSIX signals, Condition.wait's implementation uses the non-interruptable lock._acquire_restore (calls rlock_acquire_restore → PyThread_acquire_lock → PyThread_acquire_lock → PyThread_acquire_lock_timed with intr=0 [Python/thread_pthread.h] which does the uninterruptable retry loop). So, not a problem for threading.Condition.wait().

    As for re-raising the CancelledError after the lock is reacquired, I'm not sure it serves any useful purpose to the caller since by this point we're either in the process of waking up after a notify() or already raising an exception (such as a CancelledError if someone called task.cancel() _before_ the condition was notified). Making the caller also handle re-acquisition failure would be quite messy.

    For example, one idea would be to have the caller check cond.locked() after a CancelledError and have the caller reacquire, but it turns in asyncio you can't tell whom owns the lock (could have been grabbed by another task).

    @dcoles
    Copy link
    Mannequin Author

    dcoles mannequin commented Jan 15, 2015

    Just for context, the reason for using a custom wait function (cond_wait_timeout) rather than the more idiomatic asyncio.wait_for(cond.wait(), timeout) is that the combination introduces another subtle, but nasty locking inconsistency (see https://groups.google.com/forum/#!topic/python-tulip/eSm7rZAe9LM ). Should probably open a ticket for that issue too.

    @dcoles
    Copy link
    Mannequin Author

    dcoles mannequin commented Apr 8, 2015

    This issue can still be reproduced on Python 3.5.0a1.
    (Triggers a "RuntimeError: Lock is not acquired" on cond.release())

    Please let me know if there's any further steps you'd like me to take here.

    @vstinner vstinner changed the title Cancelling wait() after notification leaves Condition in an inconsistent state asyncio: Cancelling wait() after notification leaves Condition in an inconsistent state Apr 9, 2015
    @larryhastings
    Copy link
    Contributor

    Is anyone going to try and fix this for 3.5.0? Or should we "kick the can down the road" and reassign it to 3.6?

    @vstinner
    Copy link
    Member

    vstinner commented Sep 4, 2015

    asyncio keeps its "provisional API" status in the Python 3.5 cycle, so this issue must not block the 3.5.0 release. It can be fixed later.

    This issue is complex, I'm not convinced that asyncio-fix-wait-cancellation-race.patch is the best fix.

    @larryhastings
    Copy link
    Contributor

    Reassigning to 3.6.

    @gvanrossum
    Copy link
    Member

    (It could be fixed in 3.5.1, since asyncio is still provisional in 3.5.)

    @ned-deily
    Copy link
    Member

    Is this still a Release Blocker for 3.6?

    @gvanrossum
    Copy link
    Member

    Presumably. :-(

    @dcoles
    Copy link
    Mannequin Author

    dcoles mannequin commented May 13, 2016

    Since I reported this, I haven't seen any proposed solutions other other than a retry loop to ensure that the lock is guaranteed to be reacquired when the sleeping coroutine is woken.

    The four possibilities of cancelling the waiting coroutine are:

    1. Before wait is scheduled
      (Because the coroutine is never run, the lock is never released and a CancelledError is thrown by the runtime)
    2. While waiting to be notified
      (CancelledError is thrown, but the 'finally' clause reacquires lock)
    3. After wait has returned
      (This is a no-op since the wait coroutine has already completed - no CancelledError is thrown)
    4. After notification, but while waiting on lock reaquisition
      (The problem discussed here)

    Cancellation is quite analogous to EINTER of system calls interrupted by signals - you'll notice that the CPython runtime will retry acquiring the lock forever https://hg.python.org/cpython/file/default/Python/thread_pthread.h#l355 . This is acceptable behaviour acording the the POSIX spec http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_cond_timedwait.html , in particular:

    "If a signal is delivered to a thread waiting for a condition variable, upon return from the signal handler the thread resumes waiting for the condition variable as if it was not interrupted, or it shall return zero due to spurious wakeup."

    "Upon successful return, the mutex shall have been locked and shall be owned by the calling thread."

    "These functions shall not return an error code of [EINTR]."

    Thus, other than throwing something like a RuntimeError, there's no way to legally return from a Condition.wait without having that lock reacquired. Thus the only option is to retry if the lock reacquire is interrupted.

    @gvanrossum
    Copy link
    Member

    Yury can you take this on?

    @1st1
    Copy link
    Member

    1st1 commented May 13, 2016

    OK, I'll see what I can do (this is some code that I don't know/use).

    @1st1
    Copy link
    Member

    1st1 commented May 14, 2016

    Hi David, thanks for working on this issue. Your patch and reasoning behind it looks correct, but it's hard to say that definitively without tests and full code review.

    Could you please submit a PR to https://github.com/python/asyncio (upstream for asyncio) with tests, and I'll do my best to review it before 3.5.2 happens.

    @dcoles
    Copy link
    Mannequin Author

    dcoles mannequin commented May 14, 2016

    Hi Yury,

    Sure - I'll create a PR along with a test that reproduces the issue. Should be able to get that done this weekend.

    @dcoles
    Copy link
    Mannequin Author

    dcoles mannequin commented May 16, 2016

    Please find the PR including a test to reproduce the issue here: python/asyncio#346

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 11, 2016

    New changeset 0bda9bc443ce by Yury Selivanov in branch '3.5':
    Issue bpo-22970: asyncio: Fix inconsistency cancelling Condition.wait.
    https://hg.python.org/cpython/rev/0bda9bc443ce

    New changeset 00a9de0f3fdc by Yury Selivanov in branch 'default':
    Merge 3.5 (issue bpo-22970)
    https://hg.python.org/cpython/rev/00a9de0f3fdc

    @1st1
    Copy link
    Member

    1st1 commented Jun 11, 2016

    I'm merging David's patch. I concur with David that the only viable way to fix this is to loop through the CancelledError until the lock is acquired (and Condition is in a right state).

    Thanks, David. It should make it to 3.5.2 now.

    @1st1 1st1 closed this as completed Jun 11, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    release-blocker topic-asyncio type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants