classification
Title: asyncio: Cancelling wait() after notification leaves Condition in an inconsistent state
Type: behavior Stage: resolved
Components: asyncio Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: yselivanov Nosy List: dcoles, gvanrossum, jcea, larry, ned.deily, python-dev, vstinner, yselivanov
Priority: release blocker Keywords: patch

Created on 2014-12-01 03:29 by dcoles, last changed 2016-06-29 01:03 by jcea. This issue is now closed.

Files
File name Uploaded Description Edit
asyncio-fix-wait-cancellation-race.patch dcoles, 2014-12-01 03:29 review
Messages (20)
msg231912 - (view) Author: David Coles (dcoles) * Date: 2014-12-01 03:29
If a task that is waiting on an asyncio.Condition is cancelled after notification but before having successfully reacquired the associated lock, the acquire() will be cancelled causing wait() to return without the lock held (violating wait()'s contract and leaving the program in an inconsistent state).

This can be reproduced in cases where there's some contention on the Condition's lock. For example:

    import asyncio

    loop = asyncio.get_event_loop()
    cond = asyncio.Condition()

    @asyncio.coroutine
    def cond_wait_timeout(condition, timeout):
      wait_task = asyncio.async(condition.wait())
      loop.call_later(timeout, wait_task.cancel)
      try:
          yield from wait_task
          return True
      except asyncio.CancelledError:
          print("Timeout (locked: {0})".format(condition.locked()))
          return False

    @asyncio.coroutine
    def waiter():
      yield from cond.acquire()
      try:
          print("Wait")
          if (yield from cond_wait_timeout(cond, 1)):
              # Cause some lock contention
              print("Do work")
              yield from asyncio.sleep(1)
      finally:
          cond.release()

    @asyncio.coroutine
    def notifier():
      # Yield to the waiters
      yield from asyncio.sleep(0.1)

      yield from cond.acquire()
      try:
          print("Notify")
          cond.notify_all()
      finally:
          cond.release()

    loop.run_until_complete(asyncio.wait([waiter(), waiter(), notifier()]))


The most straightforward fix appears to be just to have wait() retry to acquire the lock, effectively ignoring cancellation at this point (since the condition has already finished waiting and just trying to reacquire the lock before returning).
msg233694 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-01-08 22:46
I don't like the idea of ignoring exceptions (CancelledError). An option may be to store the latest exception and reraise it when the condition is acquired.

I'm not sure that it's safe or correct to "retry" to acquire the condition.

I don't know what I should think about this issue. I don't see any obvious and simple fix, but I agree that inconsistencies in lock primitives is a severe issue.
msg233696 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-01-08 22:54
threading.Condition.wait() implementation is very similar to asyncio.Condition.wait(), and the threading code only call acquire() once, it doesn't loop or ignore exceptions.

Does it mean that threading.Condition.wait() has the same issue?
msg234044 - (view) Author: David Coles (dcoles) * Date: 2015-01-15 02:16
Hi Victor,

(Sorry for the delay, I think GMail ate the notification)

The main issue is that Condition.wait MUST reacquire the associated lock released on entry to wait() before returning or else the caller's assumption that it holds a lock (such as `with lock:`) will be incorrect.

With the threading module, it's typically not possible to interrupt or cancel lock acquisition (no Thread.interrupt for example). Even on Python 3.2 where lock acquisition can be interrupted by POSIX signals, Condition.wait's implementation uses the non-interruptable lock._acquire_restore (calls rlock_acquire_restore →  PyThread_acquire_lock → PyThread_acquire_lock → PyThread_acquire_lock_timed with intr=0 [Python/thread_pthread.h] which does the uninterruptable retry loop). So, not a problem for threading.Condition.wait().

As for re-raising the CancelledError after the lock is reacquired, I'm not sure it serves any useful purpose to the caller since by this point we're either in the process of waking up after a notify() or already raising an exception (such as a CancelledError if someone called task.cancel() _before_ the condition was notified). Making the caller also handle re-acquisition failure would be quite messy.

For example, one idea would be to have the caller check cond.locked() after a CancelledError and have the caller reacquire, but it turns in asyncio you can't tell whom owns the lock (could have been grabbed by another task).
msg234046 - (view) Author: David Coles (dcoles) * Date: 2015-01-15 02:44
Just for context, the reason for using a custom wait function (cond_wait_timeout) rather than the more idiomatic asyncio.wait_for(cond.wait(), timeout) is that the combination introduces another subtle, but nasty locking inconsistency (see https://groups.google.com/forum/#!topic/python-tulip/eSm7rZAe9LM ). Should probably open a ticket for that issue too.
msg240278 - (view) Author: David Coles (dcoles) * Date: 2015-04-08 16:47
This issue can still be reproduced on Python 3.5.0a1.
(Triggers a "RuntimeError: Lock is not acquired" on cond.release())

Please let me know if there's any further steps you'd like me to take here.
msg249723 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2015-09-04 07:07
Is anyone going to try and fix this for 3.5.0?  Or should we "kick the can down the road" and reassign it to 3.6?
msg249746 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-09-04 09:11
asyncio keeps its "provisional API" status in the Python 3.5 cycle, so this issue must not block the 3.5.0 release. It can be fixed later.

This issue is complex, I'm not convinced that asyncio-fix-wait-cancellation-race.patch is the best fix.
msg250044 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2015-09-07 03:56
Reassigning to 3.6.
msg250046 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2015-09-07 03:57
(It could be fixed in 3.5.1, since asyncio is still provisional in 3.5.)
msg265424 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2016-05-12 20:34
Is this still a Release Blocker for 3.6?
msg265429 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-05-12 21:33
Presumably. :-(
msg265476 - (view) Author: David Coles (dcoles) * Date: 2016-05-13 13:54
Since I reported this, I haven't seen any proposed solutions other other than a retry loop to ensure that the lock is guaranteed to be reacquired when the sleeping coroutine is woken.

The four possibilities of cancelling the waiting coroutine are:

1. Before wait is scheduled
   (Because the coroutine is never run, the lock is never released and a CancelledError is thrown by the runtime)
2. While waiting to be notified
   (CancelledError is thrown, but the 'finally' clause reacquires lock)
3. After wait has returned
   (This is a no-op since the wait coroutine has already completed - no CancelledError is thrown)
4. After notification, but while waiting on lock reaquisition
   (The problem discussed here)

Cancellation is quite analogous to EINTER of system calls interrupted by signals - you'll notice that the CPython runtime will retry acquiring the lock forever https://hg.python.org/cpython/file/default/Python/thread_pthread.h#l355 . This is acceptable behaviour acording the the POSIX spec http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_cond_timedwait.html , in particular:

"If a signal is delivered to a thread waiting for a condition variable, upon return from the signal handler the thread resumes waiting for the condition variable as if it was not interrupted, or it shall return zero due to spurious wakeup."

"Upon successful return, the mutex shall have been locked and shall be owned by the calling thread."

"These functions shall not return an error code of [EINTR]."

Thus, other than throwing something like a RuntimeError, there's no way to legally return from a Condition.wait without having that lock reacquired. Thus the only option is to retry if the lock reacquire is interrupted.
msg265492 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-05-13 21:35
Yury can you take this on?
msg265493 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-05-13 21:42
OK, I'll see what I can do (this is some code that I don't know/use).
msg265531 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-05-14 14:26
Hi David, thanks for working on this issue.  Your patch and reasoning behind it looks correct, but it's hard to say that definitively without tests and full code review.

Could you please submit a PR to https://github.com/python/asyncio (upstream for asyncio) with tests, and I'll do my best to review it before 3.5.2 happens.
msg265534 - (view) Author: David Coles (dcoles) * Date: 2016-05-14 16:56
Hi Yury,

Sure - I'll create a PR along with a test that reproduces the issue. Should be able to get that done this weekend.
msg265703 - (view) Author: David Coles (dcoles) * Date: 2016-05-16 16:01
Please find the PR including a test to reproduce the issue here: https://github.com/python/asyncio/pull/346
msg268220 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-06-11 16:01
New changeset 0bda9bc443ce by Yury Selivanov in branch '3.5':
Issue #22970: asyncio: Fix inconsistency cancelling Condition.wait.
https://hg.python.org/cpython/rev/0bda9bc443ce

New changeset 00a9de0f3fdc by Yury Selivanov in branch 'default':
Merge 3.5 (issue #22970)
https://hg.python.org/cpython/rev/00a9de0f3fdc
msg268221 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-06-11 16:05
I'm merging David's patch.  I concur with David that the only viable way to fix this is to loop through the CancelledError until the lock is acquired (and Condition is in a right state).

Thanks, David.  It should make it to 3.5.2 now.
History
Date User Action Args
2016-06-29 01:03:34jceasetnosy: + jcea
2016-06-11 16:05:13yselivanovsetstatus: open -> closed
resolution: fixed
messages: + msg268221

stage: resolved
2016-06-11 16:01:32python-devsetnosy: + python-dev
messages: + msg268220
2016-05-16 16:01:58dcolessetmessages: + msg265703
2016-05-14 16:56:13dcolessetmessages: + msg265534
2016-05-14 14:26:42yselivanovsetmessages: + msg265531
2016-05-13 21:42:46yselivanovsetmessages: + msg265493
2016-05-13 21:35:16gvanrossumsetassignee: yselivanov
messages: + msg265492
2016-05-13 13:54:14dcolessetmessages: + msg265476
2016-05-12 21:33:32gvanrossumsetmessages: + msg265429
2016-05-12 20:34:42ned.deilysetnosy: + ned.deily
messages: + msg265424
2015-09-07 03:57:24gvanrossumsetmessages: + msg250046
2015-09-07 03:56:03larrysetpriority: deferred blocker -> release blocker

messages: + msg250044
versions: + Python 3.6, - Python 3.4, Python 3.5
2015-09-04 09:11:45vstinnersetmessages: + msg249746
2015-09-04 07:07:16larrysetnosy: + larry
messages: + msg249723
2015-05-19 04:40:00yselivanovsetpriority: normal -> deferred blocker
2015-04-09 20:24:17vstinnersettitle: Cancelling wait() after notification leaves Condition in an inconsistent state -> asyncio: Cancelling wait() after notification leaves Condition in an inconsistent state
2015-04-08 16:47:25dcolessetmessages: + msg240278
versions: + Python 3.5
2015-01-15 02:44:45dcolessetmessages: + msg234046
2015-01-15 02:16:20dcolessetmessages: + msg234044
2015-01-08 22:54:37vstinnersetmessages: + msg233696
2015-01-08 22:46:09vstinnersetmessages: + msg233694
2014-12-01 03:29:41dcolescreate