msg231912 - (view) |
Author: David Coles (dcoles) * |
Date: 2014-12-01 03:29 |
If a task that is waiting on an asyncio.Condition is cancelled after notification but before having successfully reacquired the associated lock, the acquire() will be cancelled causing wait() to return without the lock held (violating wait()'s contract and leaving the program in an inconsistent state).
This can be reproduced in cases where there's some contention on the Condition's lock. For example:
import asyncio
loop = asyncio.get_event_loop()
cond = asyncio.Condition()
@asyncio.coroutine
def cond_wait_timeout(condition, timeout):
wait_task = asyncio.async(condition.wait())
loop.call_later(timeout, wait_task.cancel)
try:
yield from wait_task
return True
except asyncio.CancelledError:
print("Timeout (locked: {0})".format(condition.locked()))
return False
@asyncio.coroutine
def waiter():
yield from cond.acquire()
try:
print("Wait")
if (yield from cond_wait_timeout(cond, 1)):
# Cause some lock contention
print("Do work")
yield from asyncio.sleep(1)
finally:
cond.release()
@asyncio.coroutine
def notifier():
# Yield to the waiters
yield from asyncio.sleep(0.1)
yield from cond.acquire()
try:
print("Notify")
cond.notify_all()
finally:
cond.release()
loop.run_until_complete(asyncio.wait([waiter(), waiter(), notifier()]))
The most straightforward fix appears to be just to have wait() retry to acquire the lock, effectively ignoring cancellation at this point (since the condition has already finished waiting and just trying to reacquire the lock before returning).
|
msg233694 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2015-01-08 22:46 |
I don't like the idea of ignoring exceptions (CancelledError). An option may be to store the latest exception and reraise it when the condition is acquired.
I'm not sure that it's safe or correct to "retry" to acquire the condition.
I don't know what I should think about this issue. I don't see any obvious and simple fix, but I agree that inconsistencies in lock primitives is a severe issue.
|
msg233696 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2015-01-08 22:54 |
threading.Condition.wait() implementation is very similar to asyncio.Condition.wait(), and the threading code only call acquire() once, it doesn't loop or ignore exceptions.
Does it mean that threading.Condition.wait() has the same issue?
|
msg234044 - (view) |
Author: David Coles (dcoles) * |
Date: 2015-01-15 02:16 |
Hi Victor,
(Sorry for the delay, I think GMail ate the notification)
The main issue is that Condition.wait MUST reacquire the associated lock released on entry to wait() before returning or else the caller's assumption that it holds a lock (such as `with lock:`) will be incorrect.
With the threading module, it's typically not possible to interrupt or cancel lock acquisition (no Thread.interrupt for example). Even on Python 3.2 where lock acquisition can be interrupted by POSIX signals, Condition.wait's implementation uses the non-interruptable lock._acquire_restore (calls rlock_acquire_restore → PyThread_acquire_lock → PyThread_acquire_lock → PyThread_acquire_lock_timed with intr=0 [Python/thread_pthread.h] which does the uninterruptable retry loop). So, not a problem for threading.Condition.wait().
As for re-raising the CancelledError after the lock is reacquired, I'm not sure it serves any useful purpose to the caller since by this point we're either in the process of waking up after a notify() or already raising an exception (such as a CancelledError if someone called task.cancel() _before_ the condition was notified). Making the caller also handle re-acquisition failure would be quite messy.
For example, one idea would be to have the caller check cond.locked() after a CancelledError and have the caller reacquire, but it turns in asyncio you can't tell whom owns the lock (could have been grabbed by another task).
|
msg234046 - (view) |
Author: David Coles (dcoles) * |
Date: 2015-01-15 02:44 |
Just for context, the reason for using a custom wait function (cond_wait_timeout) rather than the more idiomatic asyncio.wait_for(cond.wait(), timeout) is that the combination introduces another subtle, but nasty locking inconsistency (see https://groups.google.com/forum/#!topic/python-tulip/eSm7rZAe9LM ). Should probably open a ticket for that issue too.
|
msg240278 - (view) |
Author: David Coles (dcoles) * |
Date: 2015-04-08 16:47 |
This issue can still be reproduced on Python 3.5.0a1.
(Triggers a "RuntimeError: Lock is not acquired" on cond.release())
Please let me know if there's any further steps you'd like me to take here.
|
msg249723 - (view) |
Author: Larry Hastings (larry) *  |
Date: 2015-09-04 07:07 |
Is anyone going to try and fix this for 3.5.0? Or should we "kick the can down the road" and reassign it to 3.6?
|
msg249746 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2015-09-04 09:11 |
asyncio keeps its "provisional API" status in the Python 3.5 cycle, so this issue must not block the 3.5.0 release. It can be fixed later.
This issue is complex, I'm not convinced that asyncio-fix-wait-cancellation-race.patch is the best fix.
|
msg250044 - (view) |
Author: Larry Hastings (larry) *  |
Date: 2015-09-07 03:56 |
Reassigning to 3.6.
|
msg250046 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2015-09-07 03:57 |
(It could be fixed in 3.5.1, since asyncio is still provisional in 3.5.)
|
msg265424 - (view) |
Author: Ned Deily (ned.deily) *  |
Date: 2016-05-12 20:34 |
Is this still a Release Blocker for 3.6?
|
msg265429 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2016-05-12 21:33 |
Presumably. :-(
|
msg265476 - (view) |
Author: David Coles (dcoles) * |
Date: 2016-05-13 13:54 |
Since I reported this, I haven't seen any proposed solutions other other than a retry loop to ensure that the lock is guaranteed to be reacquired when the sleeping coroutine is woken.
The four possibilities of cancelling the waiting coroutine are:
1. Before wait is scheduled
(Because the coroutine is never run, the lock is never released and a CancelledError is thrown by the runtime)
2. While waiting to be notified
(CancelledError is thrown, but the 'finally' clause reacquires lock)
3. After wait has returned
(This is a no-op since the wait coroutine has already completed - no CancelledError is thrown)
4. After notification, but while waiting on lock reaquisition
(The problem discussed here)
Cancellation is quite analogous to EINTER of system calls interrupted by signals - you'll notice that the CPython runtime will retry acquiring the lock forever https://hg.python.org/cpython/file/default/Python/thread_pthread.h#l355 . This is acceptable behaviour acording the the POSIX spec http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_cond_timedwait.html , in particular:
"If a signal is delivered to a thread waiting for a condition variable, upon return from the signal handler the thread resumes waiting for the condition variable as if it was not interrupted, or it shall return zero due to spurious wakeup."
"Upon successful return, the mutex shall have been locked and shall be owned by the calling thread."
"These functions shall not return an error code of [EINTR]."
Thus, other than throwing something like a RuntimeError, there's no way to legally return from a Condition.wait without having that lock reacquired. Thus the only option is to retry if the lock reacquire is interrupted.
|
msg265492 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2016-05-13 21:35 |
Yury can you take this on?
|
msg265493 - (view) |
Author: Yury Selivanov (yselivanov) *  |
Date: 2016-05-13 21:42 |
OK, I'll see what I can do (this is some code that I don't know/use).
|
msg265531 - (view) |
Author: Yury Selivanov (yselivanov) *  |
Date: 2016-05-14 14:26 |
Hi David, thanks for working on this issue. Your patch and reasoning behind it looks correct, but it's hard to say that definitively without tests and full code review.
Could you please submit a PR to https://github.com/python/asyncio (upstream for asyncio) with tests, and I'll do my best to review it before 3.5.2 happens.
|
msg265534 - (view) |
Author: David Coles (dcoles) * |
Date: 2016-05-14 16:56 |
Hi Yury,
Sure - I'll create a PR along with a test that reproduces the issue. Should be able to get that done this weekend.
|
msg265703 - (view) |
Author: David Coles (dcoles) * |
Date: 2016-05-16 16:01 |
Please find the PR including a test to reproduce the issue here: https://github.com/python/asyncio/pull/346
|
msg268220 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2016-06-11 16:01 |
New changeset 0bda9bc443ce by Yury Selivanov in branch '3.5':
Issue #22970: asyncio: Fix inconsistency cancelling Condition.wait.
https://hg.python.org/cpython/rev/0bda9bc443ce
New changeset 00a9de0f3fdc by Yury Selivanov in branch 'default':
Merge 3.5 (issue #22970)
https://hg.python.org/cpython/rev/00a9de0f3fdc
|
msg268221 - (view) |
Author: Yury Selivanov (yselivanov) *  |
Date: 2016-06-11 16:05 |
I'm merging David's patch. I concur with David that the only viable way to fix this is to loop through the CancelledError until the lock is acquired (and Condition is in a right state).
Thanks, David. It should make it to 3.5.2 now.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:10 | admin | set | github: 67159 |
2016-06-29 01:03:34 | jcea | set | nosy:
+ jcea
|
2016-06-11 16:05:13 | yselivanov | set | status: open -> closed resolution: fixed messages:
+ msg268221
stage: resolved |
2016-06-11 16:01:32 | python-dev | set | nosy:
+ python-dev messages:
+ msg268220
|
2016-05-16 16:01:58 | dcoles | set | messages:
+ msg265703 |
2016-05-14 16:56:13 | dcoles | set | messages:
+ msg265534 |
2016-05-14 14:26:42 | yselivanov | set | messages:
+ msg265531 |
2016-05-13 21:42:46 | yselivanov | set | messages:
+ msg265493 |
2016-05-13 21:35:16 | gvanrossum | set | assignee: yselivanov messages:
+ msg265492 |
2016-05-13 13:54:14 | dcoles | set | messages:
+ msg265476 |
2016-05-12 21:33:32 | gvanrossum | set | messages:
+ msg265429 |
2016-05-12 20:34:42 | ned.deily | set | nosy:
+ ned.deily messages:
+ msg265424
|
2015-09-07 03:57:24 | gvanrossum | set | messages:
+ msg250046 |
2015-09-07 03:56:03 | larry | set | priority: deferred blocker -> release blocker
messages:
+ msg250044 versions:
+ Python 3.6, - Python 3.4, Python 3.5 |
2015-09-04 09:11:45 | vstinner | set | messages:
+ msg249746 |
2015-09-04 07:07:16 | larry | set | nosy:
+ larry messages:
+ msg249723
|
2015-05-19 04:40:00 | yselivanov | set | priority: normal -> deferred blocker |
2015-04-09 20:24:17 | vstinner | set | title: Cancelling wait() after notification leaves Condition in an inconsistent state -> asyncio: Cancelling wait() after notification leaves Condition in an inconsistent state |
2015-04-08 16:47:25 | dcoles | set | messages:
+ msg240278 versions:
+ Python 3.5 |
2015-01-15 02:44:45 | dcoles | set | messages:
+ msg234046 |
2015-01-15 02:16:20 | dcoles | set | messages:
+ msg234044 |
2015-01-08 22:54:37 | vstinner | set | messages:
+ msg233696 |
2015-01-08 22:46:09 | vstinner | set | messages:
+ msg233694 |
2014-12-01 03:29:41 | dcoles | create | |