classification
Title: [Windows] asyncio: race condition when cancelling a _WaitHandleFuture
Type: Stage:
Components: asyncio, Windows Versions: Python 3.4, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: gvanrossum, python-dev, sbt, steve.dower, tim.golden, vstinner, yselivanov, zach.ware
Priority: normal Keywords:

Created on 2014-12-21 00:47 by vstinner, last changed 2015-01-26 22:53 by vstinner. This issue is now closed.

Messages (14)
msg232987 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-12-21 00:47
On Windows using the IOCP (proactor) event loop, I noticed race conditions when running the test suite of Trollius. For examples, sometimes the returncode of a process is None, which should never happen. It looks like wait_for_handle() has an invalid behaviour.

When I run the test suite of asyncio in debug mode (PYTHONASYNCIODEBUG=1), sometimes I see the message "GetQueuedCompletionStatus() returned an unexpected event" which should never occur neither.

I added debug traces. I saw that the IocpProactor.wait_for_handle() calls later PostQueuedCompletionStatus() through its internal C callback (PostToQueueCallback). It looks like sometimes the callback is called whereas the wait was cancelled/acked with UnregisterWait().

I didn't understand the logic between RegisterWaitForSingleObject(), UnregisterWait() and the callback.

It looks like sometimes the overlapped object created in Python ("ov = _overlapped.Overlapped(NULL)") is destroyed, before PostToQueueCallback() was called. In the unit tests, it doesn't crash because a different overlapped object is created and it gets the same memory address (probably because we are lucky!).

The current implementation of wait_for_handle() has an optimization: it polls immediatly the wait to check if it already completed. I tried to remove it, but I got some different issues. If I understood correctly, this optimization hides other bugs and reduce the probability of getting the race condition.

wait_for_handle() in used to wait for the completion of a subprocess, so by all unit tests running subprocesses, but also in test_wait_for_handle() and test_wait_for_handle_cancel() tests. I suspect that running test_wait_for_handle() or test_wait_for_handle_cancel() schedule the bug.

Note: Removing "_winapi.CloseHandle(self._iocp)" in IocpProactor.close() works around the bug. The bug looks to be an expected call to PostToQueueCallback() which calls PostQueuedCompletionStatus() on an IOCP. Not closing the IOCP means using a different IOCP for each test, so the unexpected call to PostQueuedCompletionStatus() has no effect on following tests.

--

I rewrote some parts of the IOCP code in asyncio. Maybe I introduced this issue during the refactoring. Maybe it already existed before but nobody noticed it, asyncio had fewer unit tests before.

At the beginning, I wanted to fix this crash:
https://code.google.com/p/tulip/issues/detail?id=195
"_WaitHandleFuture.cancel() crash if the wait event was already unregistered"

Later, I tried to make the code more reliable in this issue:
https://code.google.com/p/tulip/issues/detail?id=196
"_OverlappedFuture.set_result() should clear the its reference to the overlapped object"

Read Trollius 1.0.1 changelog which lists these changes:
http://trollius.readthedocs.org/changelog.html#version-1-0-1

--

Note: The IOCP code still has code which can be enhanced:

- "Investigate IocpProactor.accept_pipe() special case (don't register overlapped)"
  https://code.google.com/p/tulip/issues/detail?id=204

- "Rewrite IocpProactor.connect_pipe() with non-blocking calls to avoid non interruptible QueueUserWorkItem()"
  https://code.google.com/p/tulip/issues/detail?id=197
msg234134 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-01-16 16:35
The race condition occurs when _WaitHandleFuture().cancel() is called. This object is created by IocpProactor.wait_for_handle().

_WaitHandleFuture().cancel() is only called by 4 tests:

- test_cancel_post_init()
- test_cancel_make_subprocess_transport_exec
- test_wait_for_handle
- test_wait_for_handle_cancel

It looks like the "GetQueuedCompletionStatus() returned an unexpected event" error is logged when test_wait_for_handle() and/or test_wait_for_handle_cancel() is executed(). Disabling these tests is enough to workaround this issue.

Using Tulip, use "python3 runtests.py -r" to reproduce the issue. You may have to run this command multiple times to see the error, the bug is random.
msg234440 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-01-21 16:35
To wait for the exit of the subprocess, we use RegisterWaitForSingleObject(). To cancel this wait, we can use UnregisterWait() which returns immediatly. Problem: UnregisterWait() doesn't tell us if the wait was cancelled or not, the cancellation is asynchronous. Second problem: the wait may have been signaled to the IOCP... or not. The wait may be signaled after the call to UnregisterWait(), since the cancellation is asynchronous (I'm not sure of that, but it doesn't change everything). This can be explained by the implementation: RegisterWaitForSingleObject() is implemented with a pool of threads.

Windows XP introduced UnregiterWaitEx() which can be used to be notified when the wait has been cancelled. Cool. But the notification requires an Event object. And how can we asynchronously wait for this Event? Using RegisterWaitForSingleObject()! Wait, what? We were cancelling another RegisterWaitForSingleObject().

To be fully asynchronous (no performance impact), cancelling a RegisterWaitForSingleObject() wait requires a new Event object and call RegisterWaitForSingleObject() on it.

--

In Python, we must ensure that the Overlapped object used by RegisterWaitForSingleObject() is kept alive until the wait is signalled, or until we are sure that the wait was cancelled. Otherwise, the program may crash.

To keep the Overlapped object alive, we keep indirectly in a _WaitHandleFuture object, and this future is registered in IocpProactor._cache.

I'm working on a change to use UnregiterWaitEx().
msg234451 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-01-21 22:42
New changeset fb8a093db8b1 by Victor Stinner in branch '3.4':
Issue #23095, asyncio: Rewrite _WaitHandleFuture.cancel()
https://hg.python.org/cpython/rev/fb8a093db8b1
msg234452 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-01-21 22:46
It took me several months to understand this issue. For the beginning of the story, see:
https://code.google.com/p/tulip/issues/detail?id=196

But I think that *this* issue can be closed: UnregisterWaitEx() really do what we need in asyncio.

I don't like the complex IocpProactor._unregister() function and _WaitCancelFuture class, but it looks that it's how we are supposed to wait until a wait for a handle is cancelled... Windows IOCP API is much complex that what I expected. It's probably because some parts (especially RegisterWaitForSingleObject()) are implemented with threads in user land, not in the kernel.

In short, I'm very happy that have fixed this very complex but also very annoying IOCP bug in asyncio.
msg234453 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2015-01-21 22:53
Congrats with the fix, and thanks for your perseverance!
msg234454 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-01-21 23:19
New changeset d3804307cce4 by Victor Stinner in branch '3.4':
Issue #23095, asyncio: IocpProactor.close() must not cancel pending
https://hg.python.org/cpython/rev/d3804307cce4
msg234455 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-01-21 23:22
"IocpProactor.close() must not cancel pending _WaitCancelFuture futures"

FYI I found this bug when running the trollius test suite.
msg234479 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-01-22 08:01
test_asyncio hangs on "AMD64 Windows7 SP1 3.x" buildbot:
http://buildbot.python.org/all/builders/AMD64%20Windows7%20SP1%203.x/builds/5562

The most significant change of this build is the changeset d3804307cce4: "IocpProactor.close() must not cancel pending _WaitCancelFuture futures".

[391/391] test_asyncio
Timeout (1:00:00)!
Current thread 0x000013f8 (most recent call first):
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\windows_events.py", line 617 in _unregister
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\windows_events.py", line 193 in _unregister_wait_cb
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\windows_events.py", line 209 in _unregister_wait
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\windows_events.py", line 152 in set_result
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\windows_events.py", line 671 in _poll
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\windows_events.py", line 386 in select
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\base_events.py", line 1081 in _run_once
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\base_events.py", line 258 in run_forever
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\base_events.py", line 286 in run_until_complete
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_asyncio\test_events.py", line 1695 in test_subprocess_stderr_redirect_to_stdout
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\case.py", line 577 in run
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\case.py", line 625 in __call__
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 125 in run
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 125 in run
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 125 in run
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 125 in run
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 125 in run
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\runner.py", line 168 in run
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 1770 in _run_suite
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 1804 in run_unittest
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\regrtest.py", line 1283 in test_runner
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\regrtest.py", line 1284 in runtest_inner
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\regrtest.py", line 967 in runtest
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\regrtest.py", line 532 in main
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\regrtest.py", line 1568 in main_in_temp_cwd
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\regrtest.py", line 1593 in <module>
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\runpy.py", line 85 in _run_code
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\runpy.py", line 170 in _run_module_as_main
msg234482 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-01-22 08:16
A different hang on "AMD64 Windows7 SP1 3.4/" buildbot:

http://buildbot.python.org/all/builders/AMD64%20Windows7%20SP1%203.4/builds/805/steps/test/logs/stdio

This build is also related to the changeset d3804307cce44f7f02e38166daf6d8227aa45021: "IocpProactor.close() must not cancel pending _WaitCancelFuture futures".

IMO it's not a new bug, it's just that it was not seen before. I may be related to the issue #23293.

[390/390/1] test_asyncio
Timeout (1:00:00)!
Current thread 0x000005c8 (most recent call first):
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\windows_events.py", line 620 in _unregister
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\windows_events.py", line 196 in _unregister_wait_cb
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\windows_events.py", line 212 in _unregister_wait
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\windows_events.py", line 152 in set_result
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\windows_events.py", line 674 in _poll
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\windows_events.py", line 389 in select
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\base_events.py", line 1081 in _run_once
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\base_events.py", line 258 in run_forever
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\base_events.py", line 286 in run_until_complete
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\test_asyncio\test_subprocess.py", line 211 in test_pause_reading
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\case.py", line 577 in run
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\case.py", line 625 in __call__
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 125 in run
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 125 in run
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 125 in run
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 125 in run
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\runner.py", line 168 in run
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\support\__init__.py", line 1769 in _run_suite
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\support\__init__.py", line 1803 in run_unittest
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\regrtest.py", line 1279 in test_runner
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\regrtest.py", line 1280 in runtest_inner
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\regrtest.py", line 967 in runtest
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\regrtest.py", line 532 in main
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\regrtest.py", line 1564 in main_in_temp_cwd
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\regrtest.py", line 1589 in <module>
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\runpy.py", line 85 in _run_code
  File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\runpy.py", li
msg234535 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-01-23 01:36
Running "runtests.py test_cancel_post_init test_shell" and "runtests.py test_wait_for_handle test_wait_for_handle_cancel" of Tulip show a different behaviour of _WaitHandleFuture. In one case, the cancelled wait is still signaled, on other case, it's never signaled.

Currently, a cancelled _WaitHandleFuture always unregisters the overlapped operation, which causes unexpected event or may lead tests to hang. Never unregisters causes a different issue. Unregistering the overlapped indirectly delete it in memory, which is bad if the completion is still signaled.

A workaround is to keep a reference to the unregistered overlopped, but it's an ugly workaround.
msg234761 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-01-26 16:58
Pseudo-code:
---
handle = ...
fut = proactor.wait_for_handle(handle)
fut.cancel()
loop.close()
---

Windows functions called in this code:

- A: call RegisterWaitForSingleObject() on an handle
- A: call UnregisterWaitEx() with an event
- B: call RegisterWaitForSingleObject() on this event
- B: the wait on this event completes: the completion is signaled with PostToQueueCallback() (our internal callback passed to RegisterWaitForSingleObject())
- A: _unregister_wait_cb() is called on the future A
- (but the wait on the object A is never signaled by PostToQueueCallback())

In short, UnregisterWaitEx() is called before the callback passed to RegisterWaitForSingleObject() is called. In this case, the callback is never called, and so the completion of the wait is never signaled.

Internally, RegisterWaitForSingleObject() is implemented as a pool of threads calling WaitForMultipleObjects(). The default limit for the pool is 500 threads. WaitForMultipleObjects() is limited to 64 objects. Each thread uses a timer and computes a timeout. The timer is awaken to cancel a wait.
msg234778 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-01-26 21:32
New changeset 36e80c6599aa by Victor Stinner in branch '3.4':
Issue #23095, asyncio: Fix _WaitHandleFuture.cancel()
https://hg.python.org/cpython/rev/36e80c6599aa
msg234791 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-01-26 22:53
I fixed many other issues related to the IocpProactor. This time it should be ok. I ran the Tulip and Trollius test suite a lot of times, in release and debug mode. I didn't see any warning nor hang. I also ran test_asyncio of Python as well.

I close this issue again.
History
Date User Action Args
2015-01-26 22:53:28vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg234791
2015-01-26 21:32:54python-devsetmessages: + msg234778
2015-01-26 16:58:52vstinnersetmessages: + msg234761
2015-01-23 01:36:28vstinnersetmessages: + msg234535
2015-01-22 08:16:53vstinnersetmessages: + msg234482
2015-01-22 08:01:24vstinnersetstatus: closed -> open
resolution: fixed -> (no value)
messages: + msg234479
2015-01-21 23:22:01vstinnersetmessages: + msg234455
2015-01-21 23:19:33python-devsetmessages: + msg234454
2015-01-21 22:53:59gvanrossumsetmessages: + msg234453
2015-01-21 22:46:32vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg234452
2015-01-21 22:42:01python-devsetnosy: + python-dev
messages: + msg234451
2015-01-21 16:35:39vstinnersetmessages: + msg234440
2015-01-20 21:27:50vstinnersettitle: asyncio: race condition when cancelling a _WaitHandleFuture -> [Windows] asyncio: race condition when cancelling a _WaitHandleFuture
2015-01-16 16:35:00vstinnersetmessages: + msg234134
2015-01-16 16:12:28vstinnersettitle: asyncio: race condition in IocpProactor.wait_for_handle() -> asyncio: race condition when cancelling a _WaitHandleFuture
2014-12-21 02:28:49pitrousetnosy: + sbt
2014-12-21 00:48:08vstinnersettitle: asyncio: race condition in the IOCP code (proactor event loop) -> asyncio: race condition in IocpProactor.wait_for_handle()
2014-12-21 00:47:46vstinnercreate