Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows] asyncio: race condition when cancelling a _WaitHandleFuture #67284

Closed
vstinner opened this issue Dec 21, 2014 · 14 comments
Closed

[Windows] asyncio: race condition when cancelling a _WaitHandleFuture #67284

vstinner opened this issue Dec 21, 2014 · 14 comments

Comments

@vstinner
Copy link
Member

BPO 23095
Nosy @gvanrossum, @vstinner, @tjguk, @zware, @1st1, @zooba

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2015-01-26.22:53:28.317>
created_at = <Date 2014-12-21.00:47:46.193>
labels = ['OS-windows', 'expert-asyncio']
title = '[Windows] asyncio: race condition when cancelling a _WaitHandleFuture'
updated_at = <Date 2015-01-26.22:53:28.316>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2015-01-26.22:53:28.316>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2015-01-26.22:53:28.317>
closer = 'vstinner'
components = ['Windows', 'asyncio']
creation = <Date 2014-12-21.00:47:46.193>
creator = 'vstinner'
dependencies = []
files = []
hgrepos = []
issue_num = 23095
keywords = []
message_count = 14.0
messages = ['232987', '234134', '234440', '234451', '234452', '234453', '234454', '234455', '234479', '234482', '234535', '234761', '234778', '234791']
nosy_count = 8.0
nosy_names = ['gvanrossum', 'vstinner', 'tim.golden', 'python-dev', 'sbt', 'zach.ware', 'yselivanov', 'steve.dower']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue23095'
versions = ['Python 3.4', 'Python 3.5']

@vstinner
Copy link
Member Author

On Windows using the IOCP (proactor) event loop, I noticed race conditions when running the test suite of Trollius. For examples, sometimes the returncode of a process is None, which should never happen. It looks like wait_for_handle() has an invalid behaviour.

When I run the test suite of asyncio in debug mode (PYTHONASYNCIODEBUG=1), sometimes I see the message "GetQueuedCompletionStatus() returned an unexpected event" which should never occur neither.

I added debug traces. I saw that the IocpProactor.wait_for_handle() calls later PostQueuedCompletionStatus() through its internal C callback (PostToQueueCallback). It looks like sometimes the callback is called whereas the wait was cancelled/acked with UnregisterWait().

I didn't understand the logic between RegisterWaitForSingleObject(), UnregisterWait() and the callback.

It looks like sometimes the overlapped object created in Python ("ov = _overlapped.Overlapped(NULL)") is destroyed, before PostToQueueCallback() was called. In the unit tests, it doesn't crash because a different overlapped object is created and it gets the same memory address (probably because we are lucky!).

The current implementation of wait_for_handle() has an optimization: it polls immediatly the wait to check if it already completed. I tried to remove it, but I got some different issues. If I understood correctly, this optimization hides other bugs and reduce the probability of getting the race condition.

wait_for_handle() in used to wait for the completion of a subprocess, so by all unit tests running subprocesses, but also in test_wait_for_handle() and test_wait_for_handle_cancel() tests. I suspect that running test_wait_for_handle() or test_wait_for_handle_cancel() schedule the bug.

Note: Removing "_winapi.CloseHandle(self._iocp)" in IocpProactor.close() works around the bug. The bug looks to be an expected call to PostToQueueCallback() which calls PostQueuedCompletionStatus() on an IOCP. Not closing the IOCP means using a different IOCP for each test, so the unexpected call to PostQueuedCompletionStatus() has no effect on following tests.

--

I rewrote some parts of the IOCP code in asyncio. Maybe I introduced this issue during the refactoring. Maybe it already existed before but nobody noticed it, asyncio had fewer unit tests before.

At the beginning, I wanted to fix this crash:
https://code.google.com/p/tulip/issues/detail?id=195
"_WaitHandleFuture.cancel() crash if the wait event was already unregistered"

Later, I tried to make the code more reliable in this issue:
https://code.google.com/p/tulip/issues/detail?id=196
"_OverlappedFuture.set_result() should clear the its reference to the overlapped object"

Read Trollius 1.0.1 changelog which lists these changes:
http://trollius.readthedocs.org/changelog.html#version-1-0-1

--

Note: The IOCP code still has code which can be enhanced:

@vstinner vstinner changed the title asyncio: race condition in the IOCP code (proactor event loop) asyncio: race condition in IocpProactor.wait_for_handle() Dec 21, 2014
@vstinner vstinner changed the title asyncio: race condition in IocpProactor.wait_for_handle() asyncio: race condition when cancelling a _WaitHandleFuture Jan 16, 2015
@vstinner
Copy link
Member Author

The race condition occurs when _WaitHandleFuture().cancel() is called. This object is created by IocpProactor.wait_for_handle().

_WaitHandleFuture().cancel() is only called by 4 tests:

  • test_cancel_post_init()
  • test_cancel_make_subprocess_transport_exec
  • test_wait_for_handle
  • test_wait_for_handle_cancel

It looks like the "GetQueuedCompletionStatus() returned an unexpected event" error is logged when test_wait_for_handle() and/or test_wait_for_handle_cancel() is executed(). Disabling these tests is enough to workaround this issue.

Using Tulip, use "python3 runtests.py -r" to reproduce the issue. You may have to run this command multiple times to see the error, the bug is random.

@vstinner vstinner changed the title asyncio: race condition when cancelling a _WaitHandleFuture [Windows] asyncio: race condition when cancelling a _WaitHandleFuture Jan 20, 2015
@vstinner
Copy link
Member Author

To wait for the exit of the subprocess, we use RegisterWaitForSingleObject(). To cancel this wait, we can use UnregisterWait() which returns immediatly. Problem: UnregisterWait() doesn't tell us if the wait was cancelled or not, the cancellation is asynchronous. Second problem: the wait may have been signaled to the IOCP... or not. The wait may be signaled after the call to UnregisterWait(), since the cancellation is asynchronous (I'm not sure of that, but it doesn't change everything). This can be explained by the implementation: RegisterWaitForSingleObject() is implemented with a pool of threads.

Windows XP introduced UnregiterWaitEx() which can be used to be notified when the wait has been cancelled. Cool. But the notification requires an Event object. And how can we asynchronously wait for this Event? Using RegisterWaitForSingleObject()! Wait, what? We were cancelling another RegisterWaitForSingleObject().

To be fully asynchronous (no performance impact), cancelling a RegisterWaitForSingleObject() wait requires a new Event object and call RegisterWaitForSingleObject() on it.

--

In Python, we must ensure that the Overlapped object used by RegisterWaitForSingleObject() is kept alive until the wait is signalled, or until we are sure that the wait was cancelled. Otherwise, the program may crash.

To keep the Overlapped object alive, we keep indirectly in a _WaitHandleFuture object, and this future is registered in IocpProactor._cache.

I'm working on a change to use UnregiterWaitEx().

@python-dev
Copy link
Mannequin

python-dev mannequin commented Jan 21, 2015

New changeset fb8a093db8b1 by Victor Stinner in branch '3.4':
Issue bpo-23095, asyncio: Rewrite _WaitHandleFuture.cancel()
https://hg.python.org/cpython/rev/fb8a093db8b1

@vstinner
Copy link
Member Author

It took me several months to understand this issue. For the beginning of the story, see:
https://code.google.com/p/tulip/issues/detail?id=196

But I think that *this* issue can be closed: UnregisterWaitEx() really do what we need in asyncio.

I don't like the complex IocpProactor._unregister() function and _WaitCancelFuture class, but it looks that it's how we are supposed to wait until a wait for a handle is cancelled... Windows IOCP API is much complex that what I expected. It's probably because some parts (especially RegisterWaitForSingleObject()) are implemented with threads in user land, not in the kernel.

In short, I'm very happy that have fixed this very complex but also very annoying IOCP bug in asyncio.

@gvanrossum
Copy link
Member

Congrats with the fix, and thanks for your perseverance!

@python-dev
Copy link
Mannequin

python-dev mannequin commented Jan 21, 2015

New changeset d3804307cce4 by Victor Stinner in branch '3.4':
Issue bpo-23095, asyncio: IocpProactor.close() must not cancel pending
https://hg.python.org/cpython/rev/d3804307cce4

@vstinner
Copy link
Member Author

"IocpProactor.close() must not cancel pending _WaitCancelFuture futures"

FYI I found this bug when running the trollius test suite.

@vstinner
Copy link
Member Author

test_asyncio hangs on "AMD64 Windows7 SP1 3.x" buildbot:
http://buildbot.python.org/all/builders/AMD64%20Windows7%20SP1%203.x/builds/5562

The most significant change of this build is the changeset d3804307cce4: "IocpProactor.close() must not cancel pending _WaitCancelFuture futures".

[391/391] test_asyncio
Timeout (1:00:00)!
Current thread 0x000013f8 (most recent call first):
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\windows_events.py", line 617 in _unregister
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\windows_events.py", line 193 in _unregister_wait_cb
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\windows_events.py", line 209 in _unregister_wait
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\windows_events.py", line 152 in set_result
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\windows_events.py", line 671 in _poll
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\windows_events.py", line 386 in select
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\base_events.py", line 1081 in _run_once
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\base_events.py", line 258 in run_forever
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\asyncio\base_events.py", line 286 in run_until_complete
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_asyncio\test_events.py", line 1695 in test_subprocess_stderr_redirect_to_stdout
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\case.py", line 577 in run
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\case.py", line 625 in __call__
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 125 in run
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 125 in run
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 125 in run
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 125 in run
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 125 in run
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\unittest\runner.py", line 168 in run
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\init.py", line 1770 in _run_suite
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\init.py", line 1804 in run_unittest
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\regrtest.py", line 1283 in test_runner
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\regrtest.py", line 1284 in runtest_inner
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\regrtest.py", line 967 in runtest
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\regrtest.py", line 532 in main
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\regrtest.py", line 1568 in main_in_temp_cwd
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\regrtest.py", line 1593 in <module>
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\runpy.py", line 85 in _run_code
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\runpy.py", line 170 in _run_module_as_main

@vstinner vstinner reopened this Jan 22, 2015
@vstinner
Copy link
Member Author

A different hang on "AMD64 Windows7 SP1 3.4/" buildbot:

http://buildbot.python.org/all/builders/AMD64%20Windows7%20SP1%203.4/builds/805/steps/test/logs/stdio

This build is also related to the changeset d3804307cce44f7f02e38166daf6d8227aa45021: "IocpProactor.close() must not cancel pending _WaitCancelFuture futures".

IMO it's not a new bug, it's just that it was not seen before. I may be related to the issue bpo-23293.

[390/390/1] test_asyncio
Timeout (1:00:00)!
Current thread 0x000005c8 (most recent call first):
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\windows_events.py", line 620 in _unregister
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\windows_events.py", line 196 in _unregister_wait_cb
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\windows_events.py", line 212 in _unregister_wait
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\windows_events.py", line 152 in set_result
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\windows_events.py", line 674 in _poll
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\windows_events.py", line 389 in select
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\base_events.py", line 1081 in _run_once
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\base_events.py", line 258 in run_forever
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\asyncio\base_events.py", line 286 in run_until_complete
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\test_asyncio\test_subprocess.py", line 211 in test_pause_reading
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\case.py", line 577 in run
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\case.py", line 625 in __call__
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 125 in run
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 125 in run
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 125 in run
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 125 in run
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\suite.py", line 87 in __call__
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\unittest\runner.py", line 168 in run
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\support\init.py", line 1769 in _run_suite
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\support\init.py", line 1803 in run_unittest
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\regrtest.py", line 1279 in test_runner
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\regrtest.py", line 1280 in runtest_inner
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\regrtest.py", line 967 in runtest
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\regrtest.py", line 532 in main
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\regrtest.py", line 1564 in main_in_temp_cwd
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\test\regrtest.py", line 1589 in <module>
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\runpy.py", line 85 in _run_code
File "C:\buildbot.python.org\3.4.kloth-win64\build\lib\runpy.py", li

@vstinner
Copy link
Member Author

Running "runtests.py test_cancel_post_init test_shell" and "runtests.py test_wait_for_handle test_wait_for_handle_cancel" of Tulip show a different behaviour of _WaitHandleFuture. In one case, the cancelled wait is still signaled, on other case, it's never signaled.

Currently, a cancelled _WaitHandleFuture always unregisters the overlapped operation, which causes unexpected event or may lead tests to hang. Never unregisters causes a different issue. Unregistering the overlapped indirectly delete it in memory, which is bad if the completion is still signaled.

A workaround is to keep a reference to the unregistered overlopped, but it's an ugly workaround.

@vstinner
Copy link
Member Author

Pseudo-code:
---
handle = ...
fut = proactor.wait_for_handle(handle)
fut.cancel()
loop.close()
---

Windows functions called in this code:

  • A: call RegisterWaitForSingleObject() on an handle
  • A: call UnregisterWaitEx() with an event
  • B: call RegisterWaitForSingleObject() on this event
  • B: the wait on this event completes: the completion is signaled with PostToQueueCallback() (our internal callback passed to RegisterWaitForSingleObject())
  • A: _unregister_wait_cb() is called on the future A
  • (but the wait on the object A is never signaled by PostToQueueCallback())

In short, UnregisterWaitEx() is called before the callback passed to RegisterWaitForSingleObject() is called. In this case, the callback is never called, and so the completion of the wait is never signaled.

Internally, RegisterWaitForSingleObject() is implemented as a pool of threads calling WaitForMultipleObjects(). The default limit for the pool is 500 threads. WaitForMultipleObjects() is limited to 64 objects. Each thread uses a timer and computes a timeout. The timer is awaken to cancel a wait.

@python-dev
Copy link
Mannequin

python-dev mannequin commented Jan 26, 2015

New changeset 36e80c6599aa by Victor Stinner in branch '3.4':
Issue bpo-23095, asyncio: Fix _WaitHandleFuture.cancel()
https://hg.python.org/cpython/rev/36e80c6599aa

@vstinner
Copy link
Member Author

I fixed many other issues related to the IocpProactor. This time it should be ok. I ran the Tulip and Trollius test suite a lot of times, in release and debug mode. I didn't see any warning nor hang. I also ran test_asyncio of Python as well.

I close this issue again.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants