New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Windows] asyncio: race condition when cancelling a _WaitHandleFuture #67284
Comments
On Windows using the IOCP (proactor) event loop, I noticed race conditions when running the test suite of Trollius. For examples, sometimes the returncode of a process is None, which should never happen. It looks like wait_for_handle() has an invalid behaviour. When I run the test suite of asyncio in debug mode (PYTHONASYNCIODEBUG=1), sometimes I see the message "GetQueuedCompletionStatus() returned an unexpected event" which should never occur neither. I added debug traces. I saw that the IocpProactor.wait_for_handle() calls later PostQueuedCompletionStatus() through its internal C callback (PostToQueueCallback). It looks like sometimes the callback is called whereas the wait was cancelled/acked with UnregisterWait(). I didn't understand the logic between RegisterWaitForSingleObject(), UnregisterWait() and the callback. It looks like sometimes the overlapped object created in Python ("ov = _overlapped.Overlapped(NULL)") is destroyed, before PostToQueueCallback() was called. In the unit tests, it doesn't crash because a different overlapped object is created and it gets the same memory address (probably because we are lucky!). The current implementation of wait_for_handle() has an optimization: it polls immediatly the wait to check if it already completed. I tried to remove it, but I got some different issues. If I understood correctly, this optimization hides other bugs and reduce the probability of getting the race condition. wait_for_handle() in used to wait for the completion of a subprocess, so by all unit tests running subprocesses, but also in test_wait_for_handle() and test_wait_for_handle_cancel() tests. I suspect that running test_wait_for_handle() or test_wait_for_handle_cancel() schedule the bug. Note: Removing "_winapi.CloseHandle(self._iocp)" in IocpProactor.close() works around the bug. The bug looks to be an expected call to PostToQueueCallback() which calls PostQueuedCompletionStatus() on an IOCP. Not closing the IOCP means using a different IOCP for each test, so the unexpected call to PostQueuedCompletionStatus() has no effect on following tests. -- I rewrote some parts of the IOCP code in asyncio. Maybe I introduced this issue during the refactoring. Maybe it already existed before but nobody noticed it, asyncio had fewer unit tests before. At the beginning, I wanted to fix this crash: Later, I tried to make the code more reliable in this issue: Read Trollius 1.0.1 changelog which lists these changes: -- Note: The IOCP code still has code which can be enhanced:
|
The race condition occurs when _WaitHandleFuture().cancel() is called. This object is created by IocpProactor.wait_for_handle(). _WaitHandleFuture().cancel() is only called by 4 tests:
It looks like the "GetQueuedCompletionStatus() returned an unexpected event" error is logged when test_wait_for_handle() and/or test_wait_for_handle_cancel() is executed(). Disabling these tests is enough to workaround this issue. Using Tulip, use "python3 runtests.py -r" to reproduce the issue. You may have to run this command multiple times to see the error, the bug is random. |
To wait for the exit of the subprocess, we use RegisterWaitForSingleObject(). To cancel this wait, we can use UnregisterWait() which returns immediatly. Problem: UnregisterWait() doesn't tell us if the wait was cancelled or not, the cancellation is asynchronous. Second problem: the wait may have been signaled to the IOCP... or not. The wait may be signaled after the call to UnregisterWait(), since the cancellation is asynchronous (I'm not sure of that, but it doesn't change everything). This can be explained by the implementation: RegisterWaitForSingleObject() is implemented with a pool of threads. Windows XP introduced UnregiterWaitEx() which can be used to be notified when the wait has been cancelled. Cool. But the notification requires an Event object. And how can we asynchronously wait for this Event? Using RegisterWaitForSingleObject()! Wait, what? We were cancelling another RegisterWaitForSingleObject(). To be fully asynchronous (no performance impact), cancelling a RegisterWaitForSingleObject() wait requires a new Event object and call RegisterWaitForSingleObject() on it. -- In Python, we must ensure that the Overlapped object used by RegisterWaitForSingleObject() is kept alive until the wait is signalled, or until we are sure that the wait was cancelled. Otherwise, the program may crash. To keep the Overlapped object alive, we keep indirectly in a _WaitHandleFuture object, and this future is registered in IocpProactor._cache. I'm working on a change to use UnregiterWaitEx(). |
New changeset fb8a093db8b1 by Victor Stinner in branch '3.4': |
It took me several months to understand this issue. For the beginning of the story, see: But I think that *this* issue can be closed: UnregisterWaitEx() really do what we need in asyncio. I don't like the complex IocpProactor._unregister() function and _WaitCancelFuture class, but it looks that it's how we are supposed to wait until a wait for a handle is cancelled... Windows IOCP API is much complex that what I expected. It's probably because some parts (especially RegisterWaitForSingleObject()) are implemented with threads in user land, not in the kernel. In short, I'm very happy that have fixed this very complex but also very annoying IOCP bug in asyncio. |
Congrats with the fix, and thanks for your perseverance! |
New changeset d3804307cce4 by Victor Stinner in branch '3.4': |
"IocpProactor.close() must not cancel pending _WaitCancelFuture futures" FYI I found this bug when running the trollius test suite. |
test_asyncio hangs on "AMD64 Windows7 SP1 3.x" buildbot: The most significant change of this build is the changeset d3804307cce4: "IocpProactor.close() must not cancel pending _WaitCancelFuture futures". [391/391] test_asyncio |
A different hang on "AMD64 Windows7 SP1 3.4/" buildbot: This build is also related to the changeset d3804307cce44f7f02e38166daf6d8227aa45021: "IocpProactor.close() must not cancel pending _WaitCancelFuture futures". IMO it's not a new bug, it's just that it was not seen before. I may be related to the issue bpo-23293. [390/390/1] test_asyncio |
Running "runtests.py test_cancel_post_init test_shell" and "runtests.py test_wait_for_handle test_wait_for_handle_cancel" of Tulip show a different behaviour of _WaitHandleFuture. In one case, the cancelled wait is still signaled, on other case, it's never signaled. Currently, a cancelled _WaitHandleFuture always unregisters the overlapped operation, which causes unexpected event or may lead tests to hang. Never unregisters causes a different issue. Unregistering the overlapped indirectly delete it in memory, which is bad if the completion is still signaled. A workaround is to keep a reference to the unregistered overlopped, but it's an ugly workaround. |
Pseudo-code: Windows functions called in this code:
In short, UnregisterWaitEx() is called before the callback passed to RegisterWaitForSingleObject() is called. In this case, the callback is never called, and so the completion of the wait is never signaled. Internally, RegisterWaitForSingleObject() is implemented as a pool of threads calling WaitForMultipleObjects(). The default limit for the pool is 500 threads. WaitForMultipleObjects() is limited to 64 objects. Each thread uses a timer and computes a timeout. The timer is awaken to cancel a wait. |
New changeset 36e80c6599aa by Victor Stinner in branch '3.4': |
I fixed many other issues related to the IocpProactor. This time it should be ok. I ran the Tulip and Trollius test suite a lot of times, in release and debug mode. I didn't see any warning nor hang. I also ran test_asyncio of Python as well. I close this issue again. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: