Message 232987 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	gvanrossum, steve.dower, tim.golden, vstinner, yselivanov, zach.ware
Date	2014-12-21.00:47:44
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1419122866.29.0.725788641938.issue23095@psf.upfronthosting.co.za>
In-reply-to

Content
On Windows using the IOCP (proactor) event loop, I noticed race conditions when running the test suite of Trollius. For examples, sometimes the returncode of a process is None, which should never happen. It looks like wait_for_handle() has an invalid behaviour. When I run the test suite of asyncio in debug mode (PYTHONASYNCIODEBUG=1), sometimes I see the message "GetQueuedCompletionStatus() returned an unexpected event" which should never occur neither. I added debug traces. I saw that the IocpProactor.wait_for_handle() calls later PostQueuedCompletionStatus() through its internal C callback (PostToQueueCallback). It looks like sometimes the callback is called whereas the wait was cancelled/acked with UnregisterWait(). I didn't understand the logic between RegisterWaitForSingleObject(), UnregisterWait() and the callback. It looks like sometimes the overlapped object created in Python ("ov = _overlapped.Overlapped(NULL)") is destroyed, before PostToQueueCallback() was called. In the unit tests, it doesn't crash because a different overlapped object is created and it gets the same memory address (probably because we are lucky!). The current implementation of wait_for_handle() has an optimization: it polls immediatly the wait to check if it already completed. I tried to remove it, but I got some different issues. If I understood correctly, this optimization hides other bugs and reduce the probability of getting the race condition. wait_for_handle() in used to wait for the completion of a subprocess, so by all unit tests running subprocesses, but also in test_wait_for_handle() and test_wait_for_handle_cancel() tests. I suspect that running test_wait_for_handle() or test_wait_for_handle_cancel() schedule the bug. Note: Removing "_winapi.CloseHandle(self._iocp)" in IocpProactor.close() works around the bug. The bug looks to be an expected call to PostToQueueCallback() which calls PostQueuedCompletionStatus() on an IOCP. Not closing the IOCP means using a different IOCP for each test, so the unexpected call to PostQueuedCompletionStatus() has no effect on following tests. -- I rewrote some parts of the IOCP code in asyncio. Maybe I introduced this issue during the refactoring. Maybe it already existed before but nobody noticed it, asyncio had fewer unit tests before. At the beginning, I wanted to fix this crash: https://code.google.com/p/tulip/issues/detail?id=195 "_WaitHandleFuture.cancel() crash if the wait event was already unregistered" Later, I tried to make the code more reliable in this issue: https://code.google.com/p/tulip/issues/detail?id=196 "_OverlappedFuture.set_result() should clear the its reference to the overlapped object" Read Trollius 1.0.1 changelog which lists these changes: http://trollius.readthedocs.org/changelog.html#version-1-0-1 -- Note: The IOCP code still has code which can be enhanced: - "Investigate IocpProactor.accept_pipe() special case (don't register overlapped)" https://code.google.com/p/tulip/issues/detail?id=204 - "Rewrite IocpProactor.connect_pipe() with non-blocking calls to avoid non interruptible QueueUserWorkItem()" https://code.google.com/p/tulip/issues/detail?id=197

On Windows using the IOCP (proactor) event loop, I noticed race conditions when running the test suite of Trollius. For examples, sometimes the returncode of a process is None, which should never happen. It looks like wait_for_handle() has an invalid behaviour.

When I run the test suite of asyncio in debug mode (PYTHONASYNCIODEBUG=1), sometimes I see the message "GetQueuedCompletionStatus() returned an unexpected event" which should never occur neither.

I added debug traces. I saw that the IocpProactor.wait_for_handle() calls later PostQueuedCompletionStatus() through its internal C callback (PostToQueueCallback). It looks like sometimes the callback is called whereas the wait was cancelled/acked with UnregisterWait().

I didn't understand the logic between RegisterWaitForSingleObject(), UnregisterWait() and the callback.

It looks like sometimes the overlapped object created in Python ("ov = _overlapped.Overlapped(NULL)") is destroyed, before PostToQueueCallback() was called. In the unit tests, it doesn't crash because a different overlapped object is created and it gets the same memory address (probably because we are lucky!).

The current implementation of wait_for_handle() has an optimization: it polls immediatly the wait to check if it already completed. I tried to remove it, but I got some different issues. If I understood correctly, this optimization hides other bugs and reduce the probability of getting the race condition.

wait_for_handle() in used to wait for the completion of a subprocess, so by all unit tests running subprocesses, but also in test_wait_for_handle() and test_wait_for_handle_cancel() tests. I suspect that running test_wait_for_handle() or test_wait_for_handle_cancel() schedule the bug.

Note: Removing "_winapi.CloseHandle(self._iocp)" in IocpProactor.close() works around the bug. The bug looks to be an expected call to PostToQueueCallback() which calls PostQueuedCompletionStatus() on an IOCP. Not closing the IOCP means using a different IOCP for each test, so the unexpected call to PostQueuedCompletionStatus() has no effect on following tests.

--

I rewrote some parts of the IOCP code in asyncio. Maybe I introduced this issue during the refactoring. Maybe it already existed before but nobody noticed it, asyncio had fewer unit tests before.

At the beginning, I wanted to fix this crash:
https://code.google.com/p/tulip/issues/detail?id=195
"_WaitHandleFuture.cancel() crash if the wait event was already unregistered"

Later, I tried to make the code more reliable in this issue:
https://code.google.com/p/tulip/issues/detail?id=196
"_OverlappedFuture.set_result() should clear the its reference to the overlapped object"

Read Trollius 1.0.1 changelog which lists these changes:
http://trollius.readthedocs.org/changelog.html#version-1-0-1

--

Note: The IOCP code still has code which can be enhanced:

- "Investigate IocpProactor.accept_pipe() special case (don't register overlapped)"
  https://code.google.com/p/tulip/issues/detail?id=204

- "Rewrite IocpProactor.connect_pipe() with non-blocking calls to avoid non interruptible QueueUserWorkItem()"
  https://code.google.com/p/tulip/issues/detail?id=197

History
Date	User	Action	Args
2014-12-21 00:47:46	vstinner	set	recipients: + vstinner, gvanrossum, tim.golden, zach.ware, yselivanov, steve.dower
2014-12-21 00:47:46	vstinner	set	messageid: <1419122866.29.0.725788641938.issue23095@psf.upfronthosting.co.za>
2014-12-21 00:47:46	vstinner	link	issue23095 messages
2014-12-21 00:47:44	vstinner	create