Issue 35662: Windows #define _PY_EMULATED_WIN_CV 0 bug

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/79843

classification

Title:	Windows #define _PY_EMULATED_WIN_CV 0 bug
Type:	crash	Stage:	resolved
Components:	Windows	Versions:	Python 3.7

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	Enable optimized locks on Windows View: 29871
Assigned To:		Nosy List:	jeffr@livedata.com, paul.moore, steve.dower, tim.golden, zach.ware
Priority:	normal	Keywords:

Created on 2019-01-05 03:25 by jeffr@livedata.com, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
thread_test2.py	jeffr@livedata.com, 2019-01-05 03:25

Messages (9)
msg333036 - (view)	Author: Jeff Robbins (jeffr@livedata.com) *	Date: 2019-01-05 03:25
Python 3.x defaults to using emulated condition variables on Windows. I tested a build with native Windows condition variables (#define _PY_EMULATED_WIN_CV 0), and found a serious issue. The problem is in condvar.h, in this routine: /* This implementation makes no distinction about timeouts. Signal * 2 to indicate that we don't know. / Py_LOCAL_INLINE(int) PyCOND_TIMEDWAIT(PyCOND_T cv, PyMUTEX_T cs, long long us) { return SleepConditionVariableSRW(cv, cs, (DWORD)(us/1000), 0) ? 2 : -1; } The issue is that `SleepConditionVariableSRW` returns FALSE in the timeout case. PyCOND_TIMEDWAIT returns -1 in that case. But... COND_TIMED_WAIT, which calls PyCOND_TIMEDWAIT, in ceval_gil.h, fatals(!) on a negative return value #define COND_TIMED_WAIT(cond, mut, microseconds, timeout_result) \ { \ int r = PyCOND_TIMEDWAIT(&(cond), &(mut), (microseconds)); \ if (r < 0) \ Py_FatalError("PyCOND_WAIT(" #cond ") failed"); \ I'd like to suggest that we use the documented behavior of the OS API call already being used (SleepConditionVariableSRW https://docs.microsoft.com/en-us/windows/desktop/api/synchapi/nf-synchapi-sleepconditionvariablesrw) and return 0 on regular success and 1 on timeout, like in the _POSIX_THREADS case. """ Return Value If the function succeeds, the return value is nonzero. If the function fails, the return value is zero. To get extended error information, call GetLastError. If the timeout expires the function returns FALSE and GetLastError returns ERROR_TIMEOUT. """ I've tested this rewrite -- the main difference is in the FALSE case, check GetLastError() for ERROR_TIMEOUT and then do not* treat this as a fatal error. /* * PyCOND_TIMEDWAIT, in addition to returning negative on error, * thus returns 0 on regular success, 1 on timeout / Py_LOCAL_INLINE(int) PyCOND_TIMEDWAIT(PyCOND_T cv, PyMUTEX_T *cs, long long us) { BOOL result = SleepConditionVariableSRW(cv, cs, (DWORD)(us / 1000), 0); if (result) return 0; if (GetLastError() == ERROR_TIMEOUT) return 1; return -1; } I've attached the test I ran to reproduce the crash.
msg333044 - (view)	Author: Steve Dower (steve.dower) *	Date: 2019-01-05 06:11
There's an existing issue for this somewhere - we've tried a couple times to switch over and run into various issues. I'm not in a place to find it right now, but worth looking.
msg333046 - (view)	Author: Jeff Robbins (jeffr@livedata.com) *	Date: 2019-01-05 07:59
I did a search and couldn't find exactly this issue. This issue is about a broken function. It is broken because it treats a timeout as a fatal error which crashes your Python program. I supplied a proposed fix for the function. If there are other known issues or tests, happy to dig in. Seems a shame that Python 3 on Windows needs to be running on emulated condition variables when the OS has (apparently) working actual ones. Jeff On Sat, Jan 5, 2019, 1:11 AM Steve Dower <report@bugs.python.org wrote: > > Steve Dower <steve.dower@python.org> added the comment: > > There's an existing issue for this somewhere - we've tried a couple times > to switch over and run into various issues. I'm not in a place to find it > right now, but worth looking. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue35662> > _______________________________________ >
msg333049 - (view)	Author: Jeff Robbins (jeffr@livedata.com) *	Date: 2019-01-05 08:46
I searched harder. :-) https://bugs.python.org/issue29871 I see that someone else already noticed this broken function, but I guess it was left broken because of other issues with using condition variables instead of the emulated ones? Still, the code is wrong as written... Jeff On Sat, Jan 5, 2019, 1:11 AM Steve Dower <report@bugs.python.org wrote: > > Steve Dower <steve.dower@python.org> added the comment: > > There's an existing issue for this somewhere - we've tried a couple times > to switch over and run into various issues. I'm not in a place to find it > right now, but worth looking. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue35662> > _______________________________________ >
msg333725 - (view)	Author: Steve Dower (steve.dower) *	Date: 2019-01-15 18:31
It's broken, but unused. And the entire section needs fixing before it can be used, which necessitates fixing this function. So issue 29871 covers this sufficiently (though I'll post a link back to this one for the added context on this particular issue).
msg336256 - (view)	Author: Jeff Robbins (jeffr@livedata.com) *	Date: 2019-02-21 19:51
Steve, I did some more digging into why the native condition variable approach might be causing problems on Windows. Testing my fix revealed that there is at least one place in Modules\overlapped.c that either a) waits too long to call GetLastError(), or b) reveals an expectation that Py_END_ALLOW_THREADS won't change the results of GetLastError(). Py_BEGIN_ALLOW_THREADS ret = GetQueuedCompletionStatus(CompletionPort, &NumberOfBytes, &CompletionKey, &Overlapped, Milliseconds); save_err = GetLastError(); Py_END_ALLOW_THREADS err = ret ? ERROR_SUCCESS : GetLastError(); The problem in this code is that it allows other Windows API calls between the original Windows API call (in this case GetQueuedCompletionStatus()) and the call to GetLastError(). If those other Windows API calls change the thread-specific GetLastError state, the info we need is lost. To test for this possibility, I added a diagnostic test right after the code above if (!ret && (err != save_err)) { printf("GetQueuedCompletionStatus returned 0 but we lost the error=%d lost=%d Overlapped=%d\n", save_err, err, (long)Overlapped); } and ran a test that eventually produced this on the console: GetQueuedCompletionStatus returned 0 but we lost the error=258 lost=0 Overlapped=0 error 258 is WAIT_TIMEOUT. The next lines of code are looking for that "error" in order to decide if GetQueuedCompletionStatus failed, or merely timed out. if (Overlapped == NULL) { if (err == WAIT_TIMEOUT) Py_RETURN_NONE; else return SetFromWindowsErr(err); } So the impact of this problem is severe. Instead of returning None to the caller (in this case _poll in asyncio\windows_events.py), it will raise an error! while True: status = _overlapped.GetQueuedCompletionStatus(self._iocp, ms) if status is None: break And, to make things extra confusing, the error raised via SetFromWindowsErr(err) (where err == 0) ends up looking like this: OSError: [WinError 0] The operation completed successfully A valid WAIT_TIMEOUT thus gets converted to a Python error, but also loses the original Windows Error Code of 258, so you are left scratching your head about how a WinError 0 (ERROR_SUCCESS) could have crashed your call to, say, asyncio.run()? (See traceback below.) So either we need to make sure that all calls to GetLastError() are made immediately after the relevant Windows API call, without any intervening other Windows API calls, and thereby prevent case a) above, or as in case b), the GIL code (using either emulated or native condition variables from condvar.h) needs to preserve the Error state. Some code in Python\thread_nt.h in fact does this already, e.g. void * PyThread_get_key_value(int key) { /* because TLS is used in the Py_END_ALLOW_THREAD macro, * it is necessary to preserve the windows error state, because * it is assumed to be preserved across the call to the macro. * Ideally, the macro should be fixed, but it is simpler to * do it here. / DWORD error = GetLastError(); void result = TlsGetValue(key); SetLastError(error); return result; } Of course there might be other problems associated with using native condition variables on Windows, but this is the only one I've experienced after some fairly heavy use of Python 3.7.2 asyncio on Windows. traceback: ------------------------------------------------------------------------------------------------------------------------ asyncio.run(self.main()) File "C:\Users\jeffr\Documents\projects\Python-3.7.2\lib\asyncio\runners.py", line 43, in run return loop.run_until_complete(main) File "C:\Users\jeffr\Documents\projects\Python-3.7.2\lib\asyncio\base_events.py", line 571, in run_until_complete self.run_forever() File "C:\Users\jeffr\Documents\projects\Python-3.7.2\lib\asyncio\base_events.py", line 539, in run_forever self._run_once() File "C:\Users\jeffr\Documents\projects\Python-3.7.2\lib\asyncio\base_events.py", line 1739, in _run_once event_list = self._selector.select(timeout) File "C:\Users\jeffr\Documents\projects\Python-3.7.2\lib\asyncio\windows_events.py", line 405, in select self._poll(timeout) File "C:\Users\jeffr\Documents\projects\Python-3.7.2\lib\asyncio\windows_events.py", line 703, in _poll status = _overlapped.GetQueuedCompletionStatus(self._iocp, ms) OSError: [WinError 0] The operation completed successfully
msg336263 - (view)	Author: Steve Dower (steve.dower) *	Date: 2019-02-21 21:46
> reveals an expectation that Py_END_ALLOW_THREADS won't change the results of GetLastError() Fantastic work, Jeff! That's almost certainly the major problem there - Py_END_ALLOW_THREADS can totally change the error code, and we haven't ever done a full sweep to check it. Feel free to send a PR against issue29871 with your changes. If the tests are happy, then I am too.
msg336266 - (view)	Author: Jeff Robbins (jeffr@livedata.com) *	Date: 2019-02-21 22:17
Steve, sorry to be dense, but I'm unfortunately ignorant as to what tests I ought to be running. The only test I have right now is much too complicated, and I'd rather be running some official regression test that reveals the problem without my app code, if possible.
msg336276 - (view)	Author: Steve Dower (steve.dower) *	Date: 2019-02-22 00:57
The regular test suite ought to be enough - see devguide.python.org for the info. It was definitely failing in multiprocessing last time I tried this. You could also just push changes and start a PR, as that will run the tests automatically.

History
Date	User	Action	Args
2022-04-11 14:59:09	admin	set	github: 79843
2019-02-22 13:47:52	vstinner	set	nosy: - vstinner
2019-02-22 00:57:38	steve.dower	set	messages: + msg336276
2019-02-21 22:17:47	jeffr@livedata.com	set	messages: + msg336266
2019-02-21 21:46:58	steve.dower	set	messages: + msg336263
2019-02-21 19:51:34	jeffr@livedata.com	set	messages: + msg336256
2019-01-15 18:31:20	steve.dower	set	status: open -> closed superseder: Enable optimized locks on Windows messages: + msg333725 resolution: duplicate stage: resolved
2019-01-07 14:56:10	vstinner	set	nosy: + vstinner
2019-01-05 08:46:48	jeffr@livedata.com	set	messages: + msg333049
2019-01-05 07:59:13	jeffr@livedata.com	set	messages: + msg333046
2019-01-05 06:11:45	steve.dower	set	messages: + msg333044
2019-01-05 03:25:25	jeffr@livedata.com	create