This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Windows #define _PY_EMULATED_WIN_CV 0 bug
Type: crash Stage: resolved
Components: Windows Versions: Python 3.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Enable optimized locks on Windows
View: 29871
Assigned To: Nosy List: jeffr@livedata.com, paul.moore, steve.dower, tim.golden, zach.ware
Priority: normal Keywords:

Created on 2019-01-05 03:25 by jeffr@livedata.com, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
thread_test2.py jeffr@livedata.com, 2019-01-05 03:25
Messages (9)
msg333036 - (view) Author: Jeff Robbins (jeffr@livedata.com) * Date: 2019-01-05 03:25
Python 3.x defaults to using emulated condition variables on Windows.  I tested a build with native Windows condition variables (#define _PY_EMULATED_WIN_CV 0), and found a serious issue.

The problem is in condvar.h, in this routine:


/* This implementation makes no distinction about timeouts.  Signal
 * 2 to indicate that we don't know.
 */
Py_LOCAL_INLINE(int)
PyCOND_TIMEDWAIT(PyCOND_T *cv, PyMUTEX_T *cs, long long us)
{
    return SleepConditionVariableSRW(cv, cs, (DWORD)(us/1000), 0) ? 2 : -1;
}


The issue is that `SleepConditionVariableSRW` returns FALSE in the timeout case.  PyCOND_TIMEDWAIT returns -1 in that case. 

But... COND_TIMED_WAIT, which calls PyCOND_TIMEDWAIT, in ceval_gil.h, fatals(!) on a negative return value


#define COND_TIMED_WAIT(cond, mut, microseconds, timeout_result) \
    { \
        int r = PyCOND_TIMEDWAIT(&(cond), &(mut), (microseconds)); \
        if (r < 0) \
            Py_FatalError("PyCOND_WAIT(" #cond ") failed"); \



I'd like to suggest that we use the documented behavior of the OS API call already being used (SleepConditionVariableSRW https://docs.microsoft.com/en-us/windows/desktop/api/synchapi/nf-synchapi-sleepconditionvariablesrw) and return 0 on regular success and 1 on timeout, like in the _POSIX_THREADS case.  

"""
Return Value
If the function succeeds, the return value is nonzero.

If the function fails, the return value is zero. To get extended error information, call GetLastError.

If the timeout expires the function returns FALSE and GetLastError returns ERROR_TIMEOUT.
"""

I've tested this rewrite -- the main difference is in the FALSE case, check GetLastError() for ERROR_TIMEOUT and then *do not* treat this as a fatal error.
 


/*
 * PyCOND_TIMEDWAIT, in addition to returning negative on error,
 * thus returns 0 on regular success, 1 on timeout
*/
Py_LOCAL_INLINE(int)
PyCOND_TIMEDWAIT(PyCOND_T *cv, PyMUTEX_T *cs, long long us)
{
    BOOL result = SleepConditionVariableSRW(cv, cs, (DWORD)(us / 1000), 0);

    if (result)
        return 0;

    if (GetLastError() == ERROR_TIMEOUT)
        return 1;

    return -1;
}


I've attached the test I ran to reproduce the crash.
msg333044 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-01-05 06:11
There's an existing issue for this somewhere - we've tried a couple times to switch over and run into various issues. I'm not in a place to find it right now, but worth looking.
msg333046 - (view) Author: Jeff Robbins (jeffr@livedata.com) * Date: 2019-01-05 07:59
I did a search and couldn't find exactly this issue. This issue is about a
broken function. It is broken because it treats a timeout as a fatal error
which crashes your Python program.

I supplied a proposed fix for the function.

If there are other known issues or tests, happy to dig in.  Seems a shame
that Python 3 on Windows needs to be running on emulated condition
variables when the OS has (apparently) working actual ones.

Jeff

On Sat, Jan 5, 2019, 1:11 AM Steve Dower <report@bugs.python.org wrote:

>
> Steve Dower <steve.dower@python.org> added the comment:
>
> There's an existing issue for this somewhere - we've tried a couple times
> to switch over and run into various issues. I'm not in a place to find it
> right now, but worth looking.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue35662>
> _______________________________________
>
msg333049 - (view) Author: Jeff Robbins (jeffr@livedata.com) * Date: 2019-01-05 08:46
I searched harder.  :-)

https://bugs.python.org/issue29871

I see that someone else already noticed this broken function, but I guess
it was left broken because of other issues with using condition variables
instead of the emulated ones?

Still, the code is wrong as written...

Jeff

On Sat, Jan 5, 2019, 1:11 AM Steve Dower <report@bugs.python.org wrote:

>
> Steve Dower <steve.dower@python.org> added the comment:
>
> There's an existing issue for this somewhere - we've tried a couple times
> to switch over and run into various issues. I'm not in a place to find it
> right now, but worth looking.
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <https://bugs.python.org/issue35662>
> _______________________________________
>
msg333725 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-01-15 18:31
It's broken, but unused. And the entire section needs fixing before it can be used, which necessitates fixing this function. So issue 29871 covers this sufficiently (though I'll post a link back to this one for the added context on this particular issue).
msg336256 - (view) Author: Jeff Robbins (jeffr@livedata.com) * Date: 2019-02-21 19:51
Steve, I did some more digging into why the native condition variable approach might be causing problems on Windows.  Testing my fix revealed that there is at least one place in Modules\overlapped.c that either 

a) waits too long to call GetLastError(), or 

b) reveals an expectation that Py_END_ALLOW_THREADS won't change the results of GetLastError().



    Py_BEGIN_ALLOW_THREADS
    ret = GetQueuedCompletionStatus(CompletionPort, &NumberOfBytes,
                                    &CompletionKey, &Overlapped, Milliseconds);
    save_err = GetLastError();
    Py_END_ALLOW_THREADS

    err = ret ? ERROR_SUCCESS : GetLastError();





The problem in this code is that it allows *other* Windows API calls between the original Windows API call (in this case GetQueuedCompletionStatus()) and the call to GetLastError().  If those other Windows API calls change the thread-specific GetLastError state,
the info we need is lost.


To test for this possibility, I added a diagnostic test right after the code above



    if (!ret && (err != save_err)) {
        printf("GetQueuedCompletionStatus returned 0 but we lost the error=%d lost=%d Overlapped=%d\n", save_err, err, (long)Overlapped);
    }




 and ran a test that eventually produced this on the console:



GetQueuedCompletionStatus returned 0 but we lost the error=258 lost=0 Overlapped=0




error 258 is WAIT_TIMEOUT.  The next lines of code are looking for that "error" in order to decide if GetQueuedCompletionStatus failed, or merely timed out.



    if (Overlapped == NULL) {
        if (err == WAIT_TIMEOUT)
            Py_RETURN_NONE;
        else
            return SetFromWindowsErr(err);
    }



So the impact of this problem is severe.   Instead of returning None to the caller (in this case _poll in asyncio\windows_events.py), it will raise an error!



        while True:
            status = _overlapped.GetQueuedCompletionStatus(self._iocp, ms)
            if status is None:
                break




And, to make things extra confusing, the error raised via SetFromWindowsErr(err) (where err == 0) ends up looking like this:



OSError: [WinError 0] The operation completed successfully




A valid WAIT_TIMEOUT thus gets converted to a Python error, but also loses the original Windows Error Code of 258, so you are left scratching your head about how a WinError 0 (ERROR_SUCCESS) could have crashed your call to, say, asyncio.run()? (See traceback below.)


So either we need to make sure that all calls to GetLastError() are made immediately after the relevant Windows API call, without any intervening other Windows API calls, and thereby prevent case a) above, or as in case b), the GIL code (using either emulated or native condition variables from condvar.h) needs to preserve the Error state. 

Some code in Python\thread_nt.h in fact does this already, e.g.



void *
PyThread_get_key_value(int key)
{
    /* because TLS is used in the Py_END_ALLOW_THREAD macro,
     * it is necessary to preserve the windows error state, because
     * it is assumed to be preserved across the call to the macro.
     * Ideally, the macro should be fixed, but it is simpler to
     * do it here.
     */
    DWORD error = GetLastError();
    void *result = TlsGetValue(key);
    SetLastError(error);
    return result;
}



Of course there might be *other* problems associated with using native condition variables on Windows, but this is the only one 
I've experienced after some fairly heavy use of Python 3.7.2 asyncio on Windows.


traceback:
------------------------------------------------------------------------------------------------------------------------
    asyncio.run(self.main())
  File "C:\Users\jeffr\Documents\projects\Python-3.7.2\lib\asyncio\runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "C:\Users\jeffr\Documents\projects\Python-3.7.2\lib\asyncio\base_events.py", line 571, in run_until_complete
    self.run_forever()
  File "C:\Users\jeffr\Documents\projects\Python-3.7.2\lib\asyncio\base_events.py", line 539, in run_forever
    self._run_once()
  File "C:\Users\jeffr\Documents\projects\Python-3.7.2\lib\asyncio\base_events.py", line 1739, in _run_once
    event_list = self._selector.select(timeout)
  File "C:\Users\jeffr\Documents\projects\Python-3.7.2\lib\asyncio\windows_events.py", line 405, in select
    self._poll(timeout)
  File "C:\Users\jeffr\Documents\projects\Python-3.7.2\lib\asyncio\windows_events.py", line 703, in _poll
    status = _overlapped.GetQueuedCompletionStatus(self._iocp, ms)
OSError: [WinError 0] The operation completed successfully
msg336263 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-02-21 21:46
> reveals an expectation that Py_END_ALLOW_THREADS won't change the results of GetLastError()

Fantastic work, Jeff! That's almost certainly the major problem there - Py_END_ALLOW_THREADS can totally change the error code, and we haven't ever done a full sweep to check it.

Feel free to send a PR against issue29871 with your changes. If the tests are happy, then I am too.
msg336266 - (view) Author: Jeff Robbins (jeffr@livedata.com) * Date: 2019-02-21 22:17
Steve, sorry to be dense, but I'm unfortunately ignorant as to what tests I ought to be running.  The only test I have right now is much too complicated, and I'd rather be running some official regression test that reveals the problem without my app code, if possible.
msg336276 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-02-22 00:57
The regular test suite ought to be enough - see devguide.python.org for the info. It was definitely failing in multiprocessing last time I tried this.

You could also just push changes and start a PR, as that will run the tests automatically.
History
Date User Action Args
2022-04-11 14:59:09adminsetgithub: 79843
2019-02-22 13:47:52vstinnersetnosy: - vstinner
2019-02-22 00:57:38steve.dowersetmessages: + msg336276
2019-02-21 22:17:47jeffr@livedata.comsetmessages: + msg336266
2019-02-21 21:46:58steve.dowersetmessages: + msg336263
2019-02-21 19:51:34jeffr@livedata.comsetmessages: + msg336256
2019-01-15 18:31:20steve.dowersetstatus: open -> closed
superseder: Enable optimized locks on Windows
messages: + msg333725

resolution: duplicate
stage: resolved
2019-01-07 14:56:10vstinnersetnosy: + vstinner
2019-01-05 08:46:48jeffr@livedata.comsetmessages: + msg333049
2019-01-05 07:59:13jeffr@livedata.comsetmessages: + msg333046
2019-01-05 06:11:45steve.dowersetmessages: + msg333044
2019-01-05 03:25:25jeffr@livedata.comcreate