classification
Title: [Windows] Potential for GIL deadlock on Windows in threadmodule acquire_lock
Type: behavior Stage: resolved
Components: Library (Lib), Windows Versions: Python 3.6, Python 3.4, Python 3.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: paul.moore, pitrou, sdeibel, steve.dower, tim.golden, zach.ware
Priority: normal Keywords:

Created on 2017-06-15 13:34 by sdeibel, last changed 2017-06-15 17:30 by sdeibel. This issue is now closed.

Messages (3)
msg296088 - (view) Author: Stephan R.A. Deibel (sdeibel) Date: 2017-06-15 13:34
In acquire_timed in _threadmodule.c (used to implement threading lock acquire()) there is an optimization from https://github.com/python/cpython/commit/e75ff35af2b6c85d48c68b95f295aeac7396b162 that says it first tries non-blocking lock w/o releasing the GIL.  However, it seems that this call can block on Windows.

The call to PyThread_acquire_lock_timed(lock, 0, 0) (w/o releasing GIL) on Windows calls EnterNonRecursiveMutex which in turn, in the _PY_USE_CV_LOCKS impl, immediately calls PyMUTEX_LOCK as if it's not going to block.  However, various implementations of this seem like they can block.  Specifically, the impls of PyMUTEX_LOCK that end up using AcquireSRWLockExclusive and EnterCriticalSection both block.  The only case that is correct (does not block) is when !_PY_USE_CV_LOCKS where EnterNonRecursiveMutex() instead calls WaitForSingleObjectEx() with the zero timeout.

This seems like an incorrect potential for GIL deadlock, because the thread blocks without releasing GIL.  Shouldn't it ultimately be using TryAcquireSRWLockExclusive and TryEnterCriticalSection in these two cases?

This seems like an issue on Windows only.  The pthreads implementations look like they correctly handle timeout of 0 sent to PyThread_acquire_lock_timed().
msg296113 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2017-06-15 16:56
We recently tried switching the GIL implementation on Windows by changing the setting you refer to. It didn't work, so we didn't keep the change, but I don't recall the details - likely the same issue you bring up here.

I don't have the other issue number handy right now, but it may be worth trying to find it. We should be able to switch to the OS implementation of condition variables, and I'd expect performance improvement, but we need to resolve these issues first.
msg296117 - (view) Author: Stephan R.A. Deibel (sdeibel) Date: 2017-06-15 17:30
I think I misunderstood the implementation of EnterNonRecursiveMutex -- the mutex that could block there is the internal 'cs' mutex, which would only be held only briefly while Enter/LeaveNonRecursiveMutex are running, and it looks like the 'cs' mutex is released before doing anything that blocks (in the two impls of PyCOND_WAIT and PyCOND_TIMEDWAIT).

So my report is invalid and I'm closing it.  Sorry about that!
History
Date User Action Args
2017-06-15 17:30:43sdeibelsetstatus: open -> closed
resolution: not a bug
messages: + msg296117

stage: resolved
2017-06-15 16:56:22steve.dowersetmessages: + msg296113
2017-06-15 13:44:58vstinnersettitle: Potential for GIL deadlock on Windows in threadmodule acquire_lock -> [Windows] Potential for GIL deadlock on Windows in threadmodule acquire_lock
2017-06-15 13:44:33vstinnersetnosy: + pitrou
2017-06-15 13:34:17sdeibelcreate