Title: KeyboardInterrupt during Thread.join hangs that Thread
Type: behavior Stage:
Components: Library (Lib), Windows Versions: Python 3.8, Python 3.7, Python 3.6
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, gaborbernat, neologix, paul.moore, pdgoins-work, pitrou, steve.dower, tim.golden, tim.peters, tupl, zach.ware
Priority: normal Keywords:

Created on 2014-06-21 15:44 by tupl, last changed 2019-03-13 12:33 by gaborbernat.

File name Uploaded Description Edit tupl, 2014-06-21 15:44 Interactive script to reproduce the problem pdgoins-work, 2018-04-25 22:43
Messages (15)
msg221180 - (view) Author: Steve (tupl) Date: 2014-06-21 15:44
I am attempting to join a thread after a previous join was interrupted by Ctrl-C.

I did not find any warning for this case in threading module docs, so I assume this is legal.

The full test script is attached, but the essential code is:

def work(t):

twork = 3; twait = 4
t = Thread(target=work, args=(twork,))
	t.join(twait)  # here I do Ctrl-C
except KeyboardInterrupt:
t.join()  # this hangs if twork < twait

I can observe the following reproduce sequence:

1. start another thread that sleeps (or works) for T seconds
2. join that thread with timeout longer than T
3. before thread finishes and join returns, hit Ctrl-C to raise KeyboardInterrupt (KI)
4. after T seconds, thread finishes its (Python) code and KI is raised
5. Process Explorer confirms that thread really terminates (looked at .ident())
6. thread still reports .is_alive() == True
7. second attempt to join that thread hangs indefinitely

I tried replacing try-except clause with custom signal handler for SIGINT, as shown in the script. If the handler does not raise an exception, the thread can be normally joined. If it does, however, the behavior is the same as with default handler.

My _guess_ is that the exception prevents some finishing code that puts Thread into proper stopped state after its target completes.

Running Python 3.4.0 on Windows 7 x64
msg223608 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-07-21 20:22
This works for me under Linux.
msg315761 - (view) Author: Paul Goins (pdgoins-work) Date: 2018-04-25 22:43
I've confirmed this issue on Windows.  Attached is an SSCCE which seems to reliably reproduce the issue in a Windows environment.

Long story short: if a KeyboardInterrupt occurs during Thread.join(), there's a good chance that Thread._is_stopped never gets set.
msg315762 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-04-25 22:49
Thanks for the update.  Since I still can't reproduce on Linux, perhaps you (or one of our resident Windows experts) can try to propose a patch fixing the issue?
msg315811 - (view) Author: Paul Goins (pdgoins-work) Date: 2018-04-26 19:04
Not sure if I'll do the full fix (need to check w/ my employer), but I'm doing some investigation.  Here's what I know so far:

At the Python level, the KeyboardInterrupt is being raised within _wait_for_tstate_lock, on "elif lock.acquire(block, timeout)".

Going into the C code, it looks like this goes through lock_PyThread_acquire_lock -> acquire_timed -> PyThread_acquire_lock_timed.  acquire_timed .  lock_PyThread_acquire_lock will abort the lock if it receives PY_LOCK_INTR from acquire_timed.

My best guess right now is that PyThread_acquire_lock_timed never returns PY_LOCK_INTR.  Indeed, I see this comment at the top of the NT version of that function:

    /* Fow now, intr_flag does nothing on Windows, and lock acquires are
     * uninterruptible.  */

And indeed, the thread_pthread.h implementations both have a path for returning PY_LOCK_INTR, while the thread_nt.h version does not.

...And that's where I am thus far.
msg315815 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2018-04-26 19:32
For POSIX systems, try the following test function several times. For the bug to manifest, Thread._wait_for_tstate_lock has to be interrupted in between acquiring and releasing the sentinel lock. Maybe it could use a reentrant lock in order to avoid this problem.

    import os
    import signal
    import threading
    import time

    def raise_sigint():
        print('raising SIGINT')
        if == 'nt':
            os.kill(0, signal.CTRL_C_EVENT)
            os.kill(os.getpid(), signal.SIGINT)

    def test(f=raise_sigint):
        global g_sigint
        g_sigint = threading.Thread(target=f, name='SIGINT')
        for i in range(100):
                if not g_sigint.is_alive():
            except KeyboardInterrupt:
                print('KeyboardInterrupt ignored')
        print('g_sigint is alive:', g_sigint.is_alive())

POSIX-only code normally wouldn't join() with a small timeout in a loop as in the above example. This is a workaround for the problem that's demonstrated in msg221180, in which a signal doesn't interrupt a wait on the main thread. Other than time.sleep(), most waits in Windows CPython have not been updated to include the SIGINT Event object when waiting in the main thread. (It's possible to fix some cases such as waiting on locks, but only as long as the implementation continues to use semaphores and kernel waits instead of native condition variables with SRW locks.)
msg315966 - (view) Author: Paul Goins (pdgoins-work) Date: 2018-04-30 21:16
Focusing on the Windows case specifically...  One way to possibly make this work (although perhaps not as clean as may be desired) would be to add polling logic into the thread_nt.h version of PyThread_acquire_lock_timed.

That would have the benefit of avoiding the complexity of the various "non recursive mutex" implementations (i.e. semaphores vs "emulated" condition variables vs native condition variables) and may be less code than setting up "alertable" WaitForObjectSignleObjectEx calls (plus whatever else needs to be done for the SRW-lock-based code path).

Thoughts or feedback?  (I've not done any mainline Python commits yet so I'm totally ready to be completely wrong or misguided here.)
msg315967 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-04-30 21:21
> One way to possibly make this work (although perhaps not as clean as may be desired) would be to add polling logic into the thread_nt.h version of PyThread_acquire_lock_timed.

I think adding polling to such a widely-used primitive is out of question.
msg315968 - (view) Author: Paul Goins (pdgoins-work) Date: 2018-04-30 21:32
> I think adding polling to such a widely-used primitive is out of question.

I'm guessing this is because of the perceived performance impact?  (That's the main thought I have against polling in this case.)  Or is it something else?

I can certainly look at tweaking the 3 mutex implementations I mentioned previously, but I do expect that will be a bit more code; I at least wanted to put the "simpler" idea out there.  Fully knowing that "simpler" isn't necessarily better.
msg315969 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-04-30 21:39
> I'm guessing this is because of the perceived performance impact?

The problem is polling is pretty detrimental to power saving on modern CPUs.  There might also be a performance impact that makes things worse :-)

More generally, we really would like locks to be interruptible under Windows, as it would be useful in many more situations than joining threads. Unfortunately, as you have discovered we have several lock implementations for Windows right now.  The SRWLock one is probably difficult to make interruptible, but it's never enabled by default.  The Semaphore one looks legacy, so could perhaps be removed.  Remains the condition variable-based implementation.
msg315970 - (view) Author: Paul Goins (pdgoins-work) Date: 2018-04-30 21:47
Ahh, thanks for the explanation.  I didn't think about that.  Let's *not* do that then. :)

I'll see if I can squeeze in some time to play with the alternatives.
msg316023 - (view) Author: Paul Goins (pdgoins-work) Date: 2018-05-01 21:50
Okay, I think I need to abandon my research into this.  This does seem to have quite an amount of complexity to it and is probably more than I should be taking on at this point in time.

Anyone else who wants to look at this, consider it fair game.

Parting thoughts based on my limited expertise in the area, take them or ignore them:

* Semaphore-based AllocNonRecursiveMutex (!_PY_USE_CV_LOCKS): Using WaitForSingleObjectEx with alertable set to TRUE may be one path forward, however it seems like that would involve tracking all threads performing a ctrl-c-interruptible wait and calling QueueUserAPC with a no-op callback for each such thread to cause the wait to terminate early.  I don't particularly like the need to roll-our-own tracking, but at least off-hand and based on my somewhat limited experience in this area, I don't know of a better way.  Hopefully someone else does.

* CriticalSection/Semaphore-based AllocNonRecursiveMutex (_PY_USE_CV_LOCKS with _PY_EMULATED_WIN_CV): I don't know of a way of interrupting the CriticalSection lock, but the WaitForSingleObjectEx strategy may be usable on the semaphore-based condition variable.

* SRWLock/ConditionVariable-based AllocNonRecursiveMutex (_PY_USE_CV_LOCKS with !_PY_EMULATED_WIN_CV): Similarly, I don't know of a way to interrupt the SRWLock.  However, similar to WaitForSingleObjectEx, it may be possible to wake the condition variables via a Ctrl-C if we add our own tracking for such.  I'm not sure if this gets us quite to where we need to be or not.

Hopefully the above notes are of some value.
msg316024 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2018-05-01 22:08
multiprocessing semaphores support Ctrl-C under Windows, so it should be doable for regular locks as well (notice the `sigint_event`):
msg316025 - (view) Author: Paul Goins (pdgoins-work) Date: 2018-05-01 22:24
Good point, I forgot about WaitForMultipleObjectsEx; something like that seems like it would be much simpler for the first 2 cases.
msg337846 - (view) Author: gaborbernat (gaborbernat) Date: 2019-03-13 12:33
I think I'm hitting this with subprocesses inside tox (parallel feature), any plans to fix this?
Date User Action Args
2019-03-13 12:33:07gaborbernatsetnosy: + gaborbernat
messages: + msg337846
2018-05-02 10:02:23vstinnersetnosy: - vstinner
2018-05-01 22:24:28pdgoins-worksetmessages: + msg316025
2018-05-01 22:08:40pitrousetmessages: + msg316024
2018-05-01 21:50:41pdgoins-worksetmessages: + msg316023
2018-04-30 21:47:24pdgoins-worksetmessages: + msg315970
2018-04-30 21:39:14pitrousetmessages: + msg315969
2018-04-30 21:32:42pdgoins-worksetmessages: + msg315968
2018-04-30 21:21:49pitrousetmessages: + msg315967
2018-04-30 21:16:25pdgoins-worksetmessages: + msg315966
2018-04-26 19:32:52eryksunsetnosy: + eryksun
messages: + msg315815
2018-04-26 19:04:40pdgoins-worksetmessages: + msg315811
2018-04-25 22:49:32pitrousetversions: + Python 3.7, Python 3.8, - Python 3.4, Python 3.5
nosy: + paul.moore, tim.golden, zach.ware, steve.dower

messages: + msg315762

components: + Windows
2018-04-25 22:43:47pdgoins-worksetfiles: +
versions: + Python 3.6
nosy: + pdgoins-work

messages: + msg315761
2014-07-21 20:22:56pitrousetmessages: + msg223608
2014-07-21 20:17:37pitrousetnosy: + tim.peters, pitrou, neologix

type: behavior
versions: + Python 3.5
2014-07-21 16:38:36vstinnersetnosy: + vstinner
2014-06-21 15:44:35tuplcreate