Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows] KeyboardInterrupt during Thread.join hangs that Thread #66021

Closed
tupl mannequin opened this issue Jun 21, 2014 · 18 comments
Closed

[Windows] KeyboardInterrupt during Thread.join hangs that Thread #66021

tupl mannequin opened this issue Jun 21, 2014 · 18 comments
Labels
3.8 only security fixes 3.9 only security fixes 3.10 only security fixes extension-modules C modules in the Modules dir interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-windows stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@tupl
Copy link
Mannequin

tupl mannequin commented Jun 21, 2014

BPO 21822
Nosy @tim-one, @pfmoore, @pitrou, @vstinner, @tjguk, @zware, @eryksun, @zooba, @maggyero, @gaborbernat
Superseder
  • bpo-45274: Race condition in Thread._wait_for_tstate_lock()
  • Files
  • join.py: Interactive script to reproduce the problem
  • sscce_thread_join_fails_on_keyboardinterrupt.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-09-27.14:11:30.231>
    created_at = <Date 2014-06-21.15:44:35.111>
    labels = ['3.8', '3.9', 'extension-modules', 'interpreter-core', 'type-bug', '3.10', 'library', 'OS-windows']
    title = '[Windows] KeyboardInterrupt during Thread.join hangs that Thread'
    updated_at = <Date 2021-09-27.14:11:30.230>
    user = 'https://bugs.python.org/tupl'

    bugs.python.org fields:

    activity = <Date 2021-09-27.14:11:30.230>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-09-27.14:11:30.231>
    closer = 'vstinner'
    components = ['Extension Modules', 'Interpreter Core', 'Library (Lib)', 'Windows']
    creation = <Date 2014-06-21.15:44:35.111>
    creator = 'tupl'
    dependencies = []
    files = ['35715', '47549']
    hgrepos = []
    issue_num = 21822
    keywords = []
    message_count = 18.0
    messages = ['221180', '223608', '315761', '315762', '315811', '315815', '315966', '315967', '315968', '315969', '315970', '316023', '316024', '316025', '337846', '346191', '402714', '402717']
    nosy_count = 12.0
    nosy_names = ['tim.peters', 'paul.moore', 'pitrou', 'vstinner', 'tim.golden', 'neologix', 'zach.ware', 'eryksun', 'steve.dower', 'pdgoins-work', 'maggyero', 'gaborjbernat']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '45274'
    type = 'behavior'
    url = 'https://bugs.python.org/issue21822'
    versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

    @tupl
    Copy link
    Mannequin Author

    tupl mannequin commented Jun 21, 2014

    I am attempting to join a thread after a previous join was interrupted by Ctrl-C.

    I did not find any warning for this case in threading module docs, so I assume this is legal.

    The full test script is attached, but the essential code is:

    def work(t):
        sleep(t)
    
    twork = 3; twait = 4
    t = Thread(target=work, args=(twork,))
    try:
    	t.start()
    	t.join(twait)  # here I do Ctrl-C
    except KeyboardInterrupt:
    	pass
    t.join()  # this hangs if twork < twait

    I can observe the following reproduce sequence:

    1. start another thread that sleeps (or works) for T seconds
    2. join that thread with timeout longer than T
    3. before thread finishes and join returns, hit Ctrl-C to raise KeyboardInterrupt (KI)
    4. after T seconds, thread finishes its (Python) code and KI is raised
    5. Process Explorer confirms that thread really terminates (looked at .ident())
    6. thread still reports .is_alive() == True
    7. second attempt to join that thread hangs indefinitely

    I tried replacing try-except clause with custom signal handler for SIGINT, as shown in the script. If the handler does not raise an exception, the thread can be normally joined. If it does, however, the behavior is the same as with default handler.

    My _guess_ is that the exception prevents some finishing code that puts Thread into proper stopped state after its target completes.

    Running Python 3.4.0 on Windows 7 x64

    @tupl tupl mannequin added the stdlib Python modules in the Lib dir label Jun 21, 2014
    @pitrou pitrou added the type-bug An unexpected behavior, bug, or error label Jul 21, 2014
    @pitrou
    Copy link
    Member

    pitrou commented Jul 21, 2014

    This works for me under Linux.

    @pdgoins-work
    Copy link
    Mannequin

    pdgoins-work mannequin commented Apr 25, 2018

    I've confirmed this issue on Windows. Attached is an SSCCE which seems to reliably reproduce the issue in a Windows environment.

    Long story short: if a KeyboardInterrupt occurs during Thread.join(), there's a good chance that Thread._is_stopped never gets set.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 25, 2018

    Thanks for the update. Since I still can't reproduce on Linux, perhaps you (or one of our resident Windows experts) can try to propose a patch fixing the issue?

    @pitrou pitrou added OS-windows 3.7 (EOL) end of life 3.8 only security fixes labels Apr 25, 2018
    @pdgoins-work
    Copy link
    Mannequin

    pdgoins-work mannequin commented Apr 26, 2018

    Not sure if I'll do the full fix (need to check w/ my employer), but I'm doing some investigation. Here's what I know so far:

    At the Python level, the KeyboardInterrupt is being raised within _wait_for_tstate_lock, on "elif lock.acquire(block, timeout)".

    Going into the C code, it looks like this goes through lock_PyThread_acquire_lock -> acquire_timed -> PyThread_acquire_lock_timed. acquire_timed . lock_PyThread_acquire_lock will abort the lock if it receives PY_LOCK_INTR from acquire_timed.

    My best guess right now is that PyThread_acquire_lock_timed never returns PY_LOCK_INTR. Indeed, I see this comment at the top of the NT version of that function:

    /* Fow now, intr_flag does nothing on Windows, and lock acquires are
     * uninterruptible.  */
    

    And indeed, the thread_pthread.h implementations both have a path for returning PY_LOCK_INTR, while the thread_nt.h version does not.

    ...And that's where I am thus far.

    @eryksun
    Copy link
    Contributor

    eryksun commented Apr 26, 2018

    For POSIX systems, try the following test function several times. For the bug to manifest, Thread._wait_for_tstate_lock has to be interrupted in between acquiring and releasing the sentinel lock. Maybe it could use a reentrant lock in order to avoid this problem.

        import os
        import signal
        import threading
        import time
    
        def raise_sigint():
            print('raising SIGINT')
            time.sleep(0.5)
            if os.name == 'nt':
                os.kill(0, signal.CTRL_C_EVENT)
            else:
                os.kill(os.getpid(), signal.SIGINT)
            print('finishing')
    
        def test(f=raise_sigint):
            global g_sigint
            g_sigint = threading.Thread(target=f, name='SIGINT')
            g_sigint.start()
            print('waiting')
            for i in range(100):
                try:
                    if not g_sigint.is_alive():
                        break
                    g_sigint.join(0.01)
                except KeyboardInterrupt:
                    print('KeyboardInterrupt ignored')
            print('g_sigint is alive:', g_sigint.is_alive())

    POSIX-only code normally wouldn't join() with a small timeout in a loop as in the above example. This is a workaround for the problem that's demonstrated in msg221180, in which a signal doesn't interrupt a wait on the main thread. Other than time.sleep(), most waits in Windows CPython have not been updated to include the SIGINT Event object when waiting in the main thread. (It's possible to fix some cases such as waiting on locks, but only as long as the implementation continues to use semaphores and kernel waits instead of native condition variables with SRW locks.)

    @pdgoins-work
    Copy link
    Mannequin

    pdgoins-work mannequin commented Apr 30, 2018

    Focusing on the Windows case specifically... One way to possibly make this work (although perhaps not as clean as may be desired) would be to add polling logic into the thread_nt.h version of PyThread_acquire_lock_timed.

    That would have the benefit of avoiding the complexity of the various "non recursive mutex" implementations (i.e. semaphores vs "emulated" condition variables vs native condition variables) and may be less code than setting up "alertable" WaitForObjectSignleObjectEx calls (plus whatever else needs to be done for the SRW-lock-based code path).

    Thoughts or feedback? (I've not done any mainline Python commits yet so I'm totally ready to be completely wrong or misguided here.)

    @pitrou
    Copy link
    Member

    pitrou commented Apr 30, 2018

    One way to possibly make this work (although perhaps not as clean as may be desired) would be to add polling logic into the thread_nt.h version of PyThread_acquire_lock_timed.

    I think adding polling to such a widely-used primitive is out of question.

    @pdgoins-work
    Copy link
    Mannequin

    pdgoins-work mannequin commented Apr 30, 2018

    I think adding polling to such a widely-used primitive is out of question.

    I'm guessing this is because of the perceived performance impact? (That's the main thought I have against polling in this case.) Or is it something else?

    I can certainly look at tweaking the 3 mutex implementations I mentioned previously, but I do expect that will be a bit more code; I at least wanted to put the "simpler" idea out there. Fully knowing that "simpler" isn't necessarily better.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 30, 2018

    I'm guessing this is because of the perceived performance impact?

    The problem is polling is pretty detrimental to power saving on modern CPUs. There might also be a performance impact that makes things worse :-)

    More generally, we really would like locks to be interruptible under Windows, as it would be useful in many more situations than joining threads. Unfortunately, as you have discovered we have several lock implementations for Windows right now. The SRWLock one is probably difficult to make interruptible, but it's never enabled by default. The Semaphore one looks legacy, so could perhaps be removed. Remains the condition variable-based implementation.

    @pdgoins-work
    Copy link
    Mannequin

    pdgoins-work mannequin commented Apr 30, 2018

    Ahh, thanks for the explanation. I didn't think about that. Let's *not* do that then. :)

    I'll see if I can squeeze in some time to play with the alternatives.

    @pdgoins-work
    Copy link
    Mannequin

    pdgoins-work mannequin commented May 1, 2018

    Okay, I think I need to abandon my research into this. This does seem to have quite an amount of complexity to it and is probably more than I should be taking on at this point in time.

    Anyone else who wants to look at this, consider it fair game.

    Parting thoughts based on my limited expertise in the area, take them or ignore them:

    • Semaphore-based AllocNonRecursiveMutex (!_PY_USE_CV_LOCKS): Using WaitForSingleObjectEx with alertable set to TRUE may be one path forward, however it seems like that would involve tracking all threads performing a ctrl-c-interruptible wait and calling QueueUserAPC with a no-op callback for each such thread to cause the wait to terminate early. I don't particularly like the need to roll-our-own tracking, but at least off-hand and based on my somewhat limited experience in this area, I don't know of a better way. Hopefully someone else does.

    • CriticalSection/Semaphore-based AllocNonRecursiveMutex (_PY_USE_CV_LOCKS with _PY_EMULATED_WIN_CV): I don't know of a way of interrupting the CriticalSection lock, but the WaitForSingleObjectEx strategy may be usable on the semaphore-based condition variable.

    • SRWLock/ConditionVariable-based AllocNonRecursiveMutex (_PY_USE_CV_LOCKS with !_PY_EMULATED_WIN_CV): Similarly, I don't know of a way to interrupt the SRWLock. However, similar to WaitForSingleObjectEx, it may be possible to wake the condition variables via a Ctrl-C if we add our own tracking for such. I'm not sure if this gets us quite to where we need to be or not.

    Hopefully the above notes are of some value.

    @pitrou
    Copy link
    Member

    pitrou commented May 1, 2018

    multiprocessing semaphores support Ctrl-C under Windows, so it should be doable for regular locks as well (notice the sigint_event):
    https://github.com/python/cpython/blob/master/Modules/_multiprocessing/semaphore.c#L109-L146

    @pdgoins-work
    Copy link
    Mannequin

    pdgoins-work mannequin commented May 1, 2018

    Good point, I forgot about WaitForMultipleObjectsEx; something like that seems like it would be much simpler for the first 2 cases.

    @gaborbernat
    Copy link
    Mannequin

    gaborbernat mannequin commented Mar 13, 2019

    I think I'm hitting this with subprocesses inside tox (parallel feature), any plans to fix this?

    @maggyero
    Copy link
    Mannequin

    maggyero mannequin commented Jun 21, 2019

    I am having the same blocked signal issue on Windows when using Thread.join. This program does not print "interrupted" after pressing Ctrl+C:

    import threading
    import time
    
    
    def f():
        while True:
            print("processing")
            time.sleep(1)
    
    
    if __name__ == "__main__":
        try:
            thread = threading.Thread(target=f)
            thread.start()
            thread.join()
        except KeyboardInterrupt:
            print("interrupted")

    For reference, 2 years ago Nathaniel Smith gave an interesting explanation here:
    https://mail.python.org/pipermail/python-dev/2017-August/148800.html.

    @eryksun eryksun added extension-modules C modules in the Modules dir interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.9 only security fixes 3.10 only security fixes and removed 3.7 (EOL) end of life labels Mar 17, 2021
    @vstinner vstinner changed the title KeyboardInterrupt during Thread.join hangs that Thread [Windows] KeyboardInterrupt during Thread.join hangs that Thread Mar 18, 2021
    @vstinner
    Copy link
    Member

    I am having the same blocked signal issue on Windows when using Thread.join. This program does not print "interrupted" after pressing Ctrl+C:

    This is a different issue: bpo-29971. Currently, threading.Lock.acquire() cannot be interrupted by CTRL+C.

    @vstinner
    Copy link
    Member

    I mark this issue as a duplicate of bpo-45274.

    --

    I fixed bpo-45274 with this change:

    New changeset a22be49 by Victor Stinner in branch 'main':
    bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532)
    a22be49

    I tested join.py with the fix. It nows displays:
    ---
    vstinner@DESKTOP-DK7VBIL C:\vstinner\python\main>python x.py
    Running Debug|x64 interpreter...
    started. Press Ctrl-C now
    Ctrl-C [2.99] done=True alive=False
    finish [2.99] done=True alive=False
    Terminate batch job (Y/N)? n
    ---

    The script no longer hangs.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes 3.9 only security fixes 3.10 only security fixes extension-modules C modules in the Modules dir interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-windows stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants