Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in Thread._wait_for_tstate_lock() #89437

Closed
vstinner opened this issue Sep 23, 2021 · 12 comments
Closed

Race condition in Thread._wait_for_tstate_lock() #89437

vstinner opened this issue Sep 23, 2021 · 12 comments
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir

Comments

@vstinner
Copy link
Member

BPO 45274
Nosy @serhiy-storchaka, @eryksun, @pablogsal, @maggyero, @miss-islington, @bensimner, @gaborbernat
PRs
  • bpo-45274: Fix Thread._wait_for_tstate_lock() race condition #28532
  • [3.9] bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532) #28579
  • [3.10] bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532) #28580
  • bpo-45274: Fix one race condition Thread.join #31290
  • Files
  • threading_bug.patch
  • threading_bug.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-09-27.12:53:47.975>
    created_at = <Date 2021-09-23.20:13:09.454>
    labels = ['library', '3.9', '3.10', '3.11']
    title = 'Race condition in Thread._wait_for_tstate_lock()'
    updated_at = <Date 2022-03-12.00:31:02.958>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2022-03-12.00:31:02.958>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-09-27.12:53:47.975>
    closer = 'vstinner'
    components = ['Library (Lib)']
    creation = <Date 2021-09-23.20:13:09.454>
    creator = 'vstinner'
    dependencies = []
    files = ['50299', '50300']
    hgrepos = []
    issue_num = 45274
    keywords = ['patch']
    message_count = 12.0
    messages = ['402523', '402524', '402532', '402535', '402540', '402542', '402597', '402704', '402706', '402707', '402710', '402718']
    nosy_count = 7.0
    nosy_names = ['serhiy.storchaka', 'eryksun', 'pablogsal', 'maggyero', 'miss-islington', 'bjs', 'gaborjbernat']
    pr_nums = ['28532', '28579', '28580', '31290']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue45274'
    versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

    @vstinner
    Copy link
    Member Author

    Bernát Gábor found an interesting bug on Windows. Sometimes, when a process is interrupted on Windows with CTRL+C, its parent process hangs in thread.join():
    https://twitter.com/gjbernat/status/1440949682759426050

    Reproducer:

    tox.ini:
    ---

    [testenv]
    skip_install = true
    commands = python -c 'import os; print(f"start {os.getpid()}"); import time; time.sleep(100); print("end");'
    ---

    Source: https://gist.github.com/gaborbernat/f1e1aff0f2ee514b50a92a4d019d4d13

    I tried to attach the Python process in Python: there is a single thread, the main thread which is blocked in thread.join(). You can also see it in the faulthandler traceback.

    I did a long analysis of the _tstate_lock and I checked that thread really completed. Raw debug traces:

    == thread 6200 exit ==

    thread_run[pid=3984, thread_id=6200]: clear
    PyThreadState_Clear[pid=3984, thread_id=6200]: on_delete()
    release_sentinel[pid=3984, thread_id=6200]: enter
    release_sentinel[pid=3984, thread_id=6200]: release(obj=000001C1122669C0, lock=000001C110BBDA00)
    release_sentinel[pid=3984, thread_id=6200]: exit
    PyThreadState_Clear[pid=3984, thread_id=6200]: on_delete()--

    == main thread is calling join() but gets a KeyboardInterrupt ==

    [pid=3984, thread_id=8000] Lock<obj=000001C1122669C0>.acquire() -> ACQUIRED
    Current thread 0x00001f40 (most recent call first):
    File "C:\vstinner\python\3.10\lib\threading.py", line 1118 in _wait_for_tstate_lock
    File "C:\vstinner\python\3.10\lib\threading.py", line 1092 in join
    File "C:\vstinner\env\lib\site-packages\tox\session\cmthread_run[pid=3984, thread_id=6200]: exit
    d\run\common.py", line 203 in execute
    File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\sequential.py", line 20 in run_sequential
    File "C:\vstinner\env\lib\site-packages\tox\session\cmd\legacy.py", line 104 in legacy
    File "C:\vstinner\env\lib\site-packages\tox\run.py", line 49 in main
    File "C:\vstinner\env\lib\site-packages\tox\run.py", line 23 in run
    File "C:\vstinner\env\lib\site-packages\tox\main.py", line 4 in <module>
    File "C:\vstinner\python\3.10\lib\runpy.py", line 86 in _run_code
    File "C:\vstinner\python\3.10\lib\runpy.py", line 196 in _run_module_as_main
    _wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: EXC: KeyboardInterrupt(); acquired? None

    == main thread calls repr(thread) ==

    ROOT: [3984] KeyboardInterrupt - teardown started
    _wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: acquire(block=False, timeout=-1): lock obj= 0x1c1122669c0
    File "C:\vstinner\python\3.10\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
    File "C:\vstinner\python\3.10\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
    File "C:\vstinner\env\lib\site-packages\tox\main.py", line 4, in <module>
    run()
    File "C:\vstinner\env\lib\site-packages\tox\run.py", line 23, in run
    result = main(sys.argv[1:] if args is None else args)
    File "C:\vstinner\env\lib\site-packages\tox\run.py", line 49, in main
    result = handler(state)
    File "C:\vstinner\env\lib\site-packages\tox\session\cmd\legacy.py", line 104, in legacy
    return run_sequential(state)
    File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\sequential.py", line 20, in run_sequential
    return execute(state, max_workers=1, has_spinner=False, live=True)
    File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\common.py", line 217, in execute
    print(f'join {thread}')
    File "C:\vstinner\python\3.10\lib\threading.py", line 901, in __repr__
    self.is_alive() # easy way to get ._is_stopped set when appropriate
    File "C:\vstinner\python\3.10\lib\threading.py", line 1181, in is_alive
    self._wait_for_tstate_lock(False)
    File "C:\vstinner\python\3.10\lib\threading.py", line 1113, in _wait_for_tstate_lock
    traceback.print_stack()
    _wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: failed to acquire 0x1c1122669c0
    _wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: exit 0x1c1122669c0
    join <Thread(tox-interrupt, started 6200)>

    == main thread calls thread.join()... which hangs ==

    _wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: acquire(block=True, timeout=-1): lock obj= 0x1c1122669c0
    File "C:\vstinner\python\3.10\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
    File "C:\vstinner\python\3.10\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
    File "C:\vstinner\env\lib\site-packages\tox\main.py", line 4, in <module>
    run()
    File "C:\vstinner\env\lib\site-packages\tox\run.py", line 23, in run
    result = main(sys.argv[1:] if args is None else args)
    File "C:\vstinner\env\lib\site-packages\tox\run.py", line 49, in main
    result = handler(state)
    File "C:\vstinner\env\lib\site-packages\tox\session\cmd\legacy.py", line 104, in legacy
    return run_sequential(state)
    File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\sequential.py", line 20, in run_sequential
    return execute(state, max_workers=1, has_spinner=False, live=True)
    File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\common.py", line 218, in execute
    thread.join()
    File "C:\vstinner\python\3.10\lib\threading.py", line 1092, in join
    self._wait_for_tstate_lock()
    File "C:\vstinner\python\3.10\lib\threading.py", line 1113, in _wait_for_tstate_lock
    traceback.print_stack()

    Explanation:

    • Context: The main thread is waiting on thread.join()
    • The thread 6200 completes: release_sentinel() is called to to release _tstate_lock
    • The main thread succeed to acquire _tstate_lock (of the thread 6200) since it was just release
    • Ooops oops oops, the main thread gets KeyboardInterrupt in Thread._wait_for_tstate_lock() before being able to release the lock. As if the function if interrupted here:
        def _wait_for_tstate_lock(self, block=True, timeout=-1):
            lock = self._tstate_lock
            if lock is None:
                assert self._is_stopped
            elif lock.acquire(block, timeout):
                # -- got KeyboardInterrupt here ---
                lock.release()
                self._stop()
    • (tox does something in the main thread)
    • (there are only one remaining thread: the main thread)
    • tox catchs KeyboardInterrupt and calls thread.join() again
    • thread.join() hangs because the _tstate_lock was already acquire, so lock.acquire() hangs forever => NOT GOOD

    You can reproduce the issue on Linux with attached patch and script:

    • Apply threading_bug.patch on Python
    • Run threading_bug.py
    • See that the script hangs forever

    Example:
    ---

    $ git apply threading_bug.patch
    $ ./python threading_bug.py 
    join...
    join failed with: KeyboardInterrupt()
    join again...

    I'm now working on a PR to fix the race condition.

    @vstinner vstinner added 3.11 only security fixes stdlib Python modules in the Lib dir labels Sep 23, 2021
    @vstinner
    Copy link
    Member Author

    For curious people, see also bpo-44422 "Fix threading.enumerate() reentrant call": another example of race condition recently fixed in the threading module. But it's unrelated to this bug ;-)

    @serhiy-storchaka
    Copy link
    Member

    I am not sure that it can be solved at Python level.

    Possible solutions:

    • Add a Lock method (or just a builtin function) which acquires and immediately releases the lock, without possibility to interrupt.
            if lock._blink(block, timeout):
                self._stop()
    • Add a context manager which suppresses keyboard interruption.
            with suppress_interrupt():
                if not lock._blink(block, timeout):
                    return
            self._stop()

    @pablogsal
    Copy link
    Member

    Add a context manager which suppresses keyboard interruption.

    That's not enough technically. This can be any signal handler that raises an exception, no?

    @eryksun
    Copy link
    Contributor

    eryksun commented Sep 23, 2021

    See bpo-21822 from 2014, which I think was the first report of this issue.

    @gaborbernat
    Copy link
    Mannequin

    gaborbernat mannequin commented Sep 24, 2021

    I just want to note that my maximal reproducer moved to tox-dev/tox@e5d1a43 (that has been merged on the branch, so should be now stable).

    @vstinner
    Copy link
    Member Author

    I am not sure that it can be solved at Python level.

    Right. In pure Python, we cannot write code which works in all cases. My PR 28532 fix the most common case: application interrupted by a single CTRL+C.

    with suppress_interrupt(): (...)

    It's important to be able to interrupt acquire() which can be called in blocking mode with no timeout: it's exactly what tox does, and users expect to be able to interrupt tox in this case.

    • Add a Lock method (or just a builtin function) which acquires and immediately releases the lock, without possibility to interrupt.

    The acquire()+release() sequence can be made atomic in C, but it doesn't solve the problem of _stop() which can be interrupted by a second exception.

    This bug is likely as old as Python. I don't think that we should attempt to design a perfect solution. I only propose to make the race condition (way) less likely.

    @vstinner
    Copy link
    Member Author

    New changeset a22be49 by Victor Stinner in branch 'main':
    bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532)
    a22be49

    @miss-islington
    Copy link
    Contributor

    New changeset 1ecb641 by Miss Islington (bot) in branch '3.9':
    bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532)
    1ecb641

    @vstinner
    Copy link
    Member Author

    New changeset fae2694 by Miss Islington (bot) in branch '3.10':
    bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532) (GH-28580)
    fae2694

    @vstinner vstinner added 3.9 only security fixes 3.10 only security fixes labels Sep 27, 2021
    @vstinner vstinner added 3.9 only security fixes 3.10 only security fixes labels Sep 27, 2021
    @eryksun
    Copy link
    Contributor

    eryksun commented Sep 27, 2021

    This is the same as bpo-21822, so I suppose it should be closed as well with a reference to this issue.

    @vstinner
    Copy link
    Member Author

    This is the same as bpo-21822, so I suppose it should be closed as well with a reference to this issue.

    You're right. I marked bpo-21822 as a duplicate of this issue.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes 3.11 only security fixes stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants