classification
Title: Race condition in Thread._wait_for_tstate_lock()
Type: Stage: resolved
Components: Library (Lib) Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, gaborjbernat, miss-islington, pablogsal, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2021-09-23 20:13 by vstinner, last changed 2021-09-27 14:11 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
threading_bug.patch vstinner, 2021-09-23 20:13
threading_bug.py vstinner, 2021-09-23 20:13
Pull Requests
URL Status Linked Edit
PR 28532 merged vstinner, 2021-09-23 20:22
PR 28579 merged miss-islington, 2021-09-27 12:21
PR 28580 merged miss-islington, 2021-09-27 12:21
Messages (12)
msg402523 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-23 20:13
Bernát Gábor found an interesting bug on Windows. Sometimes, when a process is interrupted on Windows with CTRL+C, its parent process hangs in thread.join():
https://twitter.com/gjbernat/status/1440949682759426050

Reproducer:

* Install https://github.com/gaborbernat/tox/tree/2159
* Make an empty folder and put the above tox.ini in it
* Invoke python -m tox and once that command is sleeping press CTRL+C (the app should lock there indefinitely).

tox.ini:
---

[testenv]
skip_install = true
commands = python -c 'import os; print(f"start {os.getpid()}"); import time; time.sleep(100); print("end");'
---

Source: https://gist.github.com/gaborbernat/f1e1aff0f2ee514b50a92a4d019d4d13

I tried to attach the Python process in Python: there is a single thread, the main thread which is blocked in thread.join(). You can also see it in the faulthandler traceback.

I did a long analysis of the _tstate_lock and I checked that thread really completed. Raw debug traces:

== thread 6200 exit ==
 
thread_run[pid=3984, thread_id=6200]: clear
PyThreadState_Clear[pid=3984, thread_id=6200]: on_delete()
release_sentinel[pid=3984, thread_id=6200]: enter
release_sentinel[pid=3984, thread_id=6200]: release(obj=000001C1122669C0, lock=000001C110BBDA00)
release_sentinel[pid=3984, thread_id=6200]: exit
PyThreadState_Clear[pid=3984, thread_id=6200]: on_delete()--
 
== main thread is calling join() but gets a KeyboardInterrupt ==
 
[pid=3984, thread_id=8000] Lock<obj=000001C1122669C0>.acquire() -> ACQUIRED
Current thread 0x00001f40 (most recent call first):
  File "C:\vstinner\python\3.10\lib\threading.py", line 1118 in _wait_for_tstate_lock
  File "C:\vstinner\python\3.10\lib\threading.py", line 1092 in join
  File "C:\vstinner\env\lib\site-packages\tox\session\cmthread_run[pid=3984, thread_id=6200]: exit
d\run\common.py", line 203 in execute
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\sequential.py", line 20 in run_sequential
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\legacy.py", line 104 in legacy
  File "C:\vstinner\env\lib\site-packages\tox\run.py", line 49 in main
  File "C:\vstinner\env\lib\site-packages\tox\run.py", line 23 in run
  File "C:\vstinner\env\lib\site-packages\tox\__main__.py", line 4 in <module>
  File "C:\vstinner\python\3.10\lib\runpy.py", line 86 in _run_code
  File "C:\vstinner\python\3.10\lib\runpy.py", line 196 in _run_module_as_main
_wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: EXC: KeyboardInterrupt(); acquired? None
 
== main thread calls repr(thread) ==
 
ROOT: [3984] KeyboardInterrupt - teardown started
_wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: acquire(block=False, timeout=-1): lock obj= 0x1c1122669c0
  File "C:\vstinner\python\3.10\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\vstinner\python\3.10\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\vstinner\env\lib\site-packages\tox\__main__.py", line 4, in <module>
    run()
  File "C:\vstinner\env\lib\site-packages\tox\run.py", line 23, in run
    result = main(sys.argv[1:] if args is None else args)
  File "C:\vstinner\env\lib\site-packages\tox\run.py", line 49, in main
    result = handler(state)
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\legacy.py", line 104, in legacy
    return run_sequential(state)
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\sequential.py", line 20, in run_sequential
    return execute(state, max_workers=1, has_spinner=False, live=True)
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\common.py", line 217, in execute
    print(f'join {thread}')
  File "C:\vstinner\python\3.10\lib\threading.py", line 901, in __repr__
    self.is_alive() # easy way to get ._is_stopped set when appropriate
  File "C:\vstinner\python\3.10\lib\threading.py", line 1181, in is_alive
    self._wait_for_tstate_lock(False)
  File "C:\vstinner\python\3.10\lib\threading.py", line 1113, in _wait_for_tstate_lock
    traceback.print_stack()
_wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: failed to acquire 0x1c1122669c0
_wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: exit 0x1c1122669c0
join <Thread(tox-interrupt, started 6200)>
 
== main thread calls thread.join()... which hangs ==
 
_wait_for_tstate_lock[pid=3984, current thread_id=8000, self thread_id=6200]: acquire(block=True, timeout=-1): lock obj= 0x1c1122669c0
  File "C:\vstinner\python\3.10\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\vstinner\python\3.10\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\vstinner\env\lib\site-packages\tox\__main__.py", line 4, in <module>
    run()
  File "C:\vstinner\env\lib\site-packages\tox\run.py", line 23, in run
    result = main(sys.argv[1:] if args is None else args)
  File "C:\vstinner\env\lib\site-packages\tox\run.py", line 49, in main
    result = handler(state)
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\legacy.py", line 104, in legacy
    return run_sequential(state)
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\sequential.py", line 20, in run_sequential
    return execute(state, max_workers=1, has_spinner=False, live=True)
  File "C:\vstinner\env\lib\site-packages\tox\session\cmd\run\common.py", line 218, in execute
    thread.join()
  File "C:\vstinner\python\3.10\lib\threading.py", line 1092, in join
    self._wait_for_tstate_lock()
  File "C:\vstinner\python\3.10\lib\threading.py", line 1113, in _wait_for_tstate_lock
    traceback.print_stack()


Explanation:

* Context: The main thread is waiting on thread.join()
* The thread 6200 completes: release_sentinel() is called to to release _tstate_lock
* The main thread succeed to acquire _tstate_lock (of the thread 6200) since it was just release
* Ooops oops oops, the main thread gets KeyboardInterrupt in Thread._wait_for_tstate_lock() before being able to release the lock. As if the function if interrupted here:

    def _wait_for_tstate_lock(self, block=True, timeout=-1):
        lock = self._tstate_lock
        if lock is None:
            assert self._is_stopped
        elif lock.acquire(block, timeout):
            # -- got KeyboardInterrupt here ---
            lock.release()
            self._stop()

* (tox does something in the main thread)
* (there are only one remaining thread: the main thread)
* tox catchs KeyboardInterrupt and calls thread.join() again
* thread.join() hangs because the _tstate_lock was already acquire, so lock.acquire() hangs forever => NOT GOOD


You can reproduce the issue on Linux with attached patch and script:

* Apply threading_bug.patch on Python
* Run threading_bug.py
* See that the script hangs forever

Example:
---
$ git apply threading_bug.patch
$ ./python threading_bug.py 
join...
join failed with: KeyboardInterrupt()
join again...
---


I'm now working on a PR to fix the race condition.
msg402524 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-23 20:21
For curious people, see also bpo-44422 "Fix threading.enumerate() reentrant call": another example of race condition recently fixed in the threading module. But it's unrelated to this bug ;-)
msg402532 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-09-23 21:39
I am not sure that it can be solved at Python level.

Possible solutions:

* Add a Lock method (or just a builtin function) which acquires and immediately releases the lock, without possibility to interrupt.

        if lock._blink(block, timeout):
            self._stop()

* Add a context manager which suppresses keyboard interruption.

        with suppress_interrupt():
            if not lock._blink(block, timeout):
                return
        self._stop()
msg402535 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-09-23 21:59
> Add a context manager which suppresses keyboard interruption.

That's not enough technically. This can be any signal handler that raises an exception, no?
msg402540 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-09-23 23:03
See bpo-21822 from 2014, which I think was the first report of this issue.
msg402542 - (view) Author: gaborjbernat (gaborjbernat) * Date: 2021-09-24 06:56
I just want to note that my maximal reproducer moved to https://github.com/tox-dev/tox/commit/e5d1a439be0790c8104d4caf943b3d82f23a0039 (that has been merged on the branch, so should be now stable).
msg402597 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-24 23:42
> I am not sure that it can be solved at Python level.

Right. In pure Python, we cannot write code which works in all cases. My PR 28532 fix the most common case: application interrupted by a single CTRL+C.

> with suppress_interrupt(): (...)

It's important to be able to interrupt acquire() which can be called in blocking mode with no timeout: it's exactly what tox does, and users expect to be able to interrupt tox in this case.

> * Add a Lock method (or just a builtin function) which acquires and immediately releases the lock, without possibility to interrupt.

The acquire()+release() sequence can be made atomic in C, but it doesn't solve the problem of _stop() which can be interrupted by a second exception.

This bug is likely as old as Python. I don't think that we should attempt to design a perfect solution. I only propose to make the race condition (way) less likely.
msg402704 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-27 12:20
New changeset a22be4943c119fecf5433d999227ff78fc2e5741 by Victor Stinner in branch 'main':
bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532)
https://github.com/python/cpython/commit/a22be4943c119fecf5433d999227ff78fc2e5741
msg402706 - (view) Author: miss-islington (miss-islington) Date: 2021-09-27 12:49
New changeset 1ecb641b887af2feb026a266e2fceedee0815ca8 by Miss Islington (bot) in branch '3.9':
bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532)
https://github.com/python/cpython/commit/1ecb641b887af2feb026a266e2fceedee0815ca8
msg402707 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-27 12:53
New changeset fae2694bea5e9e5a114af8cb40b60e7131a6340c by Miss Islington (bot) in branch '3.10':
bpo-45274: Fix Thread._wait_for_tstate_lock() race condition (GH-28532) (GH-28580)
https://github.com/python/cpython/commit/fae2694bea5e9e5a114af8cb40b60e7131a6340c
msg402710 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-09-27 13:34
This is the same as bpo-21822, so I suppose it should be closed as well with a reference to this issue.
msg402718 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-09-27 14:11
> This is the same as bpo-21822, so I suppose it should be closed as well with a reference to this issue.

You're right. I marked bpo-21822 as a duplicate of this issue.
History
Date User Action Args
2021-09-27 14:11:56vstinnersetmessages: + msg402718
2021-09-27 14:11:30vstinnerlinkissue21822 superseder
2021-09-27 13:34:32eryksunsetmessages: + msg402710
2021-09-27 12:53:47vstinnersetstatus: open -> closed
stage: patch review -> resolved
resolution: fixed
versions: + Python 3.9, Python 3.10
2021-09-27 12:53:37vstinnersetmessages: + msg402707
2021-09-27 12:49:39miss-islingtonsetmessages: + msg402706
2021-09-27 12:21:25miss-islingtonsetpull_requests: + pull_request26963
2021-09-27 12:21:21miss-islingtonsetnosy: + miss-islington

pull_requests: + pull_request26962
stage: patch review
2021-09-27 12:20:35vstinnersetmessages: + msg402704
2021-09-24 23:42:10vstinnersetmessages: + msg402597
2021-09-24 08:16:32eryksunsetnosy: + eryksun
2021-09-24 06:56:23gaborjbernatsetnosy: - eryksun
messages: + msg402542
2021-09-23 23:03:20eryksunsetnosy: + eryksun
messages: + msg402540
2021-09-23 22:40:44gaborjbernatsetnosy: + gaborjbernat
2021-09-23 21:59:53pablogsalsetmessages: + msg402535
2021-09-23 21:39:41serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg402532
2021-09-23 20:27:19vstinnersetnosy: + pablogsal

stage: patch review -> (no value)
2021-09-23 20:22:50vstinnersetstage: patch review
pull_requests: + pull_request26920
2021-09-23 20:21:39vstinnersetmessages: + msg402524
2021-09-23 20:13:16vstinnersetfiles: + threading_bug.py
2021-09-23 20:13:09vstinnercreate