classification
Title: threading._shutdown() race condition: test_threading test_threads_join_2() fails randomly
Type: Stage: resolved
Components: Library (Lib) Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: eric.snow, koobs, kristjan.jonsson, pablogsal, pitrou, vstinner
Priority: normal Keywords: patch

Created on 2019-03-22 23:18 by vstinner, last changed 2019-08-09 11:08 by kristjan.jonsson. This issue is now closed.

Files
File name Uploaded Description Edit
py_finalize.patch vstinner, 2019-06-10 23:10
Pull Requests
URL Status Linked Edit
PR 13889 closed vstinner, 2019-06-07 10:53
PR 13946 closed vstinner, 2019-06-10 20:45
PR 13948 merged vstinner, 2019-06-10 22:39
PR 13949 closed vstinner, 2019-06-10 23:04
PR 14033 closed miss-islington, 2019-06-12 23:30
PR 14034 closed miss-islington, 2019-06-12 23:30
PR 14047 merged vstinner, 2019-06-13 08:58
PR 14050 merged vstinner, 2019-06-13 10:11
PR 14054 merged miss-islington, 2019-06-13 11:45
Messages (16)
msg338622 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-22 23:18
https://buildbot.python.org/all/#/builders/168/builds/801

0:23:17 load avg: 5.00 [334/420/1] test_threading crashed (Exit code -6) -- running: test_tools (8 min 42 sec), test_multiprocessing_spawn (5 min 41 sec), test_zipfile (30 sec 787 ms)
Fatal Python error: Py_EndInterpreter: not the last thread

Current thread 0x0000000800acd000 (most recent call first):
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/support/__init__.py", line 2778 in run_in_subinterp
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/test_threading.py", line 917 in test_threads_join_2
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/case.py", line 642 in run
  ...

The test crashed once, but then passed when run again in verbose mode ("Re-running test 'test_threading' in verbose mode").
msg338623 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-22 23:20
As far as I remember, test_threads_join_2() was already unstable.

I created this issue to try to track if it's a regression or not.

If it's a regression, I would suggest to have a look at Eric Snow's recent commits.

At this point, I simply have no idea if the test fails exactly one in the lifetime of the buildbot worker, or if it started to fail frequently on this FreeBSD buildbot.
msg344611 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-04 17:36
New failure:

AMD64 FreeBSD CURRENT Shared 3.x:
https://buildbot.python.org/all/#/builders/168/builds/1154

...
test_thread_leak (test.test_threading.PyRLockTests) ... ok
test_timeout (test.test_threading.PyRLockTests) ... ok
test_try_acquire (test.test_threading.PyRLockTests) ... ok
test_try_acquire_contended (test.test_threading.PyRLockTests) ... ok
test_weakref_deleted (test.test_threading.PyRLockTests) ... ok
test_weakref_exists (test.test_threading.PyRLockTests) ... ok
test_with (test.test_threading.PyRLockTests) ... ok
test_acquire (test.test_threading.SemaphoreTests) ... ok
test_acquire_contended (test.test_threading.SemaphoreTests) ... ok
test_acquire_destroy (test.test_threading.SemaphoreTests) ... ok
test_acquire_timeout (test.test_threading.SemaphoreTests) ... ok
test_constructor (test.test_threading.SemaphoreTests) ... ok
test_default_value (test.test_threading.SemaphoreTests) ... ok
test_release_unacquired (test.test_threading.SemaphoreTests) ... ok
test_try_acquire (test.test_threading.SemaphoreTests) ... ok
test_try_acquire_contended (test.test_threading.SemaphoreTests) ... ok
test_with (test.test_threading.SemaphoreTests) ... ok
test_daemon_threads_fatal_error (test.test_threading.SubinterpThreadingTests) ... ok
test_threads_join (test.test_threading.SubinterpThreadingTests) ... ok
Fatal Python error: Py_EndInterpreter: not the last thread

Current thread 0x0000000800ac3000 (most recent call first):
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/support/__init__.py", line 2794 in run_in_subinterp
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/test_threading.py", line 923 in test_threads_join_2
test_threads_join_2 (test.test_threading.SubinterpThreadingTests) ... *** Signal 6

Stop.
make: stopped in /usr/home/buildbot/python/3.x.koobs-freebsd-current/build
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/case.py", line 628 in _callTestMethod
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/case.py", line 671 in run
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/case.py", line 731 in __call__
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/suite.py", line 122 in run
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/suite.py", line 84 in __call__
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/suite.py", line 122 in run
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/suite.py", line 84 in __call__
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/suite.py", line 122 in run
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/suite.py", line 84 in __call__
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/runner.py", line 176 in run
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/support/__init__.py", line 1984 in _run_suite
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/support/__init__.py", line 2080 in run_unittest
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest.py", line 203 in _test_module
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest.py", line 228 in _runtest_inner2
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest.py", line 264 in _runtest_inner
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest.py", line 148 in _runtest
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest.py", line 187 in runtest
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/main.py", line 291 in rerun_failed_tests
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/main.py", line 646 in _main
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/main.py", line 588 in main
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/main.py", line 663 in main
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/__main__.py", line 2 in <module>
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/runpy.py", line 85 in _run_code
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/runpy.py", line 192 in _run_module_as_main
program finished with exit code 1
elapsedTime=3523.834729
msg344613 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-04 17:39
In the same build:
https://buildbot.python.org/all/#builders/168/builds/1154

0:28:57 load avg: 12.93 [208/423/1] test_threading crashed (Exit code 1) -- running: test_lib2to3 (7 min 9 sec), test_multiprocessing_spawn (1 min 36 sec)
Timeout (0:25:00)!
Thread 0x0000000800ac3000 (most recent call first):
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/threading.py", line 1015 in _wait_for_tstate_lock
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/threading.py", line 999 in join
  File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/test_threading.py", line 1287 in test_interrupt_main_subthread
  (...)
msg344922 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-07 11:00
I wrote PR 13889: with this change, I can easily reproduce the crash on Linux:

$ ./python -m test test_threading -m test_threads_join_2 -F
Run tests sequentially
0:00:00 load avg: 0.51 [  1] test_threading
Fatal Python error: Py_EndInterpreter: not the last thread

Current thread 0x00007f84ad74d740 (most recent call first):
  File "/home/vstinner/prog/python/master/Lib/test/support/__init__.py", line 2794 in run_in_subinterp
  File "/home/vstinner/prog/python/master/Lib/test/test_threading.py", line 923 in test_threads_join_2
...

Py_EndInterpreter() calls wait_for_thread_shutdown() to wait until threading._shutdown() completes.

When the assertion fails, threading.enumerate() only contains the main thread: the spawned thread is already gone. But the assertion fails, which means that the Python thread state of the thread (which looks to be completed) is still around.

This unit test comes from bpo-18808:

commit 7b4769937fb612d576b6829c3b834f3dd31752f1
Author: Antoine Pitrou <solipsis@pitrou.net>
Date:   Sat Sep 7 23:38:37 2013 +0200

    Issue #18808: Thread.join() now waits for the underlying thread state to be destroyed before returning.
    This prevents unpredictable aborts in Py_EndInterpreter() when some non-daemon threads are still running.
msg345143 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-10 20:59
Oh. Using PR 13889, I'm able to reproduce the bug up to Python 3.4. Example at commit commit e76cbc781044ee01b059f3702c580e66266b84c5 (tag: v3.4.10):

$ wget 'https://github.com/python/cpython/pull/13889.patch'
$ git apply 13889.patch
$ ./python -m test  -F -m test_threads_join_2  test_threading
[  1] test_threading
[  2] test_threading
(...)
[ 10] test_threading
[ 11] test_threading
Fatal Python error: Py_EndInterpreter: not the last thread

Current thread 0x00007f418b3280c0 (most recent call first):
  File "/home/vstinner/prog/python/3.4/Lib/test/support/__init__.py", line 2304 in run_in_subinterp
  File "/home/vstinner/prog/python/3.4/Lib/test/test_threading.py", line 877 in test_threads_join_2
  (...)
Aborted (core dumped)
msg345146 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-10 21:57
threading._shutdown() uses threading.enumerate() which iterations on threading._active.

threading.Thread registers itself into threading._active using its _bootstrap_inner() method. It unregisters itself when _bootstrap_inner() completes, whereas its is_alive() method still returns true: since the underlying native thread still runs and the Python thread state still exists.

_thread._set_sentinel() creates a lock and registers a tstate->on_delete callback to release this lock. It's called by threading.Thread._set_tstate_lock() to set threading.Thread._tstate_lock. This lock is used by threading.Thread.join() to wait until the thread completes.


_thread.start_new_thread() calls the C function t_bootstrap() which ends with:

    tstate->interp->num_threads--;
    PyThreadState_Clear(tstate);
    PyThreadState_DeleteCurrent();
    PyThread_exit_thread();

_PyThreadState_DeleteCurrent() calls tstate->on_delete() which releases threading.Thread._tstate_lock lock.

In test_threads_join_2() test, PyThreadState_Clear() blocks on clearing thread variables: the Sleeper destructor of the Sleeper instance sleeps.

The race condition is that:

* threading._shutdown() rely on threading._alive
* Py_EndInterpreter() rely on the interpreter linked list of Python thread states: interp->tstate_head.
msg345149 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-10 22:52
Other references to test_threads_join_2() failures:

* bpo-36989: fails once AIX
* bpo-27791: 2 failures in 2016 in "AMD64 FreeBSD 10 Shared 3.x" buildbots (I closed the issue with: "Sadly, such bug is hard to reproduce. I didn't look at buildbots recently, and I don't have access to FreeBSD, so I just close the issue :-/")
* bpo-28084: once failure on "AMD64 FreeBSD 9.x 3.x" buildbot (closed as a duplicate of bpo-27791)

See also bpo-18808: "Thread.join returns before PyThreadState is destroyed" (issue which added the test).
msg345150 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-10 23:10
test_threading: test_threads_join_2() was added by commit 7b47699 in 2013, but the test failed randomly since it was added. It's just that failures were ignored until I created https://bugs.python.org/issue36402 last March.

In fact, when the test failed randomly on buildbot (with tests run in parallel), it was fine since test_threading was re-run alone and then the test passed. The buildbot build was seen overall as a success. Previous issues were closed (see my previous comment).

The test shows the bug using subinterpreters (Py_EndInterpreter), but the bug also exists in Py_Finalize() which hash the same race condition (it also calls threading._shutdown()). It's just that Py_EndInterpreter() is stricter, it contains this assertion:

    if (tstate != interp->tstate_head || tstate->next != NULL)
        Py_FatalError("Py_EndInterpreter: not the last thread");

Attached py_finalize.patch adds the same assertion to Py_Finalize.

I added test_threading.test_finalization_shutdown() to PR 13948. If you run test_finalization_shutdown() with py_finalize.patch, Py_Finalize() fails with a similar assertion error.

But py_finalize.patch is incompatible with the principle of daemon threads and so cannot be commited.
msg345429 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-12 23:30
New changeset 468e5fec8a2f534f1685d59da3ca4fad425c38dd by Victor Stinner in branch 'master':
bpo-36402: Fix threading._shutdown() race condition (GH-13948)
https://github.com/python/cpython/commit/468e5fec8a2f534f1685d59da3ca4fad425c38dd
msg345492 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-13 10:06
New changeset 6f75c873752a16a7ad8f35855b1e29f59d048e84 by Victor Stinner in branch 'master':
tbpo-36402: Fix threading.Thread._stop() (GH-14047)
https://github.com/python/cpython/commit/6f75c873752a16a7ad8f35855b1e29f59d048e84
msg345502 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-13 11:44
New changeset e40a97a721d46307dfdc2b0322028ccded6eb571 by Victor Stinner in branch '3.8':
[3.8] bpo-36402: Fix threading._shutdown() race condition (GH-13948) (GH-14050)
https://github.com/python/cpython/commit/e40a97a721d46307dfdc2b0322028ccded6eb571
msg345508 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-13 12:22
New changeset 6eb2878e42152e9c45d7ee5e6f889532d753e67c by Victor Stinner (Miss Islington (bot)) in branch '3.7':
bpo-36402: Fix threading._shutdown() race condition (GH-13948) (GH-14050) (GH-14054)
https://github.com/python/cpython/commit/6eb2878e42152e9c45d7ee5e6f889532d753e67c
msg345509 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-13 12:27
The bpo-18808 "Thread.join returns before PyThreadState is destroyed" was not fixed in Python 2.7: threading.Thread has no _tstate_lock attribute. I'm not comfortable to backport bpo-18808 "feature" or "bugfix" to Python 2.7, not to backport this change.

Python 2.7 works as it is, and it's going to reach its end of life at the end of the year. I guess that people learnt how to work around Python 2.7 limitation like bpo-18808.
msg345510 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-06-13 12:28
Ok, the root issue (threading._shutdown() race condition) has been fixed in Python 3.7, 3.8 and master branches. I close the issue. Thanks for the reviews!
msg349281 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2019-08-09 11:08
Please note that this fix appears to be the cause of #37788
History
Date User Action Args
2019-08-09 11:08:52kristjan.jonssonsetnosy: + kristjan.jonsson
messages: + msg349281
2019-06-13 12:28:09vstinnersetstatus: open -> closed
messages: + msg345510

components: + Library (Lib), - Tests
resolution: fixed
stage: patch review -> resolved
2019-06-13 12:27:13vstinnersetmessages: + msg345509
2019-06-13 12:22:25vstinnersetmessages: + msg345508
2019-06-13 11:45:00miss-islingtonsetpull_requests: + pull_request13916
2019-06-13 11:44:29vstinnersetmessages: + msg345502
2019-06-13 10:11:05vstinnersetpull_requests: + pull_request13912
2019-06-13 10:06:30vstinnersetmessages: + msg345492
2019-06-13 08:58:12vstinnersetpull_requests: + pull_request13909
2019-06-12 23:31:59vstinnersetversions: + Python 3.7, Python 3.9
2019-06-12 23:30:37miss-islingtonsetpull_requests: + pull_request13898
2019-06-12 23:30:28miss-islingtonsetpull_requests: + pull_request13897
2019-06-12 23:30:20vstinnersetmessages: + msg345429
2019-06-11 01:06:28koobssetnosy: + koobs
2019-06-10 23:12:29vstinnersettitle: test_threading: test_threads_join_2() failed with "Fatal Python error: Py_EndInterpreter: not the last thread" on AMD64 FreeBSD CURRENT Shared 3.x -> threading._shutdown() race condition: test_threading test_threads_join_2() fails randomly
2019-06-10 23:10:50vstinnersetfiles: + py_finalize.patch

messages: + msg345150
2019-06-10 23:04:33vstinnersetpull_requests: + pull_request13817
2019-06-10 22:52:15vstinnersetmessages: + msg345149
2019-06-10 22:39:36vstinnersetpull_requests: + pull_request13816
2019-06-10 21:57:27vstinnersetmessages: + msg345146
2019-06-10 20:59:51vstinnersetmessages: + msg345143
2019-06-10 20:45:58vstinnersetpull_requests: + pull_request13813
2019-06-07 11:00:36vstinnersetnosy: + pitrou, eric.snow
2019-06-07 11:00:19vstinnersetmessages: + msg344922
2019-06-07 10:53:05vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request13763
2019-06-04 17:39:53vstinnersetmessages: + msg344613
2019-06-04 17:36:29vstinnersetmessages: + msg344611
2019-03-22 23:20:28vstinnersetnosy: + pablogsal
messages: + msg338623
2019-03-22 23:18:24vstinnercreate