Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

threading._shutdown() race condition: test_threading test_threads_join_2() fails randomly #80583

Closed
vstinner opened this issue Mar 22, 2019 · 16 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir

Comments

@vstinner
Copy link
Member

BPO 36402
Nosy @pitrou, @kristjanvalur, @vstinner, @ericsnowcurrently, @koobs, @pablogsal
PRs
  • [WIP] bpo-36402: add random sleep to test_threads_join_2() #13889
  • [WIP] bpo-36402: Fix test_threading.test_threads_join_2() #13946
  • bpo-36402: Fix threading._shutdown() race condition #13948
  • bpo-36402: Add test_threading.test_finalization_shutdown() #13949
  • [3.8] bpo-36402: Fix threading._shutdown() race condition (GH-13948) #14033
  • [3.7] bpo-36402: Fix threading._shutdown() race condition (GH-13948) #14034
  • tbpo-36402: Fix threading.Thread._stop() #14047
  • [3.8] bpo-36402: Fix threading._shutdown() race condition (GH-13948) #14050
  • [3.7] bpo-36402: Fix threading._shutdown() race condition (GH-13948) (GH-14050) #14054
  • Files
  • py_finalize.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-06-13.12:28:09.923>
    created_at = <Date 2019-03-22.23:18:24.071>
    labels = ['3.8', '3.7', 'library', '3.9']
    title = 'threading._shutdown() race condition: test_threading test_threads_join_2() fails randomly'
    updated_at = <Date 2019-08-09.11:08:52.420>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2019-08-09.11:08:52.420>
    actor = 'kristjan.jonsson'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-06-13.12:28:09.923>
    closer = 'vstinner'
    components = ['Library (Lib)']
    creation = <Date 2019-03-22.23:18:24.071>
    creator = 'vstinner'
    dependencies = []
    files = ['48409']
    hgrepos = []
    issue_num = 36402
    keywords = ['patch']
    message_count = 16.0
    messages = ['338622', '338623', '344611', '344613', '344922', '345143', '345146', '345149', '345150', '345429', '345492', '345502', '345508', '345509', '345510', '349281']
    nosy_count = 6.0
    nosy_names = ['pitrou', 'kristjan.jonsson', 'vstinner', 'eric.snow', 'koobs', 'pablogsal']
    pr_nums = ['13889', '13946', '13948', '13949', '14033', '14034', '14047', '14050', '14054']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue36402'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @vstinner
    Copy link
    Member Author

    https://buildbot.python.org/all/#/builders/168/builds/801

    0:23:17 load avg: 5.00 [334/420/1] test_threading crashed (Exit code -6) -- running: test_tools (8 min 42 sec), test_multiprocessing_spawn (5 min 41 sec), test_zipfile (30 sec 787 ms)
    Fatal Python error: Py_EndInterpreter: not the last thread

    Current thread 0x0000000800acd000 (most recent call first):
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/support/init.py", line 2778 in run_in_subinterp
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/test_threading.py", line 917 in test_threads_join_2
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/case.py", line 642 in run
    ...

    The test crashed once, but then passed when run again in verbose mode ("Re-running test 'test_threading' in verbose mode").

    @vstinner vstinner added tests Tests in the Lib/test dir 3.8 only security fixes labels Mar 22, 2019
    @vstinner
    Copy link
    Member Author

    As far as I remember, test_threads_join_2() was already unstable.

    I created this issue to try to track if it's a regression or not.

    If it's a regression, I would suggest to have a look at Eric Snow's recent commits.

    At this point, I simply have no idea if the test fails exactly one in the lifetime of the buildbot worker, or if it started to fail frequently on this FreeBSD buildbot.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 4, 2019

    New failure:

    AMD64 FreeBSD CURRENT Shared 3.x:
    https://buildbot.python.org/all/#/builders/168/builds/1154

    ...
    test_thread_leak (test.test_threading.PyRLockTests) ... ok
    test_timeout (test.test_threading.PyRLockTests) ... ok
    test_try_acquire (test.test_threading.PyRLockTests) ... ok
    test_try_acquire_contended (test.test_threading.PyRLockTests) ... ok
    test_weakref_deleted (test.test_threading.PyRLockTests) ... ok
    test_weakref_exists (test.test_threading.PyRLockTests) ... ok
    test_with (test.test_threading.PyRLockTests) ... ok
    test_acquire (test.test_threading.SemaphoreTests) ... ok
    test_acquire_contended (test.test_threading.SemaphoreTests) ... ok
    test_acquire_destroy (test.test_threading.SemaphoreTests) ... ok
    test_acquire_timeout (test.test_threading.SemaphoreTests) ... ok
    test_constructor (test.test_threading.SemaphoreTests) ... ok
    test_default_value (test.test_threading.SemaphoreTests) ... ok
    test_release_unacquired (test.test_threading.SemaphoreTests) ... ok
    test_try_acquire (test.test_threading.SemaphoreTests) ... ok
    test_try_acquire_contended (test.test_threading.SemaphoreTests) ... ok
    test_with (test.test_threading.SemaphoreTests) ... ok
    test_daemon_threads_fatal_error (test.test_threading.SubinterpThreadingTests) ... ok
    test_threads_join (test.test_threading.SubinterpThreadingTests) ... ok
    Fatal Python error: Py_EndInterpreter: not the last thread

    Current thread 0x0000000800ac3000 (most recent call first):
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/support/init.py", line 2794 in run_in_subinterp
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/test_threading.py", line 923 in test_threads_join_2
    test_threads_join_2 (test.test_threading.SubinterpThreadingTests) ... *** Signal 6

    Stop.
    make: stopped in /usr/home/buildbot/python/3.x.koobs-freebsd-current/build
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/case.py", line 628 in _callTestMethod
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/case.py", line 671 in run
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/case.py", line 731 in __call__
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/suite.py", line 122 in run
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/suite.py", line 84 in __call__
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/suite.py", line 122 in run
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/suite.py", line 84 in __call__
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/suite.py", line 122 in run
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/suite.py", line 84 in __call__
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/unittest/runner.py", line 176 in run
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/support/init.py", line 1984 in _run_suite
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/support/init.py", line 2080 in run_unittest
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest.py", line 203 in _test_module
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest.py", line 228 in _runtest_inner2
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest.py", line 264 in _runtest_inner
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest.py", line 148 in _runtest
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest.py", line 187 in runtest
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/main.py", line 291 in rerun_failed_tests
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/main.py", line 646 in _main
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/main.py", line 588 in main
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/main.py", line 663 in main
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/main.py", line 2 in <module>
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/runpy.py", line 85 in _run_code
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/runpy.py", line 192 in _run_module_as_main
    program finished with exit code 1
    elapsedTime=3523.834729

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 4, 2019

    In the same build:
    https://buildbot.python.org/all/#builders/168/builds/1154

    0:28:57 load avg: 12.93 [208/423/1] test_threading crashed (Exit code 1) -- running: test_lib2to3 (7 min 9 sec), test_multiprocessing_spawn (1 min 36 sec)
    Timeout (0:25:00)!
    Thread 0x0000000800ac3000 (most recent call first):
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/threading.py", line 1015 in _wait_for_tstate_lock
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/threading.py", line 999 in join
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/test_threading.py", line 1287 in test_interrupt_main_subthread
    (...)

    @vstinner
    Copy link
    Member Author

    vstinner commented Jun 7, 2019

    I wrote PR 13889: with this change, I can easily reproduce the crash on Linux:

    $ ./python -m test test_threading -m test_threads_join_2 -F
    Run tests sequentially
    0:00:00 load avg: 0.51 [  1] test_threading
    Fatal Python error: Py_EndInterpreter: not the last thread

    Current thread 0x00007f84ad74d740 (most recent call first):
    File "/home/vstinner/prog/python/master/Lib/test/support/init.py", line 2794 in run_in_subinterp
    File "/home/vstinner/prog/python/master/Lib/test/test_threading.py", line 923 in test_threads_join_2
    ...

    Py_EndInterpreter() calls wait_for_thread_shutdown() to wait until threading._shutdown() completes.

    When the assertion fails, threading.enumerate() only contains the main thread: the spawned thread is already gone. But the assertion fails, which means that the Python thread state of the thread (which looks to be completed) is still around.

    This unit test comes from bpo-18808:

    commit 7b47699
    Author: Antoine Pitrou <solipsis@pitrou.net>
    Date: Sat Sep 7 23:38:37 2013 +0200

    Issue bpo-18808: Thread.join() now waits for the underlying thread state to be destroyed before returning.
    This prevents unpredictable aborts in Py_EndInterpreter() when some non-daemon threads are still running.
    

    @vstinner
    Copy link
    Member Author

    Oh. Using PR 13889, I'm able to reproduce the bug up to Python 3.4. Example at commit commit e76cbc7 (tag: v3.4.10):

    $ wget 'https://github.com/python/cpython/pull/13889.patch'
    $ git apply 13889.patch
    $ ./python -m test  -F -m test_threads_join_2  test_threading
    [  1] test_threading
    [  2] test_threading
    (...)
    [ 10] test_threading
    [ 11] test_threading
    Fatal Python error: Py_EndInterpreter: not the last thread

    Current thread 0x00007f418b3280c0 (most recent call first):
    File "/home/vstinner/prog/python/3.4/Lib/test/support/init.py", line 2304 in run_in_subinterp
    File "/home/vstinner/prog/python/3.4/Lib/test/test_threading.py", line 877 in test_threads_join_2
    (...)
    Aborted (core dumped)

    @vstinner
    Copy link
    Member Author

    threading._shutdown() uses threading.enumerate() which iterations on threading._active.

    threading.Thread registers itself into threading._active using its _bootstrap_inner() method. It unregisters itself when _bootstrap_inner() completes, whereas its is_alive() method still returns true: since the underlying native thread still runs and the Python thread state still exists.

    _thread._set_sentinel() creates a lock and registers a tstate->on_delete callback to release this lock. It's called by threading.Thread._set_tstate_lock() to set threading.Thread._tstate_lock. This lock is used by threading.Thread.join() to wait until the thread completes.

    _thread.start_new_thread() calls the C function t_bootstrap() which ends with:

        tstate->interp->num_threads--;
        PyThreadState_Clear(tstate);
        PyThreadState_DeleteCurrent();
        PyThread_exit_thread();

    _PyThreadState_DeleteCurrent() calls tstate->on_delete() which releases threading.Thread._tstate_lock lock.

    In test_threads_join_2() test, PyThreadState_Clear() blocks on clearing thread variables: the Sleeper destructor of the Sleeper instance sleeps.

    The race condition is that:

    • threading._shutdown() rely on threading._alive
    • Py_EndInterpreter() rely on the interpreter linked list of Python thread states: interp->tstate_head.

    @vstinner
    Copy link
    Member Author

    Other references to test_threads_join_2() failures:

    • bpo-36989: fails once AIX
    • bpo-27791: 2 failures in 2016 in "AMD64 FreeBSD 10 Shared 3.x" buildbots (I closed the issue with: "Sadly, such bug is hard to reproduce. I didn't look at buildbots recently, and I don't have access to FreeBSD, so I just close the issue :-/")
    • bpo-28084: once failure on "AMD64 FreeBSD 9.x 3.x" buildbot (closed as a duplicate of bpo-27791)

    See also bpo-18808: "Thread.join returns before PyThreadState is destroyed" (issue which added the test).

    @vstinner
    Copy link
    Member Author

    test_threading: test_threads_join_2() was added by commit 7b47699 in 2013, but the test failed randomly since it was added. It's just that failures were ignored until I created https://bugs.python.org/issue36402 last March.

    In fact, when the test failed randomly on buildbot (with tests run in parallel), it was fine since test_threading was re-run alone and then the test passed. The buildbot build was seen overall as a success. Previous issues were closed (see my previous comment).

    The test shows the bug using subinterpreters (Py_EndInterpreter), but the bug also exists in Py_Finalize() which hash the same race condition (it also calls threading._shutdown()). It's just that Py_EndInterpreter() is stricter, it contains this assertion:

        if (tstate != interp->tstate_head || tstate->next != NULL)
            Py_FatalError("Py_EndInterpreter: not the last thread");

    Attached py_finalize.patch adds the same assertion to Py_Finalize.

    I added test_threading.test_finalization_shutdown() to PR 13948. If you run test_finalization_shutdown() with py_finalize.patch, Py_Finalize() fails with a similar assertion error.

    But py_finalize.patch is incompatible with the principle of daemon threads and so cannot be commited.

    @vstinner vstinner changed the title test_threading: test_threads_join_2() failed with "Fatal Python error: Py_EndInterpreter: not the last thread" on AMD64 FreeBSD CURRENT Shared 3.x threading._shutdown() race condition: test_threading test_threads_join_2() fails randomly Jun 10, 2019
    @vstinner
    Copy link
    Member Author

    New changeset 468e5fe by Victor Stinner in branch 'master':
    bpo-36402: Fix threading._shutdown() race condition (GH-13948)
    468e5fe

    @vstinner vstinner added 3.7 (EOL) end of life 3.9 only security fixes labels Jun 12, 2019
    @vstinner
    Copy link
    Member Author

    New changeset 6f75c87 by Victor Stinner in branch 'master':
    tbpo-36402: Fix threading.Thread._stop() (GH-14047)
    6f75c87

    @vstinner
    Copy link
    Member Author

    New changeset e40a97a by Victor Stinner in branch '3.8':
    [3.8] bpo-36402: Fix threading._shutdown() race condition (GH-13948) (GH-14050)
    e40a97a

    @vstinner
    Copy link
    Member Author

    New changeset 6eb2878 by Victor Stinner (Miss Islington (bot)) in branch '3.7':
    bpo-36402: Fix threading._shutdown() race condition (GH-13948) (GH-14050) (GH-14054)
    6eb2878

    @vstinner
    Copy link
    Member Author

    The bpo-18808 "Thread.join returns before PyThreadState is destroyed" was not fixed in Python 2.7: threading.Thread has no _tstate_lock attribute. I'm not comfortable to backport bpo-18808 "feature" or "bugfix" to Python 2.7, not to backport this change.

    Python 2.7 works as it is, and it's going to reach its end of life at the end of the year. I guess that people learnt how to work around Python 2.7 limitation like bpo-18808.

    @vstinner
    Copy link
    Member Author

    Ok, the root issue (threading._shutdown() race condition) has been fixed in Python 3.7, 3.8 and master branches. I close the issue. Thanks for the reviews!

    @vstinner vstinner added stdlib Python modules in the Lib dir and removed tests Tests in the Lib/test dir labels Jun 13, 2019
    @kristjanvalur
    Copy link
    Mannequin

    kristjanvalur mannequin commented Aug 9, 2019

    Please note that this fix appears to be the cause of bpo-37788

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant