Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shutdown (exit) can hang or segfault with daemon threads running #46164

Closed
gpshead opened this issue Jan 17, 2008 · 32 comments
Closed

shutdown (exit) can hang or segfault with daemon threads running #46164

gpshead opened this issue Jan 17, 2008 · 32 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@gpshead
Copy link
Member

gpshead commented Jan 17, 2008

BPO 1856
Nosy @birkenfeld, @terryjreedy, @doko42, @gpshead, @kbkaiser, @amauryfa, @pitrou, @vstinner, @benjaminp, @ned-deily, @Trundle
Files
  • thread_exit.py: run this several times for several different behaviors
  • python-trunk-issue1856-threadexit.diff: Immediately exit non-main threads if the interpreter is exiting
  • thread_noswap.patch
  • python-thread_noswap-2.diff
  • finalizing.patch
  • finalizing2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2014-07-11.21:08:57.822>
    created_at = <Date 2008-01-17.02:01:14.628>
    labels = ['interpreter-core', 'type-crash']
    title = 'shutdown (exit) can hang or segfault with daemon threads running'
    updated_at = <Date 2020-03-26.23:05:06.906>
    user = 'https://github.com/gpshead'

    bugs.python.org fields:

    activity = <Date 2020-03-26.23:05:06.906>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2014-07-11.21:08:57.822>
    closer = 'ned.deily'
    components = ['Interpreter Core']
    creation = <Date 2008-01-17.02:01:14.628>
    creator = 'gregory.p.smith'
    dependencies = []
    files = ['9188', '9199', '9200', '10208', '21862', '21872']
    hgrepos = []
    issue_num = 1856
    keywords = ['patch']
    message_count = 32.0
    messages = ['60014', '60016', '60018', '60056', '60059', '60063', '60069', '60070', '60071', '60072', '60073', '60074', '60075', '60077', '66363', '66742', '73720', '81485', '112641', '115330', '116928', '135006', '135007', '135051', '135144', '135146', '194599', '220786', '222768', '222796', '231460', '365120']
    nosy_count = 20.0
    nosy_names = ['georg.brandl', 'terry.reedy', 'doko', 'gregory.p.smith', 'kbk', 'guettli', 'amaury.forgeotdarc', 'ggenellina', 'Rhamphoryncus', 'larsimmisch', 'pitrou', 'vstinner', 'forest', 'flub', 'benjamin.peterson', 'jamescooper', 'ned.deily', 'Trundle', 'Netto', 'python-dev']
    pr_nums = []
    priority = 'high'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue1856'
    versions = ['Python 3.2', 'Python 3.3']

    @gpshead
    Copy link
    Member Author

    gpshead commented Jan 17, 2008

    This probably applies to 3.0 as well but i have not tested it there.

    Here are some sample failures:

    =========== A ==============
    Exception in thread Thread-0000000001 (most likely raised during
    interpreter shutdown):Exception in thread Thread-0000000003 (most likely
    raised during interpreter shutdown):

    Program received signal SIGSEGV, Segmentation fault.
    [Switching to Thread 0xe4cf1bb0 (LWP 22281)]
    0x080c0824 in PyEval_EvalFrameEx (f=0x820d59c, throwflag=0)
    at Python/ceval.c:2633
    2633 if (tstate->frame->f_exc_type != NULL)
    ============================

    =========== B ==============

    Exception in thread Thread-0000000001 (most likely raised during
    interpreter shutdown):
    Traceback (most recent call last):
      File "/home/gps/oss/python/trunk/Lib/threading.py", line 486, in
    __bootstrap_inner
      File "thread_exit.py", line 24, in run
    <type 'exceptions.AttributeError'>: 'NoneType' object has no attribute 'add'

    Program received signal SIGINT, Interrupt.
    [Switching to Thread 0xf7fe08e0 (LWP 21703)]
    0xffffe410 in __kernel_vsyscall ()
    (gdb) bt
    #0 0xffffe410 in __kernel_vsyscall ()
    #1 0x4dac1af0 in sem_wait@GLIBC_2.0 () from
    /lib/tls/i686/cmov/libpthread.so.0
    #2 0xf7fe08c0 in ?? ()
    #3 0x080ec9d5 in PyThread_acquire_lock (lock=0x0, waitflag=1)
    at Python/thread_pthread.h:349
    #4 0x080be6d5 in PyEval_RestoreThread (tstate=0x8168050) at
    Python/ceval.c:314
    #5 0x0806caf5 in file_dealloc (f=0xf7f29d58) at Objects/fileobject.c:396
    #6 0x0810be36 in frame_dealloc (f=0x8201f04) at Objects/frameobject.c:416
    #7 0x080e28d2 in PyThreadState_Clear (tstate=0x8184068)
    at Python/pystate.c:217
    #8 0x080e2994 in PyInterpreterState_Clear (interp=0x8168008)
    at Python/pystate.c:105
    #9 0x080e473b in Py_Finalize () at Python/pythonrun.c:444
    #10 0x080e428e in handle_system_exit () at Python/pythonrun.c:1641
    #11 0x080e447a in PyErr_PrintEx (set_sys_last_vars=1)
    at Python/pythonrun.c:1087
    #12 0x080e510d in PyRun_SimpleFileExFlags (fp=0x817bd98,
    filename=0xffffd73c "thread_exit.py", closeit=1, flags=0xffffd518)
    at Python/pythonrun.c:996
    #13 0x0805753b in Py_Main (argc=1, argv=0xffffd5b4) at Modules/main.c:574
    #14 0x4d95cea2 in __libc_start_main () from /lib/tls/i686/cmov/libc.so.6
    #15 0x080566c1 in _start () at ../sysdeps/i386/elf/start.S:119
    (gdb) bt
    #0 0x080c0824 in PyEval_EvalFrameEx (f=0x820d59c, throwflag=0)
    at Python/ceval.c:2633
    #1 0x080c451c in PyEval_EvalFrameEx (f=0x820d40c, throwflag=0)
    at Python/ceval.c:3690
    #2 0x080c451c in PyEval_EvalFrameEx (f=0x820d13c, throwflag=0)
    at Python/ceval.c:3690
    #3 0x080c5fd4 in PyEval_EvalCodeEx (co=0xf7f22d58, globals=0xf7f962d4,
    locals=0x0, args=0xf7f46e18, argcount=1, kws=0x0, kwcount=0, defs=0x0,
    defcount=0, closure=0x0) at Python/ceval.c:2876
    #4 0x0810d4d6 in function_call (func=0xf7f41454, arg=0xf7f46e0c, kw=0x0)
    at Objects/funcobject.c:524
    #5 0x0805b0b4 in PyObject_Call (func=0xf7f41454, arg=0x0, kw=0x0)
    at Objects/abstract.c:1872
    #6 0x08061008 in instancemethod_call (func=0xf7f41454, arg=0xf7f46e0c,
    kw=0x0)
    at Objects/classobject.c:2509
    #7 0x0805b0b4 in PyObject_Call (func=0xf7f3fdec, arg=0x0, kw=0x0)
    at Objects/abstract.c:1872
    #8 0x080be8f9 in PyEval_CallObjectWithKeywords (func=0x0, arg=0xf7fa002c,
    kw=0x0) at Python/ceval.c:3473
    #9 0x080efb4d in t_bootstrap (boot_raw=0x820c910)
    at ./Modules/threadmodule.c:422
    #10 0x4dabd341 in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
    #11 0x4da114ee in clone () from /lib/tls/i686/cmov/libc.so.6
    ============================

    And as with all problems of this sort... sometimes the program exits
    normally without any problems.

    I ran python trunk:60012 under gdb above. But these problems occur on
    older versions such as 2.4.

    @gpshead gpshead added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump labels Jan 17, 2008
    @gvanrossum
    Copy link
    Member

    So can we definitely rule out that this could be caused by the recent
    changes to threading.py (r57216)?

    @gpshead
    Copy link
    Member Author

    gpshead commented Jan 17, 2008

    Yes i believe it is unrelated to any recent change.

    I can reproduce both behaviors on my OS X 10.4 dual core mac using
    Python 2.5.1.

    Python 2.3 on the mac appears to get stuck in a loop when run stand
    alone but gets a memory access fault when run under gdb.

    python 2.4.4 seems to hang most of the time.

    (all behaviors are possible i expect, i just ran it by hand under
    several versions a few times)

    The systems i ran it on when reporting the bug was SMP. As with many
    threading bugs it might be easier to reproduce on SMP systems but i
    haven't checked.

    @amauryfa
    Copy link
    Member

    I uploaded a script for a similar issue:
    http://bugs.python.org/issue1596321 (thread_crash.py)
    which sometimes segfaults on Windows XP (trunk version, debug build),
    with the same call stacks as printed by Gregory, on both threads.

    I explain it this way:
    On interpreter shutdown, the main thread clears the other's thread
    TreadState. There you find the instruction: 
          Py_CLEAR(tstate->frame);
    But this can call arbitrary python code on deallocation of locals, and
    release the GIL (file.close() in our case).
    The other thread can then continue to run. If it is given enough
    processor time before the whole process shutdowns, it will reach this
    line in ceval.c (line 2633)
    	if (tstate->frame->f_exc_type != NULL)
    since tstate->frame has been cleared by the main thread ==> boom.

    There can be several ways to solve this problem:

    • While clearing a thread state, prevent the "owner" thread to gain
      focus. Only the main thread can "Py_END_ALLOW_THREADS". The daemon
      thread is blocked forever, and will die there.
    • Be smarter when clearing tstate->frame: first call frame_clear(),
      which will free local variables, but let the frame in a valid state.
      Only after that, call Py_CLEAR(tstate->frame), when we are sure that no
      more thread switch can occur. Of course there are many other fields in a
      frame; we have to inspect them carefully. The first solution seems a
      more robust approach.

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented Jan 17, 2008

    I think non-main threads should kill themselves off if they grab the
    interpreter lock and the interpreter is tearing down. They're about to
    get killed off anyway, when the process exits.

    PyGILState_Ensure would still be broken. It touches various things that
    get torn down (autoInterpreterState, autoTLSkey, and the HEAD_LOCK
    macros) before it grabs the GIL. Reordering should be possible of course.

    @gpshead
    Copy link
    Member Author

    gpshead commented Jan 17, 2008

    agreed, during shutdown the other threads should be stopped. anything
    to do this complicates acquiring and releasing the GIL by adding another
    check to see if we're shutting down.

    brainstorm: I haven't looked at the existing BEGIN_ALLOW_THREADS and
    END_ALLOW_THREADS implementations but would it be possible to modify
    them on the fly from the thread doing the shutdown (main or not) while
    it holds the GIL such that all future calls to BEGIN_ALLOW_THREADS do
    not actually release the GIL and END_ALLOW_THREADS always blocks. That
    should bring other threads to a halt pretty quickly and prevent
    destructors from releasing the GIL (file.close, etc).

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented Jan 17, 2008

    I'm not sure I understand you, Gregory. Are arguing in favour of adding
    extra logic to the GIL code, or against it?

    I'm attaching a patch that has non-main thread exit, and it seems to fix
    the test case. It doesn't fix the PyGILState_Ensure problems though.

    Also note that PyThread_exit_thread() was completely broken, becoming a
    no-op once threading was initialized, and even before then it would exit
    the process rather than just the thread. I've fixed it for pthreads,
    but not for any of the other platforms.

    @amauryfa
    Copy link
    Member

    PyGILState_Ensure would still be broken.
    It touches various things that get torn down (autoInterpreterState,
    autoTLSkey, and the HEAD_LOCK macros) before it grabs the GIL.
    Reordering should be possible of course.

    Adam, did you notice the change on revision 59231 ? the
    PyGILState_Ensure stuff should now remain valid during the
    PyInterpreterState_Clear() call.

    @amauryfa
    Copy link
    Member

    Adam, your patch cover one case of the thread releasing the GIL
    (Py_ALLOW_THREADS), but there are more calls to PyThread_acquire_lock
    (the checkinterval for example).

    I have a competing patch: it makes the main thread never release the GIL
    after some point. Other threads stay blocked and don't clean themselves:
    the OS will take care of them.

    Both approaches correct the initial problem, though.
    A remaining question is where exactly in Py_Finalize() we should ban
    other threads. I think that this point should be as late as possible, to
    allow some object/resources to be cleaned properly.

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented Jan 17, 2008

    Adam, did you notice the change on revision 59231 ? the
    PyGILState_Ensure stuff should now remain valid during the
    PyInterpreterState_Clear() call.

    That doesn't matter. PyGILState_Ensure needs to remain valid *forever*.
    Only once the process is completely gone can we be sure it won't be called.

    Note that PyGILState_Ensure has two behaviours: it can be called when
    your thread is already running python code and has a tstate, or it can
    be called when you have none. revision 59231 fixed the former, but only
    when Py_Finalize hasn't finished. It doesn't fix it for any daemon
    threads, or for dummy threads (created outside of python's control).

    ...

    You're right, I did forget the 3 other places that acquire the
    interpreter_lock. The more I think about it the more I think our two
    approaches are equivalent, but conditionalizing the release means we
    don't need to touch the broken PyThread_exit_thread functions.

    I think the banning should be as early as possible, right after
    call_sys_exitfunc() has finished. You can't do anything sane once the
    interpreter is half-gone (despite the precedent of trying anyway.)

    @amauryfa
    Copy link
    Member

    That doesn't matter. PyGILState_Ensure needs to remain valid
    *forever*. Only once the process is completely gone can we be sure
    it won't be called.

    We could apply the same idea: when exiting, PyGILState_Ensure() blocks
    forever, except for the main thread of course.

    Note that all this state must be restartable: after Py_Finalize(), it
    should be possible to call Py_Initialize() again. This seems to raise
    the score of the "exit_thread" approach.
    I don't know if multiple interpreters are well supported, though.
    Is there somewhere a list of use cases, or a test script that can
    exercise this?

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented Jan 17, 2008

    PyGILState_Ensure WOULD block forever if it acquired the GIL before
    doing anything else.

    The only way to make Py_Initialize callable after Py_Finalize is to make
    various bits of the finalization into no-ops. For instance, it's
    currently impossible to unload C extension modules, so they must be left
    around permanently. I'm not convinced it works currently, or that
    there's use cases for it.

    Note that unloading python.so between Py_Finalize and Py_Initialize
    would definitely break things, as you'd lose all the global variables
    maintaining things. That eliminates the only "it's more elegant" argument.

    @amauryfa
    Copy link
    Member

    PyGILState_Ensure WOULD block forever if it acquired the GIL
    before doing anything else.
    Is it possible at all? PyThread_acquire_lock is not reentrant.

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented Jan 18, 2008

    Hrm. It seems you're right. Python needs thread-local data to
    determine if the GIL is held by the current thread. Thus, autoTLSkey
    and all that need to never be torn down. (The check could be done much
    more directly than the current PyThreadState_IsCurrent machinations, but
    I digress.)

    @Rhamphoryncus
    Copy link
    Mannequin

    Rhamphoryncus mannequin commented May 7, 2008

    Cleaned up version of Amaury's patch. I stop releasing the GIL after
    sys.exitfunc is called, which protects threads from the ensuing teardown.

    I also grab the import lock (and never release it). This should prevent
    the nasty issue with daemon threads doing imports mentioned in issue
    1720705, so that part of the documentation change can be removed. I
    didn't do any significant testing though.

    Importing raises a potential issue with this approach. The standard
    meme of "release GIL; grab lock; acquire GIL" becomes invalid. A child
    may grab the lock, then block on the GIL, while the main thread (which
    never releases the GIL) will block on acquiring the lock. I worked
    around it with the import lock due to it's recursive behaviour, but if
    it exists anywhere else in CPython it may cause a problem if used during
    shutdown.

    @birkenfeld
    Copy link
    Member

    Closed bpo-2077 as a duplicate.

    @amauryfa
    Copy link
    Member

    The threads don't have to be daemons: sys.exit() calls Py_Finalize()
    without waiting for non-daemons threads.

    @ggenellina
    Copy link
    Mannequin

    ggenellina mannequin commented Feb 9, 2009

    I think applying Rhamphoryncus' patch in bpo-1722344 fixes this too (that
    is, move WaitForThreadShutdown from the end of PyMain into Py_Finalize,
    so it is always called)
    But it should be tested on several platforms.

    @terryjreedy
    Copy link
    Member

    Is this still an issue in 2.7 or 3.x?
    Or should it be closed?

    @pitrou
    Copy link
    Member

    pitrou commented Sep 1, 2010

    It is certainly still a problem with 3.x, but I don't find a way to exhibit it here.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 20, 2010

    See bpo-9901 for a variation on the same global issue (running threads can access interpreter structures - the GIL - while the main thread is trying to destroy them).

    @pitrou
    Copy link
    Member

    pitrou commented May 2, 2011

    With Python 3.3, thread_exit.py still crashes more or less randomly:

    $ ./python thread_exit.py 
    [49205 refs]
    python: Python/_warnings.c:501: setup_context: Assertion `((((((PyObject*)(*filename))->ob_type))->tp_flags & ((1L<<28))) != 0)' failed.
    Abandon

    @pitrou
    Copy link
    Member

    pitrou commented May 2, 2011

    A patch that seems to work under Linux and Windows (for 3.2/3.3).

    @pitrou
    Copy link
    Member

    pitrou commented May 3, 2011

    Victor pointed out that Py_Finalize() is not necessarily called in the main Python thread. This new patch records the thread state of the finalizing thread, and also includes a test case.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 4, 2011

    New changeset 2a19d09b08f8 by Antoine Pitrou in branch '3.2':
    Issue bpo-1856: Avoid crashes and lockups when daemon threads run while the
    http://hg.python.org/cpython/rev/2a19d09b08f8

    New changeset c892b0321d23 by Antoine Pitrou in branch 'default':
    Issue bpo-1856: Avoid crashes and lockups when daemon threads run while the
    http://hg.python.org/cpython/rev/c892b0321d23

    @pitrou
    Copy link
    Member

    pitrou commented May 4, 2011

    Should be fixed in 3.2 and 3.3 now. I don't really want to bother with 2.7 and 3.1 (the GIL implementation is different), but someone can backport the patch if they want to :)

    @pitrou pitrou closed this as completed May 4, 2011
    @guettli
    Copy link
    Mannequin

    guettli mannequin commented Aug 7, 2013

    There are some examples to work around this for Python2: http://stackoverflow.com/questions/18098475/detect-interpreter-shut-down-in-daemon-thread

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 17, 2014

    New changeset 7741d0dd66ca by Benjamin Peterson in branch '2.7':
    avoid crashes and lockups from daemon threads during interpreter shutdown (bpo-1856)
    http://hg.python.org/cpython/rev/7741d0dd66ca

    @doko42
    Copy link
    Member

    doko42 commented Jul 11, 2014

    http://tracker.ceph.com/issues/8797 reports that the backport to 2.7 causes a regression in ceph.

    @doko42 doko42 reopened this Jul 11, 2014
    @ned-deily
    Copy link
    Member

    I've opened bpo-21963 to track the 2.7.8 regression. Please continue any discussion there.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 21, 2014

    New changeset 4ceca79d1c63 by Antoine Pitrou in branch '2.7':
    Issue bpo-21963: backout issue bpo-1856 patch (avoid crashes and lockups when
    https://hg.python.org/cpython/rev/4ceca79d1c63

    @vstinner
    Copy link
    Member

    bpo-1193099 was marked as a duplicate of this issue.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    9 participants