Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck during interpreter exit, attempting to take the GIL #80650

Closed
mocramis mannequin opened this issue Mar 29, 2019 · 18 comments
Closed

Stuck during interpreter exit, attempting to take the GIL #80650

mocramis mannequin opened this issue Mar 29, 2019 · 18 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir

Comments

@mocramis
Copy link
Mannequin

mocramis mannequin commented Mar 29, 2019

BPO 36469
Nosy @methane, @ericsnowcurrently
Files
  • stacktraces
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-04-25.15:56:45.929>
    created_at = <Date 2019-03-29.10:45:19.443>
    labels = ['interpreter-core', 'library']
    title = 'Stuck during interpreter exit, attempting to take the GIL'
    updated_at = <Date 2019-04-25.15:56:45.927>
    user = 'https://bugs.python.org/mocramis'

    bugs.python.org fields:

    activity = <Date 2019-04-25.15:56:45.927>
    actor = 'mocramis'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-04-25.15:56:45.929>
    closer = 'mocramis'
    components = ['Interpreter Core', 'Library (Lib)']
    creation = <Date 2019-03-29.10:45:19.443>
    creator = 'mocramis'
    dependencies = []
    files = ['48239']
    hgrepos = []
    issue_num = 36469
    keywords = []
    message_count = 18.0
    messages = ['339103', '339111', '339116', '339118', '339120', '339125', '339132', '339133', '339134', '339137', '339139', '339140', '339141', '339146', '339147', '339187', '339947', '340847']
    nosy_count = 3.0
    nosy_names = ['methane', 'eric.snow', 'mocramis']
    pr_nums = []
    priority = 'normal'
    resolution = 'later'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue36469'
    versions = ['Python 3.5']

    @mocramis
    Copy link
    Mannequin Author

    mocramis mannequin commented Mar 29, 2019

    I have a script (sadly, I can't publish it) spawning multiple threads that, in rare occurences, does not manage to exit properly and get stuck forever.

    More precisely, this seems to happen during Interpreter exit: The atexit callbacks are called sucessfully, and we then have multiple threads that are all atempting to get the GIL why None seems to owns it (_PyThreadState_Current is always '{_value = 0}' while gil_locked is '{_value = 1}').

    The main thread stack looks like this:

    #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
    #1 0x000055d00997ce6a in PyCOND_TIMEDWAIT (cond=0x55d009e8ddc0 <gil_cond>, mut=0x55d009e8dd80 <gil_mutex>, us=<optimized out>) at ../Python/condvar.h:103
    #2 take_gil () at ../Python/ceval_gil.h:224
    #3 0x000055d00998580b in PyEval_EvalFrameEx () at ../Python/ceval.c:1273
    #4 0x000055d00996f16f in _PyEval_EvalCodeWithName.lto_priv.1929 (qualname=0x0, name=<optimized out>, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwcount=0, kws=<optimized out>, argcount=<optimized out>,
    args=<optimized out>, locals=<optimized out>, globals=<optimized out>, _co=<optimized out>) at ../Python/ceval.c:4033
    #5 PyEval_EvalCodeEx () at ../Python/ceval.c:4054
    #6 0x000055d0099b90e3 in function_call.lto_priv () at ../Objects/funcobject.c:627
    #7 0x000055d009a02e17 in PyObject_Call () at ../Objects/abstract.c:2166
    #8 0x000055d00992034e in method_call.lto_priv () at ../Objects/classobject.c:330
    #9 0x000055d009a02e17 in PyObject_Call () at ../Objects/abstract.c:2166
    #10 0x000055d00996df7d in PyEval_CallObjectWithKeywords () at ../Python/ceval.c:4595
    #11 0x000055d009a5d05d in slot_tp_repr () at ../Objects/typeobject.c:5992
    #12 0x000055d0099c9685 in PyObject_Repr () at ../Objects/object.c:482
    #13 0x000055d0099aa6be in unicode_fromformat_arg (vargs=0x7ffc2ca81110, f=0x55d009a8a837 "R", writer=0x7ffc2ca810b0) at ../Objects/unicodeobject.c:2645
    #14 PyUnicode_FromFormatV () at ../Objects/unicodeobject.c:2710
    #15 0x000055d009a572bc in PyErr_WarnFormat () at ../Python/_warnings.c:895
    #16 0x000055d0098840bb in sock_dealloc (s=0x7f43000fc528) at ../Modules/socketmodule.c:4177
    #17 0x000055d0099d031d in subtype_dealloc.lto_priv () at ../Objects/typeobject.c:1209
    #18 0x000055d0099b68f7 in frame_dealloc.lto_priv () at ../Objects/frameobject.c:431
    #19 0x000055d0098ab7b1 in PyThreadState_Clear (tstate=0x55d00bee8a70) at ../Python/pystate.c:386
    #20 0x000055d009a4d08a in PyInterpreterState_Clear () at ../Python/pystate.c:118
    #21 0x000055d009a4e1d2 in Py_Finalize () at ../Python/pylifecycle.c:633
    #22 0x000055d009a4e2a8 in Py_Exit (sts=sts@entry=0) at ../Python/pylifecycle.c:1465
    #23 0x000055d009a4e38e in handle_system_exit () at ../Python/pythonrun.c:602
    #24 0x000055d009a4e3f6 in PyErr_PrintEx () at ../Python/pythonrun.c:612
    #25 0x000055d009a4f667 in PyErr_Print () at ../Python/pythonrun.c:508
    #26 PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:401
    #27 0x000055d009a7c2e7 in run_file (p_cf=0x7ffc2ca814fc, filename=0x55d00bb01140 L"...", fp=0x55d00bb62e60) at ../Modules/main.c:318
    #28 Py_Main () at ../Modules/main.c:768
    #29 0x000055d00990bd71 in main () at ../Programs/python.c:65
    #30 0x00007f430b7cd2e1 in __libc_start_main (main=0x55d00990bc90 <main>, argc=11, argv=0x7ffc2ca81708, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffc2ca816f8)
    at ../csu/libc-start.c:291
    #31 0x000055d009a12a7a in _start ()

    We can see it is trying to get the GIL while finalizing (as it is emitting a warning when destroying a socket). However, this prevents any other thread to get deleted since the first thread holds the head_lock. For instance we have thread 18 trying to get the head lock:

    Thread 18 (Thread 0x7f4302ffd700 (LWP 21117)):
    #0 0x00007f430c6aa536 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, futex_word=0x55d00bb014c0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
    #1 do_futex_wait (sem=sem@entry=0x55d00bb014c0, abstime=0x0) at sem_waitcommon.c:111
    #2 0x00007f430c6aa5e4 in __new_sem_wait_slow (sem=0x55d00bb014c0, abstime=0x0) at sem_waitcommon.c:181
    #3 0x000055d00994d2d5 in PyThread_acquire_lock_timed () at ../Python/thread_pthread.h:352
    #4 0x000055d009a4dcec in tstate_delete_common () at ../Python/pystate.c:418
    #5 0x000055d009a4dd88 in PyThreadState_DeleteCurrent () at ../Python/pystate.c:457
    #6 0x000055d009a482a4 in t_bootstrap () at ../Modules/_threadmodule.c:1027
    #7 0x00007f430c6a2494 in start_thread (arg=0x7f4302ffd700) at pthread_create.c:333
    #8 0x00007f430b895acf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

    I attached the full stacktrace of the 18 threads.

    I a not sure wether we either shouldn't try to lock the GIL while finalizing or if i somehow just happen to have run into a thread aqcuiring the GIL without releasing it.

    python version is 3.5.3.

    I kept the problematic process running and can extract any information you may want from it.

    @mocramis mocramis mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir labels Mar 29, 2019
    @methane
    Copy link
    Member

    methane commented Mar 29, 2019

    Can you reproduce it on 3.7 or master branch?

    Python 3.5 is security fix only mode now.
    https://devguide.python.org/#status-of-python-branches

    @mocramis
    Copy link
    Mannequin Author

    mocramis mannequin commented Mar 29, 2019

    The bug happens about once every two weeks on a script that is fired more than 10K times a day.

    Sadly, i can't update the whole production environment to try it on the latest. (and i was unable to trigger the bug by myself).

    I was hoping we could find inconsistencies in the hanging process that could lead to clues about the origin of the error.

    @ericsnowcurrently
    Copy link
    Member

    As Inada-san indicated, the problem might be resolved already. So without the option of reproducing the problem, it will be hard to resolve this.

    Here's some information you could provide that might help narrow down the scope a bit:

    • what (stdlib/third-party/internal) modules are you using, particularly extension modules?
    • how many threads are you using at once?
    • how many atexit handlers do you have and what are they each doing?

    @ericsnowcurrently ericsnowcurrently added the type-bug An unexpected behavior, bug, or error label Mar 29, 2019
    @ericsnowcurrently
    Copy link
    Member

    Also:

    • are any of your threads daemon threads?

    @ericsnowcurrently
    Copy link
    Member

    • Are any signals (or signal handlers) involved?
    • Are you monkeypatching any builtins, builtin/stdlib modules (or any parts of those)?

    Keep in mind that a number of bugs have been fixed in later releases related to the various things I've asked about. For instance, see issue bpo-30703.

    @mocramis
    Copy link
    Mannequin Author

    mocramis mannequin commented Mar 29, 2019

    std modules: atexit, json, os, signal, socket, ssl, subprocess, sys, time, threading, xmlrpc.client
    The non-standard extension: psutil, PIL (only for image encoding though).

    The amount of threads may vary, but i may have around 20 threads at all time. Most of them are indeed demon threads.

    I have one atexit handler: i executes a few subprocess and pereform a few xmlrpc calls then exit. In this case, the handler go fully executed.

    There are signal handlers, but none of them got called.

    No monkeypatching is involved :)

    I only browsed the patch up until the 3.5 head. (i guess il lacked to courage to go up to 3.7).

    I tried to write a reproduction case, but i failed to run into the error. Of course, i will try to improve it if i get a clue about a way to increase the likelyness of the problem.
    Sadly, all i have right now is a process i can attach to.

    @mocramis mocramis mannequin removed the type-bug An unexpected behavior, bug, or error label Mar 29, 2019
    @ericsnowcurrently
    Copy link
    Member

    At this point I think it's likely that the problem relates to how daemon threads are handled during runtime finalization.

    What normally happens in the main thread of the "python3" executable is this:

    1. Python runtime initializes
    2. main interpreter initializes
    3. the requested code is executed (in this case via PyRun_SimpleFileExFlags)
    4. the runtime is finalized
      a. wait for all non-daemon threads to finish
      b. call registered atexit funcs
      c. mark the runtime as finalizing
      d. ...

    From the stack trace you gave, the main thread is definitely past step 4c in the runtime finalization process.

    Note the following:

    • marking the runtime as finalizing is meant to cause all remaining daemon threads to exit the next time they take the GIL
    • further, runtime finalization assumes that all daemon threads will have finished soon after step 4c
    • not all C-API functions for acquiring the GIL actually cause the current thread to exit if the runtime is finalizing (see below)
    • step 4a applies only to the main interpreter; using subinterpreters complicates the situation a little (threads get finalized with each subinterpreter later on)
    • any thread created in an atexit func (4b) can cause problems

    Cause thread to exit if runtime is finalizing:

    • PyEval_RestoreThread()
    • PyEval_EvalFrameEx() (the eval loop)

    Do not cause thread to exit if runtime is finalizing:

    • PyEval_InitThreads()
    • PyEval_ReInitThreads()
    • PyEval_AcquireLock()
    • PyEval_AcquireThread()

    Regardless, from what you've reported it looks like the following is happening:

    m1. main thread starts
    m2. thread A (daemon) created
    m3. thread B (daemon) created (by main thread or thread A)
    m4. code in main thread raises SystemExit
    m5. the Python runtime begins finalization (incl. steps 4a-4c above)
    m6. the main thread starts cleaning up the main interpreter
    m7. it starts cleaning up the main interpreter's threads
    m8. it acquires the "head" lock
    m9. it starts cleaning up the frame tied to one of those threads
    m10. a socket object in that frame gets destroyed
    m11. a warning is issued (presumably about an unclosed socket)

    tB1. thread B (still running) acquires GIL
    tB2. it does not release the GIL

    m12. creating the warning causes a function to get called (socket.__repr__)
    m13. this causes the main thread to try to acquire the GIL
    m14. it blocks (waiting for thread B)

    tA1. thread A (still running) finishes and starts cleaning itself up
    tA2. it tries to acquire the "head" lock
    tA3. it blocks (waiting for the main thread)

    Notable:

    • at step m7 the code assumes that all the interpreter's threads have exited at that point
    • with the verbose flag to "python3", the runtime will print a warning if any thread is still running when its (finalizing) interpreter tries to clean it up

    @ericsnowcurrently
    Copy link
    Member

    Here are some things that would likely help:

    • in the main thread stop your daemon threads if you hit SystemExit
    • make sure daemon threads release/acquire the GIL frequently (so they notice finalization)
    • make sure daemon threads otherwise exit promptly (no long running C code)
    • stop using daemon threads
    • use a newer version of Python (may be fixed there)

    I'm going take a look at master to see if it has a similar possible problem with daemon threads and runtime finalization. If there is then I'll likely open a separate issue (and reference it here).

    @mocramis
    Copy link
    Mannequin Author

    mocramis mannequin commented Mar 29, 2019

    Thank you a lot for this detailed answer.

    Does the "causes of exit" may terminate the thread without releasing the GIL ?
    Because as far as i can tell, none of the threads seems to own the GIL (i only checked _PyThreadState_Current though there might be a better way to find the GIL owner).
    Therefore, the question is whether thread B is still alive after tB2. and, if so, whether we can find it. (Or whether we can understand why it left without releasing the GIL).

    Is there any variable i may check to dig this further ?

    @mocramis
    Copy link
    Mannequin Author

    mocramis mannequin commented Mar 29, 2019

    Oh, also, i do not use any C extension (apart from the one i mentionned), so i do not acquire/release the GIL directly (a component of the standard library would do so).

    The demon threads, mainly spend their time listening to sockets and running short subprocesses (all in pure python).

    @ericsnowcurrently
    Copy link
    Member

    I've opened bpo-36475 for the two C-API functions that do not cause daemon threads to exit.

    @ericsnowcurrently
    Copy link
    Member

    Looking at the stack traces for all your threads (super helpful, BTW), I saw 4 groups:

    • (1) main thread (blocked on GIL during runtime finalization)
      + Thread 1
    • (12) blocked on GIL in call to PyEval_RestoreThread() (socket, select, time.sleep(), file.read())
    • (3) waiting for a connection on a socket (with GIL not held, presumably)
      + Thread 5
      + Thread 12
      + Thread 14
    • (1) blocked on a threading.Lock
      + Thread 7
    • (1) blocked on "head" lock when cleaning up after execution
      + Thread 18

    So there's a third lock involved in this deadlock. It isn't actually clear to me (without further digging) which thread actually holds the GIL. I'd guess it's one of the last two (though it could be one of the 3 waiting on a socket). However, in each of those cases I would not expect the GIL to be held at that point.

    Regardless, the race likely involves the threading.Lock being held late in finalization. It could be that you are not releasing it somewhere where you should be.

    @ericsnowcurrently
    Copy link
    Member

    I've also opened issues bpo-36476 and bpo-36477.

    @ericsnowcurrently
    Copy link
    Member

    @remy, aside from the recommendations I've made, I'm not sure what else we can do to help. Before we close the issue, I'd really like to ensure that one of those threads is holding the GIL still. It would definitely be a problem if a thread exited while still holding the GIL. The only way I can think of is to trace through the code. [1]

    [1] https://github.com/python/cpython/tree/v3.5.3

    @mocramis
    Copy link
    Mannequin Author

    mocramis mannequin commented Mar 30, 2019

    Thanks for the advicesand thorough analysis. I'll try to force threads shutdown from the cleanup callback but i'd like to dig to the root of this isssue if possible.

    This is what the thread 7 python backtrace looks like:

    (gdb) py-bt
    Traceback (most recent call first):
      <built-in method acquire of _thread.lock object at remote 0x7f43088859b8>
      File "/usr/lib/python3.5/threading.py", line 293, in wait
        waiter.acquire()
      File "/usr/lib/python3.5/threading.py", line 549, in wait
        signaled = self._cond.wait(timeout)
      File "/usr/lib/python3.5/threading.py", line 849, in start
        self._started.wait()
      File "...", line 44, in __init__
        thr.start()

    So we are basically spawning a thread and waiting for it to start (which will likely never happen). That seems like a "normal" behaviour for me (from a programming standpoint, that is), but this may be another cause of never-terminating threads. (unless this is also caused by the headlock and the thread is expected to spawn/release the lock even after finalizing.)

    Also, i have access to the process that i kept running. Is there any way to me to figure out which thread is currently holding the GIL ? I just want to be sure i can't get this info myself before we close this ticket (at which point i will get rid of the culprit process).

    @mocramis
    Copy link
    Mannequin Author

    mocramis mannequin commented Apr 11, 2019

    @eric.snow Unless you confirm there is no way to figure out which thread is/was holding the GIL from a debugging session on the running process, I'll get rid of it at the end of the week.

    Should i close the ticket then or will you do it ?

    Thanks.

    @mocramis
    Copy link
    Mannequin Author

    mocramis mannequin commented Apr 25, 2019

    closing the ticket for now.

    @mocramis mocramis mannequin closed this as completed Apr 25, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants