New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck during interpreter exit, attempting to take the GIL #80650
Comments
I have a script (sadly, I can't publish it) spawning multiple threads that, in rare occurences, does not manage to exit properly and get stuck forever. More precisely, this seems to happen during Interpreter exit: The atexit callbacks are called sucessfully, and we then have multiple threads that are all atempting to get the GIL why None seems to owns it (_PyThreadState_Current is always '{_value = 0}' while gil_locked is '{_value = 1}'). The main thread stack looks like this: #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225 We can see it is trying to get the GIL while finalizing (as it is emitting a warning when destroying a socket). However, this prevents any other thread to get deleted since the first thread holds the head_lock. For instance we have thread 18 trying to get the head lock: Thread 18 (Thread 0x7f4302ffd700 (LWP 21117)): I attached the full stacktrace of the 18 threads. I a not sure wether we either shouldn't try to lock the GIL while finalizing or if i somehow just happen to have run into a thread aqcuiring the GIL without releasing it. python version is 3.5.3. I kept the problematic process running and can extract any information you may want from it. |
Can you reproduce it on 3.7 or master branch? Python 3.5 is security fix only mode now. |
The bug happens about once every two weeks on a script that is fired more than 10K times a day. Sadly, i can't update the whole production environment to try it on the latest. (and i was unable to trigger the bug by myself). I was hoping we could find inconsistencies in the hanging process that could lead to clues about the origin of the error. |
As Inada-san indicated, the problem might be resolved already. So without the option of reproducing the problem, it will be hard to resolve this. Here's some information you could provide that might help narrow down the scope a bit:
|
Also:
|
Keep in mind that a number of bugs have been fixed in later releases related to the various things I've asked about. For instance, see issue bpo-30703. |
std modules: atexit, json, os, signal, socket, ssl, subprocess, sys, time, threading, xmlrpc.client The amount of threads may vary, but i may have around 20 threads at all time. Most of them are indeed demon threads. I have one atexit handler: i executes a few subprocess and pereform a few xmlrpc calls then exit. In this case, the handler go fully executed. There are signal handlers, but none of them got called. No monkeypatching is involved :) I only browsed the patch up until the 3.5 head. (i guess il lacked to courage to go up to 3.7). I tried to write a reproduction case, but i failed to run into the error. Of course, i will try to improve it if i get a clue about a way to increase the likelyness of the problem. |
At this point I think it's likely that the problem relates to how daemon threads are handled during runtime finalization. What normally happens in the main thread of the "python3" executable is this:
From the stack trace you gave, the main thread is definitely past step 4c in the runtime finalization process. Note the following:
Cause thread to exit if runtime is finalizing:
Do not cause thread to exit if runtime is finalizing:
Regardless, from what you've reported it looks like the following is happening: m1. main thread starts tB1. thread B (still running) acquires GIL m12. creating the warning causes a function to get called (socket.__repr__) tA1. thread A (still running) finishes and starts cleaning itself up Notable:
|
Here are some things that would likely help:
I'm going take a look at master to see if it has a similar possible problem with daemon threads and runtime finalization. If there is then I'll likely open a separate issue (and reference it here). |
Thank you a lot for this detailed answer. Does the "causes of exit" may terminate the thread without releasing the GIL ? Is there any variable i may check to dig this further ? |
Oh, also, i do not use any C extension (apart from the one i mentionned), so i do not acquire/release the GIL directly (a component of the standard library would do so). The demon threads, mainly spend their time listening to sockets and running short subprocesses (all in pure python). |
I've opened bpo-36475 for the two C-API functions that do not cause daemon threads to exit. |
Looking at the stack traces for all your threads (super helpful, BTW), I saw 4 groups:
So there's a third lock involved in this deadlock. It isn't actually clear to me (without further digging) which thread actually holds the GIL. I'd guess it's one of the last two (though it could be one of the 3 waiting on a socket). However, in each of those cases I would not expect the GIL to be held at that point. Regardless, the race likely involves the threading.Lock being held late in finalization. It could be that you are not releasing it somewhere where you should be. |
@remy, aside from the recommendations I've made, I'm not sure what else we can do to help. Before we close the issue, I'd really like to ensure that one of those threads is holding the GIL still. It would definitely be a problem if a thread exited while still holding the GIL. The only way I can think of is to trace through the code. [1] |
Thanks for the advicesand thorough analysis. I'll try to force threads shutdown from the cleanup callback but i'd like to dig to the root of this isssue if possible. This is what the thread 7 python backtrace looks like: (gdb) py-bt
Traceback (most recent call first):
<built-in method acquire of _thread.lock object at remote 0x7f43088859b8>
File "/usr/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()
File "/usr/lib/python3.5/threading.py", line 549, in wait
signaled = self._cond.wait(timeout)
File "/usr/lib/python3.5/threading.py", line 849, in start
self._started.wait()
File "...", line 44, in __init__
thr.start() So we are basically spawning a thread and waiting for it to start (which will likely never happen). That seems like a "normal" behaviour for me (from a programming standpoint, that is), but this may be another cause of never-terminating threads. (unless this is also caused by the headlock and the thread is expected to spawn/release the lock even after finalizing.) Also, i have access to the process that i kept running. Is there any way to me to figure out which thread is currently holding the GIL ? I just want to be sure i can't get this info myself before we close this ticket (at which point i will get rid of the culprit process). |
@eric.snow Unless you confirm there is no way to figure out which thread is/was holding the GIL from a debugging session on the running process, I'll get rid of it at the end of the week. Should i close the ticket then or will you do it ? Thanks. |
closing the ticket for now. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: