Issue 30165: faulthandler acquires lock from signal handler, can deadlock while crashing

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/74351

classification

Title:	faulthandler acquires lock from signal handler, can deadlock while crashing
Type:	behavior	Stage:	resolved
Components:	Extension Modules	Versions:	Python 2.7

process

Status:	closed	Resolution:	works for me
Dependencies:		Superseder:
Assigned To:	vstinner	Nosy List:	gregory.p.smith, vstinner
Priority:	normal	Keywords:

Created on 2017-04-26 06:23 by gregory.p.smith, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (10)
msg292307 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2017-04-26 06:23
https://github.com/python/cpython/blob/master/Modules/faulthandler.c#L240 faulthandler_dump_traceback() is called from the signal handler faulthandler_fatal_error() which needs to be async signal safe and only call async signal safe things... but faulthandler_dump_traceback calls PyGILState_GetThisThreadState() which ultimately calls thread.c's find_key() which acquires a lock: https://github.com/python/cpython/blob/master/Python/thread.c#L208 (and calls malloc!) This sometimes leads to a deadlock when the process is crashing, handled via faulthandler, instead of a crash with stacktrace information printed. The opposite of the happy debugging experience it is intended to provide. The _Py_DumpTracebackThreads() code that this calls also calls the same offending function. Despite having comments alluding to how it is called from within a signal handler. https://github.com/python/cpython/blob/master/Python/traceback.c#L754 This is a crashing exception. Rather than ever deadlock, we should do potentially dangerous things (we're already crashing!). Most of the time we'll be able to get and display useful information. On the occasions something bad happens as a result, at least the message printed to stderr before we started trying to do bad things will give a hint as to why the crash reporter crashed. I _believe_ we always want to use _PyThreadState_UncheckedGet() from the signal handler for the entire codepath. Effectively a simple read from TLS. No guarantees possible about the thread state list not in an intermediate state which will trip us up when dumping, but we could never guarantee that anyways. note: I saw https://bugs.python.org/issue23886 but it only seems quasi related though it is also about getting the thread state.
msg292314 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-04-26 08:00
> but faulthandler_dump_traceback calls PyGILState_GetThisThreadState() which ultimately calls thread.c's find_key() which acquires a lock: Hum, Python 3 now uses native TLS, not this fallback implementation using a lock. At least on Linux and Windows. I don't know if the fallback implementation is still used (which platforms supported by Python don't provide pthread API? PyGILState_GetThisThreadState -> PyThread_get_key_value: * pthread: pthread_getspecific() * Windows ("nt"): TlsGetValue()
msg292315 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-04-26 08:02
> I _believe_ we always want to use _PyThreadState_UncheckedGet() faulthandler_dump_traceback() uses PyGILState_GetThisThreadState(). There is a comment to explain why: /* SIGSEGV, SIGFPE, SIGABRT, SIGBUS and SIGILL are synchronous signals and are thus delivered to the thread that caused the fault. Get the Python thread state of the current thread. PyThreadState_Get() doesn't give the state of the thread that caused the fault if the thread released the GIL, and so this function cannot be used. Read the thread local storage (TLS) instead: call PyGILState_GetThisThreadState(). / See also test_faulthandler tests, especially tests releasing the GIL: test_gil_released() * test_fatal_error_without_gil()
msg292316 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-04-26 08:06
> This sometimes leads to a deadlock when the process is crashing, handled via faulthandler, instead of a crash with stacktrace information printed. The opposite of the happy debugging experience it is intended to provide. Is it an hypothetical analysis by reading the C code, or did you get deadlock in practice? If yes, can you please explain how to reproduce the bug and try to get the gdb traceback of the thread which is blocked in faulthandler? In general, faulthandler is an opt-in option for developers, it's not perfect, it only promises best-effort ;-) But I'm a little bit surprised of your deadlock issue. test_faulthandler is running on a wide range of platforms and I didn't notice such deadlock on buildbots yet.
msg292391 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2017-04-27 03:12
This report is based on manual code inspection in CPython head after we encountered a deadlock using pytracemalloc on Python 2.7.12 where it _appeared_ to be the scenario I've described. I see now that I missed noticing the "#ifndef Py_HAVE_NATIVE_TLS" within thread.c which should imply a different PyThread_get_key_value() implementation that likely does not use our lock acquiring fallback find_key(). So my code analysis may not make sense... To give a taste of the large process setup we saw it in: * A CPython process that has extension modules which create their own threads as well as potentially Python having created its own threads. * We've called faulthandler.enable() and faulthandler.register(SIGTERM). * We send a SIGTERM to processes in this environment which are taking too long to complete; the goal was to get a stack trace of what the process was potentially stuck doing when our external timeout monitoring mechanism kicked in. our stuck process had received the SIGTERM and analyzing it revealed a deadlock between two threads which appeared to involve this faulthandler path. let me gather more info into one place. from there i should be able to come up with a way to reproduce it (or even better, not)
msg292393 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2017-04-27 03:50
So... it looks like this could and does happen only on the 2.7 backport of faulthandler. 2.7 is missing the following commit to use pthreads native TLS instead of our lock filled non-async signal safe implementation: https://github.com/python/cpython/commit/2fea9b961d6ea1041ace66037513fcad1fc2173d Which is easy to backport to 2.7 for anyone who maintains their own patched interpreter. For the faulthandler backport package on PyPI perspective I'm not sure what if anything you want to do about it or if you want to track that here or on the faulthandler github issue tracker. It is not safe to call PyGILState_GetThisThreadState() on POSIX systems in a 2.7 extension module without the above patch to the interpreter. No idea how many people run into this in practice, or if they do, have actually noticed what is going on. Because the process is already crashing when it deadlocks so if they do, they are likely to just wind up killing it anyways at some point. We happened to run into it in an integration test framework due to a crash introduced by an unrelated change that caused a single digit percentage of test runs to timeout instead of pass or fail as they had in the past.
msg292405 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-04-27 05:47
Oh sorry, you reported the bug on my 2.7 backport but I ask you to report the bug here, and now the bug is specific to 2.7 :-) Yes, faulthandler implementation is different on Python 2.7. If you look in depth, it is even very different especially "dump later" (2.7 uses alarm, 3.x uses a thread). The stdlib implementation is much more reliable because of different Python enhancements likr native TLS.
msg292414 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-04-27 09:59
Since CPython 3 is not affected, I close this issue and move back the discussion on the bug tracker of the CPython 2.7 backport: https://github.com/haypo/faulthandler/issues/29
msg293351 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2017-05-09 21:33
Follow up note: Note that even with POSIX TLS in use, it isn't entirely safe. pthread_getspecific() as used by the posix thread code is not required to be async signal safe by the POSIX standard. :( The Linux glibc implementation thankfully does not use locks. But it does suffer from a consistency issue in its current implementation when used asynchronously. A trivial patch to fix glibc's pthread_getspecific() with a test exists - https://sourceware.org/ml/libc-alpha/2014-12/msg00583.html - but upstream glibc maintainers are obstinate as usual. Regardless, I believe the worst that happens in that relatively uncommon race condition situation is a faulthandler crash reporting odd information or crashes while crashing. not a deadlock. I have not investigated other pthread_getspecific implementations. The systems I care most about today thankfully run a modified glibc.
msg293354 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-05-09 22:07
> Note that even with POSIX TLS in use, it isn't entirely safe. I wrote faulthandler to collect debug data just before dying, when something already gone very bad, like a deadlock or a memory corruption. I didn't design faulthandler for correctness. If you call faulthandler.dump_traceback() from regular Python code, you are safe. For example, if you register your own Python signal handler using signal.signal(), it's ok. The grey area is when you use C signal handlers using the C API of faulthandler, like faulthandler.register().

History
Date	User	Action	Args
2022-04-11 14:58:45	admin	set	github: 74351
2017-05-09 22:07:40	vstinner	set	messages: + msg293354
2017-05-09 21:33:36	gregory.p.smith	set	messages: + msg293351
2017-04-27 09:59:28	vstinner	set	status: open -> closed resolution: works for me messages: + msg292414 stage: needs patch -> resolved
2017-04-27 05:47:02	vstinner	set	messages: + msg292405
2017-04-27 03:50:43	gregory.p.smith	set	assignee: gregory.p.smith -> vstinner messages: + msg292393 versions: + Python 2.7, - Python 3.6, Python 3.7
2017-04-27 03:12:12	gregory.p.smith	set	assignee: vstinner -> gregory.p.smith messages: + msg292391
2017-04-26 08:06:51	vstinner	set	messages: + msg292316
2017-04-26 08:02:55	vstinner	set	messages: + msg292315
2017-04-26 08:00:14	vstinner	set	messages: + msg292314
2017-04-26 06:23:51	gregory.p.smith	create