New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_faulthandler.test_register_chain fails on 64bit ppc/arm with kernel >= 3.10 #65330
Comments
test_faulthandler.test_register_chain fails on some 64bit architectures (arm8, ppc64) with kernel >= 3.10: ====================================================================== Traceback (most recent call last):
File "/builddir/build/BUILD/Python-3.3.2/Lib/test/test_faulthandler.py", line 588, in test_register_chain
self.check_register(chain=True)
File "/builddir/build/BUILD/Python-3.3.2/Lib/test/test_faulthandler.py", line 572, in check_register
self.assertEqual(exitcode, 0)
AssertionError: -11 != 0 I created a minimal reproducer (reproduces the issue on 3.3, 3.4 and dev) of this segfault (attached). When run with GC assertions turned on, Python fails with: python: /builddir/build/BUILD/Python-3.3.2/Modules/gcmodule.c:332: update_refs: Assertion `gc->gc.gc_refs == (-3)\' failed. We experienced this issue when building Python 3.3 on Fedora's arm8 builder [1], it seems that opensuse have experienced the same failure on ppc64 [2] and ubuntu has a similar issue in their 64bit arm builds [3]. It seems that this may be caused by a change in kernel, since it's only reproducible on kernel >= 3.10. A nice explanation of what goes on and how the problem can be narrowed down is at the opensuse bug report [4], this is basically where I got stuck with this problem, too. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1045193 |
"test_faulthandler.test_register_chain fails on some 64bit architectures (arm8, ppc64) with kernel >= 3.10" I am a little bit surprised that the bug depends on the kernel version. Does test_register_chain_segfault_reproducer.py also crash with chain=False? Can you check if HAVE_SIGACTION is defined in pyconfig.h? Or in Python: import sysconfig; print(sysconfig.get_config_var('HAVE_SIGACTION')). Wit chain=True, faulthandler_register() registers its signal handler with SA_NODEFER flag: /* if the signal is received while the kernel is executing a system
call, try to restart the system call instead of interrupting it and
return EINTR. */
action.sa_flags = SA_RESTART;
if (chain) {
/* do not prevent the signal from being received from within its
own signal handler */
action.sa_flags = SA_NODEFER;
} With chain=True, the faulthandler_user() function calls the previous signal handler with: #ifdef HAVE_SIGACTION
if (user->chain) {
(void)sigaction(signum, &user->previous, NULL);
errno = save_errno;
save_errno = errno;
(void)faulthandler_register(signum, user->chain, NULL);
errno = save_errno;
}
#else
if (user->chain) {
errno = save_errno;
/* call the previous signal handler */
user->previous(signum);
}
#endif You can try to add "#undef HAVE_SIGACTION" in faulthandler.c (after #include "Python.h") to see if the bug can be reproduced without sigaction. The code using signal() is very different, especially when chaining signal handlers. |
Would it be possible to list the kernel version (full version including the minor version) on which the test works or crash? You are talking about "3.10" without the minor version. It may be a regression in the kernel. Example of change in Linux kernel 3.10.17:
--- Extract of changes of Linux 3.10.6: I would like to make sure that the bug is not a kernel bug. |
I'm also surprised that this depends on kernel version, however that's what I found out (and the opensuse guys seem to only have reproduced this on kernel >= 3.10, too). - Full kernel version (uname -r output): 3.13.0-0.rc7.28.sa2.aarch64
- The reproducer *doesn't* crash with chain=False.
- HAVE_SIGACTION:
>>> import sysconfig; print(sysconfig.get_config_var('HAVE_SIGACTION'))
1
- I'll do rebuild with "#undef HAVE_SIGACTION" and post my results here as soon as it's finished. |
Ok, so with "#undef HAVE_SIGACTION" both the reproducer and the original test (as well as all tests in test_faulthandler) pass fine. |
We're running into this building python 3.4.3 on EL6 ppc64. The os kernel is 4.7.2-201.fc24.ppc64, but the EL6 chroot kernel-headers are 2.6.32-642.4.2.el6. Any progress here? |
Sorry, but if I'm unable to reproduce the issue, I cannot make progress on analyzing the bug. I would need an remote access (SSH) to a computer where the bug can be reproduced. I also would like to know if the issue can be reproduced in C. faulthandler depends a lot how signals and threads are implemented. I'm not completely suprised to see bugs on some specific platforms. |
Hi - we ran into what looks like exactly this issue on an x86_64 sporadically, and tracked down the root cause. When faulthandler.c uses sigaltstack(2), the stack size is set up with a buffer of size SIGSTKSZ. That is, sadly, only 8k. When a signal is raised, before the handler is called, the kernel stores the machine state on the user's (possibly "alternate") stack. The size of that state is very much variable, depending on the CPU. When we chain the signal handler in the sigaction variant of the code in faulthandler, we raise the signal with the existing handler still on the stack, and save a second copy of the CPU state. Finally, when any part of that signal handler has to invoke a function that requires the dynamic linker's intervention to resolve, it will call some form of _dl_runtime_resolve* - likely _dl_runtime_resolve_xsave or _dl_runtime_resolve_xsavec. These functions will also have to save machine state. So, how big is the machine state? Well, it depends on the CPU.
On another machine, reporting as "Intel(R) Xeon(R) Gold 5118 CPU", I have:
This means that the required stack space to hold 3 sets of CPU state is over 7.5k. And, for the signal handlers, it's actually worse: more like 3.25k per frame. A chained signal handler that needs to invoke dynamic linking will therefore consume more than the default stack space allocated in faulthandler.c, just in machine-state saves alone. So, the failing test is failing because its scribbling on random memory before the allocated stack space. My guess is that the previous architectures this manifested in have larger stack demands for signal handling than x86_64, but clearly newer x86_64 processors are starting to get tickled by this. Fix is pretty simple - just allocate more stack space. The attached patch uses pthread_attr_getstacksize to find the system's default stack size, and then uses that as the default, and also defines an absolute minimum stack size of 1M. This fixes the issue on our machine with the big xsave state size. (I'm sure I'm getting the feature test macros wrong for testing for pthreads availability) Also, I think in the case of a threaded environment, using the altstack might not be the best choice - I think multiple threads handling signals that run on that stack will wind up stomping on the same memory - is there a strong reason to maintain this altstack behaviour? |
+ pthread_attr_t attrs; PyThread_start_new_thread() of thread_pthread.h already contains logic to get a "good" stack size. I would prefer to reuse this code. See also _pythread_nt_set_stacksize() of thread_nt.h. Maybe we need a private function to get the default stack size? See also PyThread_get_stacksize() and _thread.stack_size(). |
Ok - let me submit a pull request with your suggestions On Tue, 28 May 2019 at 13:08, STINNER Victor <report@bugs.python.org> wrote:
|
One additional note on this. Thanks to a colleague at USC who pointed out that this bug does not seem to get exercised if one does not include I confirmed this using the distributed Python-3.7.4.tgz file and Applying the patch file attached to this report and rebuilding leads to If someone was trying to reproduce and did not configure shared libraries, then that would have failed to reproduce. |
Perhaps I should add, that we are able to reproduce this behavior on this hardware Dell R640 Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz but not on other chips, such as a recent i5, Haswell family, and older. That may also make a difference if someone tries to reproduce this. |
I dislike PR 13649 because it touch the thread module, only to fix a faulthandler unit test. The relationship between thread stack size and faulthandler is not obvious to me. Currently, faulthandler uses SIGSTKSZ, not the thread stack size. faulthandler usage of the stack should be quite low: it should need less than 1 MiB for example. "When faulthandler.c uses sigaltstack(2), the stack size is set up with a buffer of size SIGSTKSZ. That is, sadly, only 8k." "A chained signal handler that needs to invoke dynamic linking will therefore consume more than the default stack space allocated in faulthandler.c, just in machine-state saves alone. So, the failing test is failing because its scribbling on random memory before the allocated stack space." Aha, that's interesting: SIGSTKSZ should be enough for 1 signal handler, but test_register_chain calls 2 signal handlers using the same stack. Can you please try the following patch? diff --git a/Modules/faulthandler.c b/Modules/faulthandler.c
index 2331051f79..e7d13f2b2d 100644
--- a/Modules/faulthandler.c
+++ b/Modules/faulthandler.c
@@ -1325,7 +1325,7 @@ _PyFaulthandler_Init(int enable)
* be able to allocate memory on the stack, even on a stack overflow. If it
* fails, ignore the error. */
stack.ss_flags = 0;
- stack.ss_size = SIGSTKSZ;
+ stack.ss_size = SIGSTKSZ * 2;
stack.ss_sp = PyMem_Malloc(stack.ss_size);
if (stack.ss_sp != NULL) {
err = sigaltstack(&stack, &old_stack); |
Attached altstack.c mimicks faulthandler unit test test_register_chain(), except that faulthandler_user() uses almost no stack memory. This test should check if SIGSTKSZ is big enough to call a second signal handler from a first signal handler. Example of my Fedora 30 with "Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz" model name in /proc/cpuinfo and Linux kernel 5.1.18-300.fc30.x86_64: $ gcc altstack.c -o altstack && ./altstack
SIGSTKSZ = 8192
our signal handler
User defined signal 1 Note: the follow gdb command didn't work for me:
How can I get xsave_state_size for my glibc/kernel/CPU? |
Ah, the cpuid command tells me "bytes required by XSAVE/XRSTOR area = 1088": CPU 0: /proc/cpuinfo also says: flags : ... xsave avx ... xsaveopt xsavec xgetbv1 xsaves ... I recall that the Linux kernel was modified to only save AVX registers if a program uses AVX. So the process state size depends on the usage of AVX. But I cannot find the related LWN article. |
Ah, I found the recent change about XSAVE: it is a fix for CVE-2018-3665 vulnerability. "The software mitigation for this is to switch to an "eager" / immediate FPU state save and restore, in both kernels and hypervisors." "On Intel and AMD x86 processors, operating systems and hypervisors often use what is referred to as a deferred saving and restoring method of the x86 FPU state, as part of performance optimization. This is done in a "lazy" on-demand fashion." "It was found that due to the "lazy" approach, the x86 FPU states or FPU / XMM / AVX512 register content, could leak across process, or even VM boundaries, giving attackers possibilities to read private data from other processes, when using speculative execution side channel gadgets." https://www.suse.com/support/kb/doc/?id=7023076 See also: https://en.wikipedia.org/wiki/Lazy_FP_state_restore |
Hi Victor, thanks for the comments. Responses inline below. On Wed, 14 Aug 2019 at 12:25, STINNER Victor <report@bugs.python.org> wrote:
My original patch (posted in the comments above) was purely in faulthandler
PyThread_start_new_thread() of thread_pthread.h already contains logic to
I have no problem reformulating the code to avoid touching the threads
The point of contention here is really he choice of stack size. SIGSTKSZ is
It's more complex than that - in dynamically linked applications when you
|
I just tested the proposed change in Aha, that's interesting: SIGSTKSZ should be enough for 1 signal handler, but test_register_chain calls 2 signal handlers using the same stack. Can you please try the following patch?
and the segfault no longer occurs at the faulthandler test. Compiling and running the altstack.c using the system installed GCC 4.8.5 on CentOS 7.6.1810, kernel version 3.10.0-957.10.1.el7.x86_64 running on Dell R640 Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz results in this output.
It does seem to me that relying on a statically set stack size when using dynamically loaded libraries is inviting similar problems in the future for the reasons that Peter enumerated: There is no telling now what the requirements will be for some new chip family, and one cannot predict now what additional (if any) memory requirements might be needed by the linker in the future. But, I think getting _some_ patch accepted and pushed to the main Python releases should have some priority, as the current state does seem undesirable. |
The patch I originally proposed here ( On Wed, 14 Aug 2019 at 14:26, Bennet Fauber <report@bugs.python.org> wrote:
|
"On a 64-bit system, consuming 8M of address space is a drop in the ocean." Let me disagree here. Python always allocates faulthandler stack, even if faulthandler is not used. Even when faulthandler is used, I would prefer to not waste memory if 8 KiB is good enough. By the way, I just created bpo-37851 to allocate this stack at the first faulthandler usage, instead of always allocating it, even when faulthandler is not used. I wrote PR 15276 to use a stack of SIGSTKSZ*2 bytes. According to Can someone please double check that PR 15276 fix test_faulthandler on a platform where the test crash without this change? |
On Wed, 14 Aug 2019 at 14:46, STINNER Victor <report@bugs.python.org> wrote:
I can understand the aversion to the waste when its never used - I can I do think SIGSTKSZ*2=16k is far too small considering the fault handler
I can confirm that on the specific hardware I could reproduce this, that
|
I updated the versions affected to include 3.6 and 3.7, both of which are affected. I am a bit concerned that the conversation might get fragmented, so I will put in the full URL to the newly created PR at GitHub here.
just to make it easier for anyone else who finds this to see where things have gone. I am now, also, uncertain what the current status of
is now. It seems that we are now waiting on review from someone from the core developers? |
"I do think SIGSTKSZ*2=16k is far too small considering the fault handler could be running arbitrary python code," We are talking abou the faulthandler_user() function of Modules/faulthandler.c. It is implemented in pure C, it doesn't allocate memory on the heap, it uses a very small set of functions (write(), sigaction(), raise()) and it tries to minimize its usage of the stack memory. It is very different than the traceback module which is implemented in pure Python. faulthandler is really designed to debug segmentation fault, stack overflow, Python hang (like a deadlock), etc. |
"I can understand the aversion to the waste when its never used - I can address 37851 if you like - it seems pretty simple to fix. The pedant in me must point out that it's 8M of address space, not memory. The cost on 64-bit (well, with a 47-bit user address space) is vanishingly small, ..." Well, many users pay attention to the RSS value and don't care if the memory is physically allocated or not. Moreover, I'm not sure that we can fix bpo-37851 in Python 3.7. In general, the policy is to minimize changes in stable Python versions. I'm not sure for Python 3.8 neither. I would suggest to only modify Python 3.9, simply to reduce the risk of regressions. |
"I can confirm that on the specific hardware I could reproduce this, that Thanks for testing. I merged my PR. About PR 13649, I'm not sure that _PyThread_preferred_stacksize() is still relevant, since my change fixed test_faulthandler test_register_chain(). I chose my change since it's less invasive: it only impacts faulthandler, and it minimalizes the memory usage (especially when faulthandler is not used). Python/thread_pthread.h refactor changes of PR 13649 are interested. Would you like to extract them into a new PR which doesn't add _PyThread_preferred_stacksize() but just add new PLATFORM_xxx macros? -- Maybe test_faulthandler will fail tomorrow on a new platform, but I prefer to open a discussion once such case happens, rather than guessing how faulthandler can crash on an hypothetical platforms. I'm sure that libc developers are well aware of the FPU state size and update SIGSTKSZ accordingly. glibc code computing xsave_state_size: -- If tomorrow, it becomes too hard to choose a good default value for faulthandler stack size, another workaround would be to make it configurable, as Python lets developers choose the thread stack size: _thread.stack_size(size). |
The bug has been fixed in 3.7, 3.8 and master (future 3.9) branches. I close the issue. Thanks to everyone who was involved to report the bug and help to find the root issue! The relationship between faulthandler, the Linux kernel version, CPU model, and the FPU state size wasn't obvious at the first look ;-) If someone wants to cleanup/rework how Python handles thread stack size, please open a separated issue. I prefer to restrict this issue to test_faulthandler.test_register_chain() (which is now fixed). |
On Wed, 14 Aug 2019 at 22:34, STINNER Victor <report@bugs.python.org> wrote:
That's totally reasonable, sure. |
On Wed, 14 Aug 2019 at 22:32, STINNER Victor <report@bugs.python.org> wrote:
I was more concerned about what was happening in the chained handler, which
Right, totally - I had jumped to the conclusion that it would end up
|
On Wed, 14 Aug 2019 at 23:13, STINNER Victor <report@bugs.python.org> wrote:
Sure - there's no reason for it to exist if you don't want to use it to fix
Yes, certainly. Maybe test_faulthandler will fail tomorrow on a new platform, but I prefer
Well, one argument for the dynamic approach is that existing python https://github.com/torvalds/linux/blob/master/arch/ia64/include/uapi/asm/signal.h#L83
The current value comes from the kernel sources, and has not changed since
|
Python doesn't support IA64 architecture. Note: I'm not sure that this architecture is going to survive in the long term... More and more systems stopped to support it.
PR 13649 gets the default thread stack size, it doesn't get SIGSTKSZ. I expect a thread stack size to be way larger than SIGSTKSZ. For on my Fedora 30 (Linux kernel 5.1.19-300.fc30.x86_64, glibc-2.29-15.fc30.x86_64) on x86-64, SIGSTKSZ is just 8 KiB (8192 bytes) whereas pthread_attr_getstacksize() returns 8 MiB (8388608 bytes). As I already wrote, using 8 MiB instead of 8 KiB looks like a waste of memory: faulthandler is only useful for stack overflow, which is a very unlikely bug. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: