classification
Title: test_faulthandler: test_register_chain() crash with SIGSEGV (signal 11) on Skylake chipset
Type: compile error Stage: resolved
Components: Build, Tests Versions: Python 3.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: test_faulthandler.test_register_chain fails on 64bit ppc/arm with kernel >= 3.10
View: 21131
Assigned To: Nosy List: Jeffrey.Kintscher, jshelly, justbennet, vstinner
Priority: normal Keywords:

Created on 2019-07-11 15:12 by jshelly, last changed 2019-08-19 10:35 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
sigusr1.py vstinner, 2019-07-14 09:12
script.py vstinner, 2019-07-15 15:46
Messages (12)
msg347688 - (view) Author: Michelle Johnson (jshelly) Date: 2019-07-11 15:12
CentOS 7.6, running Linux gl-build.arc-ts.umich.edu 3.10.0-957.10.1.el7.x86_64 #1 SMP Mon Mar 18 15:06:45 UTC 2019 x86_64 on Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz.

Using gcc 8.2.0 compiled from source to build python 3.7.4 from source, make test fails with the following errors:

```
0:00:22 load avg: 3.12 [402/416/1] test_faulthandler failed
test test_faulthandler failed -- Traceback (most recent call last):
  File "/tmp/shellysw/build/Python-3.7.4/Lib/test/test_faulthandler.py", line 724, in test_register_chain
    self.check_register(chain=True)
  File "/tmp/shellysw/build/Python-3.7.4/Lib/test/test_faulthandler.py", line 702, in check_register
    self.assertEqual(exitcode, 0)
AssertionError: -11 != 0
```

I found a similar error reported in RedHat bugzilla (1687171) and tried running the script provided by Victor Stinner, and this closed issue on GitHub seems like it might be related?
#32 because of the Skylake chipset.

Script:
```
import faulthandler, signal, os

def handler(signum, frame):
    handler.called = True
handler.called = False

faulthandler.enable()

signum = signal.SIGUSR1
signal.signal(signum, handler)
faulthandler.register(signum, chain=True)
os.kill(os.getpid(), signum)
print("called", handler.called)
```

It produced a segmentation fault. I also ran that script using the python 3.6.6 binary that came installed with centos and it also produced a segmentation fault as well.
msg347899 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-07-14 09:12
I don't know why you get a crash. Can you try attached sigusr1.py script? Expected output:

$ python3 sigusr1.py 
called True
msg347900 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-07-14 09:13
> I found a similar error reported in RedHat bugzilla (1687171) and tried running the script provided by Victor Stinner

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1687171

> and this closed issue on GitHub seems like it might be related?
#32 because of the Skylake chipset.

I guess that you mean: https://github.com/vstinner/faulthandler/issues/36
msg347962 - (view) Author: Michelle Johnson (jshelly) Date: 2019-07-15 14:30
Hello Victor.  Thank you for looking into this.  I ran the attached script, sigusr1.py as you requested.  The output was as expected:

[jshelly@gl-build bin]$ ./python3.7 ~/sigusr1.py
called True

Not sure if that helps identify the problem.  Regarding the related bugs I mentioned, I think my first comments were mixed up.  There were 2 possibly related bugs.

1.  Redhat bugzilla 1687171, which may have been related to the skylake chipset: https://bugzilla.redhat.com/show_bug.cgi?id=1687171 

2.  Github # 32:  https://github.com/vstinner/faulthandler/issues/32
That is an older one (2017), and maybe not related.

Bug # 36 at Github is one that I opened initially, and that one is a duplicate of this bug report. It is closed now as I was asked to open the bug here instead.

Thanks for any further insight or suggestions you can provide.
msg347975 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-07-15 15:46
More tests.

(*) What is the output of signal.getsignal(signal.SIGUSR1) before importing faulthandler (before calling any faulthandler method)? Example:

$ python3
Python 3.7.3 (default, May 11 2019, 00:38:04) 
>>> import signal; signal.getsignal(signal.SIGUSR1)
<Handlers.SIG_DFL: 0>

(*) Can you try to run attached script in gdb? If you get a crash, type "where" to get the traceback. Example:

$ gdb -args ./python script.py 
(gdb) handle SIGUSR1 nostop
Signal        Stop	Print	Pass to program	Description
SIGUSR1       No	Yes	Yes		User defined signal 1

(gdb) run
Starting program: /home/vstinner/python/master/python script.py

Program received signal SIGUSR1, User defined signal 1.
Current thread 0x00007ffff7c27740 (most recent call first):
  File "/home/vstinner/python/master/script.py", line 13 in <module>

Program received signal SIGUSR1, User defined signal 1.
called True
[Inferior 1 (process 24361) exited normally]
msg347977 - (view) Author: Michelle Johnson (jshelly) Date: 2019-07-15 16:05
1.  Here is the output of the signal.getsignal test:

[jshelly@gl-build bin]$ ./python3
Python 3.7.4 (default, Jul 15 2019, 10:08:37)
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import signal
>>> signal.getsignal(signal.SIGUSR1)
<Handlers.SIG_DFL: 0>

2.  Here is the output after running the provided script in gdb:

[jshelly@gl-build bin]$ gdb -args ./python3 ~/script.py
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-114.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /sw/arcts/centos7/python/3.7.4/bin/python3.7...done.
(gdb) handle SIGUSR1 nostop
Signal        Stop	Print	Pass to program	Description
SIGUSR1       No	Yes	Yes		User defined signal 1
(gdb) run
Starting program: /sw/arcts/centos7/python/3.7.4/bin/./python3 /home/jshelly/script.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGUSR1, User defined signal 1.
Current thread 0x00002aaaaaaf59c0 (most recent call first):
  File "/home/jshelly/script.py", line 12 in <module>

Program received signal SIGUSR1, User defined signal 1.
called True

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaaae88924 in visit_decref (op=0x2aaaaacd3468, data=0x0) at Modules/gcmodule.c:271
271	    if (PyObject_IS_GC(op)) {
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.3.x86_64 nss-softokn-freebl-3.36.0-5.el7_5.x86_64
(gdb) where
#0  0x00002aaaaae88924 in visit_decref (op=0x2aaaaacd3468, data=0x0) at Modules/gcmodule.c:271
#1  0x00002aaaaad92d8d in dict_traverse (op=<optimized out>, visit=0x2aaaaae88920 <visit_decref>, arg=0x0) at Objects/dictobject.c:2987
#2  0x00002aaaaae87b3d in subtract_refs (containers=<optimized out>) at Modules/gcmodule.c:296
#3  collect (generation=generation@entry=2, n_collected=n_collected@entry=0x7fffffffce20,
    n_uncollectable=n_uncollectable@entry=0x7fffffffce28, nofail=nofail@entry=0) at Modules/gcmodule.c:853
#4  0x00002aaaaae891c9 in collect_with_callback (generation=2) at Modules/gcmodule.c:1028
#5  PyGC_Collect () at Modules/gcmodule.c:1573
#6  0x00002aaaaae5e62a in Py_FinalizeEx () at Python/pylifecycle.c:1185
#7  0x00002aaaaae86d58 in pymain_main (pymain=pymain@entry=0x7fffffffcf60) at Modules/main.c:3030
#8  0x00002aaaaae874e9 in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:3063
#9  0x00002aaaabd843d5 in __libc_start_main () from /lib64/libc.so.6
#10 0x000000000040067e in _start ()
(gdb)
msg347978 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-07-15 16:29
Oh. So Py_FinalizeEx() does get a SIGSEGV when triggering a garbage collection: the test doesn't crash, it's Python which crash at exit.

> Program received signal SIGSEGV, Segmentation fault.
> 0x00002aaaaae88924 in visit_decref (op=0x2aaaaacd3468, data=0x0) at Modules/gcmodule.c:271
> 271	    if (PyObject_IS_GC(op)) {

You can try to dump the content of op? For example:

(gdb) print *op
(gdb) print *op->ob_type


Can you try to rebuild Python from source in debug mode, and then retry the script? Something like:
---
wget https://www.python.org/ftp/python/3.7.4/Python-3.7.4.tar.xz
tar -xf Python-3.7.4.tar.xz 
cd Python-3.7.4/
./configure --with-pydebug --prefix /opt/py37
make
./python script.py
# no need to install Python
---

If it works, maybe retry without --with-pydebug.
msg347979 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-07-15 16:29
My notes to debug a crash on a GC collection:
https://pythondev.readthedocs.io/debug_tools.html#debug-crash-in-garbage-collection-visit-decref
msg347983 - (view) Author: Michelle Johnson (jshelly) Date: 2019-07-15 17:08
Oh, ok.  So, it doesn't seem like trying to get the contents of op was successful:

(gdb) print *op
$1 = {ob_refcnt = 0, ob_type = 0x0}
(gdb) print *op->ob_type
Cannot access memory at address 0x0
(gdb)


I then followed your instructions to recompile in debug mode and retried the script.  Here are the results:

[jshelly@gl-build Python-3.7.4]$ ./python ~/script.py
Current thread 0x00002ab43fe0d400 (most recent call first):
  File "/home/jshelly/script.py", line 12 in <module>
called True
[jshelly@gl-build Python-3.7.4]$

Does that help at all?
msg348022 - (view) Author: Michelle Johnson (jshelly) Date: 2019-07-16 15:22
Is the problem I'm seeing possibly related to this issue:

https://bugs.python.org/issue21131

It seems to share a lot of similarities.
msg349898 - (view) Author: Bennet Fauber (justbennet) Date: 2019-08-17 13:00
It would appear that jshelly was correct and this is resolved by fixing the problem isolated in issue 21131, which was closed and a patch was committed.

https://github.com/python/cpython/pull/15276
msg349953 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-19 10:35
Bennet Fauber:
> It would appear that jshelly was correct and this is resolved by fixing the problem isolated in issue 21131, which was closed and a patch was committed.

Great! I close this issue as a duplicate of bpo-21131.
History
Date User Action Args
2019-08-19 10:35:01vstinnersetstatus: open -> closed
superseder: test_faulthandler.test_register_chain fails on 64bit ppc/arm with kernel >= 3.10
messages: + msg349953

resolution: duplicate
stage: resolved
2019-08-17 13:00:38justbennetsetnosy: + justbennet
messages: + msg349898
2019-07-16 15:22:37jshellysetmessages: + msg348022
2019-07-15 17:08:42jshellysetmessages: + msg347983
2019-07-15 16:29:32vstinnersetmessages: + msg347979
2019-07-15 16:29:11vstinnersetmessages: + msg347978
2019-07-15 16:05:50jshellysetmessages: + msg347977
2019-07-15 15:49:17vstinnersettitle: test_faulthandler: test_register_chain() crash on Skylake chipset -> test_faulthandler: test_register_chain() crash with SIGSEGV (signal 11) on Skylake chipset
2019-07-15 15:47:05vstinnersettitle: test_faulthandler failure -> test_faulthandler: test_register_chain() crash on Skylake chipset
2019-07-15 15:46:38vstinnersetfiles: + script.py

messages: + msg347975
2019-07-15 14:30:00jshellysetmessages: + msg347962
2019-07-14 09:13:35vstinnersetmessages: + msg347900
2019-07-14 09:12:28vstinnersetfiles: + sigusr1.py

messages: + msg347899
2019-07-12 20:01:09Jeffrey.Kintschersetnosy: + Jeffrey.Kintscher
2019-07-11 15:25:32xtreaksetnosy: + vstinner
2019-07-11 15:12:51jshellycreate