Issue 12936: armv5tejl segfaults: sched_setaffinity() vs. pthread_setaffinity_np()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/57145

classification

Title:	armv5tejl segfaults: sched_setaffinity() vs. pthread_setaffinity_np()
Type:	crash	Stage:	resolved
Components:	Extension Modules	Versions:	Python 3.3

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	benjamin.peterson, meador.inge, neologix, skrah, vstinner
Priority:	normal	Keywords:	patch

Created on 2011-09-08 09:54 by skrah, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
crash.py	skrah, 2011-09-13 10:51
crash.c	skrah, 2011-09-13 15:14
pthread_nocrash.c	skrah, 2011-09-13 15:54
arm_setaffinity.diff	skrah, 2011-09-14 16:13

Messages (28)
msg143723 - (view)	Author: Stefan Krah (skrah) *	Date: 2011-09-08 09:54
I'm getting random segfaults in `make buildbottest` on qemu-debian-arm: Linux-2.6.26-2-versatile-armv5tejl-with-debian-5.0.8 little-endian The segfaults occurred in test_robotparser and test_nntplib and couldn't be reproduced when running the tests separately. qemu-debian-arm is horrendously slow, so I don't think I'll have time to debug this. I'm submitting the report in case someone has access to fast ARM hardware. [ 81/359/3] test_nntplib Fatal Python error: Segmentation fault Current thread 0x400225f0: File "/home/user/cpython-e91ad9669c08/Lib/socket.py", line 389 in create_connection File "/home/user/cpython-e91ad9669c08/Lib/nntplib.py", line 1024 in __init__ File "/home/user/cpython-e91ad9669c08/Lib/test/test_nntplib.py", line 291 in setUpClass File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 143 in _handleClassSetUp File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 97 in run File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 67 in __call__ File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 105 in run File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 67 in __call__ File "/home/user/cpython-e91ad9669c08/Lib/unittest/runner.py", line 168 in run File "/home/user/cpython-e91ad9669c08/Lib/test/support.py", line 1293 in _run_suite File "/home/user/cpython-e91ad9669c08/Lib/test/support.py", line 1327 in run_unittest File "/home/user/cpython-e91ad9669c08/Lib/test/test_nntplib.py", line 1260 in test_main File "/home/user/cpython-e91ad9669c08/Lib/test/regrtest.py", line 1140 in runtest_inner File "/home/user/cpython-e91ad9669c08/Lib/test/regrtest.py", line 905 in runtest File "/home/user/cpython-e91ad9669c08/Lib/test/regrtest.py", line 708 in main File "/home/user/cpython-e91ad9669c08/Lib/test/__main__.py", line 13 in <module> File "/home/user/cpython-e91ad9669c08/Lib/runpy.py", line 73 in _run_code File "/home/user/cpython-e91ad9669c08/Lib/runpy.py", line 160 in _run_module_as_main Segmentation fault
msg143767 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-09-09 16:55
You don't have a core dump, do you?
msg143791 - (view)	Author: Stefan Krah (skrah) *	Date: 2011-09-09 19:19
No such luck. Somehow gdb doesn't dump the core file: [ 25/359] test_urllib2_localnet Fatal Python error: Segmentation fault Current thread 0x400225f0: File "/home/user/cpython-e91ad9669c08/Lib/socket.py", line 389 in create_connection File "/home/user/cpython-e91ad9669c08/Lib/http/client.py", line 721 in connect File "/home/user/cpython-e91ad9669c08/Lib/http/client.py", line 743 in send File "/home/user/cpython-e91ad9669c08/Lib/http/client.py", line 805 in _send_output File "/home/user/cpython-e91ad9669c08/Lib/http/client.py", line 960 in endheaders File "/home/user/cpython-e91ad9669c08/Lib/http/client.py", line 1002 in _send_request File "/home/user/cpython-e91ad9669c08/Lib/http/client.py", line 964 in request File "/home/user/cpython-e91ad9669c08/Lib/urllib/request.py", line 1145 in do_open File "/home/user/cpython-e91ad9669c08/Lib/urllib/request.py", line 1165 in http_open File "/home/user/cpython-e91ad9669c08/Lib/urllib/request.py", line 347 in _call_chain File "/home/user/cpython-e91ad9669c08/Lib/urllib/request.py", line 387 in _open File "/home/user/cpython-e91ad9669c08/Lib/urllib/request.py", line 369 in open File "/home/user/cpython-e91ad9669c08/Lib/urllib/request.py", line 138 in urlopen File "/home/user/cpython-e91ad9669c08/Lib/unittest/case.py", line 136 in handle File "/home/user/cpython-e91ad9669c08/Lib/unittest/case.py", line 572 in assertRaises File "/home/user/cpython-e91ad9669c08/Lib/test/test_urllib2_localnet.py", line 537 in test_bad_address File "/home/user/cpython-e91ad9669c08/Lib/unittest/case.py", line 386 in _executeTestPart File "/home/user/cpython-e91ad9669c08/Lib/unittest/case.py", line 441 in run File "/home/user/cpython-e91ad9669c08/Lib/unittest/case.py", line 493 in __call__ File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 105 in run File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 67 in __call__ File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 105 in run File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 67 in __call__ File "/home/user/cpython-e91ad9669c08/Lib/unittest/runner.py", line 168 in run File "/home/user/cpython-e91ad9669c08/Lib/test/support.py", line 1293 in _run_suite File "/home/user/cpython-e91ad9669c08/Lib/test/support.py", line 1327 in run_unittest File "/home/user/cpython-e91ad9669c08/Lib/test/test_urllib2_localnet.py", line 561 in test_main File "/home/user/cpython-e91ad9669c08/Lib/test/support.py", line 1420 in decorator File "/home/user/cpython-e91ad9669c08/Lib/test/regrtest.py", line 1140 in runtest_inner File "/home/user/cpython-e91ad9669c08/Lib/test/regrtest.py", line 905 in runtest File "/home/user/cpython-e91ad9669c08/Lib/test/regrtest.py", line 708 in main File "/home/user/cpython-e91ad9669c08/Lib/test/__main__.py", line 13 in <module> File "/home/user/cpython-e91ad9669c08/Lib/runpy.py", line 73 in _run_code File "/home/user/cpython-e91ad9669c08/Lib/runpy.py", line 160 in _run_module_as_main make: *** [buildbottest] Segmentation fault (core dumped) user@debian-arm:~/cpython-e91ad9669c08$ ulimit -c unlimited user@debian-arm:~/cpython-e91ad9669c08$ ls core ls: cannot access core: No such file or directory user@debian-arm:~/cpython-e91ad9669c08$ find . -name core user@debian-arm:~/cpython-e91ad9669c08$ When I run under gcc, the test are automatically interrupted by SIGINT at some point. Perhaps this is another broken threading implementation. I'll try --without-threads.
msg143835 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-09-10 09:56
> No such luck. Somehow gdb doesn't dump the core file: What do $ /sbin/sysctl -a \| grep "kernel.core" And $ grep core /etc/security/limits.conf return?
msg143840 - (view)	Author: Stefan Krah (skrah) *	Date: 2011-09-10 15:05
This is slightly embarrassing: The partition containing the qemu images was full. I don't encounter this often, so it tends to be the last thing I think of. Proudly presenting a core dump. Since the segfault occurs in libpthread, I suggest we close this. What do you think? gdb ./python ./build/test_python_2217/core Core was generated by `./python -m test -uall -r --randseed=8304772'. Program terminated with signal 11, Segmentation fault. [New process 2217] #0 0x400356f4 in raise () from /lib/libpthread.so.0 (gdb) bt #0 0x400356f4 in raise () from /lib/libpthread.so.0 #1 0x400356d8 in raise () from /lib/libpthread.so.0 Backtrace stopped: frame did not save the PC (gdb)
msg143858 - (view)	Author: Stefan Krah (skrah) *	Date: 2011-09-11 09:23
Traceback with faulthandler disabled: Core was generated by `./python -m test -uall -r --randseed=8304772'. Program terminated with signal 11, Segmentation fault. [New process 3948] #0 0x40011d20 in __tls_get_addr () from /lib/ld-linux.so.2 (gdb) bt #0 0x40011d20 in __tls_get_addr () from /lib/ld-linux.so.2 #1 0x40011d10 in __tls_get_addr () from /lib/ld-linux.so.2 Backtrace stopped: frame did not save the PC
msg143859 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-09-11 09:47
> Traceback with faulthandler disabled: It crashes when trying to look up TLS (which explains why it doesn't crash when built ``without-threads`). Looks like a libc bug, but would it be possible to have a backtrace with Python built with `with-pydebug`?
msg143865 - (view)	Author: Stefan Krah (skrah) *	Date: 2011-09-11 13:53
Curiously enough python is built --with-pydebug. Version 9d658f000419, which is pre-faulthandler, runs without segfaults. Could faulthandler cause problems like these: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=370060
msg143869 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-09-11 15:07
> Could faulthandler cause problems like these: Well, that would explain why it crashes in the TLS lookup code, and why the core dump looks borked. 1) Apparently, Etch on ARM uses linuxthread instead of NPTL: what does $ getconf GNU_LIBPTHREAD_VERSION return on your box? 2) If it's using linxthreads, the culprit is likely the call to PyGILState_GetThisThreadState() from faulthandler_fatal_error(), which does a TLS lookup (which screws up because it's running in a user-allocated stack allocated with sigaltstack). However, this should only happen when a a fatal signal is handled by faulthandler, which should - AFAICT - only happen in test_faulthandler. Rebuilding faulthandler with #undef HAVE_SIGALTSTACK at the top of the file, should do the trick.
msg143877 - (view)	Author: Stefan Krah (skrah) *	Date: 2011-09-11 21:06
I completely removed faulthandler from e91ad9669c08 and the problem still occurs (with the same broken backtrace). $ getconf GNU_LIBPTHREAD_VERSION NPTL 2.7 It is a bit unsatisfying that the segfault isn't reproducible with the earlier revision, but there are several glibc issues with __tls_get_addr(): 1) http://www.cygwin.com/ml/libc-hacker/2008-10/msg00005.html 2) http://sources.redhat.com/bugzilla/show_bug.cgi?id=12453 If I run the demo script from 2), I get a segfault both on debian-arm as well as on Ubuntu Lucid. So, it may very well be that some recent change in Python exposes a glibc problem.
msg143880 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-09-11 23:39
> Looks like a libc bug ... > http://sources.redhat.com/bugzilla/show_bug.cgi?id=12453 Yes, the GNU libc has bugs (as every software!): this one has been fixed only recently (in glibc 2.14, released the 2011-05-31). I don't know if this issue is a duplicate of glibc bug 12453. > Could faulthandler cause problems ... faulthandler creates two locks at startup. faulthandler.enable() (e.g. called by regrtest when running the the test suite) creates a thread and changes the signal mask of this thread (to ignore all signals). I don't see how faulthandler can be linked to this issue, but yes, it might be the linked to this issue. In your case, faulthandler only reads a TLS on a crash. So faulthandler is not the cause of the initial crash, but it may cause a new fault :-) -- > Apparently, Etch on ARM uses linuxthread instead of NPTL ... FYI you can also try to print sys.thread_info (which should give the same information, "NPTL 2.7"). NPTL has know issues: see for example the Python issue #4970. NPTL is old and has been replaced by pthread in the glibc on Linux. -- > Traceback with faulthandler disabled: ... How did you disabled faulthandler? -- > Version 9d658f000419, which is pre-faulthandler, runs without segfaults. If it's a regression, you must try hg bisect! It is slow but it is fully automated! Try something like: hg bisect -r hg bisect -b 9d658f000419 hg bisect -c 'make && ./python -m test test_urllib2_localnet test_robotparser test_nntplib'
msg143886 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-09-12 06:59
> 2) http://sources.redhat.com/bugzilla/show_bug.cgi?id=12453 We actually had another issue due to this particular libc bug: http://bugs.python.org/issue6059 Basically, the problem is that if some libraries are dynamically loaded in an interleaved way, the TLS can be returned uninitialized, hence the segfault upon access. This problem can show up now because the import orders for some modules have been modified: depending on the test that crashes - or rather the tests that run just before - you might be able to pinpoint it quickly (or you could maybe use "ltrace -e dlopen"). >> Apparently, Etch on ARM uses linuxthread instead of NPTL ... > > FYI you can also try to print sys.thread_info (which should give the same information, "NPTL 2.7"). > > NPTL has know issues: see for example the Python issue #4970. NPTL is old and has been replaced by pthread in the glibc on Linux. I think you're confusing with linuxthreads ;-)
msg143887 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-09-12 07:18
Oh, and BTW, for the "Backtrace stopped: frame did not save the PC", you might want to install the libc-dbg package. This might help in finding precisely where it's crashing.
msg143890 - (view)	Author: Stefan Krah (skrah) *	Date: 2011-09-12 08:36
STINNER Victor <report@bugs.python.org> wrote: > > Traceback with faulthandler disabled: ... > > How did you disabled faulthandler? That was a run with all faulthandler references removed from regrtest.py. But as I said in my previous mail, I also did a run using e91ad9669c08 but without compiling and linking faulthandler, so that _PyFaulthandler_Init() wouldn't be called. This had the same result, so faulthandler is _not_ the cause of this bug. > > Version 9d658f000419, which is pre-faulthandler, runs without segfaults. > > If it's a regression, you must try hg bisect! It is slow but it is fully automated! Try something like: > > hg bisect -r > hg bisect -b 9d658f000419 > hg bisect -c 'make && ./python -m test test_urllib2_localnet test_robotparser test_nntplib' If it were that easy! I can't isolate the bug. The only way I can reproduce it is by running the whole test suite with various random seeds. Then it takes about 6 hours until the crash occurs in one of those tests. The whole test suite takes about 24 hours. I could try to install libc-dbg though.
msg143952 - (view)	Author: Stefan Krah (skrah) *	Date: 2011-09-13 10:51
The failure was introduced by issue #12655. I attach a minimal script to reproduce the segfault.
msg143953 - (view)	Author: Stefan Krah (skrah) *	Date: 2011-09-13 11:00
And here's a full backtrace of crash.py: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x400225f0 (LWP 633)] 0x40011d20 in __tls_get_addr () from /lib/ld-linux.so.2 (gdb) bt #0 0x40011d20 in __tls_get_addr () from /lib/ld-linux.so.2 #1 0x40035a14 in __h_errno_location () from /lib/libpthread.so.0 #2 0x40a788dc in __libc_res_nsearch () from /lib/libresolv.so.2 #3 0x40a66e9c in _nss_dns_gethostbyname3_r () from /lib/libnss_dns.so.2 #4 0x40a670ac in _nss_dns_gethostbyname2_r () from /lib/libnss_dns.so.2 #5 0x40180480 in gaih_inet () from /lib/libc.so.6 #6 0x40181da8 in getaddrinfo () from /lib/libc.so.6 #7 0x406084a4 in socket_getaddrinfo (self=0x405d7bcc, args=0x4089a8b4, kwargs=0x0) at /home/user/mercurial-1.9.2/cpython/Modules/socketmodule.c:4787 #8 0x001ea384 in PyCFunction_Call (func=0x405da1f4, arg=0x4089a8b4, kw=0x0) at Objects/methodobject.c:84 #9 0x000a3634 in call_function (pp_stack=0xbeab7d1c, oparg=4) at Python/ceval.c:4000 #10 0x0009cab8 in PyEval_EvalFrameEx (f=0x407457b4, throwflag=0) at Python/ceval.c:2625 #11 0x000a0bfc in PyEval_EvalCodeEx (_co=0x405d6ab8, globals=0x40591a34, locals=0x0, args=0x408884dc, argcount=2, kws=0x408884e4, kwcount=0, defs=0x40512a20, defcount=2, kwdefs=0x0, closure=0x0) at Python/ceval.c:3375 #12 0x000a3cfc in fast_function (func=0x405e30e4, pp_stack=0xbeab8068, n=2, na=2, nk=0) at Python/ceval.c:4098 #13 0x000a3838 in call_function (pp_stack=0xbeab8068, oparg=2) ---Type <return> to continue, or q <return> to quit--- at Python/ceval.c:4021 #14 0x0009cab8 in PyEval_EvalFrameEx (f=0x40888374, throwflag=0) at Python/ceval.c:2625 #15 0x000a0bfc in PyEval_EvalCodeEx (_co=0x4089d5d8, globals=0x4088d854, locals=0x0, args=0x404e2ac8, argcount=2, kws=0x405b43c8, kwcount=2, defs=0x4098fbd0, defcount=6, kwdefs=0x0, closure=0x0) at Python/ceval.c:3375 #16 0x001c3060 in function_call (func=0x40a2dea4, arg=0x404e2ab4, kw=0x409a98f4) at Objects/funcobject.c:629 #17 0x0017f1a0 in PyObject_Call (func=0x40a2dea4, arg=0x404e2ab4, kw=0x409a98f4) at Objects/abstract.c:2149 #18 0x001a1a9c in method_call (func=0x40a2dea4, arg=0x404e2ab4, kw=0x409a98f4) at Objects/classobject.c:318 #19 0x0017f1a0 in PyObject_Call (func=0x4050b9d4, arg=0x404e2574, kw=0x409a98f4) at Objects/abstract.c:2149 #20 0x0004a6c0 in slot_tp_init (self=0x405ae504, args=0x404e2574, kwds=0x409a98f4) at Objects/typeobject.c:5431 #21 0x00037650 in type_call (type=0x40a31034, args=0x404e2574, kwds=0x409a98f4) at Objects/typeobject.c:691 #22 0x0017f1a0 in PyObject_Call (func=0x40a31034, arg=0x404e2574, kw=0x409a98f4) at Objects/abstract.c:2149 #23 0x000a46bc in do_call (func=0x40a31034, pp_stack=0xbeab84f0, na=1, nk=2) at Python/ceval.c:4220 #24 0x000a3858 in call_function (pp_stack=0xbeab84f0, oparg=513) at Python/ceval.c:4023 #25 0x0009cab8 in PyEval_EvalFrameEx (f=0x40558544, throwflag=0) at Python/ceval.c:2625 #26 0x000a0bfc in PyEval_EvalCodeEx (_co=0x40479d28, globals=0x403d5034, locals=0x403d5034, args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:3375 #27 0x000916f4 in PyEval_EvalCode (co=0x40479d28, globals=0x403d5034, locals=0x403d5034) at Python/ceval.c:770 #28 0x000e0cb4 in run_mod (mod=0x37c8f8, filename=0x405028c8 "crash.py", globals=0x403d5034, locals=0x403d5034, flags=0xbeab8864, arena=0x2e5178) at Python/pythonrun.c:1793 #29 0x000e0a58 in PyRun_FileExFlags (fp=0x2ce260, filename=0x405028c8 "crash.py", start=257, globals=0x403d5034, locals=0x403d5034, closeit=1, flags=0xbeab8864) at Python/pythonrun.c:1750 #30 0x000debcc in PyRun_SimpleFileExFlags (fp=0x2ce260, filename=0x405028c8 "crash.py", closeit=1, flags=0xbeab8864) at Python/pythonrun.c:1275 #31 0x000dde68 in PyRun_AnyFileExFlags (fp=0x2ce260, filename=0x405028c8 "crash.py", closeit=1, flags=0xbeab8864) at Python/pythonrun.c:1046 #32 0x000ff984 in run_file (fp=0x2ce260, filename=0x401fe028, p_cf=0xbeab8864) at Modules/main.c:299 #33 0x00100780 in Py_Main (argc=2, argv=0x401fc028) at Modules/main.c:693 #34 0x0001a914 in main (argc=2, argv=0xbeab8994) at ./Modules/python.c:59
msg143959 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-09-13 12:04
> The failure was introduced by issue #12655 Wow, great job! crash.py looks like a libc and/or kernel bug. Can you try the glibc 2.14 (released the 2011-05-31)? You should first check if it is not a duplicate of http://sources.redhat.com/bugzilla/show_bug.cgi?id=12453
msg143974 - (view)	Author: Stefan Krah (skrah) *	Date: 2011-09-13 15:14
I wonder whether it is http://sources.redhat.com/bugzilla/show_bug.cgi?id=12453. The demo script from there crashes both on debian-arm and Ubuntu Lucid, but this specific segfault only occurs on debian arm. Attached is a minimal C test case that only crashes on debian-arm when sched_setaffinity() is called and the program is linked to pthread: $ gcc -Wall -W -O0 -g -o crash crash.c $ ./crash $ $ gcc -Wall -W -O0 -g -o crash crash.c -pthread $ ./crash Segmentation fault (core dumped) # comment out: sched_setaffinity(0, size, cpusetp); $ gcc -Wall -W -O0 -g -o crash crash.c -pthread $ ./crash $ On Ubuntu all three cases run fine. Perhaps this is a bug in sched_setaffinity()?
msg143978 - (view)	Author: Stefan Krah (skrah) *	Date: 2011-09-13 15:54
I think I got it: pthread_setaffinity_np() does not crash. `man sched_setaffinity` is slightly ambiguous, but there is this remark: (If you are using the POSIX threads API, then use pthread_setaffinity_np(3) instead of sched_setaffinity().) I'm attaching the non-crashing version.
msg143979 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-09-13 17:26
> I think I got it: pthread_setaffinity_np() does not crash. Nice. Out of curiosity, I just looked at the source code, and it just does sched_setaffinity(thread->tid), so you can do the same with sched_setaffinity(syscall(SYS_gettid)) for the current thread. However, I don't think we should/could add this to the posix module: it expects a pthread_t instead of a PID, to which we don't have access. Furthermore, even though we're linked with pthread, this should normally succeed - or at least not crash - when called from the main thread - and it does on my Debian squeeze box. So I'd suggest closing this issue.
msg143981 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-09-13 17:55
> However, I don't think we should/could add this to the posix module: > it expects a pthread_t instead of a PID, to which we don't have access. We already have such function: http://docs.python.org/dev/library/signal.html#signal.pthread_kill I added threading.get_ident() to easily get the thread identifier. In Python < 3.3, you can use threading.current_thread().ident. It's not documented, but if you pass a random integer, signal.pthread_kill() does crash.
msg143984 - (view)	Author: Stefan Krah (skrah) *	Date: 2011-09-13 19:13
Charles-François Natali <report@bugs.python.org> wrote: > Out of curiosity, I just looked at the source code, and it just does > sched_setaffinity(thread->tid), so you can do the same with > sched_setaffinity(syscall(SYS_gettid)) for the current thread. sched_setaffinity(syscall(SYS_gettid), size, cpusetp) crashes, too. This seems to be a violation of the man page, which states: "The value returned from a call to gettid(2) can be passed in the argument pid." Unless one uses a somewhat warped interpretation that linking against pthread constitutes "using the POSIX threads API". That would be the only loophole that would allow the crash. > However, I don't think we should/could add this to the posix module: > it expects a pthread_t instead of a PID, to which we don't have access. If we have access (and as I understood from Victor's post we do): pthread_getaffinity_np() also exists on FreeBSD, which would be an advantage. > So I'd suggest closing this issue. I don't care strongly about using pthread_getaffinity_np(), but at least I'd like to skip the scheduling sections on arm-linux if they don't work reliably.
msg143988 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-09-13 20:47
> If we have access (and as I understood from Victor's post we do): > pthread_getaffinity_np() also exists on FreeBSD, which would be > an advantage. Yes, but I see several drawbacks: - as noted by Victor, it's really easy to crash the interpreter by passing an invalid thread ID, which IMHO, should be avoided at all cost - to be safe, we would need to have a different API depending on whether Python is built with threads or not (i.e. sched_setaffinity() without threads, and pthread_setaffinity_np()) - pthread_setaffinity_np() is really non-portable (it's guarded by __USE_GNU in my system's header) - sched_setaffinity() seems to work fine on most systems even when linked with pthread > I don't care strongly about using pthread_getaffinity_np(), but at least I'd > like to skip the scheduling sections on arm-linux if they don't work reliably. Sounds reasonable. I guess you could use os.uname() or platform.machine().
msg143992 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-09-13 23:10
> as noted by Victor, it's really easy to crash the interpreter > by passing an invalid thread ID, which IMHO, should be avoided > at all cost Do you mean that signal.pthread_kill() should be removed? This function is very useful and solve some issues that cannot be solved differently. At the same time, I don't think that it's possible to workaround the crashes. At least, I don't see how: pthread_kill(tid, 0) is supposed to check if tid exists, but it does crash... > to be safe, we would need to have a different API depending > on whether Python is built with threads or not > (i.e. sched_setaffinity() without threads, > and pthread_setaffinity_np()) We cannot use the same name for two different C function. One expects a process identifier, whereas the other expects a thread identifier! If Python is compiled without thread, the thread will not exist (as some modules and many other functions). > pthread_setaffinity_np() is really non-portable > (it's guarded by __USE_GNU in my system's header) We can check it in configure. We already use some functions which are GNU extensions, like makedev(). Oh, os.makedev() availability is just not documented :-) > sched_setaffinity() seems to work fine on most systems > even when linked with pthread Again, it looks like a libc/kernel bug. I don't think that Python can work around such issue. > I don't care strongly about using pthread_getaffinity_np() I don't really care of pthread_getaffinity_np() :-) To add a new function, we need a usecase and it should be requested. This issue is about a crash using sched_setaffinity(), not about pthread_getaffinity_np. I don't know or need (), but the difference between sched_setaffinity and pthread_getaffinity_np is the same between sigprocmask() and pthread_sigmask(). I chose to expose only the later because the behaviour of sigprocmask is undefined in a process using threads. sched_setaffinity manual contains the sentence "If you are using the POSIX threads API, then use pthread_setaffinity_np(3) instead of sched_setaffinity()". See also Portable Hardware Locality (hwloc): http://www.open-mpi.org/projects/hwloc/
msg144009 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-09-14 05:19
> Do you mean that signal.pthread_kill() should be removed? This function is very useful and solve some issues that cannot be solved differently. At the same time, I don't think that it's possible to workaround the crashes. At least, I don't see how: pthread_kill(tid, 0) is supposed to check if tid exists, but it does crash... No, I don't suggest to remove it, it is useful. As for the crashes, with glibc pthread_t is really a pointer, so there's no way to check its validity beforehand. Even if we did check the thread ID against the list of Python-created threads IDs (stored in Thread._ident), this could still crash, because the ID becomes invalid as soon as the thread terminates (all threads are started detached). Furthermore, this wouldn't work for non-Python created threads. > We cannot use the same name for two different C function. One expects a process identifier, whereas the other expects a thread identifier! If Python is compiled without thread, the thread will not exist (as some modules and many other functions). > I know, that's why I said "different API": but I must admit it was poorly worded ;-) However, this wouldn't solve this particular problem: as long as we expose sched_setaffinity(), it will crash as soon as someone passes `0` or getpid() as PID. >> pthread_setaffinity_np() is really non-portable >> (it's guarded by __USE_GNU in my system's header) > > We can check it in configure. We already use some functions which are GNU extensions, like makedev(). Oh, os.makedev() availability is just not documented :-) As I said, this wouldn't solve this problem. If someone deems it necessary, we can open another issue for this feature request. >> sched_setaffinity() seems to work fine on most systems >> even when linked with pthread > > Again, it looks like a libc/kernel bug. I don't think that Python can work around such issue. > Agreed. > I don't know or need (), but the difference between sched_setaffinity and pthread_getaffinity_np is the same between sigprocmask() and pthread_sigmask(). I chose to expose only the later because the behaviour of sigprocmask is undefined in a process using threads. Exactly. However, nothing prevents someone from using sigprocmask() in a multithreaded process, the only difference is that it won't crash (AFAICT). So I suggest to: 1) skip the problematic tests on ARM when built with threads to avoid segfaults 2) if someone wants pthread_getaffinity_np(), then we can still open a separate feature request
msg144030 - (view)	Author: Stefan Krah (skrah) *	Date: 2011-09-14 16:13
I'd prefer to disable the misbehaving functions entirely on arm. With the patch this combination of tests now works: ./python -m test -uall test_posix test_nntplib If you think the patch is good, I can run the whole test suite, too. [I'd rather wait for review due to the slowness of the setup.]
msg144031 - (view)	Author: Charles-François Natali (neologix) *	Date: 2011-09-14 17:03
> I'd prefer to disable the misbehaving functions entirely on arm. -10 If we start disabling features on platforms with partly bogus implementations, we might as well drop threading on OpenBSD, sendmsg() on OS-X, etc. Furthermore, it's really just a libc bug, which might be fixed in a more recent version, or with another libc provider (eglibc, uclibc, etc.).
msg144174 - (view)	Author: Stefan Krah (skrah) *	Date: 2011-09-17 06:51
I cannot reproduce the crash on: Linux debian-armel 2.6.32-5-versatile #1 Wed Jan 12 23:05:11 UTC 2011 armv5tejl GNU/Linux Since the old (arm) port is deprecated, I'm closing this.

History
Date	User	Action	Args
2022-04-11 14:57:21	admin	set	github: 57145
2011-09-17 06:51:50	skrah	set	status: open -> closed resolution: wont fix messages: + msg144174 stage: test needed -> resolved
2011-09-14 17:03:06	neologix	set	messages: + msg144031
2011-09-14 16:13:53	skrah	set	files: + arm_setaffinity.diff keywords: + patch messages: + msg144030
2011-09-14 05:19:33	neologix	set	messages: + msg144009
2011-09-13 23:10:03	vstinner	set	messages: + msg143992
2011-09-13 20:47:09	neologix	set	messages: + msg143988
2011-09-13 19:13:59	skrah	set	messages: + msg143984
2011-09-13 17:55:49	vstinner	set	messages: + msg143981
2011-09-13 17:26:13	neologix	set	messages: + msg143979
2011-09-13 15:54:03	skrah	set	files: + pthread_nocrash.c messages: + msg143978 title: armv5tejl: random segfaults in getaddrinfo() -> armv5tejl segfaults: sched_setaffinity() vs. pthread_setaffinity_np()
2011-09-13 15:14:51	skrah	set	files: + crash.c messages: + msg143974
2011-09-13 12:04:12	vstinner	set	messages: + msg143959
2011-09-13 11:00:32	skrah	set	messages: + msg143953
2011-09-13 10:51:48	skrah	set	files: + crash.py nosy: + benjamin.peterson messages: + msg143952
2011-09-12 08:36:10	skrah	set	messages: + msg143890
2011-09-12 07:18:14	neologix	set	messages: + msg143887
2011-09-12 06:59:36	neologix	set	messages: + msg143886
2011-09-11 23:39:57	vstinner	set	messages: + msg143880
2011-09-11 21:06:28	skrah	set	messages: + msg143877
2011-09-11 15:07:04	neologix	set	messages: + msg143869
2011-09-11 13:53:24	skrah	set	nosy: + vstinner messages: + msg143865
2011-09-11 12:34:27	meador.inge	set	nosy: + meador.inge
2011-09-11 09:47:31	neologix	set	messages: + msg143859
2011-09-11 09:23:40	skrah	set	messages: + msg143858
2011-09-10 15:05:15	skrah	set	messages: + msg143840
2011-09-10 09:56:50	neologix	set	messages: + msg143835
2011-09-09 19:19:21	skrah	set	messages: + msg143791
2011-09-09 16:55:45	neologix	set	nosy: + neologix messages: + msg143767
2011-09-08 09:54:47	skrah	create