classification
Title: armv5tejl segfaults: sched_setaffinity() vs. pthread_setaffinity_np()
Type: crash Stage: resolved
Components: Extension Modules Versions: Python 3.3
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, meador.inge, neologix, skrah, vstinner
Priority: normal Keywords: patch

Created on 2011-09-08 09:54 by skrah, last changed 2011-09-17 06:51 by skrah. This issue is now closed.

Files
File name Uploaded Description Edit
crash.py skrah, 2011-09-13 10:51
crash.c skrah, 2011-09-13 15:14
pthread_nocrash.c skrah, 2011-09-13 15:54
arm_setaffinity.diff skrah, 2011-09-14 16:13
Messages (28)
msg143723 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-08 09:54
I'm getting random segfaults in `make buildbottest` on qemu-debian-arm:

Linux-2.6.26-2-versatile-armv5tejl-with-debian-5.0.8 little-endian

The segfaults occurred in test_robotparser and test_nntplib and
couldn't be reproduced when running the tests separately.
qemu-debian-arm is horrendously slow, so I don't think I'll have
time to debug this. I'm submitting the report in case someone has
access to fast ARM hardware.


[ 81/359/3] test_nntplib
Fatal Python error: Segmentation fault

Current thread 0x400225f0:
  File "/home/user/cpython-e91ad9669c08/Lib/socket.py", line 389 in create_connection
  File "/home/user/cpython-e91ad9669c08/Lib/nntplib.py", line 1024 in __init__
  File "/home/user/cpython-e91ad9669c08/Lib/test/test_nntplib.py", line 291 in setUpClass
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 143 in _handleClassSetUp
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 97 in run
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 67 in __call__
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 105 in run
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 67 in __call__
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/runner.py", line 168 in run
  File "/home/user/cpython-e91ad9669c08/Lib/test/support.py", line 1293 in _run_suite
  File "/home/user/cpython-e91ad9669c08/Lib/test/support.py", line 1327 in run_unittest
  File "/home/user/cpython-e91ad9669c08/Lib/test/test_nntplib.py", line 1260 in test_main
  File "/home/user/cpython-e91ad9669c08/Lib/test/regrtest.py", line 1140 in runtest_inner
  File "/home/user/cpython-e91ad9669c08/Lib/test/regrtest.py", line 905 in runtest
  File "/home/user/cpython-e91ad9669c08/Lib/test/regrtest.py", line 708 in main
  File "/home/user/cpython-e91ad9669c08/Lib/test/__main__.py", line 13 in <module>
  File "/home/user/cpython-e91ad9669c08/Lib/runpy.py", line 73 in _run_code
  File "/home/user/cpython-e91ad9669c08/Lib/runpy.py", line 160 in _run_module_as_main
Segmentation fault
msg143767 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-09-09 16:55
You don't have a core dump, do you?
msg143791 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-09 19:19
No such luck. Somehow gdb doesn't dump the core file:


[ 25/359] test_urllib2_localnet
Fatal Python error: Segmentation fault

Current thread 0x400225f0:
  File "/home/user/cpython-e91ad9669c08/Lib/socket.py", line 389 in create_connection
  File "/home/user/cpython-e91ad9669c08/Lib/http/client.py", line 721 in connect
  File "/home/user/cpython-e91ad9669c08/Lib/http/client.py", line 743 in send
  File "/home/user/cpython-e91ad9669c08/Lib/http/client.py", line 805 in _send_output
  File "/home/user/cpython-e91ad9669c08/Lib/http/client.py", line 960 in endheaders
  File "/home/user/cpython-e91ad9669c08/Lib/http/client.py", line 1002 in _send_request
  File "/home/user/cpython-e91ad9669c08/Lib/http/client.py", line 964 in request
  File "/home/user/cpython-e91ad9669c08/Lib/urllib/request.py", line 1145 in do_open
  File "/home/user/cpython-e91ad9669c08/Lib/urllib/request.py", line 1165 in http_open
  File "/home/user/cpython-e91ad9669c08/Lib/urllib/request.py", line 347 in _call_chain
  File "/home/user/cpython-e91ad9669c08/Lib/urllib/request.py", line 387 in _open
  File "/home/user/cpython-e91ad9669c08/Lib/urllib/request.py", line 369 in open
  File "/home/user/cpython-e91ad9669c08/Lib/urllib/request.py", line 138 in urlopen
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/case.py", line 136 in handle
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/case.py", line 572 in assertRaises
  File "/home/user/cpython-e91ad9669c08/Lib/test/test_urllib2_localnet.py", line 537 in test_bad_address
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/case.py", line 386 in _executeTestPart
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/case.py", line 441 in run
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/case.py", line 493 in __call__
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 105 in run
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 67 in __call__
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 105 in run
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/suite.py", line 67 in __call__
  File "/home/user/cpython-e91ad9669c08/Lib/unittest/runner.py", line 168 in run
  File "/home/user/cpython-e91ad9669c08/Lib/test/support.py", line 1293 in _run_suite
  File "/home/user/cpython-e91ad9669c08/Lib/test/support.py", line 1327 in run_unittest
  File "/home/user/cpython-e91ad9669c08/Lib/test/test_urllib2_localnet.py", line 561 in test_main
  File "/home/user/cpython-e91ad9669c08/Lib/test/support.py", line 1420 in decorator
  File "/home/user/cpython-e91ad9669c08/Lib/test/regrtest.py", line 1140 in runtest_inner
  File "/home/user/cpython-e91ad9669c08/Lib/test/regrtest.py", line 905 in runtest
  File "/home/user/cpython-e91ad9669c08/Lib/test/regrtest.py", line 708 in main
  File "/home/user/cpython-e91ad9669c08/Lib/test/__main__.py", line 13 in <module>
  File "/home/user/cpython-e91ad9669c08/Lib/runpy.py", line 73 in _run_code
  File "/home/user/cpython-e91ad9669c08/Lib/runpy.py", line 160 in _run_module_as_main
make: *** [buildbottest] Segmentation fault (core dumped)


user@debian-arm:~/cpython-e91ad9669c08$ ulimit -c
unlimited
user@debian-arm:~/cpython-e91ad9669c08$ ls core
ls: cannot access core: No such file or directory
user@debian-arm:~/cpython-e91ad9669c08$ find . -name core
user@debian-arm:~/cpython-e91ad9669c08$ 


When I run under gcc, the test are automatically interrupted
by SIGINT at some point. Perhaps this is another broken
threading implementation. I'll try --without-threads.
msg143835 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-09-10 09:56
> No such luck. Somehow gdb doesn't dump the core file:

What do
$ /sbin/sysctl -a | grep "kernel.core"

And
$ grep core /etc/security/limits.conf

return?
msg143840 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-10 15:05
This is slightly embarrassing: The partition containing the qemu images
was full. I don't encounter this often, so it tends to be the last thing
I think of.

Proudly presenting a core dump. Since the segfault occurs in
libpthread, I suggest we close this. What do you think?


gdb ./python ./build/test_python_2217/core


Core was generated by `./python -m test -uall -r --randseed=8304772'.
Program terminated with signal 11, Segmentation fault.
[New process 2217]
#0  0x400356f4 in raise () from /lib/libpthread.so.0
(gdb) bt
#0  0x400356f4 in raise () from /lib/libpthread.so.0
#1  0x400356d8 in raise () from /lib/libpthread.so.0
Backtrace stopped: frame did not save the PC
(gdb)
msg143858 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-11 09:23
Traceback with faulthandler disabled:

Core was generated by `./python -m test -uall -r --randseed=8304772'.
Program terminated with signal 11, Segmentation fault.
[New process 3948]
#0  0x40011d20 in __tls_get_addr () from /lib/ld-linux.so.2
(gdb) bt
#0  0x40011d20 in __tls_get_addr () from /lib/ld-linux.so.2
#1  0x40011d10 in __tls_get_addr () from /lib/ld-linux.so.2
Backtrace stopped: frame did not save the PC
msg143859 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-09-11 09:47
> Traceback with faulthandler disabled:

It crashes when trying to look up TLS (which explains why it doesn't crash when built ``without-threads`).
Looks like a libc bug, but would it be possible to have a backtrace with Python built with `with-pydebug`?
msg143865 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-11 13:53
Curiously enough python *is* built --with-pydebug.


Version 9d658f000419, which is pre-faulthandler, runs without segfaults.


Could faulthandler cause problems like these:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=370060
msg143869 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-09-11 15:07
> Could faulthandler cause problems like these:

Well, that would explain why it crashes in the TLS lookup code, and why the core dump looks borked.

1) Apparently, Etch on ARM uses linuxthread instead of NPTL: what does
$ getconf GNU_LIBPTHREAD_VERSION
return on your box?

2) If it's using linxthreads, the culprit is likely the call to PyGILState_GetThisThreadState() from faulthandler_fatal_error(), which does a TLS lookup (which screws up because it's running in a user-allocated stack allocated with sigaltstack).
However, this should only happen when a a fatal signal is handled by faulthandler, which should - AFAICT - only happen in test_faulthandler.

Rebuilding faulthandler with
#undef HAVE_SIGALTSTACK

at the top of the file, should do the trick.
msg143877 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-11 21:06
I completely removed faulthandler from e91ad9669c08 and the problem
still occurs (with the same broken backtrace).

$ getconf GNU_LIBPTHREAD_VERSION
NPTL 2.7


It is a bit unsatisfying that the segfault isn't reproducible with
the earlier revision, but there are several glibc issues with
__tls_get_addr():


1) http://www.cygwin.com/ml/libc-hacker/2008-10/msg00005.html
2) http://sources.redhat.com/bugzilla/show_bug.cgi?id=12453


If I run the demo script from 2), I get a segfault both on
debian-arm as well as on Ubuntu Lucid.

So, it may very well be that some recent change in Python exposes a
glibc problem.
msg143880 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-09-11 23:39
> Looks like a libc bug ...
> http://sources.redhat.com/bugzilla/show_bug.cgi?id=12453

Yes, the GNU libc has bugs (as every software!): this one has been fixed only recently (in glibc 2.14, released the 2011-05-31). I don't know if this issue is a duplicate of glibc bug 12453.

> Could faulthandler cause problems ...

faulthandler creates two locks at startup. faulthandler.enable() (e.g. called by regrtest when running the the test suite) creates a thread and changes the signal mask of this thread (to ignore all signals).

I don't see how faulthandler can be linked to this issue, but yes, it might be the linked to this issue.

In your case, faulthandler only reads a TLS on a crash. So faulthandler is not the cause of the initial crash, but it may cause a new fault :-)


--

> Apparently, Etch on ARM uses linuxthread instead of NPTL ...

FYI you can also try to print sys.thread_info (which should give the same information, "NPTL 2.7").

NPTL has know issues: see for example the Python issue #4970. NPTL is old and has been replaced by pthread in the glibc on Linux.

--

> Traceback with faulthandler disabled: ...

How did you disabled faulthandler?

--

> Version 9d658f000419, which is pre-faulthandler, runs without segfaults.

If it's a regression, you must try hg bisect! It is slow but it is fully automated! Try something like:

hg bisect -r
hg bisect -b 9d658f000419
hg bisect -c 'make && ./python -m test test_urllib2_localnet test_robotparser test_nntplib'
msg143886 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-09-12 06:59
> 2) http://sources.redhat.com/bugzilla/show_bug.cgi?id=12453

We actually had another issue due to this particular libc bug:

http://bugs.python.org/issue6059

Basically, the problem is that if some libraries are dynamically
loaded in an interleaved way, the TLS can be returned uninitialized,
hence the segfault upon access.
This problem can show up now because the import orders for some
modules have been modified: depending on the test that crashes - or
rather the tests that run just before - you might be able to pinpoint
it quickly (or you could maybe use "ltrace -e dlopen").

>> Apparently, Etch on ARM uses linuxthread instead of NPTL ...
>
> FYI you can also try to print sys.thread_info (which should give the same information, "NPTL 2.7").
>
> NPTL has know issues: see for example the Python issue #4970. NPTL is old and has been replaced by pthread in the glibc on Linux.

I think you're confusing with linuxthreads ;-)
msg143887 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-09-12 07:18
Oh, and BTW, for the "Backtrace stopped: frame did not save the PC", you might want to install the libc-dbg package. This might help in finding precisely where it's crashing.
msg143890 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-12 08:36
STINNER Victor <report@bugs.python.org> wrote:
> > Traceback with faulthandler disabled: ...
> 
> How did you disabled faulthandler?

That was a run with all faulthandler references removed from regrtest.py.

But as I said in my previous mail, I also did a run using e91ad9669c08
but without compiling and linking faulthandler, so that _PyFaulthandler_Init()
wouldn't be called. This had the same result, so faulthandler is _not_ the cause
of this bug.

> > Version 9d658f000419, which is pre-faulthandler, runs without segfaults.
> 
> If it's a regression, you must try hg bisect! It is slow but it is fully automated! Try something like:
> 
> hg bisect -r
> hg bisect -b 9d658f000419
> hg bisect -c 'make && ./python -m test test_urllib2_localnet test_robotparser test_nntplib'

If it were that easy! I can't isolate the bug. The only way I can reproduce it
is by running the whole test suite with various random seeds. Then it takes
about 6 hours until the crash occurs in one of those tests.

The whole test suite takes about 24 hours.

I could try to install libc-dbg though.
msg143952 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-13 10:51
The failure was introduced by issue #12655. I attach a minimal script
to reproduce the segfault.
msg143953 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-13 11:00
And here's a full backtrace of crash.py:


Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x400225f0 (LWP 633)]
0x40011d20 in __tls_get_addr () from /lib/ld-linux.so.2
(gdb) bt
#0  0x40011d20 in __tls_get_addr () from /lib/ld-linux.so.2
#1  0x40035a14 in __h_errno_location () from /lib/libpthread.so.0
#2  0x40a788dc in __libc_res_nsearch () from /lib/libresolv.so.2
#3  0x40a66e9c in _nss_dns_gethostbyname3_r () from /lib/libnss_dns.so.2
#4  0x40a670ac in _nss_dns_gethostbyname2_r () from /lib/libnss_dns.so.2
#5  0x40180480 in gaih_inet () from /lib/libc.so.6
#6  0x40181da8 in getaddrinfo () from /lib/libc.so.6
#7  0x406084a4 in socket_getaddrinfo (self=0x405d7bcc, args=0x4089a8b4, 
    kwargs=0x0)
    at /home/user/mercurial-1.9.2/cpython/Modules/socketmodule.c:4787
#8  0x001ea384 in PyCFunction_Call (func=0x405da1f4, arg=0x4089a8b4, kw=0x0)
    at Objects/methodobject.c:84
#9  0x000a3634 in call_function (pp_stack=0xbeab7d1c, oparg=4)
    at Python/ceval.c:4000
#10 0x0009cab8 in PyEval_EvalFrameEx (f=0x407457b4, throwflag=0)
    at Python/ceval.c:2625
#11 0x000a0bfc in PyEval_EvalCodeEx (_co=0x405d6ab8, globals=0x40591a34, 
    locals=0x0, args=0x408884dc, argcount=2, kws=0x408884e4, kwcount=0, 
    defs=0x40512a20, defcount=2, kwdefs=0x0, closure=0x0)
    at Python/ceval.c:3375
#12 0x000a3cfc in fast_function (func=0x405e30e4, pp_stack=0xbeab8068, n=2, 
    na=2, nk=0) at Python/ceval.c:4098
#13 0x000a3838 in call_function (pp_stack=0xbeab8068, oparg=2)
---Type <return> to continue, or q <return> to quit---
    at Python/ceval.c:4021
#14 0x0009cab8 in PyEval_EvalFrameEx (f=0x40888374, throwflag=0)
    at Python/ceval.c:2625
#15 0x000a0bfc in PyEval_EvalCodeEx (_co=0x4089d5d8, globals=0x4088d854, 
    locals=0x0, args=0x404e2ac8, argcount=2, kws=0x405b43c8, kwcount=2, 
    defs=0x4098fbd0, defcount=6, kwdefs=0x0, closure=0x0)
    at Python/ceval.c:3375
#16 0x001c3060 in function_call (func=0x40a2dea4, arg=0x404e2ab4, 
    kw=0x409a98f4) at Objects/funcobject.c:629
#17 0x0017f1a0 in PyObject_Call (func=0x40a2dea4, arg=0x404e2ab4, 
    kw=0x409a98f4) at Objects/abstract.c:2149
#18 0x001a1a9c in method_call (func=0x40a2dea4, arg=0x404e2ab4, kw=0x409a98f4)
    at Objects/classobject.c:318
#19 0x0017f1a0 in PyObject_Call (func=0x4050b9d4, arg=0x404e2574, 
    kw=0x409a98f4) at Objects/abstract.c:2149
#20 0x0004a6c0 in slot_tp_init (self=0x405ae504, args=0x404e2574, 
    kwds=0x409a98f4) at Objects/typeobject.c:5431
#21 0x00037650 in type_call (type=0x40a31034, args=0x404e2574, kwds=0x409a98f4)
    at Objects/typeobject.c:691
#22 0x0017f1a0 in PyObject_Call (func=0x40a31034, arg=0x404e2574, 
    kw=0x409a98f4) at Objects/abstract.c:2149
#23 0x000a46bc in do_call (func=0x40a31034, pp_stack=0xbeab84f0, na=1, nk=2)
    at Python/ceval.c:4220
#24 0x000a3858 in call_function (pp_stack=0xbeab84f0, oparg=513)
    at Python/ceval.c:4023
#25 0x0009cab8 in PyEval_EvalFrameEx (f=0x40558544, throwflag=0)
    at Python/ceval.c:2625
#26 0x000a0bfc in PyEval_EvalCodeEx (_co=0x40479d28, globals=0x403d5034, 
    locals=0x403d5034, args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, 
    defcount=0, kwdefs=0x0, closure=0x0) at Python/ceval.c:3375
#27 0x000916f4 in PyEval_EvalCode (co=0x40479d28, globals=0x403d5034, 
    locals=0x403d5034) at Python/ceval.c:770
#28 0x000e0cb4 in run_mod (mod=0x37c8f8, filename=0x405028c8 "crash.py", 
    globals=0x403d5034, locals=0x403d5034, flags=0xbeab8864, arena=0x2e5178)
    at Python/pythonrun.c:1793
#29 0x000e0a58 in PyRun_FileExFlags (fp=0x2ce260, 
    filename=0x405028c8 "crash.py", start=257, globals=0x403d5034, 
    locals=0x403d5034, closeit=1, flags=0xbeab8864) at Python/pythonrun.c:1750
#30 0x000debcc in PyRun_SimpleFileExFlags (fp=0x2ce260, 
    filename=0x405028c8 "crash.py", closeit=1, flags=0xbeab8864)
    at Python/pythonrun.c:1275
#31 0x000dde68 in PyRun_AnyFileExFlags (fp=0x2ce260, 
    filename=0x405028c8 "crash.py", closeit=1, flags=0xbeab8864)
    at Python/pythonrun.c:1046
#32 0x000ff984 in run_file (fp=0x2ce260, filename=0x401fe028, p_cf=0xbeab8864)
    at Modules/main.c:299
#33 0x00100780 in Py_Main (argc=2, argv=0x401fc028) at Modules/main.c:693
#34 0x0001a914 in main (argc=2, argv=0xbeab8994) at ./Modules/python.c:59
msg143959 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-09-13 12:04
> The failure was introduced by issue #12655

Wow, great job!

crash.py looks like a libc and/or kernel bug. Can you try the glibc 2.14 (released the 2011-05-31)? You should first check if it is not a duplicate of http://sources.redhat.com/bugzilla/show_bug.cgi?id=12453
msg143974 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-13 15:14
I wonder whether it is http://sources.redhat.com/bugzilla/show_bug.cgi?id=12453.

The demo script from there crashes both on debian-arm and Ubuntu Lucid,
but this specific segfault only occurs on debian arm.

Attached is a minimal C test case that only crashes on debian-arm
when sched_setaffinity() is called *and* the program is linked to
pthread:


$ gcc -Wall -W -O0 -g -o crash crash.c
$ ./crash
$
$ gcc -Wall -W -O0 -g -o crash crash.c -pthread
$ ./crash
Segmentation fault (core dumped)

# comment out: sched_setaffinity(0, size, cpusetp);

$ gcc -Wall -W -O0 -g -o crash crash.c -pthread
$ ./crash
$ 


On Ubuntu all three cases run fine. Perhaps this is a bug in
sched_setaffinity()?
msg143978 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-13 15:54
I think I got it: pthread_setaffinity_np() does not crash. 
 
`man sched_setaffinity` is slightly ambiguous, but there is this remark:

(If  you  are  using  the POSIX threads API, then use pthread_setaffinity_np(3) 
 instead of sched_setaffinity().)


I'm attaching the non-crashing version.
msg143979 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-09-13 17:26
> I think I got it: pthread_setaffinity_np() does not crash.

Nice.
Out of curiosity, I just looked at the source code, and it just does sched_setaffinity(thread->tid), so you can do the same with sched_setaffinity(syscall(SYS_gettid)) for the current thread.
However, I don't think we should/could add this to the posix module: it expects a pthread_t instead of a PID, to which we don't have access.
Furthermore, even though we're linked with pthread, this should normally succeed - or at least not crash - when called from the main thread - and it does on my Debian squeeze box.
So I'd suggest closing this issue.
msg143981 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-09-13 17:55
> However, I don't think we should/could add this to the posix module: 
> it expects a pthread_t instead of a PID, to which we don't have access.

We already have such function:
http://docs.python.org/dev/library/signal.html#signal.pthread_kill

I added threading.get_ident() to easily get the thread identifier. In Python < 3.3, you can use threading.current_thread().ident.

It's not documented, but if you pass a random integer, signal.pthread_kill() does crash.
msg143984 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-13 19:13
Charles-François Natali <report@bugs.python.org> wrote:
> Out of curiosity, I just looked at the source code, and it just does
> sched_setaffinity(thread->tid), so you can do the same with
> sched_setaffinity(syscall(SYS_gettid)) for the current thread.

sched_setaffinity(syscall(SYS_gettid), size, cpusetp) crashes, too.
This seems to be a violation of the man page, which states:

"The value returned from a call to gettid(2) can be passed in
 the argument pid."

Unless one uses a somewhat warped interpretation that linking
against pthread constitutes "using the POSIX threads API". That
would be the only loophole that would allow the crash.

> However, I don't think we should/could add this to the posix module:
> it expects a pthread_t instead of a PID, to which we don't have access.

If we have access (and as I understood from Victor's post we do):
pthread_getaffinity_np() also exists on FreeBSD, which would be
an advantage.

> So I'd suggest closing this issue.

I don't care strongly about using pthread_getaffinity_np(), but at least I'd
like to skip the scheduling sections on arm-linux if they don't work reliably.
msg143988 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-09-13 20:47
> If we have access (and as I understood from Victor's post we do):
> pthread_getaffinity_np() also exists on FreeBSD, which would be
> an advantage.

Yes, but I see several drawbacks:
- as noted by Victor, it's really easy to crash the interpreter by passing an invalid thread ID, which IMHO, should be avoided at all cost
- to be safe, we would need to have a different API depending on whether Python is built with threads or not (i.e. sched_setaffinity() without threads, and pthread_setaffinity_np())
- pthread_setaffinity_np() is really non-portable (it's guarded by __USE_GNU in my system's header)
- sched_setaffinity() seems to work fine on most systems even when linked with pthread

> I don't care strongly about using pthread_getaffinity_np(), but at least I'd
> like to skip the scheduling sections on arm-linux if they don't work reliably.

Sounds reasonable.
I guess you could use os.uname() or platform.machine().
msg143992 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-09-13 23:10
> as noted by Victor, it's really easy to crash the interpreter
> by passing an invalid thread ID, which IMHO, should be avoided
> at all cost

Do you mean that signal.pthread_kill() should be removed? This function is very useful and solve some issues that cannot be solved differently. At the same time, I don't think that it's possible to workaround the crashes. At least, I don't see how: pthread_kill(tid, 0) is supposed to check if tid exists, but it does crash...

> to be safe, we would need to have a different API depending
> on whether Python is built with threads or not
> (i.e. sched_setaffinity() without threads,
> and pthread_setaffinity_np())

We cannot use the same name for two different C function. One expects a process identifier, whereas the other expects a thread identifier! If Python is compiled without thread, the thread will not exist (as some modules and many other functions).

> pthread_setaffinity_np() is really non-portable
> (it's guarded by __USE_GNU in my system's header)

We can check it in configure. We already use some functions which are GNU extensions, like makedev(). Oh, os.makedev() availability is just not documented :-)

> sched_setaffinity() seems to work fine on most systems
> even when linked with pthread

Again, it looks like a libc/kernel bug. I don't think that Python can work around such issue.

> I don't care strongly about using pthread_getaffinity_np()

I don't really care of pthread_getaffinity_np() :-) To add a new function, we need a usecase and it should be requested. This issue is about a crash using sched_setaffinity(), not about pthread_getaffinity_np.

I don't know or need (), but the difference between sched_setaffinity and pthread_getaffinity_np is the same between sigprocmask() and pthread_sigmask(). I chose to expose only the later because the behaviour of sigprocmask is undefined in a process using threads. sched_setaffinity manual contains the sentence "If you are using the POSIX threads API, then use pthread_setaffinity_np(3) instead of sched_setaffinity()".

See also Portable Hardware Locality (hwloc):
http://www.open-mpi.org/projects/hwloc/
msg144009 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-09-14 05:19
> Do you mean that signal.pthread_kill() should be removed? This function is very useful and solve some issues that cannot be solved differently. At the same time, I don't think that it's possible to workaround the crashes. At least, I don't see how: pthread_kill(tid, 0) is supposed to check if tid exists, but it does crash...

No, I don't suggest to remove it, it is useful.
As for the crashes, with glibc pthread_t is really a pointer, so
there's no way to check its validity beforehand. Even if we did check
the thread ID against the list of Python-created threads IDs (stored
in Thread._ident), this could still crash, because the ID becomes
invalid as soon as the thread terminates (all threads are started
detached). Furthermore, this wouldn't work for non-Python created
threads.

> We cannot use the same name for two different C function. One expects a process identifier, whereas the other expects a thread identifier! If Python is compiled without thread, the thread will not exist (as some modules and many other functions).
>

I know, that's why I said "different API": but I must admit it was
poorly worded ;-)
However, this wouldn't solve this particular problem: as long as we
expose sched_setaffinity(), it will crash as soon as someone passes
`0` or getpid() as PID.

>> pthread_setaffinity_np() is really non-portable
>> (it's guarded by __USE_GNU in my system's header)
>
> We can check it in configure. We already use some functions which are GNU extensions, like makedev(). Oh, os.makedev() availability is just not documented :-)

As I said, this wouldn't solve this problem. If someone deems it
necessary, we can open another issue for this feature request.

>> sched_setaffinity() seems to work fine on most systems
>> even when linked with pthread
>
> Again, it looks like a libc/kernel bug. I don't think that Python can work around such issue.
>

Agreed.

> I don't know or need (), but the difference between sched_setaffinity and pthread_getaffinity_np is the same between sigprocmask() and pthread_sigmask(). I chose to expose only the later because the behaviour of sigprocmask is undefined in a process using threads.

Exactly.
However, nothing prevents someone from using sigprocmask() in a
multithreaded process, the only difference is that it won't crash
(AFAICT).

So I suggest to:
1) skip the problematic tests on ARM when built with threads to avoid segfaults
2) if someone wants pthread_getaffinity_np(), then we can still open a
separate feature request
msg144030 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-14 16:13
I'd prefer to disable the misbehaving functions entirely on arm.
With the patch this combination of tests now works:

  ./python -m test -uall test_posix test_nntplib


If you think the patch is good, I can run the whole test suite, too.
[I'd rather wait for review due to the slowness of the setup.]
msg144031 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-09-14 17:03
> I'd prefer to disable the misbehaving functions entirely on arm.

-10

If we start disabling features on platforms with partly bogus implementations, we might as well drop threading on OpenBSD, sendmsg() on OS-X, etc.

Furthermore, it's really just a libc bug, which might be fixed in a more recent version, or with another libc provider (eglibc, uclibc, etc.).
msg144174 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-17 06:51
I cannot reproduce the crash on:

Linux debian-armel 2.6.32-5-versatile #1 Wed Jan 12 23:05:11 UTC 2011 armv5tejl GNU/Linux


Since the old (arm) port is deprecated, I'm closing this.
History
Date User Action Args
2011-09-17 06:51:50skrahsetstatus: open -> closed
resolution: wont fix
messages: + msg144174

stage: test needed -> resolved
2011-09-14 17:03:06neologixsetmessages: + msg144031
2011-09-14 16:13:53skrahsetfiles: + arm_setaffinity.diff
keywords: + patch
messages: + msg144030
2011-09-14 05:19:33neologixsetmessages: + msg144009
2011-09-13 23:10:03vstinnersetmessages: + msg143992
2011-09-13 20:47:09neologixsetmessages: + msg143988
2011-09-13 19:13:59skrahsetmessages: + msg143984
2011-09-13 17:55:49vstinnersetmessages: + msg143981
2011-09-13 17:26:13neologixsetmessages: + msg143979
2011-09-13 15:54:03skrahsetfiles: + pthread_nocrash.c

messages: + msg143978
title: armv5tejl: random segfaults in getaddrinfo() -> armv5tejl segfaults: sched_setaffinity() vs. pthread_setaffinity_np()
2011-09-13 15:14:51skrahsetfiles: + crash.c

messages: + msg143974
2011-09-13 12:04:12vstinnersetmessages: + msg143959
2011-09-13 11:00:32skrahsetmessages: + msg143953
2011-09-13 10:51:48skrahsetfiles: + crash.py
nosy: + benjamin.peterson
messages: + msg143952

2011-09-12 08:36:10skrahsetmessages: + msg143890
2011-09-12 07:18:14neologixsetmessages: + msg143887
2011-09-12 06:59:36neologixsetmessages: + msg143886
2011-09-11 23:39:57vstinnersetmessages: + msg143880
2011-09-11 21:06:28skrahsetmessages: + msg143877
2011-09-11 15:07:04neologixsetmessages: + msg143869
2011-09-11 13:53:24skrahsetnosy: + vstinner
messages: + msg143865
2011-09-11 12:34:27meador.ingesetnosy: + meador.inge
2011-09-11 09:47:31neologixsetmessages: + msg143859
2011-09-11 09:23:40skrahsetmessages: + msg143858
2011-09-10 15:05:15skrahsetmessages: + msg143840
2011-09-10 09:56:50neologixsetmessages: + msg143835
2011-09-09 19:19:21skrahsetmessages: + msg143791
2011-09-09 16:55:45neologixsetnosy: + neologix
messages: + msg143767
2011-09-08 09:54:47skrahcreate