classification
Title: pyro4 with more than 15 threads often crashes 2.7.12
Type: crash Stage: resolved
Components: Interpreter Core Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: eric.smith, irmen, methane, pwp333, vstinner
Priority: normal Keywords: patch

Created on 2016-11-12 08:31 by pwp333, last changed 2019-02-15 09:56 by methane. This issue is now closed.

Files
File name Uploaded Description Edit
core_python2.7.10.gz pwp333, 2016-11-15 22:07
28673-reproduce.py methane, 2016-11-24 11:58
28673-reproduce2.py methane, 2016-11-25 03:07
finish-deadlock.py methane, 2016-11-25 13:39
28673-fix-deadlock.patch methane, 2016-11-27 07:38 review
Messages (19)
msg280640 - (view) Author: Michael Hu (pwp333) Date: 2016-11-12 08:31
When using pyro4 with more than 15 threads, python 2.7.12 cores frequently (>60% time)

Note "v" (op in frame 1) in frame 2 is NULL which has some value in frame 3. So some other thread cleans it.

=== gdb ===
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python /etc/remoting/remoting_agent.zip run --system'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  PyDict_SetItem (op=op@entry=0x0, key=key@entry=0x7f0aa11cd420, value=value@entry=0x920ce0 <_PyExc_AttributeError.12478>) at ../Objects/dictobject.c:832

warning: Source file is more recent than executable.
832    if (!PyDict_Check(op)) {
(gdb) bt
#0  PyDict_SetItem (op=op@entry=0x0, key=key@entry='exc_type', value=value@entry=<type at remote 0x920ce0>) at ../Objects/dictobject.c:832
#1  0x0000000000529e86 in PyDict_SetItemString (item=<type at remote 0x920ce0>, key=0x61d02a "exc_type", v=0x0) at ../Objects/dictobject.c:2469
#2  PySys_SetObject (name=name@entry=0x61d02a "exc_type", v=v@entry=<type at remote 0x920ce0>) at ../Python/sysmodule.c:83
#3  0x000000000055acc0 in set_exc_info (tb=<traceback at remote 0x7f0aa1232878>, value=exceptions.AttributeError("'NoneType' object has no attribute 'SHUT_RDWR'",),
    type=<type at remote 0x920ce0>, tstate=0x95ad10) at ../Python/ceval.c:3736
#4  PyEval_EvalFrameEx (f=f@entry=Frame 0x7f0a9e929b60, for file /etc/remoting/remoting_agent.zip/Pyro4/socketutil.py, line 463, in close (self=<SocketConnection at remote 0x7f0a9e924510>),
    throwflag=throwflag@entry=0) at ../Python/ceval.c:3251
#5  0x0000000000559c27 in fast_function (nk=<optimized out>, na=<optimized out>, n=1, pp_stack=0x7ffce435ab00, func=<function at remote 0x7f0a9f6c0230>) at ../Python/ceval.c:4435
#6  call_function (oparg=<optimized out>, pp_stack=0x7ffce435ab00) at ../Python/ceval.c:4370
#7  PyEval_EvalFrameEx (f=f@entry=Frame 0x7f0a40000b50, for file /etc/remoting/remoting_agent.zip/Pyro4/socketutil.py, line 453, in __del__ (self=<SocketConnection at remote 0x7f0a9e924510>),
    throwflag=throwflag@entry=0) at ../Python/ceval.c:2987
#8  0x000000000056aa8a in PyEval_EvalCodeEx (closure=<optimized out>, defcount=<optimized out>, defs=0x0, kwcount=<optimized out>, kws=<optimized out>, argcount=1073744720, args=<optimized out>,
    locals=0x0, globals=<optimized out>, co=<optimized out>) at ../Python/ceval.c:3582
#9  function_call (func=func@entry=<function at remote 0x7f0a9f6c00c8>, arg=arg@entry=(<SocketConnection at remote 0x7f0a9e924510>,), kw=kw@entry=0x0) at ../Objects/funcobject.c:523
#10 0x00000000004be724 in PyObject_Call (kw=0x0, arg=(<SocketConnection at remote 0x7f0a9e924510>,), func=<function at remote 0x7f0a9f6c00c8>) at ../Objects/abstract.c:2546
#11 instancemethod_call.8803 (func=<function at remote 0x7f0a9f6c00c8>, arg=(<SocketConnection at remote 0x7f0a9e924510>,), kw=0x0) at ../Objects/classobject.c:2602
#12 0x00000000004c4683 in PyObject_Call (kw=0x0, arg=(), func=<instancemethod at remote 0x7f0a9f7b5b90>) at ../Objects/abstract.c:2546
#13 PyEval_CallObjectWithKeywords (kw=0x0, arg=(), func=<instancemethod at remote 0x7f0a9f7b5b90>) at ../Python/ceval.c:4219
#14 slot_tp_del.25647 (self=self@entry=<SocketConnection at remote 0x7f0a9e924510>) at ../Objects/typeobject.c:5844
#15 0x00000000004c5880 in subtype_dealloc.25650 (self=<SocketConnection at remote 0x7f0a9e924510>) at ../Objects/typeobject.c:1002
#16 0x00000000005336cf in dict_dealloc.18423 (mp=0x7f0a9e93c398) at ../Objects/dictobject.c:1040
#17 0x00000000004c56e7 in subtype_dealloc.25650 (
    self=<ClientConnectionJob(daemon=<Daemon(_Daemon__mustshutdown=<_Event(_Verbose__verbose=False, _Event__flag=True, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f0a9f759790>, acquire=<built-in method acquire of thread.lock object at remote 0x7f0a9f759790>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f0a9f759790>) at remote 0x7f0a9e9241d0>) at remote 0x7f0a9e924190>, objectsById={'root': <RootService(_scheduler=<Scheduler(_lock=<thread.lock at remote 0x7f0aa1211d50>, _operations={'cookie_9de0e8d93ed84452a1ce8cc92268979e': <Operation(_exception=None, _result='obj_450f0972022e41b2b7c1f29f49bdca83', _action=<function at remote 0x7f0a9f7b89b0>, _lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f0a9f759370>, _RLock__count=0) at remote 0x7f0a9e936190>, _started_at=<float at remote 0xab90f8>, cookie='cookie_9de0e8d93ed84452a1ce8cc92268979e', _complete=<_Condition(_Condition__lock=...(truncated)) at ../Objects/typeobject.c:1035
#18 0x0000000000534b93 in frame_dealloc.14921 (f=Frame 0x7f0a9f6d49d8, for file /etc/remoting/remoting_agent.zip/Pyro4/socketserver/threadpool.py, line 39, in run ())
    at ../Objects/frameobject.c:458
#19 0x00000000004d86c6 in tb_dealloc.46270 (tb=0x7f0aa1278f38) at ../Python/traceback.c:28
#20 0x00000000004d87e1 in tb_dealloc.46270 (tb=0x7f0aa1278ea8) at ../Python/traceback.c:27
#21 0x00000000005336cf in dict_dealloc.18423 (mp=0x7f0aa1277280) at ../Objects/dictobject.c:1040
#22 0x00000000005a52db in PyInterpreterState_Clear (interp=0x95ac80) at ../Python/pystate.c:111
#23 0x0000000000423aea in Py_Finalize () at ../Python/pythonrun.c:500
#24 0x00000000004672c5 in Py_Main (argc=<optimized out>, argv=0x7ffce435b2d8) at ../Modules/main.c:665
#25 0x00007f0aa0aa7f45 in __libc_start_main (main=0x4672f4 <main>, argc=4, argv=0x7ffce435b2d8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffce435b2c8)
    at libc-start.c:287
#26 0x000000000057c581 in _start ()
====


Google shows similar trace has been reported back in 2009 for python 2.6
https://bugs.launchpad.net/ubuntu/+source/system-config-printer/+bug/478071
msg280664 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2016-11-12 17:30
Can you provide a short sample that causes this error? Without a way to reproduce it, there's not a lot we can do.
msg280750 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-14 11:04
Traceback tells:

#22 0x00000000005a52db in PyInterpreterState_Clear (interp=0x95ac80) at ../Python/pystate.c:111
    Py_CLEAR(interp->sysdict);

While cleaning up interp->sysdict...

#19 0x00000000004d86c6 in tb_dealloc.46270 (tb=0x7f0aa1278f38) at ../Python/traceback.c:28
#20 0x00000000004d87e1 in tb_dealloc.46270 (tb=0x7f0aa1278ea8) at ../Python/traceback.c:27

Removing some traceback from sysdict. Maybe, sys.exc_traceback.

#17 0x00000000004c56e7 in subtype_dealloc.25650 (
    self=<ClientConnectionJob(daemon=<Daemon(_Daemon__mustshutdown=<_Event(_Verbose__verbose=False, _Event__flag=True, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f0a9f759790>, acquire=<built-in method acquire of thread.lock object at remote 0x7f0a9f759790>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7f0a9f759790>) at remote 0x7f0a9e9241d0>) at remote 0x7f0a9e924190>, objectsById={'root': <RootService(_scheduler=<Scheduler(_lock=<thread.lock at remote 0x7f0aa1211d50>, _operations={'cookie_9de0e8d93ed84452a1ce8cc92268979e': <Operation(_exception=None, _result='obj_450f0972022e41b2b7c1f29f49bdca83', _action=<function at remote 0x7f0a9f7b89b0>, _lock=<_RLock(_Verbose__verbose=False, _RLock__owner=None, _RLock__block=<thread.lock at remote 0x7f0a9f759370>, _RLock__count=0) at remote 0x7f0a9e936190>, _started_at=<float at remote 0xab90f8>, cookie='cookie_9de0e8d93ed84452a1ce8cc92268979e', _complete=<_Condition(_Condition__lock=...(truncated)) at ../Objects/typeobject.c:1035
#18 0x0000000000534b93 in frame_dealloc.14921 (f=Frame 0x7f0a9f6d49d8, for file /etc/remoting/remoting_agent.zip/Pyro4/socketserver/threadpool.py, line 39, in run ())
    at ../Objects/frameobject.c:458

Removing frame object owned by the traceback.

...

#7  PyEval_EvalFrameEx (f=f@entry=Frame 0x7f0a40000b50, for file /etc/remoting/remoting_agent.zip/Pyro4/socketutil.py, line 453, in __del__ (self=<SocketConnection at remote 0x7f0a9e924510>),
    throwflag=throwflag@entry=0) at ../Python/ceval.c:2987
#4  PyEval_EvalFrameEx (f=f@entry=Frame 0x7f0a9e929b60, for file /etc/remoting/remoting_agent.zip/Pyro4/socketutil.py, line 463, in close (self=<SocketConnection at remote 0x7f0a9e924510>),
    throwflag=throwflag@entry=0) at ../Python/ceval.c:3251

Calling SocketConnection.__del__.  It calls SocketConnection.close.

#3  0x000000000055acc0 in set_exc_info (tb=<traceback at remote 0x7f0aa1232878>, value=exceptions.AttributeError("'NoneType' object has no attribute 'SHUT_RDWR'",),
    type=<type at remote 0x920ce0>, tstate=0x95ad10) at ../Python/ceval.c:3736

SocketConnection.close raise AttributeError.

#2  PySys_SetObject (name=name@entry=0x61d02a "exc_type", v=v@entry=<type at remote 0x920ce0>) at ../Python/sysmodule.c:83

Trying to register the exception to sys.

#0  PyDict_SetItem (op=op@entry=0x0, key=key@entry='exc_type', value=value@entry=<type at remote 0x920ce0>) at ../Objects/dictobject.c:832

But interp->sysdict is NULL already.
msg280896 - (view) Author: Michael Hu (pwp333) Date: 2016-11-15 22:07
Core is uploaded for python 2.7.10 to assist debugging.
msg280897 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-15 22:11
Try to get the Python traceback of thread on the crash: try the faulthandler module.
msg280909 - (view) Author: Michael Hu (pwp333) Date: 2016-11-16 02:31
(gdb) py-bt
#4 Frame 0x7f0ab7a2db60, for file /etc/remoting/remoting_agent.zip/Pyro4/socketutil.py, line 463, in close (self=<SocketConnection at remote 0x7f0ab7a24190>)
    self.sock.shutdown(socket.SHUT_RDWR)
#7 Frame 0x7f0ab0001760, for file /etc/remoting/remoting_agent.zip/Pyro4/socketutil.py, line 453, in __del__ (self=<SocketConnection at remote 0x7f0ab7a24190>)
    self.close()
(gdb) py-list
 458        def recv(self, size):
 459            return receiveData(self.sock, size)
 460
 461        def close(self):
 462            try:
>463                self.sock.shutdown(socket.SHUT_RDWR)
 464            except (OSError, socket.error):
 465                pass
 466            try:
 467                self.sock.close()
 468            except AttributeError:
(gdb) f 7
#7  PyEval_EvalFrameEx (
    f=f@entry=Frame 0x7f0ab0001760, for file /etc/remoting/remoting_agent.zip/Pyro4/socketutil.py, line 453, in __del__ (self=<SocketConnection at remote 0x7f0ab7a24190>), throwflag=throwflag@entry=0) at ../Python/ceval.c:2681
2681	in ../Python/ceval.c
(gdb) py-list
 448        def __init__(self, sock, objectId=None):
 449            self.sock = sock
 450            self.objectId = objectId
 451
 452        def __del__(self):
>453            self.close()
 454
 455        def send(self, data):
 456            sendData(self.sock, data)
 457
 458        def recv(self, size):
(gdb)
msg280911 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-16 04:51
Can you check backtrace of main thread?
msg280921 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-16 07:13
Ok but can you also try faulthandler to see all frames of all threads?
msg280922 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-16 07:15
By the way, you can try your application on Python 3.6 using
PYTHONMALLOC=debug env var.

Or retry your application on Python 2.7 recompiled using ./configure
--with-pydebug.

I would like to know if Python sees a buffer overflow.
msg281067 - (view) Author: Michael Hu (pwp333) Date: 2016-11-17 23:39
Could you please look into the core (attached into this bug) to figure out? This issue does not happen always. We are tight on resource to reproduce this.
msg281579 - (view) Author: Irmen de Jong (irmen) (Python triager) Date: 2016-11-23 19:00
Due to lack of example code to reproduce the issue, and because I'm mildly interested in this bug because it mentions Pyro4 (because I'm the author of that) I've tried to crash my system myself using Pyro4 and a simple torture test but it trucked on just fine. 
Pyro4 is not doing any "strange" things as far as I am aware. It does have its own (simple) thread pool of regular Python threads that are handling incoming proxy connections.
(Had Pyro4 been doing weird things I suppose Python itself still should never core dump on the user but rather raise a regular exception if something was wrong)
msg281630 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-24 11:58
I can't get same C traceback yet.
But attached script segfaults by different C traceback.

I think this issue is dup of #1856
msg281665 - (view) Author: Irmen de Jong (irmen) (Python triager) Date: 2016-11-24 22:28
The 28673-reproduce.py didn't crash on any of the systems I've tried it on. Are you sure it is complete? It looks like a part is missing.
msg281676 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-25 03:07
I'm sorry, I had uploaded script with wrong parameter.
Additionally, this script fails by chance.

$ while true; do python2 28673-reproduce2.py || break; echo -n .; done
.....Segmentation fault (core dumped)
msg281680 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-25 06:52
Python 3 would have #21963 too. If we can fix it, we can fix this issue
too.
msg281718 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-25 13:39
This script can cause deadlock on Python 3.6.

1. Other threads can register traceback to sys module after sys module dict is cleaned up.
2. In Py_Finalize, __del__ method of arbitrary object in the traceback can be called.
3. The __del__ method calls threading.Thread.start()
4. Thread.start() waits self.__started event which is set by Thread.__bootstrap
5. While finalizing, threads other than main thread exits when getting GIL. So Thread.__bootstrap() won't be called.
6. Thread.start() in main thread waits self.__started event forever.

I suggest thread_PyThread_start_new_thread() raise RuntimeError("can't start thread: finalizing")
msg281721 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-25 14:25
> I suggest thread_PyThread_start_new_thread() raise RuntimeError("can't start thread: finalizing")

Where *exactly*? Py_FinalizeEx() first calls
wait_for_thread_shutdown() and *then* sets _Py_Finalizing to tstate.

Does wait_for_thread_shutdown() complete on this bug?

Note: It's not the first time that I noticed a deadlock around this
code :-/ I recall a silly bug when you press CTRL+c while
wait_for_thread_shutdown() is running.

See for example threading_shutdown_interrupted.py attached to the
issue #20526. Run it and quickly press CTRL+c. Sometime, it does crash
Python with a segfault...
---
haypo@selma$ ./python threading_shutdown_interrupted.py
..........................................................................................................................................................................................................Exception
ignored in: <module 'threading' from
'/home/haypo/prog/python/default/Lib/threading.py'>
Traceback (most recent call last):
  File "/home/haypo/prog/python/default/Lib/threading.py", line 1290,
in _shutdown
    t.join()
  File "/home/haypo/prog/python/default/Lib/threading.py", line 1056, in join
.    self._wait_for_tstate_lock()
  File "/home/haypo/prog/python/default/Lib/threading.py", line 1072,
in _wait_for_tstate_lock
...    elif lock.acquire(block, timeout):
  File "threading_shutdown_interrupted.py", line 29, in killer
..    raise KeyboardInterrupt()
KeyboardInterrupt:
^CSegmentation fault (core dumped)
---

Python finalization is a very fragile and complex task :-/
msg281811 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-11-27 07:38
> Where *exactly*?

See attached patch.
But python with this patch can deadlock in other state. Main thread wait GIL
but no other living threads have GIL. It seems caused by other issue around finalization and multithreading.

> Py_FinalizeEx() first calls wait_for_thread_shutdown() and *then* sets _Py_Finalizing to tstate.
> Does wait_for_thread_shutdown() complete on this bug?

Yes. wait_for_thread_shutdown() waits only daemon threads.
After _Py_Finalizing = tstate, main thread calls many __del__ methods
while shutdown. If one of them starts new thread, this deadlock happens.
msg335594 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-02-15 09:56
I think this issue is fixed by #21963.
History
Date User Action Args
2019-02-15 09:56:24methanesetstatus: open -> closed
resolution: fixed
messages: + msg335594

stage: test needed -> resolved
2016-11-27 07:38:24methanesetfiles: + 28673-fix-deadlock.patch
keywords: + patch
messages: + msg281811
2016-11-25 14:25:22vstinnersetmessages: + msg281721
2016-11-25 13:39:14methanesetfiles: + finish-deadlock.py

messages: + msg281718
2016-11-25 06:52:40methanesetmessages: + msg281680
2016-11-25 03:07:22methanesetfiles: + 28673-reproduce2.py

messages: + msg281676
2016-11-24 22:28:24irmensetmessages: + msg281665
2016-11-24 11:58:59methanesetfiles: + 28673-reproduce.py

messages: + msg281630
2016-11-23 19:00:53irmensetnosy: + irmen
messages: + msg281579
2016-11-18 17:35:55terry.reedysettitle: When using pyro4 with more than 15 threads, python 2.7.12 cores frequently -> pyro4 with more than 15 threads often crashes 2.7.12
stage: test needed
2016-11-17 23:39:05pwp333setmessages: + msg281067
2016-11-16 07:15:23vstinnersetmessages: + msg280922
2016-11-16 07:13:30vstinnersetmessages: + msg280921
2016-11-16 04:51:16methanesetmessages: + msg280911
2016-11-16 02:31:06pwp333setmessages: + msg280909
2016-11-15 22:11:00vstinnersetnosy: + vstinner
messages: + msg280897
2016-11-15 22:07:44pwp333setfiles: + core_python2.7.10.gz

messages: + msg280896
2016-11-14 11:04:36methanesetnosy: + methane
messages: + msg280750
2016-11-12 17:30:22eric.smithsetnosy: + eric.smith
messages: + msg280664
2016-11-12 08:31:38pwp333create