classification
Title: 2.7.8 backport of Issue1856 (avoid daemon thread problems at shutdown) breaks ceph
Type: behavior Stage: resolved
Components: Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Joe.Julian, benjamin.peterson, dmick, doko, neologix, pitrou, python-dev, vstinner
Priority: release blocker Keywords:

Created on 2014-07-11 21:06 by ned.deily, last changed 2014-12-03 00:55 by dmick. This issue is now closed.

Messages (17)
msg222795 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2014-07-11 21:06
doko in msg222768 of Issue1856:

http://tracker.ceph.com/issues/8797 reports that the backport to 2.7 causes a regression in ceph.
msg222800 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-07-11 21:57
Does anyone have a *simple* script to reproduce the regression?

The changeset 7741d0dd66ca looks good to me, Python 3 does the same thing since Python 3.2 (the new GIL).

Anyway, daemon threads are evil :-( Expecting them to exit cleanly automatically is not good. Last time I tried to improve code to cleanup Python at exit in Python 3.4, I also had a regression (just before the release of Python 3.4.0): see the issue #21788.
msg222801 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-07-11 22:00
> The changeset 7741d0dd66ca looks good to me, Python 3 does the same thing since Python 3.2 (the new GIL).

Oh, I forgot to say that the simplest way to fix the regression is to revert this commit... As I wrote modifying the code to cleanup Python at exit is risky, and it's maybe too late to do that in Python 2.7?

The risk of regression caused by 7741d0dd66ca is maybe higher than the benefit of the enhancement? :-(
msg227994 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-09-30 20:26
I would favour reverting. The crashiness of daemon threads is not new, so fixing it late in the 2.7 cycle wasn't really necessary IMO.
msg227995 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-09-30 20:28
As a sidenote, I wonder if the Ceph issue would also occur on 3.x. I guess we won't know until Ceph is ported (is it?).
msg227996 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-09-30 20:32
Note that this may very well be a bug in Ceph... Is any of the Ceph authors nosied here?
msg228022 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-09-30 22:55
@Benjamin: What do you think? Should we revert the changeset 7741d0dd66ca in Python 2.7?
msg228042 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2014-10-01 00:29
Antoine is probably right that we should revert. I wonder what the exact
problem with Ceph is, though.

On Tue, Sep 30, 2014, at 18:55, STINNER Victor wrote:
> 
> STINNER Victor added the comment:
> 
> @Benjamin: What do you think? Should we revert the changeset 7741d0dd66ca
> in Python 2.7?
> 
> ----------
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue21963>
> _______________________________________
msg228056 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2014-10-01 05:59
Agreed with Antoine and Benjamin.
msg228062 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-10-01 08:03
Oh by the way, I also prefer to revert the commit.
msg231408 - (view) Author: Dan Mick (dmick) Date: 2014-11-20 02:56
Hi; I'm the original author of the code in the Ceph CLI.  

The reason it does what it does is that the Python CLI calls into librados (Ceph, through ctypes) to connect to the cluster; that connection can block for various reasons, so it's spawned in a thread; after a timeout or ^C, we desire to exit, but if we've got a non-daemon thread stuck in Ceph, and no way to cancel it from the threading module, sys.exit() will also block waiting for the Ceph thread to finally give up.  If, however, we set that thread as a daemon thread, it doesn't block the sys.exit() (that being, I thought, the whole point of daemon threads).

I confess I don't fully understand the change in 7741d0dd66ca, but it does seem to have the side effect of not actually allowing exit while there are outstanding daemon threads not hitting Python.
msg231409 - (view) Author: Joe Julian (Joe.Julian) Date: 2014-11-20 03:02
I suspect the underlying problem is that the fix expects the daemon threads to hit a point where they try to acquire the GIL and that's not going to happen with these librados threads; they stay in librados.
msg231416 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-11-20 09:49
Thanks for the explanation.

> I confess I don't fully understand the change in 7741d0dd66ca, but it 
> does seem to have the side effect of not actually allowing exit while 
> there are outstanding daemon threads not hitting Python.

That's a bit weird, as that change doesn't introduce any wait. Are you sure there isn't something else happening, e.g. a lock is owned by the daemon thread and a non-daemon thread tries to acquire it?
msg231459 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-11-21 01:05
New changeset 4ceca79d1c63 by Antoine Pitrou in branch '2.7':
Issue #21963: backout issue #1856 patch (avoid crashes and lockups when
https://hg.python.org/cpython/rev/4ceca79d1c63
msg231461 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-11-21 01:06
I've reverted the original changeset. I still believe it is likely to be an issue on the ceph side, though, and the changes remain in 3.x.
msg231974 - (view) Author: Dan Mick (dmick) Date: 2014-12-02 03:36
So, finally got to a Fedora21 beta to try this out today; the immediate problem, as identified by Joe Julian, is the shutdown() call in the __del__ method of Rados.  

Presumably something about object destructors is clashing with starting new threads; the hang occurs with a gdb backtrace like so:

1710 in sem_wait () from /lib64/libpthread.so.0
#1  0x00007fad36fbce75 in PyThread_acquire_lock ()
   from /lib64/libpython2.7.so.1.0
#2  0x00007fad36fc0c82 in lock_PyThread_acquire_lock ()
   from /lib64/libpython2.7.so.1.0
#3  0x00007fad36f91ade in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#4  0x00007fad36f92490 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#5  0x00007fad36f90b74 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#6  0x00007fad36f92490 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#7  0x00007fad36f90b74 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#8  0x00007fad36f90c76 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#9  0x00007fad36f92490 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#10 0x00007fad36f90b74 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#11 0x00007fad36f90c76 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#12 0x00007fad36f92490 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#13 0x00007fad36f1e6dc in function_call () from /lib64/libpython2.7.so.1.0
#14 0x00007fad36ef98d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#15 0x00007fad36f087e5 in instancemethod_call ()
   from /lib64/libpython2.7.so.1.0
#16 0x00007fad36ef98d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#17 0x00007fad36f8b557 in PyEval_CallObjectWithKeywords ()
   from /lib64/libpython2.7.so.1.0
#18 0x00007fad36f5082c in slot_tp_del () from /lib64/libpython2.7.so.1.0
#19 0x00007fad36f4b88a in subtype_dealloc () from /lib64/libpython2.7.so.1.0
#20 0x00007fad36f2e1ff in insertdict_by_entry ()
   from /lib64/libpython2.7.so.1.0
#21 0x00007fad36f2fcc0 in dict_set_item_by_hash_or_entry ()
   from /lib64/libpython2.7.so.1.0
#22 0x00007fad36f3467c in _PyModule_Clear () from /lib64/libpython2.7.so.1.0
#23 0x00007fad36fa1c5b in PyImport_Cleanup () from /lib64/libpython2.7.so.1.0
#24 0x00007fad36fad598 in Py_Finalize () from /lib64/libpython2.7.so.1.0
#25 0x00007fad36faced8 in Py_Exit () from /lib64/libpython2.7.so.1.0
#26 0x00007fad36fad01a in handle_system_exit.part ()
   from /lib64/libpython2.7.so.1.0
#27 0x00007fad36fad29d in PyErr_PrintEx () from /lib64/libpython2.7.so.1.0
#28 0x00007fad36fadea3 in PyRun_SimpleFileExFlags ()
   from /lib64/libpython2.7.so.1.0
#29 0x00007fad36fbf30a in Py_Main () from /lib64/libpython2.7.so.1.0
#30 0x00007fad361e6fe0 in __libc_start_main () from /lib64/libc.so.6
#31 0x000000000040071e in _start ()

I'm guessing something about the changed shutdown is causing the deadlock?...  Anyway, I'm pretty sure we can afford to remove that __del__ method, and that's another easily-applied workaround for the field.
msg232066 - (view) Author: Dan Mick (dmick) Date: 2014-12-03 00:55
Belaboring this a bit just in case what I learn helps with the interpreter change: 

seems like threading.Thread.start() is hanging in its normal "wait for start" code:

(gdb) py-bt
#4 Frame 0x25d4de0, for file /usr/lib64/python2.7/threading.py, line 339, in wait (self=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fe6203c2e10>, acquire=<built-in method acquire of thread.lock object at remote 0x7fe6203c2e10>, _Condition__waiters=[<thread.lock at remote 0x7fe6203c2e70>], release=<built-in method release of thread.lock object at remote 0x7fe6203c2e10>) at remote 0x7fe618b29110>, timeout=None, balancing=True, waiter=<thread.lock at remote 0x7fe6203c2e70>, saved_state=None)
    waiter.acquire()
#8 Frame 0x7fe618b06d38, for file /usr/lib64/python2.7/threading.py, line 623, in wait (self=<_Event(_Verbose__verbose=False, _Event__flag=False, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fe6203c2e10>, acquire=<built-in method acquire of thread.lock object at remote 0x7fe6203c2e10>, _Condition__waiters=[<thread.lock at remote 0x7fe6203c2e70>], release=<built-in method release of thread.lock object at remote 0x7fe6203c2e10>) at remote 0x7fe618b29110>) at remote 0x7fe6176a76d0>, timeout=None, balancing=True)
    self.__cond.wait(timeout, balancing)
#12 Frame 0x7fe6175917d0, for file /usr/lib64/python2.7/threading.py, line 753, in start (self=<RadosThread(_Thread__ident=140626100590336, target=<_FuncPtr(__name__='rados_shutdown') at remote 0x7fe617611bb0>, _Thread__block=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fe6203c2df0>, acquire=<built-in method acquire of thread.lock object at remote 0x7fe6203c2df0>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7fe6203c2df0>) at remote 0x7fe618b29150>, _Thread__name='Thread-11', _Thread__daemonic=True, args=(<c_void_p at remote 0x7fe617694440>,), _Thread__started=<_Event(_Verbose__verbose=False, _Event__flag=False, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fe6203c2e10>, acquire=<built-in method acquire of thread.lock object at remote 0x7fe6203c2e10>, _Condition__waiters=[<thread.lock at remote 0x7fe6203c2e70>], release=<built-in method release of thread.lock object at remote 0x7fe6203c2...(truncated)
    self.__started.wait()
#15 Frame 0x24e0ae0, for file /usr/lib/python2.7/site-packages/rados.py, line 158, in run_in_thread (target=<_FuncPtr(__name__='rados_shutdown') at remote 0x7fe617611bb0>, args=(<c_void_p at remote 0x7fe617694440>,), timeout=0, sys=<module at remote 0x7fe6203efbb0>, interrupt=False, countdown=0, t=<RadosThread(_Thread__ident=140626100590336, target=<...>, _Thread__block=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fe6203c2df0>, acquire=<built-in method acquire of thread.lock object at remote 0x7fe6203c2df0>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7fe6203c2df0>) at remote 0x7fe618b29150>, _Thread__name='Thread-11', _Thread__daemonic=True, args=(...), _Thread__started=<_Event(_Verbose__verbose=False, _Event__flag=False, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fe6203c2e10>, acquire=<built-in method acquire of thread.lock object at remote 0x7fe6203c2e10>, _Condition__waiters=[<th...(truncated)
    t.start()
#19 Frame 0x7fe61762e050, for file /usr/lib/python2.7/site-packages/rados.py, line 252, in shutdown (self=<Rados(cluster=<c_void_p at remote 0x7fe617694440>, state='connected', librados=<CDLL(_FuncPtr=<_ctypes.PyCFuncPtrType at remote 0x25c5a50>, rados_mon_command=<_FuncPtr(__name__='rados_mon_command') at remote 0x7fe617611a10>, rados_buffer_free=<_FuncPtr(__name__='rados_buffer_free') at remote 0x7fe617611ae0>, rados_conf_set=<_FuncPtr(__name__='rados_conf_set') at remote 0x7fe6176117a0>, _handle=39608352, _name='librados.so.2', rados_create2=<_FuncPtr(__name__='rados_create2') at remote 0x7fe6176116d0>, rados_conf_read_file=<_FuncPtr(__name__='rados_conf_read_file') at remote 0x7fe617611870>, rados_connect=<_FuncPtr(__name__='rados_connect') at remote 0x7fe617611940>, rados_shutdown=<_FuncPtr(__name__='rados_shutdown') at remote 0x7fe617611bb0>) at remote 0x7fe617572a10>, rados_id=None) at remote 0x7fe6175729d0>)
    run_in_thread(self.librados.rados_shutdown, (self.cluster,))
#22 Frame 0x7fe617683c20, for file /usr/lib/python2.7/site-packages/rados.py, line 264, in __del__ (self=<Rados(cluster=<c_void_p at remote 0x7fe617694440>, state='connected', librados=<CDLL(_FuncPtr=<_ctypes.PyCFuncPtrType at remote 0x25c5a50>, rados_mon_command=<_FuncPtr(__name__='rados_mon_command') at remote 0x7fe617611a10>, rados_buffer_free=<_FuncPtr(__name__='rados_buffer_free') at remote 0x7fe617611ae0>, rados_conf_set=<_FuncPtr(__name__='rados_conf_set') at remote 0x7fe6176117a0>, _handle=39608352, _name='librados.so.2', rados_create2=<_FuncPtr(__name__='rados_create2') at remote 0x7fe6176116d0>, rados_conf_read_file=<_FuncPtr(__name__='rados_conf_read_file') at remote 0x7fe617611870>, rados_connect=<_FuncPtr(__name__='rados_connect') at remote 0x7fe617611940>, rados_shutdown=<_FuncPtr(__name__='rados_shutdown') at remote 0x7fe617611bb0>) at remote 0x7fe617572a10>, rados_id=None) at remote 0x7fe6175729d0>)
    self.shutdown()
History
Date User Action Args
2014-12-03 00:55:16dmicksetmessages: + msg232066
2014-12-02 03:36:49dmicksetmessages: + msg231974
2014-11-21 01:06:32pitrousetstatus: open -> closed
type: behavior
messages: + msg231461

resolution: fixed
stage: needs patch -> resolved
2014-11-21 01:05:23python-devsetnosy: + python-dev
messages: + msg231459
2014-11-20 09:49:52pitrousetmessages: + msg231416
2014-11-20 03:02:26Joe.Juliansetmessages: + msg231409
2014-11-20 02:56:03dmicksetnosy: + dmick
messages: + msg231408
2014-11-20 02:34:25Joe.Juliansetnosy: + Joe.Julian
2014-10-01 08:03:00vstinnersetmessages: + msg228062
2014-10-01 05:59:07neologixsetmessages: + msg228056
2014-10-01 00:29:38benjamin.petersonsetmessages: + msg228042
2014-09-30 22:55:18vstinnersetmessages: + msg228022
2014-09-30 20:32:48pitrousetmessages: + msg227996
2014-09-30 20:28:55pitrousetmessages: + msg227995
2014-09-30 20:26:07pitrousetnosy: + pitrou
messages: + msg227994
2014-07-11 22:00:17vstinnersetmessages: + msg222801
2014-07-11 21:57:20vstinnersetnosy: + vstinner, neologix
messages: + msg222800
2014-07-11 21:09:55ned.deilysetnosy: - ned.deily
2014-07-11 21:06:05ned.deilycreate