Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.7.8 backport of Issue1856 (avoid daemon thread problems at shutdown) breaks ceph #66162

Closed
ned-deily opened this issue Jul 11, 2014 · 17 comments
Labels
release-blocker type-bug An unexpected behavior, bug, or error

Comments

@ned-deily
Copy link
Member

BPO 21963
Nosy @doko42, @pitrou, @vstinner, @benjaminp

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2014-11-21.01:06:32.064>
created_at = <Date 2014-07-11.21:06:05.854>
labels = ['type-bug', 'release-blocker']
title = '2.7.8 backport of Issue1856 (avoid daemon thread problems at shutdown) breaks ceph'
updated_at = <Date 2014-12-03.00:55:16.514>
user = 'https://github.com/ned-deily'

bugs.python.org fields:

activity = <Date 2014-12-03.00:55:16.514>
actor = 'dmick'
assignee = 'none'
closed = True
closed_date = <Date 2014-11-21.01:06:32.064>
closer = 'pitrou'
components = []
creation = <Date 2014-07-11.21:06:05.854>
creator = 'ned.deily'
dependencies = []
files = []
hgrepos = []
issue_num = 21963
keywords = []
message_count = 17.0
messages = ['222795', '222800', '222801', '227994', '227995', '227996', '228022', '228042', '228056', '228062', '231408', '231409', '231416', '231459', '231461', '231974', '232066']
nosy_count = 8.0
nosy_names = ['doko', 'pitrou', 'vstinner', 'benjamin.peterson', 'neologix', 'python-dev', 'Joe.Julian', 'dmick']
pr_nums = []
priority = 'release blocker'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue21963'
versions = ['Python 2.7']

@ned-deily
Copy link
Member Author

doko in msg222768 of bpo-1856:

http://tracker.ceph.com/issues/8797 reports that the backport to 2.7 causes a regression in ceph.

@vstinner
Copy link
Member

Does anyone have a *simple* script to reproduce the regression?

The changeset 7741d0dd66ca looks good to me, Python 3 does the same thing since Python 3.2 (the new GIL).

Anyway, daemon threads are evil :-( Expecting them to exit cleanly automatically is not good. Last time I tried to improve code to cleanup Python at exit in Python 3.4, I also had a regression (just before the release of Python 3.4.0): see the issue bpo-21788.

@vstinner
Copy link
Member

The changeset 7741d0dd66ca looks good to me, Python 3 does the same thing since Python 3.2 (the new GIL).

Oh, I forgot to say that the simplest way to fix the regression is to revert this commit... As I wrote modifying the code to cleanup Python at exit is risky, and it's maybe too late to do that in Python 2.7?

The risk of regression caused by 7741d0dd66ca is maybe higher than the benefit of the enhancement? :-(

@pitrou
Copy link
Member

pitrou commented Sep 30, 2014

I would favour reverting. The crashiness of daemon threads is not new, so fixing it late in the 2.7 cycle wasn't really necessary IMO.

@pitrou
Copy link
Member

pitrou commented Sep 30, 2014

As a sidenote, I wonder if the Ceph issue would also occur on 3.x. I guess we won't know until Ceph is ported (is it?).

@pitrou
Copy link
Member

pitrou commented Sep 30, 2014

Note that this may very well be a bug in Ceph... Is any of the Ceph authors nosied here?

@vstinner
Copy link
Member

@benjamin: What do you think? Should we revert the changeset 7741d0dd66ca in Python 2.7?

@benjaminp
Copy link
Contributor

Antoine is probably right that we should revert. I wonder what the exact
problem with Ceph is, though.

On Tue, Sep 30, 2014, at 18:55, STINNER Victor wrote:

STINNER Victor added the comment:

@benjamin: What do you think? Should we revert the changeset 7741d0dd66ca
in Python 2.7?

----------


Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue21963\>


@neologix
Copy link
Mannequin

neologix mannequin commented Oct 1, 2014

Agreed with Antoine and Benjamin.

@vstinner
Copy link
Member

vstinner commented Oct 1, 2014

Oh by the way, I also prefer to revert the commit.

@dmick
Copy link
Mannequin

dmick mannequin commented Nov 20, 2014

Hi; I'm the original author of the code in the Ceph CLI.

The reason it does what it does is that the Python CLI calls into librados (Ceph, through ctypes) to connect to the cluster; that connection can block for various reasons, so it's spawned in a thread; after a timeout or ^C, we desire to exit, but if we've got a non-daemon thread stuck in Ceph, and no way to cancel it from the threading module, sys.exit() will also block waiting for the Ceph thread to finally give up. If, however, we set that thread as a daemon thread, it doesn't block the sys.exit() (that being, I thought, the whole point of daemon threads).

I confess I don't fully understand the change in 7741d0dd66ca, but it does seem to have the side effect of not actually allowing exit while there are outstanding daemon threads not hitting Python.

@JoeJulian
Copy link
Mannequin

JoeJulian mannequin commented Nov 20, 2014

I suspect the underlying problem is that the fix expects the daemon threads to hit a point where they try to acquire the GIL and that's not going to happen with these librados threads; they stay in librados.

@pitrou
Copy link
Member

pitrou commented Nov 20, 2014

Thanks for the explanation.

I confess I don't fully understand the change in 7741d0dd66ca, but it
does seem to have the side effect of not actually allowing exit while
there are outstanding daemon threads not hitting Python.

That's a bit weird, as that change doesn't introduce any wait. Are you sure there isn't something else happening, e.g. a lock is owned by the daemon thread and a non-daemon thread tries to acquire it?

@python-dev
Copy link
Mannequin

python-dev mannequin commented Nov 21, 2014

New changeset 4ceca79d1c63 by Antoine Pitrou in branch '2.7':
Issue bpo-21963: backout issue bpo-1856 patch (avoid crashes and lockups when
https://hg.python.org/cpython/rev/4ceca79d1c63

@pitrou
Copy link
Member

pitrou commented Nov 21, 2014

I've reverted the original changeset. I still believe it is likely to be an issue on the ceph side, though, and the changes remain in 3.x.

@pitrou pitrou closed this as completed Nov 21, 2014
@pitrou pitrou added the type-bug An unexpected behavior, bug, or error label Nov 21, 2014
@dmick
Copy link
Mannequin

dmick mannequin commented Dec 2, 2014

So, finally got to a Fedora21 beta to try this out today; the immediate problem, as identified by Joe Julian, is the shutdown() call in the __del__ method of Rados.

Presumably something about object destructors is clashing with starting new threads; the hang occurs with a gdb backtrace like so:

1710 in sem_wait () from /lib64/libpthread.so.0
#1 0x00007fad36fbce75 in PyThread_acquire_lock ()
from /lib64/libpython2.7.so.1.0
#2 0x00007fad36fc0c82 in lock_PyThread_acquire_lock ()
from /lib64/libpython2.7.so.1.0
#3 0x00007fad36f91ade in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#4 0x00007fad36f92490 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#5 0x00007fad36f90b74 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#6 0x00007fad36f92490 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#7 0x00007fad36f90b74 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#8 0x00007fad36f90c76 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#9 0x00007fad36f92490 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#10 0x00007fad36f90b74 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#11 0x00007fad36f90c76 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#12 0x00007fad36f92490 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#13 0x00007fad36f1e6dc in function_call () from /lib64/libpython2.7.so.1.0
#14 0x00007fad36ef98d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#15 0x00007fad36f087e5 in instancemethod_call ()
from /lib64/libpython2.7.so.1.0
#16 0x00007fad36ef98d3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#17 0x00007fad36f8b557 in PyEval_CallObjectWithKeywords ()
from /lib64/libpython2.7.so.1.0
#18 0x00007fad36f5082c in slot_tp_del () from /lib64/libpython2.7.so.1.0
#19 0x00007fad36f4b88a in subtype_dealloc () from /lib64/libpython2.7.so.1.0
#20 0x00007fad36f2e1ff in insertdict_by_entry ()
from /lib64/libpython2.7.so.1.0
#21 0x00007fad36f2fcc0 in dict_set_item_by_hash_or_entry ()
from /lib64/libpython2.7.so.1.0
#22 0x00007fad36f3467c in _PyModule_Clear () from /lib64/libpython2.7.so.1.0
#23 0x00007fad36fa1c5b in PyImport_Cleanup () from /lib64/libpython2.7.so.1.0
#24 0x00007fad36fad598 in Py_Finalize () from /lib64/libpython2.7.so.1.0
#25 0x00007fad36faced8 in Py_Exit () from /lib64/libpython2.7.so.1.0
#26 0x00007fad36fad01a in handle_system_exit.part ()
from /lib64/libpython2.7.so.1.0
#27 0x00007fad36fad29d in PyErr_PrintEx () from /lib64/libpython2.7.so.1.0
#28 0x00007fad36fadea3 in PyRun_SimpleFileExFlags ()
from /lib64/libpython2.7.so.1.0
#29 0x00007fad36fbf30a in Py_Main () from /lib64/libpython2.7.so.1.0
#30 0x00007fad361e6fe0 in __libc_start_main () from /lib64/libc.so.6
#31 0x000000000040071e in _start ()

I'm guessing something about the changed shutdown is causing the deadlock?... Anyway, I'm pretty sure we can afford to remove that __del__ method, and that's another easily-applied workaround for the field.

@dmick
Copy link
Mannequin

dmick mannequin commented Dec 3, 2014

Belaboring this a bit just in case what I learn helps with the interpreter change:

seems like threading.Thread.start() is hanging in its normal "wait for start" code:

(gdb) py-bt
#4 Frame 0x25d4de0, for file /usr/lib64/python2.7/threading.py, line 339, in wait (self=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fe6203c2e10>, acquire=<built-in method acquire of thread.lock object at remote 0x7fe6203c2e10>, _Condition__waiters=[<thread.lock at remote 0x7fe6203c2e70>], release=<built-in method release of thread.lock object at remote 0x7fe6203c2e10>) at remote 0x7fe618b29110>, timeout=None, balancing=True, waiter=<thread.lock at remote 0x7fe6203c2e70>, saved_state=None)
waiter.acquire()
#8 Frame 0x7fe618b06d38, for file /usr/lib64/python2.7/threading.py, line 623, in wait (self=<_Event(_Verbose__verbose=False, _Event__flag=False, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fe6203c2e10>, acquire=<built-in method acquire of thread.lock object at remote 0x7fe6203c2e10>, _Condition__waiters=[<thread.lock at remote 0x7fe6203c2e70>], release=<built-in method release of thread.lock object at remote 0x7fe6203c2e10>) at remote 0x7fe618b29110>) at remote 0x7fe6176a76d0>, timeout=None, balancing=True)
self.__cond.wait(timeout, balancing)
#12 Frame 0x7fe6175917d0, for file /usr/lib64/python2.7/threading.py, line 753, in start (self=<RadosThread(_Thread__ident=140626100590336, target=<_FuncPtr(name='rados_shutdown') at remote 0x7fe617611bb0>, _Thread__block=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fe6203c2df0>, acquire=<built-in method acquire of thread.lock object at remote 0x7fe6203c2df0>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7fe6203c2df0>) at remote 0x7fe618b29150>, _Thread__name='Thread-11', _Thread__daemonic=True, args=(<c_void_p at remote 0x7fe617694440>,), _Thread__started=<_Event(_Verbose__verbose=False, _Event__flag=False, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fe6203c2e10>, acquire=<built-in method acquire of thread.lock object at remote 0x7fe6203c2e10>, _Condition__waiters=[<thread.lock at remote 0x7fe6203c2e70>], release=<built-in method release of thread.lock object at remote 0x7fe6203c2...(truncated)
self.__started.wait()
#15 Frame 0x24e0ae0, for file /usr/lib/python2.7/site-packages/rados.py, line 158, in run_in_thread (target=<_FuncPtr(name='rados_shutdown') at remote 0x7fe617611bb0>, args=(<c_void_p at remote 0x7fe617694440>,), timeout=0, sys=<module at remote 0x7fe6203efbb0>, interrupt=False, countdown=0, t=<RadosThread(_Thread__ident=140626100590336, target=<...>, _Thread__block=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fe6203c2df0>, acquire=<built-in method acquire of thread.lock object at remote 0x7fe6203c2df0>, _Condition__waiters=[], release=<built-in method release of thread.lock object at remote 0x7fe6203c2df0>) at remote 0x7fe618b29150>, _Thread__name='Thread-11', _Thread__daemonic=True, args=(...), _Thread__started=<_Event(_Verbose__verbose=False, _Event__flag=False, _Event__cond=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7fe6203c2e10>, acquire=<built-in method acquire of thread.lock object at remote 0x7fe6203c2e10>, _Condition__waiters=[<th...(truncated)
t.start()
#19 Frame 0x7fe61762e050, for file /usr/lib/python2.7/site-packages/rados.py, line 252, in shutdown (self=<Rados(cluster=<c_void_p at remote 0x7fe617694440>, state='connected', librados=<CDLL(_FuncPtr=<_ctypes.PyCFuncPtrType at remote 0x25c5a50>, rados_mon_command=<_FuncPtr(name='rados_mon_command') at remote 0x7fe617611a10>, rados_buffer_free=<_FuncPtr(name='rados_buffer_free') at remote 0x7fe617611ae0>, rados_conf_set=<_FuncPtr(name='rados_conf_set') at remote 0x7fe6176117a0>, _handle=39608352, _name='librados.so.2', rados_create2=<_FuncPtr(name='rados_create2') at remote 0x7fe6176116d0>, rados_conf_read_file=<_FuncPtr(name='rados_conf_read_file') at remote 0x7fe617611870>, rados_connect=<FuncPtr(name='rados_connect') at remote 0x7fe617611940>, rados_shutdown=<FuncPtr(name='rados_shutdown') at remote 0x7fe617611bb0>) at remote 0x7fe617572a10>, rados_id=None) at remote 0x7fe6175729d0>)
run_in_thread(self.librados.rados_shutdown, (self.cluster,))
#22 Frame 0x7fe617683c20, for file /usr/lib/python2.7/site-packages/rados.py, line 264, in __del
(self=<Rados(cluster=<c_void_p at remote 0x7fe617694440>, state='connected', librados=<CDLL(_FuncPtr=<_ctypes.PyCFuncPtrType at remote 0x25c5a50>, rados_mon_command=<_FuncPtr(name='rados_mon_command') at remote 0x7fe617611a10>, rados_buffer_free=<_FuncPtr(name='rados_buffer_free') at remote 0x7fe617611ae0>, rados_conf_set=<_FuncPtr(name='rados_conf_set') at remote 0x7fe6176117a0>, _handle=39608352, _name='librados.so.2', rados_create2=<_FuncPtr(name='rados_create2') at remote 0x7fe6176116d0>, rados_conf_read_file=<_FuncPtr(name='rados_conf_read_file') at remote 0x7fe617611870>, rados_connect=<_FuncPtr(name='rados_connect') at remote 0x7fe617611940>, rados_shutdown=<_FuncPtr(name='rados_shutdown') at remote 0x7fe617611bb0>) at remote 0x7fe617572a10>, rados_id=None) at remote 0x7fe6175729d0>)
self.shutdown()

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-blocker type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants