New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.7.8 backport of Issue1856 (avoid daemon thread problems at shutdown) breaks ceph #66162
Comments
doko in msg222768 of bpo-1856: http://tracker.ceph.com/issues/8797 reports that the backport to 2.7 causes a regression in ceph. |
Does anyone have a *simple* script to reproduce the regression? The changeset 7741d0dd66ca looks good to me, Python 3 does the same thing since Python 3.2 (the new GIL). Anyway, daemon threads are evil :-( Expecting them to exit cleanly automatically is not good. Last time I tried to improve code to cleanup Python at exit in Python 3.4, I also had a regression (just before the release of Python 3.4.0): see the issue bpo-21788. |
Oh, I forgot to say that the simplest way to fix the regression is to revert this commit... As I wrote modifying the code to cleanup Python at exit is risky, and it's maybe too late to do that in Python 2.7? The risk of regression caused by 7741d0dd66ca is maybe higher than the benefit of the enhancement? :-( |
I would favour reverting. The crashiness of daemon threads is not new, so fixing it late in the 2.7 cycle wasn't really necessary IMO. |
As a sidenote, I wonder if the Ceph issue would also occur on 3.x. I guess we won't know until Ceph is ported (is it?). |
Note that this may very well be a bug in Ceph... Is any of the Ceph authors nosied here? |
@benjamin: What do you think? Should we revert the changeset 7741d0dd66ca in Python 2.7? |
Antoine is probably right that we should revert. I wonder what the exact On Tue, Sep 30, 2014, at 18:55, STINNER Victor wrote:
|
Agreed with Antoine and Benjamin. |
Oh by the way, I also prefer to revert the commit. |
Hi; I'm the original author of the code in the Ceph CLI. The reason it does what it does is that the Python CLI calls into librados (Ceph, through ctypes) to connect to the cluster; that connection can block for various reasons, so it's spawned in a thread; after a timeout or ^C, we desire to exit, but if we've got a non-daemon thread stuck in Ceph, and no way to cancel it from the threading module, sys.exit() will also block waiting for the Ceph thread to finally give up. If, however, we set that thread as a daemon thread, it doesn't block the sys.exit() (that being, I thought, the whole point of daemon threads). I confess I don't fully understand the change in 7741d0dd66ca, but it does seem to have the side effect of not actually allowing exit while there are outstanding daemon threads not hitting Python. |
I suspect the underlying problem is that the fix expects the daemon threads to hit a point where they try to acquire the GIL and that's not going to happen with these librados threads; they stay in librados. |
Thanks for the explanation.
That's a bit weird, as that change doesn't introduce any wait. Are you sure there isn't something else happening, e.g. a lock is owned by the daemon thread and a non-daemon thread tries to acquire it? |
New changeset 4ceca79d1c63 by Antoine Pitrou in branch '2.7': |
I've reverted the original changeset. I still believe it is likely to be an issue on the ceph side, though, and the changes remain in 3.x. |
So, finally got to a Fedora21 beta to try this out today; the immediate problem, as identified by Joe Julian, is the shutdown() call in the __del__ method of Rados. Presumably something about object destructors is clashing with starting new threads; the hang occurs with a gdb backtrace like so: 1710 in sem_wait () from /lib64/libpthread.so.0 I'm guessing something about the changed shutdown is causing the deadlock?... Anyway, I'm pretty sure we can afford to remove that __del__ method, and that's another easily-applied workaround for the field. |
Belaboring this a bit just in case what I learn helps with the interpreter change: seems like threading.Thread.start() is hanging in its normal "wait for start" code: (gdb) py-bt |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: