Issue 17969: multiprocessing crash on exit

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/62169

classification

Title:	multiprocessing crash on exit
Type:	crash	Stage:
Components:		Versions:	Python 2.7

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	kristjan.jonsson, neologix, pitrou, sbt
Priority:	normal	Keywords:

Created on 2013-05-13 11:31 by kristjan.jonsson, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (15)
msg189123 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2013-05-13 11:31
We have observed this crash with some frequency when running our compilation scripts using multiprocessing.Pool() By analysing the crashes, this is what is happening: 1) The Pool has a "daemon" thread managing the pool. 2) the worker is asleep, waiting for the GIL 3) The main thread exits. The system starts its shutdown. During PyInterpreterState_Clear, it has cleared among other things the sys dict. During this, it clears an old traceback. The traceback contains a multiprocessing.connection object. 4) The connection object is cleared. It it contains this code: Py_BEGIN_ALLOW_THREADS CLOSE(self->handle); Py_END_ALLOW_THREADS 5) The sleeping daemon thread is woken up and starts prancing around. Upon calling sys.exc_clear() it crashes, since the tstate->interp->sysdict == NULL. I have a workaround in place in our codebase: static void connection_dealloc(ConnectionObject* self) { if (self->weakreflist != NULL) PyObject_ClearWeakRefs((PyObject)self); if (self->handle != INVALID_HANDLE_VALUE) { / CCP Change. Cannot release threads here, because this * deallocation may be running during process shutdown, and * releaseing a daemon thread will cause a crash Py_BEGIN_ALLOW_THREADS CLOSE(self->handle); Py_END_ALLOW_THREADS */ CLOSE(self->handle); } PyObject_Del(self); } In general, deallocators should have no side effects, I think. Releaseing the GIL is certainly a side effect. I realize that process shutdown is a delicate matter. One delicate thing is that we cannot allow worker threads to run anymore. I see no general mechanism for ensuring this, but surely at least not releasing the GIL for deallocators is a first step?
msg189139 - (view)	Author: Richard Oudkerk (sbt) *	Date: 2013-05-13 14:41
> In general, deallocators should have no side effects, I think. > Releaseing the GIL is certainly a side effect. Notice that socket and file objects also release the GIL when being deallocated. At least for sockets close() can block (e.g. if you you use the SO_LINGER option). I am not sure whether we can ignore the possibility for connection objects. > I realize that process shutdown is a delicate matter. One delicate > thing is that we cannot allow worker threads to run anymore. I see no > general mechanism for ensuring this, but surely at least not releasing > the GIL for deallocators is a first step? I agree that shutdown matters are delicate, particularly when daemon threads are involved. In fact I'm starting to agree with Antoine that daemon threads are evil and should be avoided wherever possible. P.S. I think in Python 3.x this thread switching is stopped (by setting _Py_Finalizing to something non-NULL) before PyInterpreterState_Clear() is run.
msg189145 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2013-05-13 15:28
I think that socket.close() is the exception rather than the rule here. What kind of handle is this? It can't be a socket, since that would require closesocket. Also, even though an IO call _can_ block, that doesn't mean that we _must_ release the gil for the duration. I´m not very familiar with multiprocessing, I'm mainly trying to enhance robustness with our build tools here. Would an alternative fix, making the worker thread a non-daemon, be hard to do?
msg189146 - (view)	Author: Charles-François Natali (neologix) *	Date: 2013-05-13 15:36
> Also, even though an IO call _can_ block, that doesn't mean that > we _must_ release the gil for the duration. Yes, otherwise some people will complain when the whole interpreter is stuck while a socket/NFS file handle/whatever is shutdown. This problem isn't specific to multiprocessing: for example, if a daemon thread's thread state is cleared as part of the shutdown procedure, and one of the TLS object releases the GIL (e.g. a database connection), you'll have the same kind of problem. AFAICT, it's "solved" in Python 3 because, as Richard said, you can't acquire the GIL once the interpreter is shutting down. Daemon threads are really tricky, since they can wake-up while the interpreter is shutting down. So they should be avoided as much as possible. We can also try to think some more about a way to avoid/limit this type of issue, but it's not trivial...
msg189148 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-05-13 15:46
Yes, I think it's too delicate to change in 2.7 right now. As Charles-François said, daemon threads should be much less crashy in 3.3.
msg189150 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2013-05-13 16:14
Well, knowing that they crash less in 3.3 doesn't really fix the problem now, does it? We can consider two options then: 1) A multiprocessing specific fix. Removing this handle close gil release (which is superfluous, since these calls aren't blocking in any real sense) will certainly remove _this_ instance of the crash. An alternative is to make this worker thread non-daemon. That shouldn't be too hard and shouldn't have any other side effects. 2) A general daemon thread fix. For example, removing the GIL at the start of the shutdown process will make it impossible to release it. We can do this by setting interpreter_lock to NULL. I don't see the point of having 2.7 in bug fix mode if we can't fix bugs.
msg189152 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-05-13 17:27
Le lundi 13 mai 2013 à 16:14 +0000, Kristján Valur Jónsson a écrit : > I don't see the point of having 2.7 in bug fix mode if we can't fix > bugs. Delicate bug fixes may entail regressions, and we've had enough of them in 2.7.4. You've already patched your own Python anyway.
msg189154 - (view)	Author: Richard Oudkerk (sbt) *	Date: 2013-05-13 17:32
> We can consider two options then: > 1) A multiprocessing specific fix. Removing this handle close gil > release (which is superfluous, since these calls aren't blocking in any > real sense) will certainly remove _this_ instance of the crash. An > alternative is to make this worker thread non-daemon. That shouldn't > be too hard and shouldn't have any other side effects. Have you tried doing p = Pool() try: ... finally: p.close() # or perhaps p.terminate() p.join() Actually, making the worker thread non-daemonic is not so simple. This is because there is no way to interrupt a background thread which is blocked doing IO (unless you use multiplexing which is not possible with non-overlapped pipes). One can try to unblock such background threads by reading/writing messages from/to the result/task pipe, but it not straightforward.
msg189206 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2013-05-14 11:31
Catching regressions is what we have the regression tests for. If it is not in caught by the regression tests, then it is not a regression, by definition. Bug fix mode is for fixing bugs, IMHO. Yes, I have patched my local version. The reason I am here having this discussion is that I want to contribute to help others that are using 2.7 for multiprocessing. Others will have the same problem, and probably have, already. Anyway, I cannot easily reproduce the problem, it happens regularly in a live production environment. My patch is a conjecture based patch. But I actually had a different thought about how to handle this. The particular manifestation of this error occurs because an exception state is being cleared from the system dict. The dict contains a frame and there is where the connection object is being held. The problem can be avoided, by clearing the exception right at the start of the PyInterpreterState_Clear(), thus flushing out most side effects right at the start. I'll prepare a patch for you to review.
msg189208 - (view)	Author: Richard Oudkerk (sbt) *	Date: 2013-05-14 11:46
Kristjan, could you confirm whether joining the pool explicitly before shutdown (in the way I suggested earlier) fixes the problem. I think it should -- at shutdown you won't switch to a thread if it has already been joined.
msg189209 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-05-14 11:49
> Catching regressions is what we have the regression tests for. If it > is not in caught by the regression tests, then it is not a > regression, by definition. Call it what you want :-) The bottom line is that we'll release a 2.7.5 soon because of breakage introduced in 2.7.4. Whether or not it's a "regression" according to some platonic definition isn't very important here. > The problem can be avoided, by clearing the exception right at the > start of the PyInterpreterState_Clear(), thus flushing out most side > effects right at the start. > > I'll prepare a patch for you to review. I think this kind of stuff is too tricky to risk breaking 2.7 once again with it. If 3.x has the same issue, then a patch is helpful, otherwise not IMHO.
msg189211 - (view)	Author: Charles-François Natali (neologix) *	Date: 2013-05-14 12:04
> Catching regressions is what we have the regression tests for. If it is not in caught by the regression tests, then it is not a regression, by definition. You do realize this sentence doesn't make sense, do you?
msg189219 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2013-05-14 13:56
Richard, I'll review implement your change. It is a bit tricky to test this, since I can only tell after a few days if the particular (rare) problem has been fixed. The crash is a rare one but consistently happens with some probability we use multiprocessing to perform certain batch processing. About regressions: The term means that problems, previously fixed, become broken again. All fixes should reasonably have appropriate tests in the regression test suite. I don't know what particular "regressions" you have been battling since 2.7.4. I hope they are all testable by now. I only know that there is a exit problem with multiprocessing that statistically happens and is problemematic. It caused use to stop using multiprocessing and parallel jobs until I could diagnose the problem. I have suggested several fixes to this particular problem. I have two favorites, which would also help in the general case: 1) calling sys.exc_clear() at the beginning of the python finalize part, to throw out any lingering traceback and frame objects. This will cause side effects such as gil-twiddling to happen early, and without consequence. sys.exc_clear() is a safe api and used throughout the code, an extra call just prior to exit should be a very safe operation. 2) Turn off multi-threading at the start of python exit, by setting interpreter_lock to NULL. Again, this is a no-brainer, since the NULL value is already well defined and works well. It will cause all subsequent GIL releases to be no-ops. I personally favor the second one. It makes no sense to allow other threads to run after finalization has started and forcing them to stay dead is only prudent. 2.7 is not frozen, and it is in bug fix mode. If a solid bug fix to an actual problem is proposed, we should embrace it, and deal with any eventual fallout appropriate as the price of doing business. 2.7 is a product that is alive and well, used by millions of people and thousands of businesses and will remain so for quite some time. I know that most of the core devs have moved on to greener pastures, but I for one would like to stay loyal to this workhorse of a product and continue to make necessary improvements to it.
msg189221 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2013-05-14 14:03
Richard, reading the multiprocessing.py code, it appears that your suggestion of closign and joining the Pool() object should also do the trick. Doing that will certainly also fix this particular case. I'll implement that in our local application code. I'm still not happy that 2.7 has a potential exit crash if a daemon thread is left hanging :(
msg215511 - (view)	Author: Kristján Valur Jónsson (kristjan.jonsson) *	Date: 2014-04-04 11:50
Closing this as won-t fix. Exiting with running threads is a can of worms.

History
Date	User	Action	Args
2022-04-11 14:57:45	admin	set	github: 62169
2014-04-04 11:58:38	kristjan.jonsson	set	resolution: wont fix
2014-04-04 11:50:12	kristjan.jonsson	set	status: open -> closed messages: + msg215511
2013-05-14 14:03:53	kristjan.jonsson	set	messages: + msg189221
2013-05-14 13:56:06	kristjan.jonsson	set	messages: + msg189219
2013-05-14 12:04:21	neologix	set	messages: + msg189211
2013-05-14 11:49:41	pitrou	set	messages: + msg189209
2013-05-14 11:46:06	sbt	set	messages: + msg189208
2013-05-14 11:31:30	kristjan.jonsson	set	messages: + msg189206
2013-05-13 17:32:35	sbt	set	messages: + msg189154
2013-05-13 17:27:39	pitrou	set	messages: + msg189152
2013-05-13 16:14:04	kristjan.jonsson	set	messages: + msg189150
2013-05-13 15:46:24	pitrou	set	messages: + msg189148
2013-05-13 15:36:54	neologix	set	nosy: + neologix, pitrou messages: + msg189146
2013-05-13 15:28:12	kristjan.jonsson	set	messages: + msg189145
2013-05-13 14:41:11	sbt	set	messages: + msg189139
2013-05-13 11:38:06	sbt	set	nosy: + sbt
2013-05-13 11:31:34	kristjan.jonsson	create