Created on 2010-10-06 11:25 by asksol, last changed 2012-06-11 15:28 by sbt. This issue is now closed.
|multiprocessing-worker-poll.patch||asksol, 2010-10-06 11:25|
|msg118062 - (view)||Author: Ask Solem (asksol)||Date: 2010-10-06 11:25|
While working on an "autoscaling" (yes, people call it that...) feature for Celery, I noticed that the processes created by the _handle_workers thread doesn't always work. I have reproduced this in general, by just using the maxtasksperchild feature and letting the workers terminate themselves so this seems to have always been an issue (just not easy to reproduce unless workers are created with some frequency) I'm not quite sure of the reason yet, but I finally managed to track it down to the workers being stuck while receiving from the queue. The patch attached seems to resolve the issue by polling the queue before trying to receive. I know this is short, I may have some more data later.
|msg122262 - (view)||Author: ysj.ray (ysj.ray)||Date: 2010-11-24 07:11|
Could you give an example code which can reproduce this issue?
|msg155556 - (view)||Author: Sean Reifschneider (jafo) *||Date: 2012-03-13 02:04|
The attached patch does change the semantics somewhat, but I don't fully understand how much. In particular: It changes the "get()" call to be turned into "get(timeout=1.0)" if inqueue doesn't have a _reader attribute. In the case that inqueue doesn't have a _reader attribute, and "inqueue._reader.poll(timeout)" is false, "get()" isn't called at all. It introduces a continue. I'd want Jesse to pronounce on this.
|msg162412 - (view)||Author: Richard Oudkerk (sbt) *||Date: 2012-06-06 14:12|
It is not clear to me how to reproduce the bug. When you say "letting the workers terminate themselves" do mean calling sys.exit() or os._exit() in the submitted task? Are you trying to get the result of a task which caused the worker to exit? I'm not sure how the patch would change the current behaviour. The following seems to work for me: import sys, os import multiprocessing as mp if __name__ == '__main__': p = mp.Pool(4, maxtasksperchild=5) results =  for i in range(100): if i % 10 == 0: results.append(p.apply_async(sys.exit)) else: results.append(p.apply_async(os.getpid)) for i, res in enumerate(results): if i % 10 != 0: print(res.get()) else: pass # trying res.get() would block forever
|msg162467 - (view)||Author: Ask Solem (asksol)||Date: 2012-06-07 09:09|
Well, I still don't know exactly why restarting the socket read made it work, but the patch solved an issue where newly started pool processes would be stuck in socket read forever (happening to maybe 1/500 new processes) This and a dozen other pool related fixes are in my billiard fork of multiprocessing, e.g. what you describe in your comment: # trying res.get() would block forever works in billiard, where res.get() will raise WorkerLostError in that case. https://github.com/celery/billiard/ Earlier commit history for the pool can be found in Celery: https://github.com/ask/celery/commits/2.5/celery/concurrency/processes/pool.py My eventual goal is to merge these fixes back into Python, but except for people using Python 3.x, they would have to use billiard for quite some time anyway, so I don't feel in a hurry. I think this issue can be closed, the worker handler is simply borked and we could open up a new issue deciding how to fix it (merging billiard.Pool or someting else). (btw, Richard, you're sbt? I was trying to find your real name to give you credit for the no_execv patch in billiard)
|msg162495 - (view)||Author: Richard Oudkerk (sbt) *||Date: 2012-06-07 20:20|
> I think this issue can be closed, the worker handler is simply borked and > we could open up a new issue deciding how to fix it (merging billiard.Pool > or someting else). OK. I am not sure which option under "Resolution" should be chosen. "Later"? > (btw, Richard, you're sbt? Yes. > I was trying to find your real name to give you credit for the no_execv > patch in billiard) The execv stuff certainly won't go in by Py3.3. There has not been consensus that adding it is a good idea. (I also have the unit tests passing with a "fork server": the server process is forked at the beginning of the program and then forked children of the server process are started on request. It is about 10 times faster then using execv, and almost as fast as simple forking.)
|msg162497 - (view)||Author: Ask Solem (asksol)||Date: 2012-06-07 20:57|
Later works, or just close it. I can open up a new issue to merge the improvements in billiard later. > The execv stuff certainly won't go in by Py3.3. There has not been > consensus that adding it is a good idea. > (I also have the unit tests passing with a "fork server": the server >process is forked at the beginning of the program and then forked >children of the server process are started on request. It is about 10 >times faster then using execv, and almost as fast as simple forking.) Ah, a working 'fork server' would be just as good. Btw, Billiard now supports running Pool without threads, using epoll/kqueue/select instead. So Celery uses that when it can be nonblocking, and execv when it can't. It performs way better without threads, and in addition shutdown + replacing worker processes is much more responsive. Changing the default Pool is not going to happen, but ncluding a simple select() based Pool would be possible, and then it could also easily work with Twisted, Eventlet, Gevent, etc. (especially now that the Connection is rewritten in pure python).
|msg162502 - (view)||Author: Richard Oudkerk (sbt) *||Date: 2012-06-07 21:13|
> Ah, a working 'fork server' would be just as good. Only problem is that it depends on fd passing which is apparently broken on MacOSX. > Btw, Billiard now supports running Pool without threads, using > epoll/kqueue/select instead. So Celery uses that when it can be > nonblocking, and execv when it can't. It performs way better without > threads, and in addition shutdown + replacing worker processes is much > more responsive. If it were not for Windows I would have tried to avoid using threads.
|2012-06-11 15:28:01||sbt||set||status: open -> closed|
stage: patch review -> resolved
|2012-06-07 21:13:42||sbt||set||messages: + msg162502|
|2012-06-07 20:57:48||asksol||set||messages: + msg162497|
|2012-06-07 20:20:37||sbt||set||messages: + msg162495|
|2012-06-07 09:09:22||asksol||set||messages: + msg162467|
|2012-06-06 14:12:46||sbt||set||messages: + msg162412|
|2012-03-13 02:04:01||jafo||set||assignee: jnoller|
messages: + msg155556
nosy: + jafo, jnoller
|2011-06-12 18:36:09||terry.reedy||set||versions: - Python 3.1|
messages: + msg122262