classification
Title: multiprocessing.pool processes started by worker handler stops working
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.3, Python 3.2, Python 2.7
process
Status: closed Resolution: later
Dependencies: Superseder:
Assigned To: jnoller Nosy List: asksol, jafo, jnoller, nirai, sbt, ysj.ray
Priority: critical Keywords: needs review, patch

Created on 2010-10-06 11:25 by asksol, last changed 2012-06-11 15:28 by sbt. This issue is now closed.

Files
File name Uploaded Description Edit
multiprocessing-worker-poll.patch asksol, 2010-10-06 11:25
Messages (8)
msg118062 - (view) Author: Ask Solem (asksol) (Python committer) Date: 2010-10-06 11:25
While working on an "autoscaling" (yes, people call it that...) feature for Celery, I noticed that the processes created by the _handle_workers thread doesn't always work.  I have reproduced this in general, by just using the maxtasksperchild feature and letting the workers terminate themselves so this seems to have always been an issue (just not easy to reproduce unless workers are created with some frequency)

I'm not quite sure of the reason yet, but I finally managed to track it down to the workers being stuck while receiving from the queue.

The patch attached seems to resolve the issue by polling the queue before trying to receive.

I know this is short, I may have some more data later.
msg122262 - (view) Author: ysj.ray (ysj.ray) Date: 2010-11-24 07:11
Could you give an example code which can reproduce this issue?
msg155556 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2012-03-13 02:04
The attached patch does change the semantics somewhat, but I don't fully understand how much.  In particular:

It changes the "get()" call to be turned into "get(timeout=1.0)" if inqueue doesn't have a _reader attribute.
In the case that inqueue doesn't have a _reader attribute, and "inqueue._reader.poll(timeout)" is false, "get()" isn't called at all.
It introduces a continue.

I'd want Jesse to pronounce on this.
msg162412 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-06-06 14:12
It is not clear to me how to reproduce the bug.

When you say "letting the workers terminate themselves" do mean calling sys.exit() or os._exit() in the submitted task?  Are you trying to get the result of a task which caused the worker to exit?

I'm not sure how the patch would change the current behaviour.

The following seems to work for me:

import sys, os
import multiprocessing as mp

if __name__ == '__main__':
    p = mp.Pool(4, maxtasksperchild=5)
    results = []

    for i in range(100):
        if i % 10 == 0:
            results.append(p.apply_async(sys.exit))
        else:
            results.append(p.apply_async(os.getpid))

    for i, res in enumerate(results):
        if i % 10 != 0:
            print(res.get())
        else:
            pass      # trying res.get() would block forever
msg162467 - (view) Author: Ask Solem (asksol) (Python committer) Date: 2012-06-07 09:09
Well, I still don't know exactly why restarting the socket read made it work, but the patch solved an issue where newly started pool processes would be stuck in socket read forever (happening to maybe 1/500 new processes)

This and a dozen other pool related fixes are in my billiard fork of multiprocessing, e.g. what you
describe in your comment:
# trying res.get() would block forever
works in billiard, where res.get() will raise WorkerLostError in that
case.

https://github.com/celery/billiard/

Earlier commit history for the pool can be found in Celery:
https://github.com/ask/celery/commits/2.5/celery/concurrency/processes/pool.py

My eventual goal is to merge these fixes back into Python, but except
for people using Python 3.x, they would have to use billiard for quite some time anyway, so I don't feel in a hurry.


I think this issue can be closed, the worker handler is simply borked and  we could open up a new issue deciding how to fix it (merging billiard.Pool or someting else).

(btw, Richard, you're sbt? I was trying to find your real name to give
you credit for the no_execv patch in billiard)
msg162495 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-06-07 20:20
> I think this issue can be closed, the worker handler is simply borked and  
> we could open up a new issue deciding how to fix it (merging billiard.Pool 
> or someting else).

OK.  I am not sure which option under "Resolution" should be chosen.  "Later"?

> (btw, Richard, you're sbt?

Yes.

> I was trying to find your real name to give you credit for the no_execv 
> patch in billiard)

The execv stuff certainly won't go in by Py3.3.  There has not been consensus that adding it is a good idea.

(I also have the unit tests passing with a "fork server": the server process is forked at the beginning of the program and then forked children of the server process are started on request.  It is about 10 times faster then using execv, and almost as fast as simple forking.)
msg162497 - (view) Author: Ask Solem (asksol) (Python committer) Date: 2012-06-07 20:57
Later works, or just close it.  I can open up a new issue to merge the improvements in billiard later.

> The execv stuff certainly won't go in by Py3.3.  There has not been 
> consensus that adding it is a good idea.

> (I also have the unit tests passing with a "fork server": the server >process is forked at the beginning of the program and then forked >children of the server process are started on request.  It is about 10 >times faster then using execv, and almost as fast as simple forking.)

Ah, a working 'fork server' would be just as good.
Btw, Billiard now supports running Pool without threads, using epoll/kqueue/select instead. So Celery uses that when it can be nonblocking, and execv when it can't.  It performs way better without threads, and in addition shutdown + replacing worker processes is much more responsive.  Changing the default Pool is not going to happen, but ncluding a simple select() based Pool would be possible, and then it could also easily work with Twisted, Eventlet, Gevent, etc. (especially now that the Connection is rewritten in pure python).
msg162502 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-06-07 21:13
> Ah, a working 'fork server' would be just as good.

Only problem is that it depends on fd passing which is apparently broken on MacOSX.

> Btw, Billiard now supports running Pool without threads, using 
> epoll/kqueue/select instead. So Celery uses that when it can be 
> nonblocking, and execv when it can't.  It performs way better without 
> threads, and in addition shutdown + replacing worker processes is much 
> more responsive.

If it were not for Windows I would have tried to avoid using threads.
History
Date User Action Args
2012-06-11 15:28:01sbtsetstatus: open -> closed
resolution: later
stage: patch review -> resolved
2012-06-07 21:13:42sbtsetmessages: + msg162502
2012-06-07 20:57:48asksolsetmessages: + msg162497
2012-06-07 20:20:37sbtsetmessages: + msg162495
2012-06-07 09:09:22asksolsetmessages: + msg162467
2012-06-06 14:12:46sbtsetmessages: + msg162412
2012-06-06 12:35:14sbtsetnosy: + sbt
2012-03-13 02:04:01jafosetassignee: jnoller

messages: + msg155556
nosy: + jafo, jnoller
2011-06-12 18:36:09terry.reedysetversions: - Python 3.1
2011-04-15 19:53:45niraisetnosy: + nirai
2010-11-24 07:11:02ysj.raysetnosy: + ysj.ray
messages: + msg122262
2010-10-06 11:25:57asksolcreate