Author asksol
Recipients asksol, gdb, jnoller
Date 2010-07-14.07:58:13
SpamBayes Score 0.0494337
Marked as misclassified No
Message-id <1279094296.21.0.17444669581.issue9205@psf.upfronthosting.co.za>
In-reply-to
Content
There's one more thing

     if exitcode is not None:
       cleaned = True
                if exitcode != 0 and not worker._termination_requested:
                    abnormal.append((worker.pid, exitcode))


Instead of restarting crashed worker processes it will simply bring down
the pool, right?

If so, then I think it's important to decide whether we want to keep
the supervisor functionality, and if so decide on a recovery strategy.

Some alternatives are:

A) Any missing worker brings down the pool.

B) Missing workers will be replaced one-by-one. A maximum-restart-frequency decides when the supervisor should give up trying to recover
the pool, and crash it.

C) Same as B, except that any process crashing when trying to get() will bring down the pool.

I think the supervisor is a good addition, so I would very much like to keep it. It's also a step closer to my goal of adding the enhancements added by Celery to multiprocessing.pool.

Using C is only a few changes away from this patch, but B would also be possible in combination with my accept_callback patch. It does pose some overhead, so it depends on the level of recovery we want to support.

accept_callback: this is a callback that is triggered when the job is reserved by a worker process. The acks are sent to an additional Queue, with an additional thread processing the acks (hence the mentioned overhead). This enables us to keep track of what the worker processes are doing, also get the PID of the worker processing any given job (besides from recovery, potential uses are monitoring and the ability to terminate a job (ApplyResult.terminate?). See http://github.com/ask/celery/blob/master/celery/concurrency/processes/pool.py
History
Date User Action Args
2010-07-14 07:58:16asksolsetrecipients: + asksol, jnoller, gdb
2010-07-14 07:58:16asksolsetmessageid: <1279094296.21.0.17444669581.issue9205@psf.upfronthosting.co.za>
2010-07-14 07:58:14asksollinkissue9205 messages
2010-07-14 07:58:13asksolcreate