Author asksol
Recipients asksol, gdb, jnoller
Date 2010-07-12.21:19:55
SpamBayes Score 0.00212888
Marked as misclassified No
Message-id <1278969597.77.0.994311547679.issue9205@psf.upfronthosting.co.za>
In-reply-to
Content
> Unfortunately, if you've lost a worker, you are no
> longer guaranteed that cache will eventually be empty.
> In particular, you may have lost a task, which could
> result in an ApplyResult waiting forever for a _set call.

> More generally, my chief assumption that went into this
> is that the unexpected death of a worker process is
> unrecoverable. It would be nice to have a better workaround
> than just aborting everything, but I couldn't see a way
> to do that.

It would be a problem if the process simply disappeared,
But in this case you have the ability to put a result on the queue,
so it doesn't have to wait forever.

For processes disappearing (if that can at all happen), we could solve
that by storing the jobs a process has accepted (started working on),
so if a worker process is lost, we can mark them as failed too.

> I could be wrong, but that's not what my experiments
> were indicating. In particular, if an unpickleable error occurs,
> then a task has been lost, which means that the relevant map,
> apply, etc. will wait forever for completion of the lost task.

It's lost now, but not if we handle the error...
For a single map operation this behavior may make sense, but what about
someone running the pool as s long-running service for users to submit map operations to? Errors in this context are expected to happen, even unpickleable errors.

I guess that the worker handler works as a supervisor is a side effect,
as it was made for the maxtasksperchild feature, but for me it's a welcome one. With the supervisor in place, multiprocessing.pool is already fairly stable to be used for this use case, and there's not much to be done to make it solid (Celery is already running for months without issue, unless there's a pickling error...)

> That does sound useful. Although, how can you determine the
> job (and the value of i) if it's an unpickleable error?
> It would be nice to be able to retrieve job/i without having
> to unpickle the rest.

I was already working on this issue last week actually, and I managed
to do that in a way that works well enough (at least for me):
http://github.com/ask/celery/commit/eaa4d5ddc06b000576a21264f11e6004b418bda1#diff-1
History
Date User Action Args
2010-07-12 21:19:58asksolsetrecipients: + asksol, jnoller, gdb
2010-07-12 21:19:57asksolsetmessageid: <1278969597.77.0.994311547679.issue9205@psf.upfronthosting.co.za>
2010-07-12 21:19:56asksollinkissue9205 messages
2010-07-12 21:19:55asksolcreate