Message 110142 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	asksol
Recipients	asksol, gdb, jnoller
Date	2010-07-12.21:19:55
SpamBayes Score	0.0021288753
Marked as misclassified	No
Message-id	<1278969597.77.0.994311547679.issue9205@psf.upfronthosting.co.za>
In-reply-to

Content
> Unfortunately, if you've lost a worker, you are no > longer guaranteed that cache will eventually be empty. > In particular, you may have lost a task, which could > result in an ApplyResult waiting forever for a _set call. > More generally, my chief assumption that went into this > is that the unexpected death of a worker process is > unrecoverable. It would be nice to have a better workaround > than just aborting everything, but I couldn't see a way > to do that. It would be a problem if the process simply disappeared, But in this case you have the ability to put a result on the queue, so it doesn't have to wait forever. For processes disappearing (if that can at all happen), we could solve that by storing the jobs a process has accepted (started working on), so if a worker process is lost, we can mark them as failed too. > I could be wrong, but that's not what my experiments > were indicating. In particular, if an unpickleable error occurs, > then a task has been lost, which means that the relevant map, > apply, etc. will wait forever for completion of the lost task. It's lost now, but not if we handle the error... For a single map operation this behavior may make sense, but what about someone running the pool as s long-running service for users to submit map operations to? Errors in this context are expected to happen, even unpickleable errors. I guess that the worker handler works as a supervisor is a side effect, as it was made for the maxtasksperchild feature, but for me it's a welcome one. With the supervisor in place, multiprocessing.pool is already fairly stable to be used for this use case, and there's not much to be done to make it solid (Celery is already running for months without issue, unless there's a pickling error...) > That does sound useful. Although, how can you determine the > job (and the value of i) if it's an unpickleable error? > It would be nice to be able to retrieve job/i without having > to unpickle the rest. I was already working on this issue last week actually, and I managed to do that in a way that works well enough (at least for me): http://github.com/ask/celery/commit/eaa4d5ddc06b000576a21264f11e6004b418bda1#diff-1

> Unfortunately, if you've lost a worker, you are no
> longer guaranteed that cache will eventually be empty.
> In particular, you may have lost a task, which could
> result in an ApplyResult waiting forever for a _set call.

> More generally, my chief assumption that went into this
> is that the unexpected death of a worker process is
> unrecoverable. It would be nice to have a better workaround
> than just aborting everything, but I couldn't see a way
> to do that.

It would be a problem if the process simply disappeared,
But in this case you have the ability to put a result on the queue,
so it doesn't have to wait forever.

For processes disappearing (if that can at all happen), we could solve
that by storing the jobs a process has accepted (started working on),
so if a worker process is lost, we can mark them as failed too.

> I could be wrong, but that's not what my experiments
> were indicating. In particular, if an unpickleable error occurs,
> then a task has been lost, which means that the relevant map,
> apply, etc. will wait forever for completion of the lost task.

It's lost now, but not if we handle the error...
For a single map operation this behavior may make sense, but what about
someone running the pool as s long-running service for users to submit map operations to? Errors in this context are expected to happen, even unpickleable errors.

I guess that the worker handler works as a supervisor is a side effect,
as it was made for the maxtasksperchild feature, but for me it's a welcome one. With the supervisor in place, multiprocessing.pool is already fairly stable to be used for this use case, and there's not much to be done to make it solid (Celery is already running for months without issue, unless there's a pickling error...)

> That does sound useful. Although, how can you determine the
> job (and the value of i) if it's an unpickleable error?
> It would be nice to be able to retrieve job/i without having
> to unpickle the rest.

I was already working on this issue last week actually, and I managed
to do that in a way that works well enough (at least for me):
http://github.com/ask/celery/commit/eaa4d5ddc06b000576a21264f11e6004b418bda1#diff-1

History
Date	User	Action	Args
2010-07-12 21:19:58	asksol	set	recipients: + asksol, jnoller, gdb
2010-07-12 21:19:57	asksol	set	messageid: <1278969597.77.0.994311547679.issue9205@psf.upfronthosting.co.za>
2010-07-12 21:19:56	asksol	link	issue9205 messages
2010-07-12 21:19:55	asksol	create