Author asksol
Recipients asksol, gdb, jnoller
Date 2010-07-15.07:10:56
SpamBayes Score 0.000980495
Marked as misclassified No
Message-id <1279177860.09.0.522238435505.issue9205@psf.upfronthosting.co.za>
In-reply-to
Content
Greg,

> Before I forget, looks like we also need to deal with the
> result from a worker being un-unpickleable:

This is what my patch in bug 9244 does...

> Yep.  Again, as things stand, once you've lost an worker,
> you've lost a task, and you can't really do much about it.
> I guess that depends on your application though... is your
> use-case such that you can lose a task without it mattering?
> If tasks are idempotent, one could have the task handler
> resubmit them, etc..  But really, thinking about the failure
> modes I've seen (OOM kills/user-initiated interrupt) I'm not
> sure under what circumstances I'd like the pool to try to
> recover.

Losing a task is not fun, but there may still be other tasks
running that are just as important. I think you're thinking
from a map_async perspective here.

user-initiated interrupts, this is very important to recover from,
think of some badly written library code suddenly raising SystemExit,
this shouldn't terminate other jobs, and it's probably easy to recover from, so why shouldn't it try?

> The idea of recording the mapping of tasks -> workers
> seems interesting. Getting all of the corner cases could
> be hard (e.g. making removing a task from the queue and
> recording which worker did the removing atomic, detecting if the worker crashed while still holding the queue lock) and doing
> this would require extra mechanism.  This feature does seem
> to be useful for pools running many different jobs, because
> that way a crashed worker need only terminate one job.

I think I may have an alternative solution. Instead of keeping track of what the workers are doing, we could simply change the result handler
so it gives up when there are no more alive processes.

    while state != TERMINATE:
        result = get(timeout=1)
        if all_processes_dead():
            break;
History
Date User Action Args
2010-07-15 07:11:00asksolsetrecipients: + asksol, jnoller, gdb
2010-07-15 07:11:00asksolsetmessageid: <1279177860.09.0.522238435505.issue9205@psf.upfronthosting.co.za>
2010-07-15 07:10:58asksollinkissue9205 messages
2010-07-15 07:10:56asksolcreate