Message 110353 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	asksol
Recipients	asksol, gdb, jnoller
Date	2010-07-15.07:10:56
SpamBayes Score	0.0009804955
Marked as misclassified	No
Message-id	<1279177860.09.0.522238435505.issue9205@psf.upfronthosting.co.za>
In-reply-to

Content
Greg, > Before I forget, looks like we also need to deal with the > result from a worker being un-unpickleable: This is what my patch in bug 9244 does... > Yep. Again, as things stand, once you've lost an worker, > you've lost a task, and you can't really do much about it. > I guess that depends on your application though... is your > use-case such that you can lose a task without it mattering? > If tasks are idempotent, one could have the task handler > resubmit them, etc.. But really, thinking about the failure > modes I've seen (OOM kills/user-initiated interrupt) I'm not > sure under what circumstances I'd like the pool to try to > recover. Losing a task is not fun, but there may still be other tasks running that are just as important. I think you're thinking from a map_async perspective here. user-initiated interrupts, this is very important to recover from, think of some badly written library code suddenly raising SystemExit, this shouldn't terminate other jobs, and it's probably easy to recover from, so why shouldn't it try? > The idea of recording the mapping of tasks -> workers > seems interesting. Getting all of the corner cases could > be hard (e.g. making removing a task from the queue and > recording which worker did the removing atomic, detecting if the worker crashed while still holding the queue lock) and doing > this would require extra mechanism. This feature does seem > to be useful for pools running many different jobs, because > that way a crashed worker need only terminate one job. I think I may have an alternative solution. Instead of keeping track of what the workers are doing, we could simply change the result handler so it gives up when there are no more alive processes. while state != TERMINATE: result = get(timeout=1) if all_processes_dead(): break;

Greg,

> Before I forget, looks like we also need to deal with the
> result from a worker being un-unpickleable:

This is what my patch in bug 9244 does...

> Yep.  Again, as things stand, once you've lost an worker,
> you've lost a task, and you can't really do much about it.
> I guess that depends on your application though... is your
> use-case such that you can lose a task without it mattering?
> If tasks are idempotent, one could have the task handler
> resubmit them, etc..  But really, thinking about the failure
> modes I've seen (OOM kills/user-initiated interrupt) I'm not
> sure under what circumstances I'd like the pool to try to
> recover.

Losing a task is not fun, but there may still be other tasks
running that are just as important. I think you're thinking
from a map_async perspective here.

user-initiated interrupts, this is very important to recover from,
think of some badly written library code suddenly raising SystemExit,
this shouldn't terminate other jobs, and it's probably easy to recover from, so why shouldn't it try?

> The idea of recording the mapping of tasks -> workers
> seems interesting. Getting all of the corner cases could
> be hard (e.g. making removing a task from the queue and
> recording which worker did the removing atomic, detecting if the worker crashed while still holding the queue lock) and doing
> this would require extra mechanism.  This feature does seem
> to be useful for pools running many different jobs, because
> that way a crashed worker need only terminate one job.

I think I may have an alternative solution. Instead of keeping track of what the workers are doing, we could simply change the result handler
so it gives up when there are no more alive processes.

    while state != TERMINATE:
        result = get(timeout=1)
        if all_processes_dead():
            break;

History
Date	User	Action	Args
2010-07-15 07:11:00	asksol	set	recipients: + asksol, jnoller, gdb
2010-07-15 07:11:00	asksol	set	messageid: <1279177860.09.0.522238435505.issue9205@psf.upfronthosting.co.za>
2010-07-15 07:10:58	asksol	link	issue9205 messages
2010-07-15 07:10:56	asksol	create