Author asksol
Recipients Albert.Strasheim, asksol, gdb, jnoller, vlasovskikh
Date 2010-08-27.18:47:40
SpamBayes Score 1.11022e-16
Marked as misclassified No
Message-id <1282934865.24.0.138816152346.issue9205@psf.upfronthosting.co.za>
In-reply-to
Content
New patch attach (termination-trackjobs3.patch).

> Hmm, a few notes. I have a bunch of nitpicks, but those
> can wait for a later iteration. (Just one style nit: I
> noticed a few unneeded whitespace changes... please try
> not to do that, as it makes the patch harder to read.)

Yeah, nitpicks can wait. We need a satisfactory solution first.
I forgot about the whitespace, the reason is that the patch was started
from the previous trackjobs patch.

> - Am I correct that you handle a crashed worker
> by aborting all running jobs?

No. The job's result is marked with the WorkerLostError, the
process is replaced by a new one, and the pool continue to be
functional.

>- If you're going to the effort of ACKing, why not record
>the mapping of tasks to workers so you can be more selective in your >termination?

I does have access to that. There's ApplyResult.worker_pids().
It doesn't terminate anything, it just clean up after whatever terminated. The MapResult could very well discard the job as a whole,
but my patch doesn't do that (at least not yet).

> Otherwise, what does the ACKing do towards fixing this particular
> issue?

It's what lets us find out what PID is processing the job. (It also
happens to be a required feature to reliably take advantage of
external ack semantics (like in AMQP), and also used by my job timeout
patch to know when a job was started, and then it shows to be useful
in this problem.

> - I think in the final version you'd need to introduce some
> interthread locking, because otherwise you're going to have weird race > conditions. I haven't thought too hard about whether you can
> get away with just catching unexpected exceptions, but it's
> probably better to do the locking.

Where is this required?

> - I'm getting hangs infrequently enough to make debugging annoying,
> and I don't have time to track down the bug right now.

Try this:

for i in 1 2 3 4 5; ./python.exe test.regrtest -v test_multiprocessing

it should show up quickly enough (at least on os x)

> Why don't you strip out any changes that are not needed (e.g. AFAICT, > the ACK logic), make sure there aren't weird race conditions,
> and if we start converging on a patch that looks right from a high > level we can try to make it work on all the corner case?

See the updated patch. I can't remove the ACK, but I removed the accept_callback, as it's not strictly needed to solve this problem.
History
Date User Action Args
2010-08-27 18:47:45asksolsetrecipients: + asksol, jnoller, vlasovskikh, gdb, Albert.Strasheim
2010-08-27 18:47:45asksolsetmessageid: <1282934865.24.0.138816152346.issue9205@psf.upfronthosting.co.za>
2010-08-27 18:47:43asksollinkissue9205 messages
2010-08-27 18:47:43asksolcreate