Message115114
New patch attach (termination-trackjobs3.patch).
> Hmm, a few notes. I have a bunch of nitpicks, but those
> can wait for a later iteration. (Just one style nit: I
> noticed a few unneeded whitespace changes... please try
> not to do that, as it makes the patch harder to read.)
Yeah, nitpicks can wait. We need a satisfactory solution first.
I forgot about the whitespace, the reason is that the patch was started
from the previous trackjobs patch.
> - Am I correct that you handle a crashed worker
> by aborting all running jobs?
No. The job's result is marked with the WorkerLostError, the
process is replaced by a new one, and the pool continue to be
functional.
>- If you're going to the effort of ACKing, why not record
>the mapping of tasks to workers so you can be more selective in your >termination?
I does have access to that. There's ApplyResult.worker_pids().
It doesn't terminate anything, it just clean up after whatever terminated. The MapResult could very well discard the job as a whole,
but my patch doesn't do that (at least not yet).
> Otherwise, what does the ACKing do towards fixing this particular
> issue?
It's what lets us find out what PID is processing the job. (It also
happens to be a required feature to reliably take advantage of
external ack semantics (like in AMQP), and also used by my job timeout
patch to know when a job was started, and then it shows to be useful
in this problem.
> - I think in the final version you'd need to introduce some
> interthread locking, because otherwise you're going to have weird race > conditions. I haven't thought too hard about whether you can
> get away with just catching unexpected exceptions, but it's
> probably better to do the locking.
Where is this required?
> - I'm getting hangs infrequently enough to make debugging annoying,
> and I don't have time to track down the bug right now.
Try this:
for i in 1 2 3 4 5; ./python.exe test.regrtest -v test_multiprocessing
it should show up quickly enough (at least on os x)
> Why don't you strip out any changes that are not needed (e.g. AFAICT, > the ACK logic), make sure there aren't weird race conditions,
> and if we start converging on a patch that looks right from a high > level we can try to make it work on all the corner case?
See the updated patch. I can't remove the ACK, but I removed the accept_callback, as it's not strictly needed to solve this problem. |
|
Date |
User |
Action |
Args |
2010-08-27 18:47:45 | asksol | set | recipients:
+ asksol, jnoller, vlasovskikh, gdb, Albert.Strasheim |
2010-08-27 18:47:45 | asksol | set | messageid: <1282934865.24.0.138816152346.issue9205@psf.upfronthosting.co.za> |
2010-08-27 18:47:43 | asksol | link | issue9205 messages |
2010-08-27 18:47:43 | asksol | create | |
|