Message 115114 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	asksol
Recipients	Albert.Strasheim, asksol, gdb, jnoller, vlasovskikh
Date	2010-08-27.18:47:40
SpamBayes Score	1.110223e-16
Marked as misclassified	No
Message-id	<1282934865.24.0.138816152346.issue9205@psf.upfronthosting.co.za>
In-reply-to

Content
New patch attach (termination-trackjobs3.patch). > Hmm, a few notes. I have a bunch of nitpicks, but those > can wait for a later iteration. (Just one style nit: I > noticed a few unneeded whitespace changes... please try > not to do that, as it makes the patch harder to read.) Yeah, nitpicks can wait. We need a satisfactory solution first. I forgot about the whitespace, the reason is that the patch was started from the previous trackjobs patch. > - Am I correct that you handle a crashed worker > by aborting all running jobs? No. The job's result is marked with the WorkerLostError, the process is replaced by a new one, and the pool continue to be functional. >- If you're going to the effort of ACKing, why not record >the mapping of tasks to workers so you can be more selective in your >termination? I does have access to that. There's ApplyResult.worker_pids(). It doesn't terminate anything, it just clean up after whatever terminated. The MapResult could very well discard the job as a whole, but my patch doesn't do that (at least not yet). > Otherwise, what does the ACKing do towards fixing this particular > issue? It's what lets us find out what PID is processing the job. (It also happens to be a required feature to reliably take advantage of external ack semantics (like in AMQP), and also used by my job timeout patch to know when a job was started, and then it shows to be useful in this problem. > - I think in the final version you'd need to introduce some > interthread locking, because otherwise you're going to have weird race > conditions. I haven't thought too hard about whether you can > get away with just catching unexpected exceptions, but it's > probably better to do the locking. Where is this required? > - I'm getting hangs infrequently enough to make debugging annoying, > and I don't have time to track down the bug right now. Try this: for i in 1 2 3 4 5; ./python.exe test.regrtest -v test_multiprocessing it should show up quickly enough (at least on os x) > Why don't you strip out any changes that are not needed (e.g. AFAICT, > the ACK logic), make sure there aren't weird race conditions, > and if we start converging on a patch that looks right from a high > level we can try to make it work on all the corner case? See the updated patch. I can't remove the ACK, but I removed the accept_callback, as it's not strictly needed to solve this problem.

New patch attach (termination-trackjobs3.patch).

> Hmm, a few notes. I have a bunch of nitpicks, but those
> can wait for a later iteration. (Just one style nit: I
> noticed a few unneeded whitespace changes... please try
> not to do that, as it makes the patch harder to read.)

Yeah, nitpicks can wait. We need a satisfactory solution first.
I forgot about the whitespace, the reason is that the patch was started
from the previous trackjobs patch.

> - Am I correct that you handle a crashed worker
> by aborting all running jobs?

No. The job's result is marked with the WorkerLostError, the
process is replaced by a new one, and the pool continue to be
functional.

>- If you're going to the effort of ACKing, why not record
>the mapping of tasks to workers so you can be more selective in your >termination?

I does have access to that. There's ApplyResult.worker_pids().
It doesn't terminate anything, it just clean up after whatever terminated. The MapResult could very well discard the job as a whole,
but my patch doesn't do that (at least not yet).

> Otherwise, what does the ACKing do towards fixing this particular
> issue?

It's what lets us find out what PID is processing the job. (It also
happens to be a required feature to reliably take advantage of
external ack semantics (like in AMQP), and also used by my job timeout
patch to know when a job was started, and then it shows to be useful
in this problem.

> - I think in the final version you'd need to introduce some
> interthread locking, because otherwise you're going to have weird race > conditions. I haven't thought too hard about whether you can
> get away with just catching unexpected exceptions, but it's
> probably better to do the locking.

Where is this required?

> - I'm getting hangs infrequently enough to make debugging annoying,
> and I don't have time to track down the bug right now.

Try this:

for i in 1 2 3 4 5; ./python.exe test.regrtest -v test_multiprocessing

it should show up quickly enough (at least on os x)

> Why don't you strip out any changes that are not needed (e.g. AFAICT, > the ACK logic), make sure there aren't weird race conditions,
> and if we start converging on a patch that looks right from a high > level we can try to make it work on all the corner case?

See the updated patch. I can't remove the ACK, but I removed the accept_callback, as it's not strictly needed to solve this problem.

History
Date	User	Action	Args
2010-08-27 18:47:45	asksol	set	recipients: + asksol, jnoller, vlasovskikh, gdb, Albert.Strasheim
2010-08-27 18:47:45	asksol	set	messageid: <1282934865.24.0.138816152346.issue9205@psf.upfronthosting.co.za>
2010-08-27 18:47:43	asksol	link	issue9205 messages
2010-08-27 18:47:43	asksol	create