classification
Title: subprocess.Popen behaves incorrect when moved in process tree
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Andre Merzky, gregory.p.smith, r.david.murray
Priority: normal Keywords: patch

Created on 2015-08-14 02:55 by Andre Merzky, last changed 2015-09-12 23:36 by Andre Merzky.

Files
File name Uploaded Description Edit
test_mp.py Andre Merzky, 2015-08-14 02:55 test
subprocess.py.diff Andre Merzky, 2015-09-12 23:36
Messages (9)
msg248553 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2015-08-14 02:55
- create a class which is a subclass of multiprocessing.Process ('A')
 - in its __init__ create new thread ('B') and share a queue with it
 - in A's run() method, run 'C=subprocess.Popen(args="/bin/false")'
 - push 'C' though the queue to 'B'
 - call 'C.pull()' --> returns 0

Apart from returning 0, the pull will also return immediately, even if the task is long running.  The task does not die -- 'ps' shows it is well alive.

I assume that the underlying reason is that 'C' is moved sideways in the process tree, and the wait is happening in a thread which is not the parent of C.  I assume (or rather guess, really) that the system level waitpid call raises a 'ECHILD' (see wait(2)), but maybe that is misinterpreted as 'process gone'?

I append a test script which shows different combinations of process spawner and watcher classes.  All of them should report an exit code of '1' (as all run /bin/false), or should raise an error.  None should report an exit code of 0 -- but some do.

PS.: I implore you not to argue if the above setup makes sense -- it probably does not.  However, it took significant work to condense a real problem into that small excerpt, and it is not a full representation of our application stack.  I am not interested in discussing alternative approaches: we have those, and I can live with the error not being fixed.

#!/usr/bin/env python

from subprocess      import Popen
from threading       import Thread  as T
from multiprocessing import Process as P
import multiprocessing as mp

class A(P):

    def __init__(self):

        P.__init__(self)

        self.q = mp.Queue()
        def b(q):
            C = q.get()
            exit_code = C.poll()
            print "exit code: %s" % exit_code
        B = T(target = b, args=[self.q])
        B.start ()

    def run(self):
        C = Popen(args  = '/bin/false')
        self.q.put(C)

a = A()
a.start()
a.join()
msg248583 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-08-14 12:48
I'll let someone else analyze this in detail if they want to, but I'll just note that mixing multiprocessing and threads is not a good idea and will lead to all sorts of strangeness.  Especially if you are using the unix default of fork for multiprocessing.
msg248592 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2015-08-14 15:48
As mentioned in the PS, I understand that the approach might be questionable.  But (a) the attached test shows the problem also for watcher *processes*, not threads, and (b) an error should be raised in unsupported uses, not a silent, unexpected behavior which mimics success.
msg248640 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2015-08-15 13:22
Looking a little further, it seems indeed to be a problem with ignoring SIGCHLD.  The behavior has been introduced with [1] at [2] AFAICS, which is a response to issue15756 [3].  IMHO, that issue should have been resolved with raising an exception instead of assuming that the child exited successfully (neither is true in this case, not the 'exited' nor the 'successfully').

[1] https://hg.python.org/cpython/rev/484c50bf445c/
[2] https://github.com/python/cpython/blob/2.7/Lib/subprocess.py#L1370
[3] http://bugs.python.org/issue15756
msg250094 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2015-09-07 10:15
Hi again, 

can I do anything to help moving this forward?

Thanks, Andre.
msg250141 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-09-08 00:14
Not really.  Give GPS a couple more weeks to respond, and then maybe bring it up on python-dev.
msg250267 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015-09-08 23:46
My gut feeling is to say "don't do that" when it comes to passing subprocess instances between processes (ie: pickling and unpickling them) and expecting them to work as if nothing had changed...  You already acknowledge that this is a strange thing for an application to do and that you have a workaround in your application.

BUT: It does looks like we are doing something a bit weird here with the waitpid errno.ECHILD exception.  However letting this bubble up to the application may, at this point, cause new bugs in code that isn't expecting it so I'm not sure we should change that in any circumstances. :/

FWIW there is also a comment at the end of the related issue1731717 (for Popen.wait() rather than .poll()) with a suggestion to ponder (though not directly related to this issue, if it is still relevant).
msg250539 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2015-09-12 22:33
Yes, I have a workaround (and even a clean solution) in my code.  My interest in this ticket is more academic than anything else :)

Thanks for the pointer to issue1731717.  While I am not sure which 'comment at the end' you exactly refer to, the whole discussion provides some more insight on why SIGCHLD is handled the way it is, so that was interesting.  

I agree that changing the behavior in a way which is unexpected for existing applications is something one wants to avoid, generally.  I can't judge if it is worth to break existing code to get more correctness in a corner case -- depends on how much (and what kind of) code relies on it, which I have no idea about.

One option to minimize change and improve correctness might be to keep track of the parent process.  So one would keep self.parent=os.getpid() along with self.pid.  In the implementation of _internal_poll one can then check if self.parent==os.getpid() still holds, and raise an ECHILD or EINVAL otherwise.  That would catch the pickle/unpickle across processes case (I don't know Python well enough to see if there are easier ways to check if a class instance is passed across process boundaries).

The above would still not be fully POSIX (it ignores process groups which would allow to wait on non-direct descendants), but going down that route would probably almost result in a reimplementation of what libc does...
msg250543 - (view) Author: Andre Merzky (Andre Merzky) * Date: 2015-09-12 23:34
This is patch is meant to be illustrative rather than functional (but it works in the limited set of cases I tested).
History
Date User Action Args
2015-09-12 23:36:00Andre Merzkysetfiles: + subprocess.py.diff
2015-09-12 23:35:47Andre Merzkysetfiles: - subprocess.py.diff
2015-09-12 23:34:17Andre Merzkysetfiles: + subprocess.py.diff
keywords: + patch
messages: + msg250543
2015-09-12 22:33:01Andre Merzkysetmessages: + msg250539
2015-09-08 23:46:35gregory.p.smithsetmessages: + msg250267
2015-09-08 00:14:51r.david.murraysetmessages: + msg250141
2015-09-07 10:15:05Andre Merzkysetmessages: + msg250094
2015-08-15 13:33:26r.david.murraysetnosy: + gregory.p.smith
2015-08-15 13:22:01Andre Merzkysetmessages: + msg248640
2015-08-14 15:48:22Andre Merzkysetmessages: + msg248592
2015-08-14 12:48:05r.david.murraysetnosy: + r.david.murray
messages: + msg248583
2015-08-14 02:55:39Andre Merzkycreate