Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not immediately SIGKILL subprocess child processes upon ^C #70130

Closed
MikePomraning mannequin opened this issue Dec 25, 2015 · 19 comments
Closed

Do not immediately SIGKILL subprocess child processes upon ^C #70130

MikePomraning mannequin opened this issue Dec 25, 2015 · 19 comments
Assignees
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@MikePomraning
Copy link
Mannequin

MikePomraning mannequin commented Dec 25, 2015

BPO 25942
Nosy @gpshead, @vstinner, @vadmium, @stuarteberg
PRs
  • bpo-25942: Added new parameter to subprocess.run(): cleanup_timeout #4283
  • bpo-25942: make subprocess more graceful on ^C #5026
  • Files
  • subprocess-call-py344-kill-only-on-timeout.patch: Patch to restrict subprocess.call SIGKILLs to timeouts only
  • second-wait.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gpshead'
    closed_at = <Date 2018-01-30.05:31:57.862>
    created_at = <Date 2015-12-25.01:15:08.206>
    labels = ['3.7', 'type-feature', 'library']
    title = 'Do not immediately SIGKILL subprocess child processes upon ^C'
    updated_at = <Date 2018-01-30.15:16:26.838>
    user = 'https://bugs.python.org/MikePomraning'

    bugs.python.org fields:

    activity = <Date 2018-01-30.15:16:26.838>
    actor = 'stuarteberg'
    assignee = 'gregory.p.smith'
    closed = True
    closed_date = <Date 2018-01-30.05:31:57.862>
    closer = 'gregory.p.smith'
    components = ['Library (Lib)']
    creation = <Date 2015-12-25.01:15:08.206>
    creator = 'Mike Pomraning'
    dependencies = []
    files = ['41407', '42483']
    hgrepos = []
    issue_num = 25942
    keywords = ['patch']
    message_count = 19.0
    messages = ['256973', '257118', '257163', '257283', '257289', '257293', '262871', '263260', '263279', '263302', '263308', '263420', '263551', '263743', '305619', '305641', '309115', '311230', '311234']
    nosy_count = 7.0
    nosy_names = ['gregory.p.smith', 'vstinner', 'SilentGhost', 'martin.panter', 'Mike Pomraning', 'rpcope1', 'stuarteberg']
    pr_nums = ['4283', '5026']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'commit review'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue25942'
    versions = ['Python 3.7']

    @MikePomraning
    Copy link
    Mannequin Author

    MikePomraning mannequin commented Dec 25, 2015

    Python 3.3 introduces timeout support in subprocess.call, implemented by sending a SIGKILL if the Popen.wait is interrupted by a TimeoutExpired exception.

    However, the "except" clause is too broad, and will, for instance, trigger on a KeyboardInterrupt. For practical purposes, this means that sending a Ctrl-C to a python program before 3.3 sent a SIGINT to both the parent and subprocess.call()d child, whereas under 3.3+ sends a SIGINT _and_ a SIGKILL to the child. The child will not be able to clean up appropriately.

    For a real world example of this, see http://stackoverflow.com/q/34458583/132382

    The fix is, I think, simply changing the clause to "except TimeoutExpired". At least, that works for me. See attached patch.

    @MikePomraning MikePomraning mannequin added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir labels Dec 25, 2015
    @SilentGhost
    Copy link
    Mannequin

    SilentGhost mannequin commented Dec 28, 2015

    The code was introduced to solve bpo-12494, so I'm adding Victor to weigh in.

    @MikePomraning
    Copy link
    Mannequin Author

    MikePomraning mannequin commented Dec 29, 2015

    If I understand correctly, the _try_wait mechanics (or 3.5's syscall behavior) already handle EINTR the way we way: ignore it and try wait()ing again.

    So, this patch would kill only on a timeout, and never on another error like Ctrl-C, a UserDefinedTimeoutException from a signal handler, etc.

    That's probably the lesser of two evils, the other being a SIGKILL against an arbitrary child process. Better to document that a non-timeout-parameter interruption to subprocess.call will separate the parent from its child, than to hard kill arbitrary programs when a polite SIGINT was intended.

    @vadmium
    Copy link
    Member

    vadmium commented Jan 1, 2016

    Doesn’t this scenario apply equally to run(), or check_output() in 3.4?

    I suspect the explicit p.wait() call is not needed either. The context manager should already be calling wait().

    One possible problem that I can think of: if you set a timeout, then interrupt the call with KeyboardInterrupt or similar, the context manager will now wait without without a timeout. Demo:

    # Hit Ctrl+C before the 3 s timeout, and it will delay 10 s
    call('trap "" INT && sleep 10', shell=True, timeout=3)

    @vstinner
    Copy link
    Member

    vstinner commented Jan 1, 2016

    The issue explained in 12494 is that the caller doesn't have access to the
    subprocess.Popen object. I disagree to not kill the process when an
    exception is raised, even KeyboardInterrupt.

    I also disagree to say that we kill an "arbitrary" process. IMHO it's part
    of the API that the process is killed with SIGKILL on error.

    Maybe we need to flag to send SIGTERM on exception and then wait N wait
    until the child exited, or send SIGKILL after the timeout. Maybe it's
    overkill and such API should be developed in third party modules.

    Anyway, not sending any signal on exction is not a good idea. We must read
    the exit status to avoid zombi process. It's not a matter of sending a
    signal but of reading the exit status.

    @vadmium
    Copy link
    Member

    vadmium commented Jan 1, 2016

    The reported problem is when no timeout is given. Perhaps it would be sufficient to:

    1. Kill the child if any exception happens when a timeout is enforced
    2. Never kill the child when there is no timeout

    If a timeout is specified, the current code is good enough, but if no timeout is specified, maybe we should just do the equivalent of:

    with Popen(...) as p:
        return p.wait()
        # If interrupted, the context manager will wait again. If the interruption is due to a terminal-wide SIGINT, the child may also have been interrupted.

    For comparison, the Posix system() function is supposed to ignore SIGINT and SIGQUIT (i.e. signals from the terminal). See the Gnu implementation: <https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/posix/system.c\>.

    @vadmium
    Copy link
    Member

    vadmium commented Apr 4, 2016

    Even if we can’t agree on any behaviour change, I think it might be worth documenting how these functions behave on exceptions (interrupts) other than TimeoutExpired. Currently all I can find is “If the timeout expires, the child process will be killed and waited for.” I think this could be expanded to also say what happens if the parent is interrupted by a signal such as KeyboardInterrupt:

    • Current behaviour: Immediately kill child (i.e. timeout expiry is not special)

    • Previous behaviour: Return without waiting for child, which will become a zombie

    • Mike’s proposal: Wait indefinitely for child without killing it, which could defeat the purpose of the timeout, especially if the child ignores or does not receive the same signal as the parent

    @vstinner
    Copy link
    Member

    Again, the problem is that the exception exits from the function which owns the last reference to the Popen object. If the Popen is left alive when you exit the function, you create a zombi process, you can leave open pipes, etc.

    It's unclear to me if CTRL+c sends SIGTERM to all processes or only a few of them (only the parent process?). Even if SIGTERM is sent to all processes, a child process can decide to completly ignore it, or can block in a deadlock or whatever.

    Giving a few seconds to the child process to wait until it ends is not easy because it's hard to choose an arbitrary timeout. If the timeout is too low, you kill the child process (SIGKILL) before it flushes files. If the timeout is too long, the parent process is blocked too long when the child process is really blocked.

    I suggest to keep the current behaviour by default.

    If you really want to give time to the child process, I suggest to add a *new* optional parameter . For example ctrlc_timeout=5.0 to send SIGTERM and then wait 5 seconds.

    I don't know if the parent process must always send a SIGTERM to the child process or not. One signal or two don't have the same behaviour. It's possible to explicitly send a SIGTERM to only one process using the kill command.

    @vadmium
    Copy link
    Member

    vadmium commented Apr 12, 2016

    I don’t know how it works on Windows, but on Unix in most cases the parent and child will share a controlling terminal. Pressing Ctrl+C in the terminal will broadcast SIGINT to all processes, parent and child. That is probably why os.system() ignores SIGINT.

    I doubt the usefulness of building in extra timeouts to send SIGTERM and SIGKILL. If the user really cares that much, they can probably design their own timeout mechanism. That is why I suggested above to treat the non-timeout mode differently.

    @MikePomraning
    Copy link
    Mannequin Author

    MikePomraning mannequin commented Apr 13, 2016

    Yes, standard UNIX terminal behavior is to map Ctrl-C to a SIGINT sent to the foreground process group, so that every member of a pipeline (e.g.) or hidden helper children processes can be terminated by the interactive user and have the chance to clean up.

    Handling a child process behind a convenience interface, like system() or subprocess.call(), is inherently a bit tricky when things go wrong.

    My expectation for .call() would be that it behave something like os.system() (or the C library system() for that matter) and _not_ be interrupted by conventional signals. That the EINTR be "swallowed" and the p.wait() resumed, as _try_wait() does already. That way a user timeout= does what we want, but a Ctrl-C has familiar semantics.

    Yes, it will be possible for a coder to contrive to throw some other exception during the wait() ... in that case we should close the pipes but _not_ kill and reap the child. There will be zombies. Zombies are better than SIGKILLing a 3rd-party process that perhaps needs a graceful shutdown.

    @vadmium
    Copy link
    Member

    vadmium commented Apr 13, 2016

    When no timeout is specified, these are the options as I see them:

    1. SIGKILL child immediately on the first KeyboardInterrupt (Victor’s behaviour since 3.3)

    2. Give up and leave a zombie after the first KeyboardInterrupt (pre-3.3 behaviour)

    3. Wait again after first KeyboardInterrupt, and leave a zombie after the second one (Mike’s patch)

    4. Ignore SIGINT so that by default no KeyboardInterrupt will happen, like C’s system()

    5. Start a timeout after the first KeyboardInterrupt (Victor’s suggestion)

    Here is my proposal, taking into account Victor’s desire to never leave a zombie, and Mike’s desire to let the child handle SIGINT in its own time: After the first KeyboardInterrupt or other exception, wait() a second time, and only use SIGKILL if the second wait() is interrupted. It’s a bit complicated, but I think this would solve everyone’s concerns raised so far:

    def call(*popenargs, timeout=None, **kwargs):
        p = Popen(*popenargs, **kwargs)
        try:
            if timeout is None:
                try:
                    return p.wait()
                except:
                    p.wait()  # Let the child handle SIGINT
                    raise
            else:
                return p.wait(timeout=timeout)
        except:
            p.kill()  # Last resort to avoid leaving a zombie
            p.wait()
            raise

    @MikePomraning
    Copy link
    Mannequin Author

    MikePomraning mannequin commented Apr 14, 2016

    #2 and #4 are the only predictable and palatable options, I think. Ignore the patch that started this issue.

    @vadmium
    Copy link
    Member

    vadmium commented Apr 16, 2016

    I don’t think Victor likes #2 because of the zombie. I would be interested in #4 (one of the documented purposes of subprocess is to replace os.system), but it might need careful consideration and discussion. Ignoring signals is such a significant change I think it would have to be a new feature for 3.6+ only.

    Mike/Victor, what do you think of my proposal (call it #6) about waiting a second time before resorting to SIGKILL? Posting a patch which implements this.

    @MikePomraning
    Copy link
    Mannequin Author

    MikePomraning mannequin commented Apr 19, 2016

    Re: #2, I'd rather have a zombie than a hard kill on a child whose code I perhaps don't control. Zombies are a fact of life (er, a fact of undeath?) in UNIX process management, and are the historical and IMHO expected behavior.

    @gpshead gpshead self-assigned this Nov 5, 2017
    @gpshead gpshead added the 3.7 (EOL) end of life label Nov 5, 2017
    @vadmium
    Copy link
    Member

    vadmium commented Nov 6, 2017

    #4283 adds a secondary timeout, which defaults to 1 s when there is no main timeout. But this seems complicated and arbitrary. As I understand, the main use case discussed here was waiting without a timeout for a child that exits soon after the interrupt. But are there any practical use cases or demand for:

    • Limiting the wait time after the interrupt (despite no timeout before the interrupt)?
    • Customizing this additional timeout?

    @vstinner
    Copy link
    Member

    vstinner commented Nov 6, 2017

    I changed the issue title to "Add a new optional cleanup_timeout parameter to subprocess.call()" to make it more positive and update it to the currently proposed change ;-)

    @vstinner vstinner changed the title subprocess.call SIGKILLs too liberally Add a new optional cleanup_timeout parameter to subprocess.call() Nov 6, 2017
    @vstinner vstinner added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Nov 6, 2017
    @gpshead
    Copy link
    Member

    gpshead commented Dec 28, 2017

    you'll notice I added an alternate PR. I don't like the complication of adding yet another knob (cleanup_timeout) to subprocesses already giant API surface. It will rarely be used.

    My PR tries to take a practical approach: Just wait a little while (arbitrary value of little chosen in the code) for the child after receiving SIGINT before reraising the exception and triggering a .kill() matching existing behavior.

    The one controversial thing in my PR (which could be undone, it is independent of the other changes) is that I also modify the context manager __exit__ behavior to not do an infinite wait() upon KeyboardInterrupt. This means context managers of Popen _will_ complete potentially leaving a dangling process around (which our existing __del__ will pick up and put in the internal subprocess._active list). Relatively harmless, but a change none-the-less.

    I went that far to try and better match the Python 2.7 and 3.2 behavior: On SIGINT our process sees the KeyboardInterrupt "right away" (not quite as instantaneously here given the wait timeouts, but close enough for interactive command line tool ^C happiness).

    It seems like an improvement all around and is IMNSHO less complicated.

    @gpshead
    Copy link
    Member

    gpshead commented Jan 30, 2018

    New changeset f4d644f by Gregory P. Smith in branch 'master':
    bpo-25942: make subprocess more graceful on ^C (GH-5026)
    f4d644f

    @gpshead
    Copy link
    Member

    gpshead commented Jan 30, 2018

    I went with my change to give the child process a small amount of time to cleanup by default. Not perfect, but this should be more similar to the Python <=3.2 behavior. Lets see if any issues crop up during the 3.7 betas.

    @gpshead gpshead closed this as completed Jan 30, 2018
    @gpshead gpshead changed the title Add a new optional cleanup_timeout parameter to subprocess.call() Do not immediately SIGKILL subprocess child processes upon ^C Jan 30, 2018
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants