Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Popen.communicate() breaks when child closes its side of pipe but not exits #79363

Closed
and800 mannequin opened this issue Nov 7, 2018 · 9 comments
Closed

Popen.communicate() breaks when child closes its side of pipe but not exits #79363

and800 mannequin opened this issue Nov 7, 2018 · 9 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@and800
Copy link
Mannequin

and800 mannequin commented Nov 7, 2018

BPO 35182
Nosy @gpshead, @MojoVampire, @and800, @miss-islington
PRs
  • bpo-35182: fix communicate() crash after child closes its pipes #17020
  • [2.7] bpo-35182: fix communicate() crash after child closes its pipes (GH-17020) #17023
  • bpo-35182: fix communicate() crash after child closes its pipes #18117
  • [3.8] bpo-35182: fix communicate() crash after child closes its pipes (GH-18117) #18148
  • [3.7] bpo-35182: fix communicate() crash after child closes its pipes (GH-18117) #18151
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gpshead'
    closed_at = <Date 2020-01-23.23:33:29.317>
    created_at = <Date 2018-11-07.10:23:30.466>
    labels = ['3.7', '3.8', 'type-bug', 'library', '3.9']
    title = 'Popen.communicate() breaks when child closes its side of pipe but not exits'
    updated_at = <Date 2020-01-23.23:33:29.310>
    user = 'https://github.com/and800'

    bugs.python.org fields:

    activity = <Date 2020-01-23.23:33:29.310>
    actor = 'gregory.p.smith'
    assignee = 'gregory.p.smith'
    closed = True
    closed_date = <Date 2020-01-23.23:33:29.317>
    closer = 'gregory.p.smith'
    components = ['Library (Lib)']
    creation = <Date 2018-11-07.10:23:30.466>
    creator = 'and800'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 35182
    keywords = ['patch']
    message_count = 9.0
    messages = ['329412', '329435', '329436', '355760', '360520', '360521', '360584', '360585', '360586']
    nosy_count = 4.0
    nosy_names = ['gregory.p.smith', 'josh.r', 'and800', 'miss-islington']
    pr_nums = ['17020', '17023', '18117', '18148', '18151']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'commit review'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue35182'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @and800
    Copy link
    Mannequin Author

    and800 mannequin commented Nov 7, 2018

    When communicate() is called in a loop, it crashes when the child process has already closed any piped standard stream, but still continues to be running.

    How this happens:

    1. the parent waits for the child events inside communicate() call
    2. the child closes its side of any attached pipes long before exiting (in my case there is some complex c++ application which had messed with its termination)
    3. communicate() receives an epoll event, tries to read/write, receives SIGPIPE (for stdin) or EOF (for stdout), decides to close corresponding file descriptors from its side
    4. communicate() waits for the death of the child, but a timeout is fired
    5. parent handles timeout exception and calls communicate() again
    6. an exception is raised when communicate() tries to register closed file in epoll

    I think there may be a simple solution: before registering file descriptors in epoll, we may check whether any of them is already closed, and don't register it in that case.

    Here is a simple reproducible example, ran on Linux 4.15.0-1021-aws x86_64:

    import subprocess
    
    child = subprocess.Popen(
        ['/usr/local/bin/python3.7', '-c', 'import os, time; os.close(1), time.sleep(30)'],
        stdout=subprocess.PIPE,
    )
    
    while True:
        try:
            child.communicate(timeout=3)
            break
        except subprocess.TimeoutExpired:
            # do something useful here
            pass

    Here is a stacktrace:

    Traceback (most recent call last):
      File "test.py", line 10, in <module>
        child.communicate(timeout=3)
      File "/usr/local/lib/python3.7/subprocess.py", line 933, in communicate
        stdout, stderr = self._communicate(input, endtime, timeout)
      File "/usr/local/lib/python3.7/subprocess.py", line 1666, in _communicate
        selector.register(self.stdout, selectors.EVENT_READ)
      File "/usr/local/lib/python3.7/selectors.py", line 352, in register
        key = super().register(fileobj, events, data)
      File "/usr/local/lib/python3.7/selectors.py", line 238, in register
        key = SelectorKey(fileobj, self._fileobj_lookup(fileobj), events, data)
      File "/usr/local/lib/python3.7/selectors.py", line 225, in _fileobj_lookup
        return _fileobj_to_fd(fileobj)
      File "/usr/local/lib/python3.7/selectors.py", line 40, in _fileobj_to_fd
        "{!r}".format(fileobj)) from None
    ValueError: Invalid file object: <_io.BufferedReader name=3>

    @and800 and800 mannequin added type-crash A hard crash of the interpreter, possibly with a core dump 3.7 (EOL) end of life 3.8 only security fixes type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Nov 7, 2018
    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Nov 7, 2018

    Sounds like the solution you'd want here is to just change each if check in _communicate, so instead of:

        if self.stdout:
            selector.register(self.stdout, selectors.EVENT_READ)
        if self.stderr:
            selector.register(self.stderr, selectors.EVENT_READ)

    it does:

        if self.stdout and not self.stdout.closed:
            selector.register(self.stdout, selectors.EVENT_READ)
        if self.stderr and not self.stderr.closed:
            selector.register(self.stderr, selectors.EVENT_READ)

    The `if self.stdin and input:` would also have to change. Right now it's buggy in a related, but far more complex way. Specifically if you call it with input the first time:

    1. If some of the input is sent but not all, and the second time you call communicate you rely on the (undocumented, but necessary for consistency) input caching and don't pass input at all, it won't register the stdin handle for read (and in fact, will explicitly close the stdin handle), and the remaining cached data won't be sent. If you try to pass some other non-empty input, it just ignores it and sends whatever remains in the cache (and fails out as in the stdout/stderr case if the data in the cache was sent completely before the timeout).

    2. If all of the input was sent on the first call, you *must* pass input=None, or you'll die trying to register self.stdin with the selector

    The fix for this would be to either:

    1. Follow the pattern for self.stdout/stderr (adding "and not self.stdin.closed"), and explicitly document that repeated calls to communicate must pass the exact same input each time (and optionally validate this in the _save_input function, which as of right now just ignores the input if a cache already exists); if input is passed the first time, incompletely transmitted, and not passed the second time, the code will error as in the OP's case, but it will have violated the documented requirements (ideally the error would be a little more clear though)

    or

    1. Change the code so populating the cache (if not already populated) is the first step, and replace all subsequent references to input with references to self._input (for setup tests, also checking if self._input_offset >= len(self._input), so it doesn't register for notifications on self.stdin if all the input has been sent), so it becomes legal to pass input=None on a second call and rely on the first call to communicate caching it. It would still ignore new input values on the subsequent calls, but at least it would behave in a sane way (not closing sys.stdin despite having unsent cached data, then producing a confusing error that is several steps removed from the actual problem)

    Either way, the caching behavior for input should be properly documented; we clearly specify that output is preserved after a timeout and retrying communicate ("If the process does not terminate after timeout seconds, a TimeoutExpired exception will be raised. Catching this exception and retrying communication will not lose any output."), but we don't say anything about input, and right now, the behavior is the somewhat odd and hard to express:

    "Retrying a call to communicate when the original call was passed non-None/non-empty input requires subsequent call(s) to pass non-None, non-empty input. The input on said subsequent calls is otherwise ignored; only the unsent remainder of the original input is sent. Also, it will just fail completely if you pass non-empty input and it turns out the original input was sent completely on the previous call, in which case you *must* call it with input=None."

    It might also be worth changing the selectors module to raise a more obvious exception when register is passed a closed file-like object, but given it only requires non-integer fileobjs to have a .fileno() method, adding a requirement for a "closed" attribute/property could break other code.

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Nov 7, 2018

    Hmm... Correction to my previous post. communicate itself has a test for:

    "if self._communication_started and input:"

    that raises an error if it passes, so the second call to communicate can only be passed None/empty input. And _communicate only explicitly closes self.stdin when input is falsy and _communication_started is False, so the required behavior right now is:

    1. First call *may* pass input
    2. Second call must not pass (non-empty) input under any circumstance

    So I think we're actually okay on the code for stdin, but it would be a good idea to document that input *must* be None on all but the first call, and that the input passed to the first call is cached such that as long as at least one call to communicate completes without a TimeoutError (and the stdin isn't explicitly closed), it will all be sent.

    Sorry for the noise; I should have rechecked communicate itself, not just _communicate.

    @and800
    Copy link
    Mannequin Author

    and800 mannequin commented Oct 31, 2019

    @josh.r but you’re correct regarding cached data that isn’t sent on subsequent communicate() calls. If the child consumes the input too slowly, and timeout occurs before sending all input, the remaining part will be lost.

    Maybe it is not a bug, but it’s quite a confusing behavior, and I think it should be mentioned in the doc.

    @gpshead
    Copy link
    Member

    gpshead commented Jan 22, 2020

    New changeset d3ae95e by Gregory P. Smith (Alex Rebert) in branch 'master':
    bpo-35182: fix communicate() crash after child closes its pipes (GH-17020) (GH-18117)
    d3ae95e

    @gpshead
    Copy link
    Member

    gpshead commented Jan 22, 2020

    backport automation appears unhappy at the moment. I'm keeping this open and assigned to me to manually run cherry_picker on this for 3.8 and 3.7 (if still open for non-security fixes).

    @gpshead gpshead self-assigned this Jan 22, 2020
    @miss-islington
    Copy link
    Contributor

    New changeset 5654f83 by Miss Islington (bot) (Alex Rebert) in branch '3.8':
    [3.8] bpo-35182: fix communicate() crash after child closes its pipes (GH-18117) (GH-18148)
    5654f83

    @miss-islington
    Copy link
    Contributor

    New changeset 61b3484 by Miss Islington (bot) (Alex Rebert) in branch '3.7':
    [3.7] bpo-35182: fix communicate() crash after child closes its pipes (GH-18117) (GH-18151)
    61b3484

    @gpshead
    Copy link
    Member

    gpshead commented Jan 23, 2020

    thanks everyone!

    @gpshead gpshead added the 3.9 only security fixes label Jan 23, 2020
    @gpshead gpshead closed this as completed Jan 23, 2020
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants