classification
Title: Popen.terminate fails with ProcessLookupError under certain conditions
Type: behavior Stage: commit review
Components: Library (Lib) Versions: Python 3.10, Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: Alexander Overvoorde, FFY00, gregory.p.smith, miss-islington, oconnor663
Priority: normal Keywords: patch

Created on 2020-05-07 20:01 by Alexander Overvoorde, last changed 2020-12-03 16:32 by oconnor663. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 20010 merged FFY00, 2020-05-08 21:20
PR 23439 merged miss-islington, 2020-11-21 09:22
Messages (10)
msg368370 - (view) Author: Alexander Overvoorde (Alexander Overvoorde) Date: 2020-05-07 20:01
The following program frequently raises a ProcessLookupError exception when calling proc.terminate():

```
import threading
import subprocess
import multiprocessing

proc = subprocess.Popen(["sh", "-c", "exit 0"])

q = multiprocessing.Queue()

def communicate_thread(proc):
    proc.communicate()

t = threading.Thread(target=communicate_thread, args=(proc,))
t.start()

proc.terminate()

t.join()
```

I'm reproducing this with Python 3.8.2 on Arch Linux by saving the script and rapidly executing it like this:

$ bash -e -c "while true; do python3 test.py; done

The (unused) multiprocessing.Queue seems to play a role here because the problem vanishes when removing that one line.
msg368389 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2020-05-07 23:35
This is the backtrace I get:

Traceback (most recent call last):
  File "/home/anubis/test/multiprocessing-error.py", line 16, in <module>
    proc.terminate()
  File "/home/anubis/git/cpython/Lib/subprocess.py", line 2069, in terminate
    self.send_signal(signal.SIGTERM)
  File "/home/anubis/git/cpython/Lib/subprocess.py", line 2064, in send_signal
    os.kill(self.pid, sig)
ProcessLookupError: [Errno 3] No such process

Is yours the same? This is expected, the process exited before proc.terminate().

You should wrap proc.terminate() in a try..except block:

try:
    proc.terminate()
except ProcessLookupError:
    pass

I am not sure we want to suppress this.
msg368462 - (view) Author: Alexander Overvoorde (Alexander Overvoorde) Date: 2020-05-08 20:53
I'm not sure that it is expected since Popen.send_signal does contain the following check:

```
def send_signal(self, sig):
    """Send a signal to the process."""
    # Skip signalling a process that we know has already died.
    if self.returncode is None:
        os.kill(self.pid, sig)
```

Additionally, the following program does not raise a ProcessLookupError despite the program already having exited:

```
import subprocess
import time

proc = subprocess.Popen(["sh", "-c", "exit 0"])

time.sleep(5)

proc.terminate()
```
msg368466 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2020-05-08 21:16
This is a simple time-of-check - time-of-action issue, which is why I suggested that it shouldn't be fixed. I was not aware send_signal did have this check, which tells us it is supposed to be suported(?). Anyway, to fix it we need to catch the exception and check for errno == 3 so that we can ignore it.

Optimally we would want to have an atomic operation here, but no such thing exists. There is still the very faint possibility that after your process exits a new process will take its id and we kill it instead.

We should keep the returncode check and just ignore the exception when errno == 3. This is the best option.
msg368470 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2020-05-08 21:26
I submitted a patch. As explained above, this only mitigates the time-of-check/time-of-action issue in case the process doesn't exist, it *is not* a proper fix for the issue. But don't see any way to properly fix it.
msg368492 - (view) Author: Alexander Overvoorde (Alexander Overvoorde) Date: 2020-05-09 00:34
I understand that it's not a perfect solution, but at least it's a little bit closer. Thanks for your patch :)
msg381535 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-11-21 09:22
New changeset 01a202ab6b0ded546e47073db6498262086c52e9 by Filipe Laíns in branch 'master':
bpo-40550: Fix time-of-check/time-of-action issue in subprocess.Popen.send_signal. (GH-20010)
https://github.com/python/cpython/commit/01a202ab6b0ded546e47073db6498262086c52e9
msg381537 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-11-21 09:26
Thanks for the patch!

PRs are in or on their way in for 3.10 and 3.9.

The 3.8 auto-backport failed, if anyone wants it on a future 3.8.x please follow up with a manual cherry pick to make a PR for the 3.8 branch.
msg381538 - (view) Author: miss-islington (miss-islington) Date: 2020-11-21 09:46
New changeset 713b4bbd97392acc492a4fefb2d07ae61fb7b75d by Miss Islington (bot) in branch '3.9':
bpo-40550: Fix time-of-check/time-of-action issue in subprocess.Popen.send_signal. (GH-20010)
https://github.com/python/cpython/commit/713b4bbd97392acc492a4fefb2d07ae61fb7b75d
msg382427 - (view) Author: Jack O'Connor (oconnor663) * Date: 2020-12-03 16:32
I'm late to the party, but I want to explain what's going on here in case it's helpful to folks. The issue you're seeing here has to do with whether a child processs has been "reaped". (Windows is different from Unix here, because the parent keeps an open handle to the child, so this is mostly a Unix thing.) In short, when a child exits, it leaves a "zombie" process whose only job is to hold some metadata and keep the child's PID reserved.  When the parent calls wait/waitpid/waitid or similar, that zombie process is cleaned up. That means that waiting has important correctness properties apart from just blocking the parent -- signaling after wait returns is unsafe, and forgetting to wait also leaks kernel resources.

Here's a short example demonstrating this:

```
  import signal                                                                                                                                                                                
  import subprocess                                                                                                                                                                            
  import time                                                                                                                                                                                  
                                                                                                                                                                                               
  # Start a child process and sleep a little bit so that we know it's exited.                                                                                                                              
  child = subprocess.Popen(["true"])                                                                                                                                                           
  time.sleep(1)                                                                                                                                                                                
                                                                                                                                                                                               
  # Signal it. Even though it's definitely exited, this is not an error.                                                                                                                                  
  os.kill(child.pid, signal.SIGKILL)                                                                                                                                                           
  print("signaling before waiting works fine")                                                                                                                                                 
                                                                                                                                                                                               
  # Now wait on it. We could also use os.waitpid or os.waitid here. This reaps                                                                                                                 
  # the zombie child.                                                                                                                                                                          
  child.wait()                                                                                                                                                                                 
                                                                                                                                                                                               
  # Try to signal it again. This raises ProcessLookupError, because the child's                                                                                                                
  # PID has been freed. But note that Popen.kill() would be a no-op here,
  # because it knows the child has already been waited on.                                                                                                                                                    
  os.kill(child.pid, signal.SIGKILL)                                                                                                                                                           
```

With that in mind, the original behavior with communicate() that started this bug is expected. The docs say that communicate() "waits for process to terminate and sets the returncode attribute." That means internally it calls waitpid, so your terminate() thread is racing against process exit. Catching the exception thrown by terminate() will hide the problem, but the underlying race condition means your program might end up killing an unrelated process that just happens to reuse the same PID at the wrong time. Doing this properly requires using waitid(WNOWAIT), which is...tricky.
History
Date User Action Args
2020-12-03 16:32:27oconnor663setnosy: + oconnor663
messages: + msg382427
2020-11-21 09:46:25miss-islingtonsetmessages: + msg381538
2020-11-21 09:26:37gregory.p.smithsetstatus: open -> closed
versions: - Python 3.8
messages: + msg381537

resolution: fixed
stage: patch review -> commit review
2020-11-21 09:22:21miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request22331
2020-11-21 09:22:11gregory.p.smithsetmessages: + msg381535
2020-11-21 08:31:38gregory.p.smithsetassignee: gregory.p.smith

nosy: + gregory.p.smith
versions: + Python 3.9, Python 3.10
2020-05-09 00:34:35Alexander Overvoordesetmessages: + msg368492
2020-05-08 21:26:34FFY00setmessages: + msg368470
2020-05-08 21:20:58FFY00setkeywords: + patch
stage: patch review
pull_requests: + pull_request19321
2020-05-08 21:16:48FFY00setmessages: + msg368466
2020-05-08 20:53:32Alexander Overvoordesetmessages: + msg368462
2020-05-07 23:35:40FFY00setnosy: + FFY00
messages: + msg368389
2020-05-07 20:01:24Alexander Overvoordecreate