This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author oconnor663
Recipients Alexander Overvoorde, FFY00, gregory.p.smith, miss-islington, oconnor663
Date 2020-12-03.16:32:26
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1607013147.06.0.648027742453.issue40550@roundup.psfhosted.org>
In-reply-to
Content
I'm late to the party, but I want to explain what's going on here in case it's helpful to folks. The issue you're seeing here has to do with whether a child processs has been "reaped". (Windows is different from Unix here, because the parent keeps an open handle to the child, so this is mostly a Unix thing.) In short, when a child exits, it leaves a "zombie" process whose only job is to hold some metadata and keep the child's PID reserved.  When the parent calls wait/waitpid/waitid or similar, that zombie process is cleaned up. That means that waiting has important correctness properties apart from just blocking the parent -- signaling after wait returns is unsafe, and forgetting to wait also leaks kernel resources.

Here's a short example demonstrating this:

```
  import signal                                                                                                                                                                                
  import subprocess                                                                                                                                                                            
  import time                                                                                                                                                                                  
                                                                                                                                                                                               
  # Start a child process and sleep a little bit so that we know it's exited.                                                                                                                              
  child = subprocess.Popen(["true"])                                                                                                                                                           
  time.sleep(1)                                                                                                                                                                                
                                                                                                                                                                                               
  # Signal it. Even though it's definitely exited, this is not an error.                                                                                                                                  
  os.kill(child.pid, signal.SIGKILL)                                                                                                                                                           
  print("signaling before waiting works fine")                                                                                                                                                 
                                                                                                                                                                                               
  # Now wait on it. We could also use os.waitpid or os.waitid here. This reaps                                                                                                                 
  # the zombie child.                                                                                                                                                                          
  child.wait()                                                                                                                                                                                 
                                                                                                                                                                                               
  # Try to signal it again. This raises ProcessLookupError, because the child's                                                                                                                
  # PID has been freed. But note that Popen.kill() would be a no-op here,
  # because it knows the child has already been waited on.                                                                                                                                                    
  os.kill(child.pid, signal.SIGKILL)                                                                                                                                                           
```

With that in mind, the original behavior with communicate() that started this bug is expected. The docs say that communicate() "waits for process to terminate and sets the returncode attribute." That means internally it calls waitpid, so your terminate() thread is racing against process exit. Catching the exception thrown by terminate() will hide the problem, but the underlying race condition means your program might end up killing an unrelated process that just happens to reuse the same PID at the wrong time. Doing this properly requires using waitid(WNOWAIT), which is...tricky.
History
Date User Action Args
2020-12-03 16:32:27oconnor663setrecipients: + oconnor663, gregory.p.smith, miss-islington, FFY00, Alexander Overvoorde
2020-12-03 16:32:27oconnor663setmessageid: <1607013147.06.0.648027742453.issue40550@roundup.psfhosted.org>
2020-12-03 16:32:27oconnor663linkissue40550 messages
2020-12-03 16:32:26oconnor663create