Issue40550
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2020-05-07 20:01 by Alexander Overvoorde, last changed 2022-04-11 14:59 by admin. This issue is now closed.
Pull Requests | |||
---|---|---|---|
URL | Status | Linked | Edit |
PR 20010 | merged | FFY00, 2020-05-08 21:20 | |
PR 23439 | merged | miss-islington, 2020-11-21 09:22 |
Messages (10) | |||
---|---|---|---|
msg368370 - (view) | Author: Alexander Overvoorde (Alexander Overvoorde) | Date: 2020-05-07 20:01 | |
The following program frequently raises a ProcessLookupError exception when calling proc.terminate(): ``` import threading import subprocess import multiprocessing proc = subprocess.Popen(["sh", "-c", "exit 0"]) q = multiprocessing.Queue() def communicate_thread(proc): proc.communicate() t = threading.Thread(target=communicate_thread, args=(proc,)) t.start() proc.terminate() t.join() ``` I'm reproducing this with Python 3.8.2 on Arch Linux by saving the script and rapidly executing it like this: $ bash -e -c "while true; do python3 test.py; done The (unused) multiprocessing.Queue seems to play a role here because the problem vanishes when removing that one line. |
|||
msg368389 - (view) | Author: Filipe Laíns (FFY00) * ![]() |
Date: 2020-05-07 23:35 | |
This is the backtrace I get: Traceback (most recent call last): File "/home/anubis/test/multiprocessing-error.py", line 16, in <module> proc.terminate() File "/home/anubis/git/cpython/Lib/subprocess.py", line 2069, in terminate self.send_signal(signal.SIGTERM) File "/home/anubis/git/cpython/Lib/subprocess.py", line 2064, in send_signal os.kill(self.pid, sig) ProcessLookupError: [Errno 3] No such process Is yours the same? This is expected, the process exited before proc.terminate(). You should wrap proc.terminate() in a try..except block: try: proc.terminate() except ProcessLookupError: pass I am not sure we want to suppress this. |
|||
msg368462 - (view) | Author: Alexander Overvoorde (Alexander Overvoorde) | Date: 2020-05-08 20:53 | |
I'm not sure that it is expected since Popen.send_signal does contain the following check: ``` def send_signal(self, sig): """Send a signal to the process.""" # Skip signalling a process that we know has already died. if self.returncode is None: os.kill(self.pid, sig) ``` Additionally, the following program does not raise a ProcessLookupError despite the program already having exited: ``` import subprocess import time proc = subprocess.Popen(["sh", "-c", "exit 0"]) time.sleep(5) proc.terminate() ``` |
|||
msg368466 - (view) | Author: Filipe Laíns (FFY00) * ![]() |
Date: 2020-05-08 21:16 | |
This is a simple time-of-check - time-of-action issue, which is why I suggested that it shouldn't be fixed. I was not aware send_signal did have this check, which tells us it is supposed to be suported(?). Anyway, to fix it we need to catch the exception and check for errno == 3 so that we can ignore it. Optimally we would want to have an atomic operation here, but no such thing exists. There is still the very faint possibility that after your process exits a new process will take its id and we kill it instead. We should keep the returncode check and just ignore the exception when errno == 3. This is the best option. |
|||
msg368470 - (view) | Author: Filipe Laíns (FFY00) * ![]() |
Date: 2020-05-08 21:26 | |
I submitted a patch. As explained above, this only mitigates the time-of-check/time-of-action issue in case the process doesn't exist, it *is not* a proper fix for the issue. But don't see any way to properly fix it. |
|||
msg368492 - (view) | Author: Alexander Overvoorde (Alexander Overvoorde) | Date: 2020-05-09 00:34 | |
I understand that it's not a perfect solution, but at least it's a little bit closer. Thanks for your patch :) |
|||
msg381535 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2020-11-21 09:22 | |
New changeset 01a202ab6b0ded546e47073db6498262086c52e9 by Filipe Laíns in branch 'master': bpo-40550: Fix time-of-check/time-of-action issue in subprocess.Popen.send_signal. (GH-20010) https://github.com/python/cpython/commit/01a202ab6b0ded546e47073db6498262086c52e9 |
|||
msg381537 - (view) | Author: Gregory P. Smith (gregory.p.smith) * ![]() |
Date: 2020-11-21 09:26 | |
Thanks for the patch! PRs are in or on their way in for 3.10 and 3.9. The 3.8 auto-backport failed, if anyone wants it on a future 3.8.x please follow up with a manual cherry pick to make a PR for the 3.8 branch. |
|||
msg381538 - (view) | Author: miss-islington (miss-islington) | Date: 2020-11-21 09:46 | |
New changeset 713b4bbd97392acc492a4fefb2d07ae61fb7b75d by Miss Islington (bot) in branch '3.9': bpo-40550: Fix time-of-check/time-of-action issue in subprocess.Popen.send_signal. (GH-20010) https://github.com/python/cpython/commit/713b4bbd97392acc492a4fefb2d07ae61fb7b75d |
|||
msg382427 - (view) | Author: Jack O'Connor (oconnor663) * | Date: 2020-12-03 16:32 | |
I'm late to the party, but I want to explain what's going on here in case it's helpful to folks. The issue you're seeing here has to do with whether a child processs has been "reaped". (Windows is different from Unix here, because the parent keeps an open handle to the child, so this is mostly a Unix thing.) In short, when a child exits, it leaves a "zombie" process whose only job is to hold some metadata and keep the child's PID reserved. When the parent calls wait/waitpid/waitid or similar, that zombie process is cleaned up. That means that waiting has important correctness properties apart from just blocking the parent -- signaling after wait returns is unsafe, and forgetting to wait also leaks kernel resources. Here's a short example demonstrating this: ``` import signal import subprocess import time # Start a child process and sleep a little bit so that we know it's exited. child = subprocess.Popen(["true"]) time.sleep(1) # Signal it. Even though it's definitely exited, this is not an error. os.kill(child.pid, signal.SIGKILL) print("signaling before waiting works fine") # Now wait on it. We could also use os.waitpid or os.waitid here. This reaps # the zombie child. child.wait() # Try to signal it again. This raises ProcessLookupError, because the child's # PID has been freed. But note that Popen.kill() would be a no-op here, # because it knows the child has already been waited on. os.kill(child.pid, signal.SIGKILL) ``` With that in mind, the original behavior with communicate() that started this bug is expected. The docs say that communicate() "waits for process to terminate and sets the returncode attribute." That means internally it calls waitpid, so your terminate() thread is racing against process exit. Catching the exception thrown by terminate() will hide the problem, but the underlying race condition means your program might end up killing an unrelated process that just happens to reuse the same PID at the wrong time. Doing this properly requires using waitid(WNOWAIT), which is...tricky. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:30 | admin | set | github: 84730 |
2020-12-03 16:32:27 | oconnor663 | set | nosy:
+ oconnor663 messages: + msg382427 |
2020-11-21 09:46:25 | miss-islington | set | messages: + msg381538 |
2020-11-21 09:26:37 | gregory.p.smith | set | status: open -> closed versions: - Python 3.8 messages: + msg381537 resolution: fixed stage: patch review -> commit review |
2020-11-21 09:22:21 | miss-islington | set | nosy:
+ miss-islington pull_requests: + pull_request22331 |
2020-11-21 09:22:11 | gregory.p.smith | set | messages: + msg381535 |
2020-11-21 08:31:38 | gregory.p.smith | set | assignee: gregory.p.smith nosy: + gregory.p.smith versions: + Python 3.9, Python 3.10 |
2020-05-09 00:34:35 | Alexander Overvoorde | set | messages: + msg368492 |
2020-05-08 21:26:34 | FFY00 | set | messages: + msg368470 |
2020-05-08 21:20:58 | FFY00 | set | keywords:
+ patch stage: patch review pull_requests: + pull_request19321 |
2020-05-08 21:16:48 | FFY00 | set | messages: + msg368466 |
2020-05-08 20:53:32 | Alexander Overvoorde | set | messages: + msg368462 |
2020-05-07 23:35:40 | FFY00 | set | nosy:
+ FFY00 messages: + msg368389 |
2020-05-07 20:01:24 | Alexander Overvoorde | create |