New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
subprocess.Popen._cleanup() "The handle is invalid" error when some old process is gone #81561
Comments
subprocess keeps the list of active processes in its
[2] will get [1] https://github.com/python/cpython/blob/v3.8.0b1/Lib/subprocess.py#L715 [2] https://github.com/python/cpython/blob/v3.8.0b1/Lib/subprocess.py#L1310 [3] https://github.com/python/cpython/blob/v3.8.0b1/Lib/subprocess.py#L1710 |
See bpo-36067 for a related discussion. The _active list and _cleanup function are not required in Windows. When a process exits, it gets rundown to free its handle table and virtual memory. Only the kernel object remains, which is kept alive by pointer and handle references to it that can query information such as the exit status code. As soon as the last reference is closed, the Process object is automatically reaped. It doesn't have to be waited on. |
Hi Eryk! Thanks for a swift reply! So the current plan for fixing it is:
Please let me know if I'm missing something. Also, is anyone working on this, or should I prepare a PR? Thanks, |
_active only gets appended to in __del__. We can skip the entire body of __del__. Also, calling _cleanup can be skipped in __init__. _internal_poll is required for poll().
ERROR_INVALID_HANDLE should not be ignored. terminate() is doing the right thing by not masking ERROR_INVALID_HANDLE. The only concern is the case where our handle has been closed elsewhere (not by subprocess) and the handle value was subsequently reused as a handle for another process to which we have terminate access. This is a bad handle at a logical, application level, but it's valid and accessible for TerminateProcess. It was suggested to initially get the pid and process startup time in order to validate the call, but Alexey Izbyshev is right that it's not worth complicating the code in subprocess to address an uncommon bug in particular application or library code. |
That sounds like a bad idea. We should avoid using a handle if it can become invalid out of our control. A handle might be recycled, no? |
Thanks for pointing that out! I've submitted a tiny patch to skip |
Linux (for example) has the same design: the kernel doesn't keep a "full process" alive, but a lightweight structure just for its parent process which gets the exit status. That's the concept of "zombie process". One issue on Linux is that the zombie process keeps the pid used until the parent reads the child exit status, and Linux pids are limited to 32768 by default. Now I'm not sure that I understand what it means on Windows. The subprocess module uses a magic Handle object which calls CloseHandle(handle) in its __del__() method. I dislike relying on destructors. If an object is kept alive by a reference cycle, it's never released: CloseHandle() isn't called. So code spawning process should wait until subprocesses complete to ensure that CloseHandle() is called, no? Except that Popen._internal_poll() doesn't clear the handle even after the process completes. If Popen.__del__() doesn't add active processes to the _active list, it should at least explicitly call CloseHandle(), no? |
The handle can deliberately live beyond the process, but it does not have to. If it is closed early then the kernel object will be freed when no handles remain, which will be at process exit. So it's a classic __exit__/del case, where both are needed if you want deterministic resource disposal. But it is in no way tied to the life of the child process, so waiting first is optional. |
Windows allocates Process and Thread IDs out of a kernel handle table, which can grow to about 2**24 entries (more than 16 million). So the practical resource limit for inactive Process and Thread objects is available memory, not running out of PID/TID values.
In Unix, the zombie remains visible in the task list (marked as <defunct> in Linux), but in Windows an exited process is removed from the Process Manager's active list, so it's no longer visible to users. Also, a Process object is reaped as soon as the last reference to it is closed, since clearly no one needs it anymore.
We could call self._handle.Close() in _wait(), right after calling GetExitCodeProcess(self._handle). With this change, __exit__ will ensure that _handle gets closed in a deterministic context. Code that needs the handle indefinitely can call _handle.Detach() before exiting the with-statement context, but that should rarely be necessary. I don't understand emitting a resource warning in Popen.__del__ if a process hasn't been waited on until completion beforehand (i.e. self.returncode is None). If a script wants to be strict about this, it can use a with statement, which is documented to wait on the process. I do understand emitting a resource warning in Popen.__del__ if self._internal_poll() is None. In this case, in Unix only now, the process gets added to the _active list to try to avoid leaking a zombie. The list gets polled in subprocess._cleanup, which is called in Popen.__init__. Shouldn't _cleanup also be set as an atexit function? There should be a way to indicate a Popen instance is intended to continue running detached from our process, so scripts don't have to ignore an irrelevant resource warning. |
I created bpo-37410 and PR 14391 to close the handle when the process completes. IMHO this change is easy and safe, it can be backported to Python 3.7 and 3.8.
My point is that if you want to "detach" a process, it must be *explicit*. In my experience, processes are "leaked" implicitly: by mistake. In multiprocessing, it can be subtle, you don't manipulate subprocess.Popen directly. Users must not access private attributes (Popen._handle). After I added the ResourceWarning, I created bpo-27068 "Add a detach() method to subprocess.Popen". But I only created to reply to a request of Martin Panter, I didn't use this feature myself. At some point, I decided to just close the issue. Maybe it's time to reopen it.
The purpose is to help developers to find bugs in their bugs. Many are now aware that programs continue to run in the background and that leaking zombie processes is an issue on Unix. The process is not polled to be able to emit the warning in more cases, again, to ease debug. The warning is ignored by default. |
It emits an incorrect warning if the process has already exited: "subprocess %s is still running". This can be rectified. Here's my current understanding: def __del__(self, _maxsize=sys.maxsize, _warn=warnings.warn):
if not self._child_created or self.returncode is not None:
return
# In Unix, not reading the subprocess exit status creates a zombie
# process, which is only destroyed at the parent Python process exit.
# In Windows, no one else should have a reference to our _handle, so
# it should get finalized and thus closed, but we use the same warning
# in order to consistently educate developers.
if self._internal_poll(_deadstate=_maxsize) is not None:
_warn("subprocess %s was implicitly finalized" % self.pid,
ResourceWarning, source=self)
else:
_warn("subprocess %s is still running" % self.pid,
ResourceWarning, source=self)
# Keep this instance alive until we can wait on it, if needed.
if _active is not None:
_active.append(self) |
Sorry, I'm lost :-( More and more different issues are discussed here. Would it be possible please to stick this issue to _active list and _cleanup() on Windows? From what I read, remove/disable _active and _cleanup() makes sense on Windows. I already created bpo-37410 to call CloseHandle() earlier. Eryk: Please open a separated issue if you want to change the ResourceWarning message. |
Here is a concrete example of ResourceWarning when debugging/stressing multiprocessing code, I interrupt (CTRL+c) test_resource_tracker while it's running: vstinner@apu$ ./python -m test test_multiprocessing_spawn --fail-env-changed -v -m test_resource_tracker ^C /home/vstinner/prog/python/master/Lib/subprocess.py:917: ResourceWarning: subprocess 18576 is still running == Tests result: INTERRUPTED == 1 test omitted: Total duration: 304 ms |
I merged PR 14360 into master. Are Python 2.7, 3.7 and 3.8 impacted by the bug as well? Should we backport the change, or would it be too dangerous? |
The |
In my experience, any change is dangerous :-) The initial bug doesn't contain any reproducer. I'm talking about:
It's the first time that I see such bug report. It seems like subprocess works for most people :-) Can you elobrate how to reproduce this issue to justify a backport? |
IIRC it's the second time I've seen that issue reported. It should be rare. It should only occur if some other code in our process or another process closes our _handle value. It's not our bug. Here's an example in 3.7: >>> p = subprocess.Popen('python -c "input()"', stdin=subprocess.PIPE)
>>> del p
>>> subprocess._active
[<subprocess.Popen object at 0x0000015E5C94E160>]
>>> subprocess._active[0]._handle.Close() The process is active, but something closed our handle. The next time Popen.__init__ calls _cleanup, _internal_poll raises OSError due to the invalid handle: >>> subprocess.call('python -V')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Program Files\Python37\lib\subprocess.py", line 323, in call
with Popen(*popenargs, **kwargs) as p:
File "C:\Program Files\Python37\lib\subprocess.py", line 664, in __init__
_cleanup()
File "C:\Program Files\Python37\lib\subprocess.py", line 228, in _cleanup
res = inst._internal_poll(_deadstate=sys.maxsize)
File "C:\Program Files\Python37\lib\subprocess.py", line 1216, in _internal_poll
if _WaitForSingleObject(self._handle, 0) == _WAIT_OBJECT_0:
OSError: [WinError 6] The handle is invalid I wouldn't want _internal_poll to silence this error, but maybe it could be translated into a warning. That said, since there's no need to wait on a process in Windows, there's no need for __del__ to insert a running process in a list of active process, in which case there's nothing for _cleanup to do. Given we're in __del__, our private _handle is about to be closed, at least it will be in CPython. Other implementations should wait() or use a with statement, in order to explicitly close the process handle. The latter isn't implemented currently, but you opened bpo-37410 to address this. |
Why would you do that? You should not access the private _active list, nor access the private _handle attribute. I understand that it's a way to trigger such bug, but is it possible to trigger this bug without accessing any private attribute?
I disagree with that. It's very bad is suddenly the handle becomes invalid for no obvious reason. It's better to get an hard error (exception) in such case. |
The example is just emulating a problem from someone else's code that's closing our handle. Typically this situation occurs because the code is holding onto a handle value for a kernel object (File, Section, Job, Process, Thread, Event, etc) that got closed. The handle value eventually gets reused, such as for our _handle.
I agree, but the exception breaks Popen.__init__. It would have to be ignored or translated to a warning somewhere if we continues to poll a list of active processes every time __init__() is called. But since the latter is unnecessary in Windows, I suggested to just skip it. |
Without accessing private attributes, I don't see how someone can discover the private handle. So for me, it's more a serious bug in an application, no? Blindly closing random handles doesn't sound like a good idea to me.
That's a side effect on the blindly closing random handles. |
Say a library calls CreateEventW and gets handle 32. It passes this handle to some other library, which uses the event and closes the handle when it no longer needs it. But due to a miscommunication in the documentation, the first library thinks the handle remains open. Now handle 32 is free for reuse, but the library doesn't know this. subprocess.Popen subsequently calls CreateProcessW and gets handle 32. Later on, the library closes handle 32, making it invalid, at least until it gets assigned to some other kernel object. |
So yeah, it's a severe bug in an application. An application should not close the wrong handle by mistake :-) Thanks for the explanation. So it would be "nice" to backport the change to reduce the impact of such application bug, but it's not really a bug in Python itself. Python cannot fully protect developers for such class of bugs. On Unix, there is a similar bug with applications trying to close a file descriptor which is already closed. It can lead to very severe bugs (crashes): |
I believe this to be a bug in the standard library instead of solely being the result of an application error. When a Popen instance is finalized by the garbage collector, the internal handle is also finalized and closed despite the instance being put on the active list. This results in _cleanup throwing because the handle can no longer be used. I've attached a small reproduction. |
I see there is a merged patch for this (#14360); is it possible to get it tagged for backport to 3.7? I haven't seen the 3.7.5 tag drop yet so wanted to recommend it's included, it's giving us fits. On our Windows 10 dev boxes and Win 2016 servers, emanuel's repro fails 3.5.4, 3.6.8, 3.7.4, and 3.8.0b4, but not apparently 2.7 (I think we tested 2.7.12). Per vstinner's request, I agree and would like to see it get a 3.7 and 3.8 backport. Many thanks! |
I triggered automatic backports to see how they go. Doesn't necessarily mean we'll merge them without looking more closely, but may as well check out what our automated tools suggest. |
Right, that's a more practical case where this bug still occurs on Python 3.8 and older. I would suggest to fix applications which should get a ResourceWarning warning in this case. Chip Lynch wrote me an email saying that his team is impacted by the bug on Windows with Python 3.7. I tried to write a simpler patch ignoring ERROR_INVALID_HANDLE, but then I read again this issue discussion: Eryk Sun and me agree that ignorning ERROR_INVALID_HANDLE is a bad idea, it can be worse: silence a real bug. So I now agree to backport the fix from master to 3.7 and 3.8 branches. I prefer to leave 2.7 unchanged even if it's affected. I don't want to take the risk of introducing another regression in 2.7. Moreover, subprocess.Popen has a __del__() method. Python 2.7 handles reference cycles differently than Python 3: it doesn't implement finalizers (PEP-442). While Python 2.7 might be affected, it should be affected differently. |
Thanks Ruslan Kuprieiev for the bug report and the fix. Thanks Eryk Sun (as usual!) for the great analysis of this issue. Ok, the bug should now be fixed in 3.7, 3.8 and master branches. I chose to not fix Python 2.7: see my previous comment. See bpo-37410 for the follow-up: "[Windows] subprocess: close the handle when the process completes". |
Steve Dower: "I triggered automatic backports to see how they go. Doesn't necessarily mean we'll merge them without looking more closely, but may as well check out what our automated tools suggest." Thanks, I merged the backports. |
Thanks. I hadn't considered that. It can lead to unusual behavior like this when __del__ keeps an object alive. The object can't assume that referenced objects haven't already been finalized. All of the subsequently required instance methods should take this into account. For example, Popen._internal_poll could set the return code to -1 if self._handle.closed is true. That's at a higher level, and much more reasonable, than trying to handle ERROR_INVALID_HANDLE, which could mask bugs. |
Hello, Is there a means to work around that bug for Python 3.6 ? It seems that the fix was only backported to 3.7, and we were not planning on dropping Python 3.6 support quite yet in our project. |
Sadly, 3.6 no longer get bug fixes, only security fixes: |
Yes, I understand that. I was wondering if there was a workaround that we could use to not drop 3.6 On Fri, May 29, 2020, 19:29 STINNER Victor <report@bugs.python.org> wrote:
|
|
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: