This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: regrtest didn't respect the timeout when running test_subprocess on AMD64 Windows11 3.x
Type: Stage: patch review
Components: Tests Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, jeremy.kloth, jkloth, vstinner
Priority: normal Keywords: patch

Created on 2022-02-10 23:09 by vstinner, last changed 2022-04-11 14:59 by admin.

Files
File name Uploaded Description Edit
process.py jkloth, 2022-02-13 04:29 Reproducer script for test_timeout()
Pull Requests
URL Status Linked Edit
PR 32079 open jkloth, 2022-03-23 17:36
PR 32081 open jkloth, 2022-03-23 19:22
Messages (9)
msg413028 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-10 23:09
regrtest was run with --timeout 900 on AMD64 Windows11 3.x: timeout confirmed by "(timeout: 15 min, worker timeout: 20 min)" log. But then test_subprocss was only stopped after "4 hour 55 min".

If the regrtest main process is able to display an update 2x per minutes (every 30 sec), it should be able to stop the test worker process (test_subprocess) after 20 minutes. How is it possible that the process took so long?

There are multiple guards:

* (1) in the worker process: _runtest() calls faulthandler.dump_traceback_later(ns.timeout, exit=True)
* (2) libregrtest/runtest_mp.py: TestWorkerProcess._run_process() thread uses popen.communicate(timeout=self.timeout)
* (3) faulthandler.dump_traceback_later(MAIN_PROCESS_TIMEOUT, exit=True): kill the parent process if it is blocked for longer than 5 minutes

Guards (1) and (2) didn't work.

Maybe the parent process should implement a 4th guard using the 20 minute timeout: almost 5 hours is way longer than 20 minutes!


C:\buildbot\3.x.kloth-win11\build>"C:\buildbot\3.x.kloth-win11\build\PCbuild\amd64\python_d.exe"  -u -Wd -E -bb -m test  -uall -rwW --slowest --timeout 1200 --fail-env-changed -j1 -j2 --junit-xml test-results.xml -j40 --timeout 900
== CPython 3.11.0a5+ (main, Feb 10 2022, 04:03:24) [MSC v.1930 64 bit (AMD64)]
== Windows-10-10.0.22000-SP0 little-endian
== cwd: C:\buildbot\3.x.kloth-win11\build\build\test_python_5732�
== CPU count: 32
== encodings: locale=cp1252, FS=utf-8
Using random seed 6320493
0:00:00 Run tests in parallel using 40 child processes (timeout: 15 min, worker timeout: 20 min)
(...)
0:03:13 load avg: 0.76 [431/432] test_multiprocessing_spawn passed (3 min 13 sec) -- running: test_subprocess (3 min 11 sec)
0:03:43 load avg: 0.46 running: test_subprocess (3 min 41 sec)
(...)
4:53:17 load avg: 0.00 running: test_subprocess (4 hour 53 min)
4:53:47 load avg: 0.00 running: test_subprocess (4 hour 53 min)
4:54:17 load avg: 0.09 running: test_subprocess (4 hour 54 min)
4:54:47 load avg: 0.35 running: test_subprocess (4 hour 54 min)
4:55:17 load avg: 0.48 running: test_subprocess (4 hour 55 min)
4:55:46 load avg: 0.50 [432/432/1] test_subprocess timed out (4 hour 55 min) (4 hour 55 min)

== Tests result: FAILURE ==

397 tests OK.

10 slowest tests:
- test_subprocess: 4 hour 55 min
- test_multiprocessing_spawn: 3 min 13 sec
- test_concurrent_futures: 2 min 46 sec
- test_peg_generator: 2 min 32 sec
- test_compileall: 1 min 34 sec
- test_unparse: 1 min 31 sec
- test_distutils: 1 min 23 sec
- test_asyncio: 1 min 22 sec
- test_tokenize: 1 min 8 sec
- test_io: 1 min 5 sec

1 test failed:
    test_subprocess
msg413033 - (view) Author: Jeremy Kloth (jkloth) * Date: 2022-02-11 01:03
The test only completed once I purposefully terminated the offending Python process.  The only identifying information I noticed was the command-line of `-c "while True: pass"`, indicating it was stuck in either
test_call_timeout() or test_timeout() in test_subprocess.py.

Something to note is that Windows does not, by default, have a concept of process trees whereas terminating a parent automatically kills the children.  Eryk Sun may have additional ideas on how this desired behavior could be accomplished.
msg413104 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2022-02-11 22:48
> test_call_timeout() or test_timeout() in test_subprocess.py.

These tests don't override the standard files, and they only spawn a single child with no descendants. I don't see why this would hang. It shouldn't be a problem with leaked pipe handles (see bpo-43346). It probably will need to be diagnosed by attaching a debugger, or offline with a dump file.

> process trees whereas terminating a parent automatically kills the children

One can use a job object to manage a child process and all of its descendants, including resource usage and termination. A process can belong to multiple job objects in Windows 8+, which is required by Python 3.9+.

For reliability, the child has to be created in a suspended state via CREATE_SUSPENDED. It can be resumed with ResumeThread() after adding it to the job with AssignProcessToJobObject().

You can try to terminate a job cleanly, which is similar in effect to sending SIGTERM to a process group in POSIX. In Windows, this has to be approached differently for console vs graphical processes.

To handle console apps, assuming the child inherits the current console, spawn it as a new process group via creationflags=subprocess.CREATE_NEW_PROCESS_GROUP. You can request an exit by sending a Ctrl+Break event to the group via os.kill(p.pid, signal.CTRL_BREAK_EVENT) [1]. The request might be ignored, but typically the default handler is called, which calls ExitProcess().

To handle GUI apps, assuming the child inherits the current desktop (usually "WinSta0\Default"), first enumerate the top-level and message-only windows on the current desktop via EnumWindows() and FindWindowExW(). Use GetWindowThreadProcessId() to filter the list to include only windows that belong to the job. Post WM_CLOSE to each window in the job. A process might ignore a request to close. It could keep the window open or continue running in the background.

After an internal timeout, you can call TerminateJobObject() to kill any process in the job that remains alive. This is a forced and abrupt termination, which is similar to sending SIGKILL to a process group in POSIX.

---

[1] This usage of os.kill() is what we're stuck with. Rightfully, we should be using os.killpg(p.pid, signal.SIGBREAK) or os.kill(-p.pid, signal.SIGBREAK) (note the negative pid value).
msg413165 - (view) Author: Jeremy Kloth (jkloth) * Date: 2022-02-13 04:29
I've been able locally to reproduce the test_subprocess hang.  The responsible function is subprocess.run().  The test case, test_timeout(), uses a small timeout value (0.0001), which, when given enough load, can cause the run() call to hang.

A judicious use of prints in subprocess.py, reveals that the timeout passed to wait() ends up being negative.  That value, once cast to a DWORD, ultimately causes a very long wait (0xfffffff2, in my testing).

It does seem that only the Windows Popen._wait() cannot handle negative timeout values, so the fix should be as simple as coercing the timeout values to >= 0.
msg413172 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2022-02-13 08:21
> the fix should be as simple as coercing the timeout values to >= 0.

Popen._remaining_time() should return max(endtime - _time(), 0).

Popen._wait() should raise OverflowError if the timeout is too large for the implementation. In Windows, the upper limit in milliseconds is `_winapi.INFINITE - 1` (about 49.7 days). It's important to only allow the timeout in milliseconds to be _winapi.INFINITE when `timeout is None`.

The DWORD converter in _winapi needs to subclass unsigned_long_converter. The current implementation based on the legacy format unit "k" is too lenient. Negative values and values that are too large should fail. I updated it to use the following definition:

    class DWORD_converter(unsigned_long_converter):
        type = 'DWORD'

This produces the following improved results:

    >>> h = _winapi.GetCurrentProcess()
    >>> _winapi.WaitForSingleObject(h, _winapi.INFINITE + 1)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    OverflowError: Python int too large to convert to C unsigned long

    >>> _winapi.WaitForSingleObject(h, -1)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: value must be positive
msg413189 - (view) Author: Jeremy Kloth (jeremy.kloth) Date: 2022-02-13 16:42
> > the fix should be as simple as coercing the timeout values to >= 0.
>
> Popen._remaining_time() should return max(endtime - _time(), 0).

That was my first initial instinct as well, however, that change would
also affect more of the Popen behavior and need a much more thorough
investigation of the POSIX side of Popen.

> Popen._wait() should raise OverflowError if the timeout is too large for the implementation. In Windows, the upper limit in milliseconds is `_winapi.INFINITE - 1` (about 49.7 days). It's important to only allow the timeout in milliseconds to be _winapi.INFINITE when `timeout is None`.

I agree.

> The DWORD converter in _winapi needs to subclass unsigned_long_converter. The current implementation based on the legacy format unit "k" is too lenient. Negative values and values that are too large should fail.

Whilst I agree this is a correct solution, I fear the potential
3rd-party breakage alone should bump this to its own issue.

I believe that this then leads to the following action items for this issue:
1) modify Windows Popen._wait() to raise on out of bounds values [< 0
or >= INFINITE]
2) cap Popen._remaining_time() return value to >= 0
3) change _winapi DWORD converter be unsigned long
4) use Job objects to group Windows processes for termination

Have I missed anything?  I should be able to knock out PRs for these today.
-- 
Jeremy Kloth
msg413215 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2022-02-14 09:27
> I fear the potential 3rd-party breakage alone should bump
> this to its own issue.

The change to the DWORD converter in _winapi should only be in 3.11+. If this causes problems for other projects, they're probably depending on undefined behavior in the standard library. No third-party package or application should use _winapi directly.

> 1) modify Windows Popen._wait() to raise on out of bounds 
> values [< 0 or >= INFINITE]

I think raising ValueError would be best at this level. For example:

            if timeout is None:
                timeout_millis = _winapi.INFINITE
            else:
                timeout_millis = int(timeout * 1000)
                if timeout_millis >= _winapi.INFINITE:
                    raise ValueError("'timeout' exceeds the platform limit")
                if timeout_millis < 0:
                    raise ValueError("'timeout' must be non-negative")
msg413219 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2022-02-14 10:41
> 4) use Job objects to group Windows processes for termination

I think a separate issue should be created for this enhancement.

_winapi wrappers would be needed for CreateJobObjectW(), SetInformationJobObject(), AssignProcessToJobObject(), TerminatejobObject(), and ResumeThread(), plus the constant CREATE_SUSPENDED. I'd also prefer to make related changes to send_signal(), which would require GetConsoleProcessList() and GenerateConsoleCtrlEvent().

A new Popen option would be needed to configure whether the job allows descendants to break away via the process creation flag CREATE_BREAKAWAY_FROM_JOB. This should be allowed by default.

---

send_signal(): SIGKILL, SIGTERM, SIBREAK, SIGINT

Support Popen.kill(group=False) and Popen.terminate(group=False) on all platforms as Popen.send_signal(signal.SIGKILL, group=group) and Popen.send_signal(signal.SIGTERM, group=group).

The Universal CRT (ucrt) in Windows doesn't define SIGKILL. Even when it's not defined by the platform, signal.SIGKILL should always be defined, preferably as 9 (unused by ucrt), else as an unused value in the range up to NSIG, else as NSIG + 1.

The `group` keyword-only option broadens the scope to the process group or job. A process is a group leader if it was created with the flag CREATE_NEW_PROCESS_GROUP (save self._creationflags) and its process ID is in GetConsoleProcessList().

For SIGKILL, always use forced termination. For SIGTERM, use forced termination either if `group` is false or if `group` is true and the process is not a group leader. To force termination, call TerminateJobObject(self._job_handle, 1) if `group` is true, else TerminateProcess(self._handle, 1). 

For SIGTERM, SIGBREAK, and SIGINT, call GenerateConsoleCtrlEvent() if `group` is true and the process is a group leader. For SIGTERM and SIGBREAK, send CTRL_BREAK_EVENT. For SIGINT, send CTRL_C_EVENT. 

Behavior to deprecate:

When `group` is false and `sig` is CTRL_C_EVENT (0) or CTRL_BREAK_EVENT (1), send_signal() always calls os.kill(). This legacy behavior tries to handle a process as if it's a process group, and it combines POSIX kill() with Windows API constants. subprocess should distance itself from the fundamental problems with os.kill() in Windows.
msg413649 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-21 11:49
> It does seem that only the Windows Popen._wait() cannot handle negative timeout values, so the fix should be as simple as coercing the timeout values to >= 0.

Oh. This function should maybe raise an exception if the timeout is negative, and ther caller must replace negative timeout with zero.

> A judicious use of prints in subprocess.py, reveals that the timeout passed to wait() ends up being negative.  That value, once cast to a DWORD, ultimately causes a very long wait (0xfffffff2, in my testing).

This sounds dangerous and must be fixed. Python must not convert negative values to very large positive values.
History
Date User Action Args
2022-04-11 14:59:56adminsetgithub: 90872
2022-03-23 19:22:09jklothsetpull_requests: + pull_request30168
2022-03-23 17:36:36jklothsetkeywords: + patch
stage: patch review
pull_requests: + pull_request30167
2022-02-21 11:49:24vstinnersetmessages: + msg413649
2022-02-14 10:41:10eryksunsetmessages: + msg413219
2022-02-14 09:27:18eryksunsetmessages: + msg413215
2022-02-13 16:42:30jeremy.klothsetnosy: + jeremy.kloth
messages: + msg413189
2022-02-13 08:21:29eryksunsetmessages: + msg413172
2022-02-13 04:29:46jklothsetfiles: + process.py

messages: + msg413165
2022-02-11 22:48:17eryksunsetmessages: + msg413104
2022-02-11 01:03:57jklothsetnosy: + jkloth, eryksun
messages: + msg413033
2022-02-10 23:10:14vstinnersettitle: regrtest didn't respect the timeout on AMD64 Windows11 3.x -> regrtest didn't respect the timeout when running test_subprocess on AMD64 Windows11 3.x
2022-02-10 23:09:17vstinnercreate