New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix regrtest timeout for subprocesses: regrtest -jN --timeout=SECONDS #81712
Comments
Reported by Victor Stinner. regrtest has a --timeout parameter which calls
|
Thanks Joannah! That's a nice enhancement which should help to stabilize buildbots even more. The next step is bpo-37711: "regrtest: re-run failed tests in subprocesses". |
It seems that an issue still exists. https://buildbot.python.org/all/#/builders/130/builds/1050 has been running for nearly 9 hours at this point. I can leave it "stuck" if there is some diagnostics that would be beneficial. But I would prefer not to have this impeding other testing for too long. |
I just tested the 3.7 branch manually: regrtest works as expected. I'm using this local changes: diff --git a/Lib/test/libregrtest/runtest.py b/Lib/test/libregrtest/runtest.py
index e7dce180cb..7d996d6918 100644
--- a/Lib/test/libregrtest/runtest.py
+++ b/Lib/test/libregrtest/runtest.py
@@ -119,7 +119,7 @@ def _runtest(ns, test_name):
use_timeout = (ns.timeout is not None)
if use_timeout:
- faulthandler.dump_traceback_later(ns.timeout, exit=True)
+ pass #faulthandler.dump_traceback_later(ns.timeout, exit=True)
start_time = time.perf_counter()
try:
diff --git a/Lib/test/test_os.py b/Lib/test/test_os.py
index 8032da0530..e21399c840 100644
--- a/Lib/test/test_os.py
+++ b/Lib/test/test_os.py
@@ -89,6 +89,7 @@ class FileTests(unittest.TestCase):
tearDown = setUp
def test_access(self):
+ import time; time.sleep(3600)
f = os.open(support.TESTFN, os.O_CREAT|os.O_RDWR)
os.close(f)
self.assertTrue(os.access(support.TESTFN, os.W_OK)) Then I ran: vstinner@apu$ ./python -m test -j2 test_os -m test_access --timeout=3 == Tests result: FAILURE == 1 test failed: Total duration: 4 sec 545 ms As you can see, test_os is killed after 4.5 seconds, as expected (it uses timeout x 1.5, since faulthandler is supposed to be first watchdog with timeout seconds). -- Maybe something is wrong with test_concurrent_futures on Windows. Maybe subprocess.Popen.kill() is unable to kill the process immediately for an unknown reason? Or maybe subprocess.Popen.communicate(timeout=self.timeout) didn't fail with TimeoutExpired? Sadly, libregrtest is quite when it kills a worker process because of timeout, since the following output should be enough to understand what happened. But here the worker was not killed, or failed to be killed, or something else... |
Can you check the process hierarchy? I would like to know how many worker processes are still running under the main regrtest process. I expect to see exactly one. I don't know how to investigate more on such issue on Windows. |
There are no child processes of the main test process. There are 5 |
(I reopen the issue.) "There are no child processes of the main test process. There are 5 Oh wow, that's really strange. I would expect that regrtest main process has one worker process: "test_concurrent_futures". https://buildbot.python.org/all/#/builders/130/builds/1050 C:\buildbot.python.org\3.7.kloth-win64\build>"C:\buildbot.python.org\3.7.kloth-win64\build\PCbuild\amd64\python_d.exe" -u -Wd -E -bb -m test -uall -rwW --slowest --timeout 1200 --fail-env-changed -j1 -j2 -j4 --timeout 900 |
I wrote PR 15345 to enhance regrtest timeout: more logs, more timeouts. |
Additional logging on failures is always welcome. Might I suggest that, in this case, regrtest treats this action as a hard fail so as to not got lost in the other transient failures (test_asyncio). By that I mean, either to not re-run, or to still be "failed" even after a successful re-run as this test_concurrent mishap is rare enough that I doubt it would happen twice in one build. Now for a brain dump. The process tree at this time: python.exe (buildbot)
[*] Windows doesn't really have the concept of a process tree
Given that typepref.exe also prevents completion, I believe the problem is in how we are spawning subprocesses. Maybe a process handle is being inherited by the spawned processes preventing it from fully terminating? I mention the process handle as it was listed as an open handle in Process Explorer for the regrtest process even though it had been terminated prior. |
I don't understand why you get 2 typepref.exe process: only the main process should spawn one. Worker processes don't need it and should not spawn it. |
Oh, interesting: test_regrtest.test_multiprocessing_timeout() failed with a timeout on x86 Windows7 3.x. This build tested commit d33e46d which is older than my "bpo-37531: Enhance regrtest multiprocess timeout (GH-15345)" change (older than commit de2d9ee). https://buildbot.python.org/all/#/builders/58/builds/2889 running: test_regrtest (32 min 8 sec) Thread 0x00000e68 (most recent call first): == Tests result: FAILURE == |
This buildbot looks slow: the 3 latest builds (2886, 2887, 2888) took around 1 hour. Example: 10 slowest tests:
Total duration: 1 hour 4 min |
I happened to catch a stuck build prior to the process being killed: https://buildbot.python.org/all/#/builders//builds/2887 In short, the PR doesn't change the problem. The regrtest main will wait indefinitely on the successfully killed process. I have some ideas to try, but creating a reproducer is not exactly easy... |
Oh, I found an issue with regrtest and Windows processes. On Windows, when a process dies (crash or whatever), Popen.communicate() continues to hang in the parent if the died process has at least one child process which is still alive. Test attached kill_timeout.py: Calling Popen.communicate() hangs even if the child process has been killed, whereas wait() completes immediately. On Windows, Popen.communicate() is implemented with threads calling this function on each pipe: def _readerthread(self, fh, buffer):
buffer.append(fh.read())
fh.close() It seems like self.stdout.read() hangs even after the child process has been killed. |
The issue can be reproduced with regrtest:
I also applied the following changes to regrtest: diff --git a/Lib/test/libregrtest/runtest.py b/Lib/test/libregrtest/runtest.py use_timeout = (ns.timeout is not None)
if use_timeout:
- faulthandler.dump_traceback_later(ns.timeout, exit=True)
+ pass #faulthandler.dump_traceback_later(ns.timeout, exit=True) start_time = time.perf_counter()
try:
diff --git a/Lib/test/libregrtest/runtest_mp.py b/Lib/test/libregrtest/runtest_mp.py
index c22479b797..51352497ae 100644
--- a/Lib/test/libregrtest/runtest_mp.py
+++ b/Lib/test/libregrtest/runtest_mp.py
@@ -19,10 +19,10 @@ from test.libregrtest.utils import format_duration # Display the running tests if nothing happened last N seconds # Time to wait until a worker completes: should be immediate def must_stop(result, ns):
@@ -305,7 +305,7 @@ class MultiprocessRunner:
self.pending = MultiprocessIterator(self.regrtest.tests)
if self.ns.timeout is not None:
self.worker_timeout = self.ns.timeout * 1.5
- self.main_timeout = self.ns.timeout * 2.0
+ self.main_timeout = self.ns.timeout * 10.0
else:
self.worker_timeout = None
self.main_timeout = None |
This is an issue when the standard handles are inherited or duplicated to a grandchild process, and so on. In the case of Popen(sys.executable), the system is duplicating the standard handles implicitly because sys.executable is a console process (assuming it's python[_d].exe), among other criteria. An additional reference on the write side prevents the pipe from closing. fh.read() in the grandparent won't return as long as there's a writer that could potentially write more data. |
Is there a way to workaround that? |
For Windows, subprocess could have a _read_all(file) method that special cases a pipe. The read loop for a pipe would check whether the child has exited. Then call _winapi.PeekNamedPipe on the handle (from get_osfhandle), and do a raw read of the available bytes. If the child has exited or PeekNamedPipe fails (EPIPE), then break, join the partial reads, decode and translate newlines if it's text mode, and return. |
Alternatively, instead of special casing the file type and spinning on PeekNamedPipe, the workaround could be based on a multiple-object wait that includes the child process handle. In this case, communicate() would always call _communicate in Windows, regardless of the timeout or number of pipes -- because simplistically calling either self.stdout.read() or self.stderr.read() could hang. The implementation would be moderately complicated. If we stop waiting on the reader threads because the process exited, we can give the threads a short time to finish reading and close the files -- maybe 250 ms is enough. But if they haven't exited at this time, we can't simply raise a TimeoutExpired exception if the call hasn't actually timed out. To finish the _communicate call, we would have to cancel the pending reads and join() the threads. We can force a reader thread to exit by canceling the read via WINAPI CancelIoEx. However, _readerthread has to be modified to support this. It could be implemented as a loop that calls _winapi.ReadFile to read the output in chunks that get appended to the buffer list. The read loop would break for either ERROR_BROKEN_PIPE or ERROR_OPERATION_ABORTED (canceled). The final step in _communicate would be to concatenate the partial reads. If it's text mode, it would also have to decode the bytes and translate newlines. The POSIX implementation of _communicate has to do this, so we already have a _translate_newlines method for this case. Note that _winapi.WaitForMultipleObjects is interruptible in the main thread via Ctrl+C, which is a bonus improvement since Thread.join() can't be interrupted in Windows. |
Eryk Sun
Well, maybe we could do something to enhance regrtest, but the initial bug described in this issue should be fixed by my commit de2d9ee which added a timeout to the second Popen.communicate() call, but also to the Popen.wait() call (moreover, "with popen:" is no longer used). Both functions use a timeout of 30 seconds which should be short enough to prevent the main regrtest process to block for hours. When a process is killed and stdout.read() never completes, the stdout.read() thread will likely never completes. But I consider that it's a separated bug that should be addressed in a separated issue. Jeremy Kloth:
Again, my commit de2d9ee should fix this behavior. -- I'm not sure why test_regrtest.test_multiprocessing_timeout() failed with a timeout on x86 Windows7 3.x; see msg350060. Maybe my commit 767434c will give more information if the test fails again on this buildbot. I will backport regrtest changes to other 3.7 and 3.8 branches. I wait a few more days to see if buildbots are stable with latest regrtest changes. |
Crap, test_regrtest.test_multiprocessing_timeout() hangs sometimes (randomly), and then support.temp_cwd() fails on rmtree() with "PermissionError: [WinError 32] The process cannot access the file because it is being used by another process". running: test_regrtest (14 min 24 sec) Thread 0x00000a30 (most recent call first): (...) 1 re-run test: Total duration: 25 min 20 sec
Tests result: FAILURE then SUCCESS
Traceback (most recent call last):
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 1013, in temp_dir
yield path
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 1065, in temp_cwd
yield cwd_dir
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\libregrtest\main.py", line 628, in main
self._main(tests, kwargs)
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\libregrtest\main.py", line 690, in _main
sys.exit(0)
SystemExit: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 330, in _force_run
return func(*args)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\buildbot.python.org\\3.x.kloth-win64\\build\\build\\test_python_4376\\test_python_worker_5136'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\runpy.py", line 192, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\__main__.py", line 2, in <module>
main()
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\libregrtest\main.py", line 695, in main
Regrtest().main(tests=tests, **kwargs)
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\libregrtest\main.py", line 628, in main
self._main(tests, kwargs)
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\contextlib.py", line 131, in __exit__
self.gen.throw(type, value, traceback)
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 1065, in temp_cwd
yield cwd_dir
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\contextlib.py", line 131, in __exit__
self.gen.throw(type, value, traceback)
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 1018, in temp_dir
rmtree(path)
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 452, in rmtree
_rmtree(path)
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 393, in _rmtree
_waitfor(_rmtree_inner, path, waitall=True)
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 341, in _waitfor
func(pathname)
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 390, in _rmtree_inner
_force_run(fullname, os.rmdir, fullname)
File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 336, in _force_run
return func(*args)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\buildbot.python.org\\3.x.kloth-win64\\build\\build\\test_python_4376\\test_python_worker_5136' |
Well, the kill timeout doesn't seem to be working, at least completely: https://buildbot.python.org/all/#/builders/40/builds/3012 The worker process has been killed (line 562), but regrtest is still waiting. |
Another day, another stuck test_concurrent_futures... https://buildbot.python.org/all/#/builders/40/builds/3030 The test process is again killed (line 568) but the processes from the multiprocess pool are still alive. Once I manually kill those pool processes, regrtest resumes as it should. |
See also bpo-37424. |
https://buildbot.python.org/all/#/builders/58/builds/2987 1:06:55 load avg: 3.39 [343/419/1] test_regrtest crashed (Exit code 1) Thread 0x00000b24 (most recent call first): |
I created bpo-38207 "subprocess: On Windows, Popen.kill() + Popen.communicate() is blocking until all processes using the pipe close the pipe". |
Oh wait, test_regrtest.test_multiprocessing_timeout() now also hangs on FreeBSD? :-( AMD64 FreeBSD CURRENT Shared 3.8: running: test_regrtest (24 min 52 sec) |
It seems like the new regrtest design doesn't work as I expected. Calling popen.kill() + popen.stdout.close() + popen.stderr.close() from a thread B does not always interrupt popen.communicate() in thread A. See https://bugs.python.org/issue38207#msg352729 for an example on Linux where communicate() still blocks until all child processes complete. I see different options:
asyncio subprocess uses overlapped operations which can be cancelled. So maybe it isn't affected by bpo-38207. |
Oh, regrtest main process was killed by faulthandler watchdog: it was blocked in _get_result() for longer than 1 minute. Extract of _get_result(): use_faulthandler = (self.ns.timeout is not None)
timeout = PROGRESS_UPDATE
while True:
if use_faulthandler:
faulthandler.dump_traceback_later(timeout * 2.0, exit=True)
# display progress
running = get_running(self.workers)
if running and not self.ns.pgo:
self.log('running: %s' % ', '.join(running)) self.ns.timeout is supposed to be 1500. I don't understand why regrtest got a timeout of 60 seconds? https://buildbot.python.org/all/#/builders/168/builds/1583 LD_LIBRARY_PATH=/usr/home/buildbot/python/3.x.koobs-freebsd-current/build ./python ./Tools/scripts/run_tests.py -j 1 -u all -W --slowest --fail-env-changed --timeout=1500 -j2 -j4 Thread 0x0000000802d2fd00 (most recent call first): Thread 0x0000000802d2f800 (most recent call first): Thread 0x0000000802d30200 (most recent call first): Thread 0x0000000800acc000 (most recent call first): |
Oh, that comes from "timeout = PROGRESS_UPDATE" with "PROGRESS_UPDATE = 30.0". But it means that the main process was stuck for longer than 60 seconds!? The main process is supposed to write an update "running: ..." every 30 seconds. This buildbot is very slow. Maybe 30 and 60 seconds limits are too low? |
The initial issue has been fixed in branches 3.7, 3.8 and master. I close the issue. Thanks to everybody who help to make these changes possible and helped to debug regressions. It's not perfect, regrtest has to workaround bpo-38207 bug, but at least regrtest should no longer runs for several hours or days.
The main regrtest process now kills a worker process running longer than timeout seconds. In practice, it's timeout x 1.5, to give time to the faulthandler watchdog to kill the process. If regrtest hits bpo-38207 bug, regrtest takes 2 min 30 sec to exit in the worst case.
I changed this timeout from 1 minute to 5 minutes.
I propose to continue the discussion in bpo-38207 "subprocess: On Windows, Popen.kill() + Popen.communicate() is blocking until all processes using the pipe close the pipe". I implemented workarounds in regrtest so regrtest doesn't block *forever*: it only waits 30 seconds until a worker thread completes, and I added a timeout of 2 minutes on Python exit of the main regrtest process to prevent blocking in threading._shutdown(), because of the blocked threads.
I didn't check on my Windows VM if I have the issue. If someone sees more than one typeperf.exe process, please open a separated issue. Note: typeperf.exe is the correct spelling (typepref.exe is a typo ;-)). |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: