classification
Title: Fix regrtest timeout for subprocesses: regrtest -jN --timeout=SECONDS
Type: crash Stage: patch review
Components: Tests Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, jeremy.kloth, jkloth, nanjekyejoannah, vstinner
Priority: normal Keywords: patch

Created on 2019-07-09 19:34 by nanjekyejoannah, last changed 2019-09-19 16:40 by vstinner.

Files
File name Uploaded Description Edit
kill_timeout.py vstinner, 2019-08-22 10:44
test_kill.py vstinner, 2019-08-22 10:48
Pull Requests
URL Status Linked Edit
PR 14679 closed nanjekyejoannah, 2019-07-10 12:16
PR 15072 merged vstinner, 2019-08-14 10:46
PR 15279 merged vstinner, 2019-08-14 13:56
PR 15280 merged vstinner, 2019-08-14 14:04
PR 15345 merged vstinner, 2019-08-20 17:21
PR 15419 merged vstinner, 2019-08-23 10:00
PR 15871 merged vstinner, 2019-09-10 15:07
PR 15874 merged vstinner, 2019-09-10 15:11
PR 16220 merged vstinner, 2019-09-17 07:22
PR 16223 merged miss-islington, 2019-09-17 10:15
PR 16224 merged miss-islington, 2019-09-17 10:15
PR 16247 merged vstinner, 2019-09-17 23:35
PR 16282 merged vstinner, 2019-09-19 10:06
PR 16285 merged vstinner, 2019-09-19 14:41
PR 16289 merged vstinner, 2019-09-19 16:19
Messages (42)
msg347576 - (view) Author: Joannah Nanjekye (nanjekyejoannah) * (Python triager) Date: 2019-07-09 19:34
Reported by Victor Stinner.

regrtest has a --timeout parameter which calls
faulthandler.dump_traceback_later(timeout), but sometimes it's not
enough to kill a test. regrtest should kill a worker process if it
runs longer than --timeout seconds.

* https://bugs.python.org/issue37313 : test_concurrent_futures ran for
25 hours whereas regrtest was run using --timeout 900
* regrtest ran for 4 days whereas it was run with --timeout=900:
https://mail.python.org/pipermail/python-buildbots/2019-June/000285.html
msg349689 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-14 12:18
New changeset b0c8369c603633f445ccbb5ca7a8742145ff9eec by Victor Stinner in branch 'master':
bpo-37531: Fix regrtest timeout for subprocesses (GH-15072)
https://github.com/python/cpython/commit/b0c8369c603633f445ccbb5ca7a8742145ff9eec
msg349703 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-14 14:30
New changeset 93bee6a8d878ff952e245614c567c7c6bb7a0398 by Victor Stinner in branch '3.7':
[3.7] bpo-37531: Fix regrtest timeout for subprocesses (GH-15072) (GH-15280)
https://github.com/python/cpython/commit/93bee6a8d878ff952e245614c567c7c6bb7a0398
msg349704 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-14 14:31
New changeset d85c5670ff1611202a25c9e0967148e72c114de9 by Victor Stinner in branch '3.8':
[3.8] bpo-37531: Fix regrtest timeout for subprocesses (GH-15072) (GH-15279)
https://github.com/python/cpython/commit/d85c5670ff1611202a25c9e0967148e72c114de9
msg349708 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-14 14:44
Thanks Joannah! That's a nice enhancement which should help to stabilize buildbots even more.

The next step is bpo-37711: "regrtest: re-run failed tests in subprocesses".
msg350009 - (view) Author: Jeremy Kloth (jkloth) * Date: 2019-08-20 14:57
It seems that an issue still exists.

https://buildbot.python.org/all/#/builders/130/builds/1050

has been running for nearly 9 hours at this point.  I can leave it "stuck" if there is some diagnostics that would be beneficial.  But I would prefer not to have this impeding other testing for too long.
msg350012 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-20 15:33
I just tested the 3.7 branch manually: regrtest works as expected.

I'm using this local changes:

diff --git a/Lib/test/libregrtest/runtest.py b/Lib/test/libregrtest/runtest.py
index e7dce180cb..7d996d6918 100644
--- a/Lib/test/libregrtest/runtest.py
+++ b/Lib/test/libregrtest/runtest.py
@@ -119,7 +119,7 @@ def _runtest(ns, test_name):
 
     use_timeout = (ns.timeout is not None)
     if use_timeout:
-        faulthandler.dump_traceback_later(ns.timeout, exit=True)
+        pass #faulthandler.dump_traceback_later(ns.timeout, exit=True)
 
     start_time = time.perf_counter()
     try:
diff --git a/Lib/test/test_os.py b/Lib/test/test_os.py
index 8032da0530..e21399c840 100644
--- a/Lib/test/test_os.py
+++ b/Lib/test/test_os.py
@@ -89,6 +89,7 @@ class FileTests(unittest.TestCase):
     tearDown = setUp
 
     def test_access(self):
+        import time; time.sleep(3600)
         f = os.open(support.TESTFN, os.O_CREAT|os.O_RDWR)
         os.close(f)
         self.assertTrue(os.access(support.TESTFN, os.W_OK))


Then I ran:

vstinner@apu$ ./python -m test -j2 test_os -m test_access --timeout=3
Run tests in parallel using 2 child processes
0:00:04 load avg: 0.59 [1/1/1] test_os timed out (4 sec 512 ms)

== Tests result: FAILURE ==

1 test failed:
    test_os

Total duration: 4 sec 545 ms
Tests result: FAILURE

As you can see, test_os is killed after 4.5 seconds, as expected (it uses timeout x 1.5, since faulthandler is supposed to be first watchdog with timeout seconds).

--

Maybe something is wrong with test_concurrent_futures on Windows. Maybe subprocess.Popen.kill() is unable to kill the process immediately for an unknown reason?

Or maybe subprocess.Popen.communicate(timeout=self.timeout) didn't fail with TimeoutExpired?

Sadly, libregrtest is quite when it kills a worker process because of timeout, since the following output should be enough to understand what happened. But here the worker was not killed, or failed to be killed, or something else...
msg350014 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-20 15:39
> I can leave it "stuck" if there is some diagnostics that would be beneficial.

Can you check the process hierarchy? I would like to know how many worker processes are still running under the main regrtest process. I expect to see exactly one. I don't know how to investigate more on such issue on Windows.
msg350017 - (view) Author: Jeremy Kloth (jeremy.kloth) Date: 2019-08-20 15:55
> Can you check the process hierarchy? I would like to know how many worker processes are still running under the main regrtest process. I expect to see exactly one. I don't know how to investigate more on such issue on Windows.

There are no child processes of the main test process.  There are 5
orphaned processes from multiprocessing.  Their parent process is gone
(assuming the parent_pid argument is correct).  There is also an
orphaned typeperf.exe process, probably also from the killed process.
msg350020 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-20 16:04
(I reopen the issue.)

"There are no child processes of the main test process.  There are 5
orphaned processes from multiprocessing.  Their parent process is gone
(assuming the parent_pid argument is correct)."

Oh wow, that's really strange. I would expect that regrtest main process has one worker process: "test_concurrent_futures".

https://buildbot.python.org/all/#/builders/130/builds/1050

C:\buildbot.python.org\3.7.kloth-win64\build>"C:\buildbot.python.org\3.7.kloth-win64\build\PCbuild\amd64\python_d.exe"  -u -Wd -E -bb -m test  -uall -rwW --slowest --timeout 1200 --fail-env-changed -j1 -j2 -j4 --timeout 900 
== CPython 3.7.4+ (heads/3.7:02c1457, Aug 20 2019, 00:03:57) [MSC v.1900 64 bit (AMD64)]
== Windows-7-6.1.7601-SP1 little-endian
== cwd: C:\buildbot.python.org\3.7.kloth-win64\build\build\test_python_5524
== CPU count: 4
== encodings: locale=cp1252, FS=utf-8
Using random seed 1058030
Run tests in parallel using 4 child processes
0:00:01 load avg: 0.00 [  1/416] test_call passed
0:00:02 load avg: 0.00 [  2/416] test_genexps passed
...
0:12:46 load avg: 2.30 [412/416] test_threaded_import passed -- running: test_concurrent_futures (8 min 42 sec)
0:12:47 load avg: 2.30 [413/416] test_email passed -- running: test_concurrent_futures (8 min 43 sec)
0:12:50 load avg: 2.19 [414/416] test_bytes passed -- running: test_concurrent_futures (8 min 46 sec)
0:12:55 load avg: 2.02 [415/416] test_timeout passed -- running: test_concurrent_futures (8 min 52 sec)
running: test_concurrent_futures (9 min 22 sec)
running: test_concurrent_futures (9 min 52 sec)
running: test_concurrent_futures (10 min 22 sec)
...
running: test_concurrent_futures (9 hour 51 min)
running: test_concurrent_futures (9 hour 51 min)
running: test_concurrent_futures (9 hour 52 min)
running: test_concurrent_futures (9 hour 52 min)
running: test_concurrent_futures (9 hour 53 min)
msg350023 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-20 17:26
I wrote PR 15345 to enhance regrtest timeout: more logs, more timeouts.
msg350044 - (view) Author: Jeremy Kloth (jkloth) * Date: 2019-08-21 02:39
Additional logging on failures is always welcome.  Might I suggest that, in this case, regrtest treats this action as a hard fail so as to not got lost in the other transient failures (test_asyncio).

By that I mean, either to not re-run, or to still be "failed" even after a successful re-run as this test_concurrent mishap is rare enough that I doubt it would happen twice in one build.

Now for a brain dump.

The process tree at this time:

python.exe (buildbot)
. cmd.exe (test step)
.. python_d.exe (regrtest main)
... typepref.exe (load)
... <already terminated process>
.... typepref.exe (load for test_concurrent_futures)
.... python_d.exe (multiprocessing pool)
.... python_d.exe (multiprocessing pool)
.... python_d.exe (multiprocessing pool)
.... python_d.exe (multiprocessing pool)
.... python_d.exe (multiprocessing pool)


1. To see how process termination played out, I killed the main regrtest process.  This did not let the buildbot complete, but just sit there without any output.  Also the typepref.exe process (and the failed multiprocessing pool processes) still lingered.

2. I then terminated the children* of the already terminated process.  No change to the buildbot.

 [*] Windows doesn't really have the concept of a process tree

3. I then terminated the final typepref.exe process.  The buildbot finally finished its test step.

Given that typepref.exe also prevents completion, I believe the problem is in how we are spawning subprocesses.  Maybe a process handle is being inherited by the spawned processes preventing it from fully terminating?

I mention the process handle as it was listed as an open handle in Process Explorer for the regrtest process even though it had been terminated prior.
msg350057 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-21 09:57
> .... typepref.exe (load for test_concurrent_futures)

I don't understand why you get 2 typepref.exe process: only the main process should spawn one. Worker processes don't need it and should not spawn it.
msg350058 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-21 09:59
New changeset de2d9eed8bc628533e1628b843cc4c7a5010f6e5 by Victor Stinner in branch 'master':
bpo-37531: Enhance regrtest multiprocess timeout (GH-15345)
https://github.com/python/cpython/commit/de2d9eed8bc628533e1628b843cc4c7a5010f6e5
msg350060 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-21 10:21
Oh, interesting: test_regrtest.test_multiprocessing_timeout() failed with a timeout on x86 Windows7 3.x.

This build tested commit d33e46d17d33f9b918846982c02ddc17d897c9bc which is older than my "bpo-37531: Enhance regrtest multiprocess timeout (GH-15345)" change (older than commit de2d9eed8bc628533e1628b843cc4c7a5010f6e5).

https://buildbot.python.org/all/#/builders/58/builds/2889

running: test_regrtest (32 min 8 sec)
running: test_regrtest (32 min 38 sec)
running: test_regrtest (33 min 8 sec)
running: test_regrtest (33 min 38 sec)
running: test_regrtest (34 min 8 sec)
running: test_regrtest (34 min 38 sec)
1:19:18 load avg: 1.19 [419/419/1] test_regrtest crashed (Exit code 1)
Timeout (0:35:00)!
Thread 0x00000ddc (most recent call first):
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 1353 in _readerthread
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 876 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 938 in _bootstrap_inner
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 896 in _bootstrap

Thread 0x00000e68 (most recent call first):
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 1033 in _wait_for_tstate_lock
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 1017 in join
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 1382 in _communicate
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 1015 in communicate
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 491 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 504 in run_command
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 529 in run_python
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 680 in run_tests
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 1178 in test_multiprocessing_timeout
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\case.py", line 611 in _callTestMethod
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\case.py", line 654 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\case.py", line 714 in __call__
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\suite.py", line 122 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\suite.py", line 84 in __call__
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\suite.py", line 122 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\suite.py", line 84 in __call__
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\suite.py", line 122 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\suite.py", line 84 in __call__
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\runner.py", line 176 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\support\__init__.py", line 1996 in _run_suite
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\support\__init__.py", line 2092 in run_unittest
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\runtest.py", line 209 in _test_module
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\runtest.py", line 234 in _runtest_inner2
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\runtest.py", line 270 in _runtest_inner
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\runtest.py", line 140 in _runtest
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\runtest.py", line 193 in runtest
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\runtest_mp.py", line 66 in run_tests_worker
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\main.py", line 642 in _main
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\main.py", line 628 in main
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\main.py", line 695 in main
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\regrtest.py", line 43 in _main
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\regrtest.py", line 47 in <module>
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\runpy.py", line 85 in _run_code

== Tests result: FAILURE ==
msg350061 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-21 10:24
> x86 Windows7 3.x
> https://buildbot.python.org/all/#/builders/58/builds/2889

This buildbot looks slow: the 3 latest builds (2886, 2887, 2888) took around 1 hour. Example:

10 slowest tests:
- test_multiprocessing_spawn: 15 min 31 sec
- test_zipfile: 8 min 45 sec
- test_concurrent_futures: 6 min 32 sec
- test_pickle: 4 min 7 sec
- test_regrtest: 3 min 47 sec
- test_lzma: 3 min 13 sec
- test_tarfile: 3 min 9 sec
- test_capi: 2 min 54 sec
- test_asyncio: 2 min 49 sec
- test_venv: 1 min 57 sec

Total duration: 1 hour 4 min
msg350132 - (view) Author: Jeremy Kloth (jkloth) * Date: 2019-08-21 23:22
I happened to catch a stuck build prior to the process being killed:

https://buildbot.python.org/all/#/builders//builds/2887

In short, the PR doesn't change the problem.  The regrtest main will wait indefinitely on the successfully killed process.

I have some ideas to try, but creating a reproducer is not exactly easy...
msg350181 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-22 10:44
Oh, I found an issue with regrtest and Windows processes.

On Windows, when a process dies (crash or whatever), Popen.communicate() continues to hang in the parent if the died process has at least one child process which is still alive.

Test attached kill_timeout.py:
---
vstinner@WIN C:\vstinner\python\master>python x.py
Running Debug|x64 interpreter...
parent pid 1600
child1 pid 2184
child2 pid 4516
communicate(): timeout (1)
child1 killed
wait() returned: returncode 1
communicate() timeout (2)
---

Calling Popen.communicate() hangs even if the child process has been killed, whereas wait() completes immediately.

On Windows, Popen.communicate() is implemented with threads calling this function on each pipe:

        def _readerthread(self, fh, buffer):
            buffer.append(fh.read())
            fh.close()

It seems like self.stdout.read() hangs even after the child process has been killed.
msg350183 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-22 10:48
The issue can be reproduced with regrtest:
* Copy attached test_kill.py to Lib/test/
* Run: python -m test test_kill -v -j2 --timeout=5

I also applied the following changes to regrtest:

diff --git a/Lib/test/libregrtest/runtest.py b/Lib/test/libregrtest/runtest.py
index e7dce180cb..7d996d6918 100644
--- a/Lib/test/libregrtest/runtest.py
+++ b/Lib/test/libregrtest/runtest.py
@@ -119,7 +119,7 @@ def _runtest(ns, test_name):

     use_timeout = (ns.timeout is not None)
     if use_timeout:
-        faulthandler.dump_traceback_later(ns.timeout, exit=True)
+        pass #faulthandler.dump_traceback_later(ns.timeout, exit=True)

     start_time = time.perf_counter()
     try:
diff --git a/Lib/test/libregrtest/runtest_mp.py b/Lib/test/libregrtest/runtest_mp.py
index c22479b797..51352497ae 100644
--- a/Lib/test/libregrtest/runtest_mp.py
+++ b/Lib/test/libregrtest/runtest_mp.py
@@ -19,10 +19,10 @@ from test.libregrtest.utils import format_duration


 # Display the running tests if nothing happened last N seconds
-PROGRESS_UPDATE = 30.0   # seconds
+PROGRESS_UPDATE = 1.0   # seconds

 # Time to wait until a worker completes: should be immediate
-JOIN_TIMEOUT = 30.0   # seconds
+JOIN_TIMEOUT = 5.0   # seconds


 def must_stop(result, ns):
@@ -305,7 +305,7 @@ class MultiprocessRunner:
         self.pending = MultiprocessIterator(self.regrtest.tests)
         if self.ns.timeout is not None:
             self.worker_timeout = self.ns.timeout * 1.5
-            self.main_timeout = self.ns.timeout * 2.0
+            self.main_timeout = self.ns.timeout * 10.0
         else:
             self.worker_timeout = None
             self.main_timeout = None
msg350191 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-08-22 15:00
> It seems like self.stdout.read() hangs even after the child process 
> has been killed.

This is an issue when the standard handles are inherited or duplicated to a grandchild process, and so on. In the case of Popen(sys.executable), the system is duplicating the standard handles implicitly because sys.executable is a console process (assuming it's python[_d].exe), among other criteria. An additional reference on the write side prevents the pipe from closing. fh.read() in the grandparent won't return as long as there's a writer that could potentially write more data.
msg350233 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-22 22:12
> This is an issue when the standard handles are inherited or duplicated to a grandchild process, and so on. In the case of Popen(sys.executable), the system is duplicating the standard handles implicitly because sys.executable is a console process (assuming it's python[_d].exe), among other criteria. An additional reference on the write side prevents the pipe from closing. fh.read() in the grandparent won't return as long as there's a writer that could potentially write more data.

Is there a way to workaround that?
msg350246 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-08-23 03:04
> Is there a way to workaround that?

For Windows, subprocess could have a _read_all(file) method that special cases a pipe. The read loop for a pipe would check whether the child has exited. Then call _winapi.PeekNamedPipe on the handle (from get_osfhandle), and do a raw read of the available bytes. If the child has exited or PeekNamedPipe fails (EPIPE), then break, join the partial reads, decode and translate newlines if it's text mode, and return.
msg350272 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-23 10:27
New changeset 767434c39c8f3c6a8af1b3282d8382ccf809fe21 by Victor Stinner in branch 'master':
bpo-37531: Fix regrtest _timedout() function on timeout (GH-15419)
https://github.com/python/cpython/commit/767434c39c8f3c6a8af1b3282d8382ccf809fe21
msg350329 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-08-23 19:19
Alternatively, instead of special casing the file type and spinning on PeekNamedPipe, the workaround could be based on a multiple-object wait that includes the child process handle. In this case, communicate() would always call _communicate in Windows, regardless of the timeout or number of pipes -- because simplistically calling either self.stdout.read() or self.stderr.read() could hang. 

The implementation would be moderately complicated. If we stop waiting on the reader threads because the process exited, we can give the threads a short time to finish reading and close the files -- maybe 250 ms is enough. But if they haven't exited at this time, we can't simply raise a TimeoutExpired exception if the call hasn't actually timed out.  To finish the _communicate call, we would have to cancel the pending reads and join() the threads.

We can force a reader thread to exit by canceling the read via WINAPI CancelIoEx. However, _readerthread has to be modified to support this. It could be implemented as a loop that calls _winapi.ReadFile to read the output in chunks that get appended to the buffer list. The read loop would break for either ERROR_BROKEN_PIPE or ERROR_OPERATION_ABORTED (canceled). 

The final step in _communicate would be to concatenate the partial reads. If it's text mode, it would also have to decode the bytes and translate newlines. The POSIX implementation of _communicate has to do this, so we already have a _translate_newlines method for this case.

Note that _winapi.WaitForMultipleObjects is interruptible in the main thread via Ctrl+C, which is a bonus improvement since Thread.join() can't be interrupted in Windows.
msg350578 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-26 22:13
Eryk Sun 
> Alternatively, instead of special casing the file type and spinning on PeekNamedPipe, the workaround could be based on a multiple-object wait that includes the child process handle. (...)

Well, maybe we could do something to enhance regrtest, but the initial bug described in this issue should be fixed by my commit de2d9eed8bc628533e1628b843cc4c7a5010f6e5 which added a timeout to the second Popen.communicate() call, but also to the Popen.wait() call (moreover, "with popen:" is no longer used). Both functions use a timeout of 30 seconds which should be short enough to prevent the main regrtest process to block for hours.

When a process is killed and stdout.read() never completes, the stdout.read() thread will likely never completes. But I consider that it's a separated bug that should be addressed in a separated issue.

Jeremy Kloth:
> In short, the PR doesn't change the problem.  The regrtest main will wait indefinitely on the successfully killed process.

Again, my commit de2d9eed8bc628533e1628b843cc4c7a5010f6e5 should fix this behavior.

--

I'm not sure why test_regrtest.test_multiprocessing_timeout() failed with a timeout on x86 Windows7 3.x; see msg350060.

Maybe my commit 767434c39c8f3c6a8af1b3282d8382ccf809fe21 will give more information if the test fails again on this buildbot.

I will backport regrtest changes to other 3.7 and 3.8 branches. I wait a few more days to see if buildbots are stable with latest regrtest changes.
msg351399 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-09 10:01
Crap, test_regrtest.test_multiprocessing_timeout() hangs sometimes (randomly), and then support.temp_cwd() fails on rmtree() with "PermissionError: [WinError 32] The process cannot access the file because it is being used by another process".


running: test_regrtest (14 min 24 sec)
running: test_regrtest (14 min 54 sec)
0:24:35 load avg: 2.02 [419/419/1] test_regrtest crashed (Exit code 1)
Timeout (0:15:00)!
Thread 0x000017d0 (most recent call first):
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\subprocess.py", line 1353 in _readerthread
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\threading.py", line 882 in run
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\threading.py", line 944 in _bootstrap_inner
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\threading.py", line 902 in _bootstrap

Thread 0x00000a30 (most recent call first):
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\threading.py", line 1039 in _wait_for_tstate_lock
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\threading.py", line 1023 in join
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\subprocess.py", line 1382 in _communicate
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\subprocess.py", line 1015 in communicate
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\subprocess.py", line 491 in run
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_regrtest.py", line 504 in run_command
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_regrtest.py", line 529 in run_python
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_regrtest.py", line 680 in run_tests
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_regrtest.py", line 1178 in test_multiprocessing_timeout

(...)

1 re-run test:
    test_regrtest

Total duration: 25 min 20 sec
Tests result: FAILURE then SUCCESS
Traceback (most recent call last):
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 1013, in temp_dir
    yield path
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 1065, in temp_cwd
    yield cwd_dir
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\libregrtest\main.py", line 628, in main
    self._main(tests, kwargs)
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\libregrtest\main.py", line 690, in _main
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 330, in _force_run
    return func(*args)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\buildbot.python.org\\3.x.kloth-win64\\build\\build\\test_python_4376\\test_python_worker_5136'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\__main__.py", line 2, in <module>
    main()
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\libregrtest\main.py", line 695, in main
    Regrtest().main(tests=tests, **kwargs)
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\libregrtest\main.py", line 628, in main
    self._main(tests, kwargs)
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 1065, in temp_cwd
    yield cwd_dir
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 1018, in temp_dir
    rmtree(path)
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 452, in rmtree
    _rmtree(path)
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 393, in _rmtree
    _waitfor(_rmtree_inner, path, waitall=True)
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 341, in _waitfor
    func(pathname)
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 390, in _rmtree_inner
    _force_run(fullname, os.rmdir, fullname)
  File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 336, in _force_run
    return func(*args)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\buildbot.python.org\\3.x.kloth-win64\\build\\build\\test_python_4376\\test_python_worker_5136'
msg351517 - (view) Author: Jeremy Kloth (jkloth) * Date: 2019-09-09 16:18
Well, the kill timeout doesn't seem to be working, at least completely:

https://buildbot.python.org/all/#/builders/40/builds/3012

The worker process has been killed (line 562), but regrtest is still waiting.
msg351713 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-10 15:54
New changeset 494b61aeec8e583755c8ff3a147a58657353bbc3 by Victor Stinner in branch '3.7':
[3.7] bpo-37531: Enhance regrtest multiprocess timeout (GH-15345) (GH-15874)
https://github.com/python/cpython/commit/494b61aeec8e583755c8ff3a147a58657353bbc3
msg351714 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-10 15:54
New changeset d42a4fdc630c54352701a466a9e512bee68b5c48 by Victor Stinner in branch '3.8':
bpo-37531: Enhance regrtest multiprocess timeout (GH-15345) (GH-15871)
https://github.com/python/cpython/commit/d42a4fdc630c54352701a466a9e512bee68b5c48
msg351716 - (view) Author: Jeremy Kloth (jkloth) * Date: 2019-09-10 15:56
Another day, another stuck test_concurrent_futures...

https://buildbot.python.org/all/#/builders/40/builds/3030

The test process is again killed (line 568) but the processes from the multiprocess pool are still alive.  Once I manually kill those pool processes, regrtest resumes as it should.
msg351815 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-11 10:55
See also bpo-37424.
msg352273 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-13 09:54
https://buildbot.python.org/all/#/builders/58/builds/2987

1:06:55 load avg: 3.39 [343/419/1] test_regrtest crashed (Exit code 1)
Timeout (0:35:00)!
Thread 0x00000cdc (most recent call first):
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 1366 in _readerthread
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 882 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 944 in _bootstrap_inner
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 902 in _bootstrap

Thread 0x00000b24 (most recent call first):
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 1039 in _wait_for_tstate_lock
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 1023 in join
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 1395 in _communicate
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 1024 in communicate
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 491 in run
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 504 in run_command
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 529 in run_python
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 680 in run_tests
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 1178 in test_multiprocessing_timeout
  ...
msg352619 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-17 08:08
New changeset 46b0b81220a23bc4aee5ba3ba67e8cf1b5df7960 by Victor Stinner in branch 'master':
bpo-37531: regrtest main process uses shorter timeout (GH-16220)
https://github.com/python/cpython/commit/46b0b81220a23bc4aee5ba3ba67e8cf1b5df7960
msg352633 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-17 12:34
New changeset 6591b4bbb1c0b9c26b99e4b2dba1e5cc8546732d by Victor Stinner (Miss Islington (bot)) in branch '3.7':
bpo-37531: regrtest main process uses shorter timeout (GH-16220) (GH-16223)
https://github.com/python/cpython/commit/6591b4bbb1c0b9c26b99e4b2dba1e5cc8546732d
msg352632 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-17 12:34
New changeset 5f1590d5e679f4dd0b611ef54ae512f137e78f1b by Victor Stinner (Miss Islington (bot)) in branch '3.8':
bpo-37531: regrtest main process uses shorter timeout (GH-16220) (GH-16224)
https://github.com/python/cpython/commit/5f1590d5e679f4dd0b611ef54ae512f137e78f1b
msg352674 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-17 23:31
I created bpo-38207 "subprocess: On Windows, Popen.kill() + Popen.communicate() is blocking until all processes using the pipe close the pipe".
msg352679 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-17 23:58
New changeset b9877cd2cc47b6f3512c171814c4f630286279b9 by Victor Stinner in branch 'master':
bpo-37531: Skip test_regrtest.test_multiprocessing_timeout() on Windows (GH-16247)
https://github.com/python/cpython/commit/b9877cd2cc47b6f3512c171814c4f630286279b9
msg352726 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-18 13:30
Oh wait, test_regrtest.test_multiprocessing_timeout() now also hangs on FreeBSD? :-(

AMD64 FreeBSD CURRENT Shared 3.8:
https://buildbot.python.org/all/#/builders/212/builds/226

running: test_regrtest (24 min 52 sec)
0:32:54 load avg: 0.26 [423/423/1] test_regrtest crashed (Exit code 1)
Timeout (0:25:00)!
Thread 0x0000000800ac0000 (most recent call first):
  File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/selectors.py", line 415 in select
  File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/subprocess.py", line 1866 in _communicate
  File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/subprocess.py", line 1024 in communicate
  File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/subprocess.py", line 491 in run
  File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/test/test_regrtest.py", line 504 in run_command
  File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/test/test_regrtest.py", line 529 in run_python
  File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/test/test_regrtest.py", line 676 in run_tests
  File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/test/test_regrtest.py", line 1174 in test_multiprocessing_timeout
  File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/unittest/case.py", line 633 in _callTestMethod
msg352730 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-18 14:26
It seems like the new regrtest design doesn't work as I expected.

Calling popen.kill() + popen.stdout.close() + popen.stderr.close() from a thread B does not always interrupt popen.communicate() in thread A.

See https://bugs.python.org/issue38207#msg352729 for an example on Linux where communicate() still blocks until all child processes complete.

I see different options:

* Revert changes to restore regrtest old design which didn't have these new issues

* Find a way to fix bpo-38207 on all platforms: be able to interrupt communicate() as soon as the process is killed and/or when all pipes are closed.

* Call communicate() with shorter timeout to workaround the blocking communicate() issue (bpo-38207).

* Maybe experiment asyncio which supports asynchronous subprocess.

asyncio subprocess uses overlapped operations which can be cancelled. So maybe it isn't affected by bpo-38207.
msg352796 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-19 14:38
New changeset b2dd2dd6e9212ba6b03885b998736a180ef283d6 by Victor Stinner in branch 'master':
bpo-37531: Skip test_regrtest.test_multiprocessing_timeout() on all platforms (GH-16282)
https://github.com/python/cpython/commit/b2dd2dd6e9212ba6b03885b998736a180ef283d6
msg352804 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-19 16:05
New changeset fb7746d5d10ec4a34198da672018ba15f5667079 by Victor Stinner in branch '3.8':
bpo-37531: sync regrtest with master branch (GH-16285)
https://github.com/python/cpython/commit/fb7746d5d10ec4a34198da672018ba15f5667079
msg352806 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-19 16:40
New changeset 5e1400a6bcbb3350a6665176980a2b8343075c63 by Victor Stinner in branch '3.7':
bpo-37531: sync regrtest with master branch (GH-16285) (GH-16289)
https://github.com/python/cpython/commit/5e1400a6bcbb3350a6665176980a2b8343075c63
History
Date User Action Args
2019-09-19 16:40:50vstinnersetmessages: + msg352806
2019-09-19 16:19:03vstinnersetpull_requests: + pull_request15874
2019-09-19 16:05:12vstinnersetmessages: + msg352804
2019-09-19 14:41:15vstinnersetpull_requests: + pull_request15870
2019-09-19 14:38:23vstinnersetmessages: + msg352796
2019-09-19 10:06:00vstinnersetpull_requests: + pull_request15867
2019-09-18 14:26:01vstinnersetmessages: + msg352730
2019-09-18 13:30:40vstinnersetmessages: + msg352726
2019-09-17 23:58:32vstinnersetmessages: + msg352679
2019-09-17 23:35:04vstinnersetpull_requests: + pull_request15843
2019-09-17 23:31:22vstinnersetmessages: + msg352674
2019-09-17 12:34:17vstinnersetmessages: + msg352632
2019-09-17 12:34:17vstinnersetmessages: + msg352633
2019-09-17 10:15:43miss-islingtonsetpull_requests: + pull_request15824
2019-09-17 10:15:36miss-islingtonsetpull_requests: + pull_request15823
2019-09-17 08:08:30vstinnersetmessages: + msg352619
2019-09-17 07:22:02vstinnersetpull_requests: + pull_request15821
2019-09-13 09:54:35vstinnersetmessages: + msg352273
2019-09-11 10:55:34vstinnersetmessages: + msg351815
2019-09-10 15:56:56jklothsetmessages: + msg351716
2019-09-10 15:54:54vstinnersetmessages: + msg351714
2019-09-10 15:54:43vstinnersetmessages: + msg351713
2019-09-10 15:11:07vstinnersetpull_requests: + pull_request15514
2019-09-10 15:07:08vstinnersetpull_requests: + pull_request15513
2019-09-09 16:18:18jklothsetmessages: + msg351517
2019-09-09 10:01:14vstinnersetmessages: + msg351399
2019-08-26 22:13:36vstinnersetmessages: + msg350578
2019-08-23 19:19:06eryksunsetmessages: + msg350329
2019-08-23 10:27:27vstinnersetmessages: + msg350272
2019-08-23 10:00:53vstinnersetpull_requests: + pull_request15119
2019-08-23 03:04:25eryksunsetmessages: + msg350246
2019-08-22 22:12:36vstinnersetmessages: + msg350233
2019-08-22 15:00:41eryksunsetmessages: + msg350191
2019-08-22 10:48:57vstinnersetfiles: + test_kill.py

messages: + msg350183
2019-08-22 10:44:50vstinnersetfiles: + kill_timeout.py

messages: + msg350181
2019-08-21 23:22:52jklothsetmessages: + msg350132
2019-08-21 10:24:01vstinnersetmessages: + msg350061
2019-08-21 10:21:13vstinnersetmessages: + msg350060
2019-08-21 09:59:23vstinnersetmessages: + msg350058
2019-08-21 09:57:15vstinnersetmessages: + msg350057
2019-08-21 02:39:36jklothsetnosy: + eryksun
messages: + msg350044
2019-08-20 17:26:08vstinnersetmessages: + msg350023
2019-08-20 17:21:28vstinnersetstage: resolved -> patch review
pull_requests: + pull_request15062
2019-08-20 16:04:04vstinnersetstatus: closed -> open
resolution: fixed ->
messages: + msg350020
2019-08-20 15:55:09jeremy.klothsetnosy: + jeremy.kloth
messages: + msg350017
2019-08-20 15:39:55vstinnersetmessages: + msg350014
2019-08-20 15:33:04vstinnersetmessages: + msg350012
2019-08-20 14:57:04jklothsetmessages: + msg350009
2019-08-14 14:44:39vstinnersetstatus: open -> closed
stage: patch review -> resolved
resolution: fixed
components: + Tests
versions: + Python 3.7, Python 3.8
2019-08-14 14:44:09vstinnersetmessages: + msg349708
2019-08-14 14:31:35vstinnersetmessages: + msg349704
2019-08-14 14:30:59vstinnersetmessages: + msg349703
2019-08-14 14:04:08vstinnersetpull_requests: + pull_request15004
2019-08-14 13:56:14vstinnersetpull_requests: + pull_request15002
2019-08-14 12:18:57vstinnersetmessages: + msg349689
2019-08-14 10:46:31vstinnersetpull_requests: + pull_request14996
2019-07-30 01:20:12jklothsetnosy: + jkloth
2019-07-29 23:48:28vstinnerlinkissue37313 superseder
2019-07-10 12:16:31nanjekyejoannahsetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request14491
2019-07-09 19:34:22nanjekyejoannahcreate