Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regrtest timeout for subprocesses: regrtest -jN --timeout=SECONDS #81712

Closed
nanjekyejoannah opened this issue Jul 9, 2019 · 51 comments
Closed
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes tests Tests in the Lib/test dir type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@nanjekyejoannah
Copy link
Member

BPO 37531
Nosy @vstinner, @jkloth, @koobs, @eryksun, @miss-islington, @nanjekyejoannah
PRs
  • bpo-37531: Fix regrtest timeout for subprocesses #14679
  • bpo-37531: Fix regrtest timeout for subprocesses #15072
  • [3.8] bpo-37531: Fix regrtest timeout for subprocesses (GH-15072) #15279
  • [3.7] bpo-37531: Fix regrtest timeout for subprocesses (GH-15072) #15280
  • bpo-37531: Enhance regrtest multiprocess timeout #15345
  • bpo-37531: Fix regrtest _timedout() function on timeout #15419
  • [3.8] bpo-37531: Enhance regrtest multiprocess timeout (GH-15345) #15871
  • [3.7] bpo-37531: Enhance regrtest multiprocess timeout (GH-15345) #15874
  • bpo-37531: regrtest main process uses shorter timeout #16220
  • [3.7] bpo-37531: regrtest main process uses shorter timeout (GH-16220) #16223
  • [3.8] bpo-37531: regrtest main process uses shorter timeout (GH-16220) #16224
  • bpo-37531: Skip test_regrtest.test_multiprocessing_timeout() on Windows #16247
  • bpo-37531: Skip test_regrtest.test_multiprocessing_timeout() on all platforms #16282
  • [3.8] bpo-37531: sync regrtest with master branch #16285
  • [3.7] bpo-37531: sync regrtest with master branch (GH-16285) #16289
  • bpo-37531: regrtest ignores output on timeout #16659
  • [3.8] bpo-37531: regrtest ignores output on timeout (GH-16659) #16676
  • [3.7] bpo-37531: regrtest ignores output on timeout (GH-16659) #16677
  • bpo-37531: regrtest now catchs ProcessLookupError #16827
  • Files
  • kill_timeout.py
  • test_kill.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-10-09.15:06:10.652>
    created_at = <Date 2019-07-09.19:34:22.877>
    labels = ['3.8', '3.7', 'tests', '3.9', 'type-crash']
    title = 'Fix regrtest timeout for subprocesses: regrtest -jN --timeout=SECONDS'
    updated_at = <Date 2020-04-14.23:05:24.157>
    user = 'https://github.com/nanjekyejoannah'

    bugs.python.org fields:

    activity = <Date 2020-04-14.23:05:24.157>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-10-09.15:06:10.652>
    closer = 'vstinner'
    components = ['Tests']
    creation = <Date 2019-07-09.19:34:22.877>
    creator = 'nanjekyejoannah'
    dependencies = []
    files = ['48555', '48556']
    hgrepos = []
    issue_num = 37531
    keywords = ['patch']
    message_count = 51.0
    messages = ['347576', '349689', '349703', '349704', '349708', '350009', '350012', '350014', '350017', '350020', '350023', '350044', '350057', '350058', '350060', '350061', '350132', '350181', '350183', '350191', '350233', '350246', '350272', '350329', '350578', '351399', '351517', '351713', '351714', '351716', '351815', '352273', '352619', '352632', '352633', '352674', '352679', '352726', '352730', '352796', '352804', '352806', '354078', '354088', '354223', '354270', '354272', '354277', '354817', '366410', '366459']
    nosy_count = 7.0
    nosy_names = ['vstinner', 'jkloth', 'jeremy.kloth', 'koobs', 'eryksun', 'miss-islington', 'nanjekyejoannah']
    pr_nums = ['14679', '15072', '15279', '15280', '15345', '15419', '15871', '15874', '16220', '16223', '16224', '16247', '16282', '16285', '16289', '16659', '16676', '16677', '16827']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue37531'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @nanjekyejoannah
    Copy link
    Member Author

    Reported by Victor Stinner.

    regrtest has a --timeout parameter which calls
    faulthandler.dump_traceback_later(timeout), but sometimes it's not
    enough to kill a test. regrtest should kill a worker process if it
    runs longer than --timeout seconds.

    @nanjekyejoannah nanjekyejoannah added 3.9 only security fixes type-crash A hard crash of the interpreter, possibly with a core dump labels Jul 9, 2019
    @vstinner
    Copy link
    Member

    New changeset b0c8369 by Victor Stinner in branch 'master':
    bpo-37531: Fix regrtest timeout for subprocesses (GH-15072)
    b0c8369

    @vstinner
    Copy link
    Member

    New changeset 93bee6a by Victor Stinner in branch '3.7':
    [3.7] bpo-37531: Fix regrtest timeout for subprocesses (GH-15072) (GH-15280)
    93bee6a

    @vstinner
    Copy link
    Member

    New changeset d85c567 by Victor Stinner in branch '3.8':
    [3.8] bpo-37531: Fix regrtest timeout for subprocesses (GH-15072) (GH-15279)
    d85c567

    @vstinner
    Copy link
    Member

    Thanks Joannah! That's a nice enhancement which should help to stabilize buildbots even more.

    The next step is bpo-37711: "regrtest: re-run failed tests in subprocesses".

    @vstinner vstinner added 3.7 (EOL) end of life 3.8 only security fixes tests Tests in the Lib/test dir labels Aug 14, 2019
    @jkloth
    Copy link
    Contributor

    jkloth commented Aug 20, 2019

    It seems that an issue still exists.

    https://buildbot.python.org/all/#/builders/130/builds/1050

    has been running for nearly 9 hours at this point. I can leave it "stuck" if there is some diagnostics that would be beneficial. But I would prefer not to have this impeding other testing for too long.

    @vstinner
    Copy link
    Member

    I just tested the 3.7 branch manually: regrtest works as expected.

    I'm using this local changes:

    diff --git a/Lib/test/libregrtest/runtest.py b/Lib/test/libregrtest/runtest.py
    index e7dce180cb..7d996d6918 100644
    --- a/Lib/test/libregrtest/runtest.py
    +++ b/Lib/test/libregrtest/runtest.py
    @@ -119,7 +119,7 @@ def _runtest(ns, test_name):
     
         use_timeout = (ns.timeout is not None)
         if use_timeout:
    -        faulthandler.dump_traceback_later(ns.timeout, exit=True)
    +        pass #faulthandler.dump_traceback_later(ns.timeout, exit=True)
     
         start_time = time.perf_counter()
         try:
    diff --git a/Lib/test/test_os.py b/Lib/test/test_os.py
    index 8032da0530..e21399c840 100644
    --- a/Lib/test/test_os.py
    +++ b/Lib/test/test_os.py
    @@ -89,6 +89,7 @@ class FileTests(unittest.TestCase):
         tearDown = setUp
     
         def test_access(self):
    +        import time; time.sleep(3600)
             f = os.open(support.TESTFN, os.O_CREAT|os.O_RDWR)
             os.close(f)
             self.assertTrue(os.access(support.TESTFN, os.W_OK))

    Then I ran:

    vstinner@apu$ ./python -m test -j2 test_os -m test_access --timeout=3
    Run tests in parallel using 2 child processes
    0:00:04 load avg: 0.59 [1/1/1] test_os timed out (4 sec 512 ms)

    == Tests result: FAILURE ==

    1 test failed:
    test_os

    Total duration: 4 sec 545 ms
    Tests result: FAILURE

    As you can see, test_os is killed after 4.5 seconds, as expected (it uses timeout x 1.5, since faulthandler is supposed to be first watchdog with timeout seconds).

    --

    Maybe something is wrong with test_concurrent_futures on Windows. Maybe subprocess.Popen.kill() is unable to kill the process immediately for an unknown reason?

    Or maybe subprocess.Popen.communicate(timeout=self.timeout) didn't fail with TimeoutExpired?

    Sadly, libregrtest is quite when it kills a worker process because of timeout, since the following output should be enough to understand what happened. But here the worker was not killed, or failed to be killed, or something else...

    @vstinner
    Copy link
    Member

    I can leave it "stuck" if there is some diagnostics that would be beneficial.

    Can you check the process hierarchy? I would like to know how many worker processes are still running under the main regrtest process. I expect to see exactly one. I don't know how to investigate more on such issue on Windows.

    @jeremykloth
    Copy link
    Mannequin

    jeremykloth mannequin commented Aug 20, 2019

    Can you check the process hierarchy? I would like to know how many worker processes are still running under the main regrtest process. I expect to see exactly one. I don't know how to investigate more on such issue on Windows.

    There are no child processes of the main test process. There are 5
    orphaned processes from multiprocessing. Their parent process is gone
    (assuming the parent_pid argument is correct). There is also an
    orphaned typeperf.exe process, probably also from the killed process.

    @vstinner
    Copy link
    Member

    (I reopen the issue.)

    "There are no child processes of the main test process. There are 5
    orphaned processes from multiprocessing. Their parent process is gone
    (assuming the parent_pid argument is correct)."

    Oh wow, that's really strange. I would expect that regrtest main process has one worker process: "test_concurrent_futures".

    https://buildbot.python.org/all/#/builders/130/builds/1050

    C:\buildbot.python.org\3.7.kloth-win64\build>"C:\buildbot.python.org\3.7.kloth-win64\build\PCbuild\amd64\python_d.exe" -u -Wd -E -bb -m test -uall -rwW --slowest --timeout 1200 --fail-env-changed -j1 -j2 -j4 --timeout 900
    == CPython 3.7.4+ (heads/3.7:02c1457, Aug 20 2019, 00:03:57) [MSC v.1900 64 bit (AMD64)]
    == Windows-7-6.1.7601-SP1 little-endian
    == cwd: C:\buildbot.python.org\3.7.kloth-win64\build\build\test_python_5524
    == CPU count: 4
    == encodings: locale=cp1252, FS=utf-8
    Using random seed 1058030
    Run tests in parallel using 4 child processes
    0:00:01 load avg: 0.00 [ 1/416] test_call passed
    0:00:02 load avg: 0.00 [ 2/416] test_genexps passed
    ...
    0:12:46 load avg: 2.30 [412/416] test_threaded_import passed -- running: test_concurrent_futures (8 min 42 sec)
    0:12:47 load avg: 2.30 [413/416] test_email passed -- running: test_concurrent_futures (8 min 43 sec)
    0:12:50 load avg: 2.19 [414/416] test_bytes passed -- running: test_concurrent_futures (8 min 46 sec)
    0:12:55 load avg: 2.02 [415/416] test_timeout passed -- running: test_concurrent_futures (8 min 52 sec)
    running: test_concurrent_futures (9 min 22 sec)
    running: test_concurrent_futures (9 min 52 sec)
    running: test_concurrent_futures (10 min 22 sec)
    ...
    running: test_concurrent_futures (9 hour 51 min)
    running: test_concurrent_futures (9 hour 51 min)
    running: test_concurrent_futures (9 hour 52 min)
    running: test_concurrent_futures (9 hour 52 min)
    running: test_concurrent_futures (9 hour 53 min)

    @vstinner vstinner reopened this Aug 20, 2019
    @vstinner
    Copy link
    Member

    I wrote PR 15345 to enhance regrtest timeout: more logs, more timeouts.

    @jkloth
    Copy link
    Contributor

    jkloth commented Aug 21, 2019

    Additional logging on failures is always welcome. Might I suggest that, in this case, regrtest treats this action as a hard fail so as to not got lost in the other transient failures (test_asyncio).

    By that I mean, either to not re-run, or to still be "failed" even after a successful re-run as this test_concurrent mishap is rare enough that I doubt it would happen twice in one build.

    Now for a brain dump.

    The process tree at this time:

    python.exe (buildbot)
    . cmd.exe (test step)
    .. python_d.exe (regrtest main)
    ... typepref.exe (load)
    ... <already terminated process>
    .... typepref.exe (load for test_concurrent_futures)
    .... python_d.exe (multiprocessing pool)
    .... python_d.exe (multiprocessing pool)
    .... python_d.exe (multiprocessing pool)
    .... python_d.exe (multiprocessing pool)
    .... python_d.exe (multiprocessing pool)

    1. To see how process termination played out, I killed the main regrtest process. This did not let the buildbot complete, but just sit there without any output. Also the typepref.exe process (and the failed multiprocessing pool processes) still lingered.

    2. I then terminated the children* of the already terminated process. No change to the buildbot.

    [*] Windows doesn't really have the concept of a process tree

    1. I then terminated the final typepref.exe process. The buildbot finally finished its test step.

    Given that typepref.exe also prevents completion, I believe the problem is in how we are spawning subprocesses. Maybe a process handle is being inherited by the spawned processes preventing it from fully terminating?

    I mention the process handle as it was listed as an open handle in Process Explorer for the regrtest process even though it had been terminated prior.

    @vstinner
    Copy link
    Member

    .... typepref.exe (load for test_concurrent_futures)

    I don't understand why you get 2 typepref.exe process: only the main process should spawn one. Worker processes don't need it and should not spawn it.

    @vstinner
    Copy link
    Member

    New changeset de2d9ee by Victor Stinner in branch 'master':
    bpo-37531: Enhance regrtest multiprocess timeout (GH-15345)
    de2d9ee

    @vstinner
    Copy link
    Member

    Oh, interesting: test_regrtest.test_multiprocessing_timeout() failed with a timeout on x86 Windows7 3.x.

    This build tested commit d33e46d which is older than my "bpo-37531: Enhance regrtest multiprocess timeout (GH-15345)" change (older than commit de2d9ee).

    https://buildbot.python.org/all/#/builders/58/builds/2889

    running: test_regrtest (32 min 8 sec)
    running: test_regrtest (32 min 38 sec)
    running: test_regrtest (33 min 8 sec)
    running: test_regrtest (33 min 38 sec)
    running: test_regrtest (34 min 8 sec)
    running: test_regrtest (34 min 38 sec)
    1:19:18 load avg: 1.19 [419/419/1] test_regrtest crashed (Exit code 1)
    Timeout (0:35:00)!
    Thread 0x00000ddc (most recent call first):
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 1353 in _readerthread
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 876 in run
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 938 in _bootstrap_inner
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 896 in _bootstrap

    Thread 0x00000e68 (most recent call first):
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 1033 in _wait_for_tstate_lock
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 1017 in join
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 1382 in _communicate
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 1015 in communicate
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 491 in run
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 504 in run_command
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 529 in run_python
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 680 in run_tests
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 1178 in test_multiprocessing_timeout
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\case.py", line 611 in _callTestMethod
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\case.py", line 654 in run
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\case.py", line 714 in __call__
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\suite.py", line 122 in run
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\suite.py", line 84 in __call__
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\suite.py", line 122 in run
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\suite.py", line 84 in __call__
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\suite.py", line 122 in run
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\suite.py", line 84 in __call__
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\unittest\runner.py", line 176 in run
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\support\init.py", line 1996 in _run_suite
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\support\init.py", line 2092 in run_unittest
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\runtest.py", line 209 in _test_module
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\runtest.py", line 234 in _runtest_inner2
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\runtest.py", line 270 in _runtest_inner
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\runtest.py", line 140 in _runtest
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\runtest.py", line 193 in runtest
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\runtest_mp.py", line 66 in run_tests_worker
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\main.py", line 642 in _main
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\main.py", line 628 in main
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\libregrtest\main.py", line 695 in main
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\regrtest.py", line 43 in _main
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\regrtest.py", line 47 in <module>
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\runpy.py", line 85 in _run_code

    == Tests result: FAILURE ==

    @vstinner
    Copy link
    Member

    x86 Windows7 3.x
    https://buildbot.python.org/all/#/builders/58/builds/2889

    This buildbot looks slow: the 3 latest builds (2886, 2887, 2888) took around 1 hour. Example:

    10 slowest tests:

    • test_multiprocessing_spawn: 15 min 31 sec
    • test_zipfile: 8 min 45 sec
    • test_concurrent_futures: 6 min 32 sec
    • test_pickle: 4 min 7 sec
    • test_regrtest: 3 min 47 sec
    • test_lzma: 3 min 13 sec
    • test_tarfile: 3 min 9 sec
    • test_capi: 2 min 54 sec
    • test_asyncio: 2 min 49 sec
    • test_venv: 1 min 57 sec

    Total duration: 1 hour 4 min

    @jkloth
    Copy link
    Contributor

    jkloth commented Aug 21, 2019

    I happened to catch a stuck build prior to the process being killed:

    https://buildbot.python.org/all/#/builders//builds/2887

    In short, the PR doesn't change the problem. The regrtest main will wait indefinitely on the successfully killed process.

    I have some ideas to try, but creating a reproducer is not exactly easy...

    @vstinner
    Copy link
    Member

    Oh, I found an issue with regrtest and Windows processes.

    On Windows, when a process dies (crash or whatever), Popen.communicate() continues to hang in the parent if the died process has at least one child process which is still alive.

    Test attached kill_timeout.py:
    ---
    vstinner@WIN C:\vstinner\python\master>python x.py
    Running Debug|x64 interpreter...
    parent pid 1600
    child1 pid 2184
    child2 pid 4516
    communicate(): timeout (1)
    child1 killed
    wait() returned: returncode 1
    communicate() timeout (2)
    ---

    Calling Popen.communicate() hangs even if the child process has been killed, whereas wait() completes immediately.

    On Windows, Popen.communicate() is implemented with threads calling this function on each pipe:

            def _readerthread(self, fh, buffer):
                buffer.append(fh.read())
                fh.close()

    It seems like self.stdout.read() hangs even after the child process has been killed.

    @vstinner
    Copy link
    Member

    The issue can be reproduced with regrtest:

    • Copy attached test_kill.py to Lib/test/
    • Run: python -m test test_kill -v -j2 --timeout=5

    I also applied the following changes to regrtest:

    diff --git a/Lib/test/libregrtest/runtest.py b/Lib/test/libregrtest/runtest.py
    index e7dce180cb..7d996d6918 100644
    --- a/Lib/test/libregrtest/runtest.py
    +++ b/Lib/test/libregrtest/runtest.py
    @@ -119,7 +119,7 @@ def _runtest(ns, test_name):

         use_timeout = (ns.timeout is not None)
         if use_timeout:
    -        faulthandler.dump_traceback_later(ns.timeout, exit=True)
    +        pass #faulthandler.dump_traceback_later(ns.timeout, exit=True)
         start_time = time.perf_counter()
         try:
    diff --git a/Lib/test/libregrtest/runtest_mp.py b/Lib/test/libregrtest/runtest_mp.py
    index c22479b797..51352497ae 100644
    --- a/Lib/test/libregrtest/runtest_mp.py
    +++ b/Lib/test/libregrtest/runtest_mp.py
    @@ -19,10 +19,10 @@ from test.libregrtest.utils import format_duration

    # Display the running tests if nothing happened last N seconds
    -PROGRESS_UPDATE = 30.0 # seconds
    +PROGRESS_UPDATE = 1.0 # seconds

    # Time to wait until a worker completes: should be immediate
    -JOIN_TIMEOUT = 30.0 # seconds
    +JOIN_TIMEOUT = 5.0 # seconds

     def must_stop(result, ns):
    @@ -305,7 +305,7 @@ class MultiprocessRunner:
             self.pending = MultiprocessIterator(self.regrtest.tests)
             if self.ns.timeout is not None:
                 self.worker_timeout = self.ns.timeout * 1.5
    -            self.main_timeout = self.ns.timeout * 2.0
    +            self.main_timeout = self.ns.timeout * 10.0
             else:
                 self.worker_timeout = None
                 self.main_timeout = None

    @eryksun
    Copy link
    Contributor

    eryksun commented Aug 22, 2019

    It seems like self.stdout.read() hangs even after the child process
    has been killed.

    This is an issue when the standard handles are inherited or duplicated to a grandchild process, and so on. In the case of Popen(sys.executable), the system is duplicating the standard handles implicitly because sys.executable is a console process (assuming it's python[_d].exe), among other criteria. An additional reference on the write side prevents the pipe from closing. fh.read() in the grandparent won't return as long as there's a writer that could potentially write more data.

    @vstinner
    Copy link
    Member

    This is an issue when the standard handles are inherited or duplicated to a grandchild process, and so on. In the case of Popen(sys.executable), the system is duplicating the standard handles implicitly because sys.executable is a console process (assuming it's python[_d].exe), among other criteria. An additional reference on the write side prevents the pipe from closing. fh.read() in the grandparent won't return as long as there's a writer that could potentially write more data.

    Is there a way to workaround that?

    @eryksun
    Copy link
    Contributor

    eryksun commented Aug 23, 2019

    Is there a way to workaround that?

    For Windows, subprocess could have a _read_all(file) method that special cases a pipe. The read loop for a pipe would check whether the child has exited. Then call _winapi.PeekNamedPipe on the handle (from get_osfhandle), and do a raw read of the available bytes. If the child has exited or PeekNamedPipe fails (EPIPE), then break, join the partial reads, decode and translate newlines if it's text mode, and return.

    @vstinner
    Copy link
    Member

    New changeset 767434c by Victor Stinner in branch 'master':
    bpo-37531: Fix regrtest _timedout() function on timeout (GH-15419)
    767434c

    @eryksun
    Copy link
    Contributor

    eryksun commented Aug 23, 2019

    Alternatively, instead of special casing the file type and spinning on PeekNamedPipe, the workaround could be based on a multiple-object wait that includes the child process handle. In this case, communicate() would always call _communicate in Windows, regardless of the timeout or number of pipes -- because simplistically calling either self.stdout.read() or self.stderr.read() could hang.

    The implementation would be moderately complicated. If we stop waiting on the reader threads because the process exited, we can give the threads a short time to finish reading and close the files -- maybe 250 ms is enough. But if they haven't exited at this time, we can't simply raise a TimeoutExpired exception if the call hasn't actually timed out. To finish the _communicate call, we would have to cancel the pending reads and join() the threads.

    We can force a reader thread to exit by canceling the read via WINAPI CancelIoEx. However, _readerthread has to be modified to support this. It could be implemented as a loop that calls _winapi.ReadFile to read the output in chunks that get appended to the buffer list. The read loop would break for either ERROR_BROKEN_PIPE or ERROR_OPERATION_ABORTED (canceled).

    The final step in _communicate would be to concatenate the partial reads. If it's text mode, it would also have to decode the bytes and translate newlines. The POSIX implementation of _communicate has to do this, so we already have a _translate_newlines method for this case.

    Note that _winapi.WaitForMultipleObjects is interruptible in the main thread via Ctrl+C, which is a bonus improvement since Thread.join() can't be interrupted in Windows.

    @vstinner
    Copy link
    Member

    Eryk Sun

    Alternatively, instead of special casing the file type and spinning on PeekNamedPipe, the workaround could be based on a multiple-object wait that includes the child process handle. (...)

    Well, maybe we could do something to enhance regrtest, but the initial bug described in this issue should be fixed by my commit de2d9ee which added a timeout to the second Popen.communicate() call, but also to the Popen.wait() call (moreover, "with popen:" is no longer used). Both functions use a timeout of 30 seconds which should be short enough to prevent the main regrtest process to block for hours.

    When a process is killed and stdout.read() never completes, the stdout.read() thread will likely never completes. But I consider that it's a separated bug that should be addressed in a separated issue.

    Jeremy Kloth:

    In short, the PR doesn't change the problem. The regrtest main will wait indefinitely on the successfully killed process.

    Again, my commit de2d9ee should fix this behavior.

    --

    I'm not sure why test_regrtest.test_multiprocessing_timeout() failed with a timeout on x86 Windows7 3.x; see msg350060.

    Maybe my commit 767434c will give more information if the test fails again on this buildbot.

    I will backport regrtest changes to other 3.7 and 3.8 branches. I wait a few more days to see if buildbots are stable with latest regrtest changes.

    @vstinner
    Copy link
    Member

    vstinner commented Sep 9, 2019

    Crap, test_regrtest.test_multiprocessing_timeout() hangs sometimes (randomly), and then support.temp_cwd() fails on rmtree() with "PermissionError: [WinError 32] The process cannot access the file because it is being used by another process".

    running: test_regrtest (14 min 24 sec)
    running: test_regrtest (14 min 54 sec)
    0:24:35 load avg: 2.02 [419/419/1] test_regrtest crashed (Exit code 1)
    Timeout (0:15:00)!
    Thread 0x000017d0 (most recent call first):
    File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\subprocess.py", line 1353 in _readerthread
    File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\threading.py", line 882 in run
    File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\threading.py", line 944 in _bootstrap_inner
    File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\threading.py", line 902 in _bootstrap

    Thread 0x00000a30 (most recent call first):
    File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\threading.py", line 1039 in _wait_for_tstate_lock
    File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\threading.py", line 1023 in join
    File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\subprocess.py", line 1382 in _communicate
    File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\subprocess.py", line 1015 in communicate
    File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\subprocess.py", line 491 in run
    File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_regrtest.py", line 504 in run_command
    File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_regrtest.py", line 529 in run_python
    File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_regrtest.py", line 680 in run_tests
    File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\test_regrtest.py", line 1178 in test_multiprocessing_timeout

    (...)

    1 re-run test:
    test_regrtest

    Total duration: 25 min 20 sec
    Tests result: FAILURE then SUCCESS
    Traceback (most recent call last):
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 1013, in temp_dir
        yield path
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 1065, in temp_cwd
        yield cwd_dir
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\libregrtest\main.py", line 628, in main
        self._main(tests, kwargs)
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\libregrtest\main.py", line 690, in _main
        sys.exit(0)
    SystemExit: 0
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 330, in _force_run
        return func(*args)
    PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\buildbot.python.org\\3.x.kloth-win64\\build\\build\\test_python_4376\\test_python_worker_5136'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\runpy.py", line 192, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\__main__.py", line 2, in <module>
        main()
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\libregrtest\main.py", line 695, in main
        Regrtest().main(tests=tests, **kwargs)
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\libregrtest\main.py", line 628, in main
        self._main(tests, kwargs)
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\contextlib.py", line 131, in __exit__
        self.gen.throw(type, value, traceback)
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 1065, in temp_cwd
        yield cwd_dir
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\contextlib.py", line 131, in __exit__
        self.gen.throw(type, value, traceback)
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 1018, in temp_dir
        rmtree(path)
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 452, in rmtree
        _rmtree(path)
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 393, in _rmtree
        _waitfor(_rmtree_inner, path, waitall=True)
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 341, in _waitfor
        func(pathname)
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 390, in _rmtree_inner
        _force_run(fullname, os.rmdir, fullname)
      File "C:\buildbot.python.org\3.x.kloth-win64\build\lib\test\support\__init__.py", line 336, in _force_run
        return func(*args)
    PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\buildbot.python.org\\3.x.kloth-win64\\build\\build\\test_python_4376\\test_python_worker_5136'

    @jkloth
    Copy link
    Contributor

    jkloth commented Sep 9, 2019

    Well, the kill timeout doesn't seem to be working, at least completely:

    https://buildbot.python.org/all/#/builders/40/builds/3012

    The worker process has been killed (line 562), but regrtest is still waiting.

    @vstinner
    Copy link
    Member

    New changeset 494b61a by Victor Stinner in branch '3.7':
    [3.7] bpo-37531: Enhance regrtest multiprocess timeout (GH-15345) (GH-15874)
    494b61a

    @vstinner
    Copy link
    Member

    New changeset d42a4fd by Victor Stinner in branch '3.8':
    bpo-37531: Enhance regrtest multiprocess timeout (GH-15345) (GH-15871)
    d42a4fd

    @jkloth
    Copy link
    Contributor

    jkloth commented Sep 10, 2019

    Another day, another stuck test_concurrent_futures...

    https://buildbot.python.org/all/#/builders/40/builds/3030

    The test process is again killed (line 568) but the processes from the multiprocess pool are still alive. Once I manually kill those pool processes, regrtest resumes as it should.

    @vstinner
    Copy link
    Member

    See also bpo-37424.

    @vstinner
    Copy link
    Member

    https://buildbot.python.org/all/#/builders/58/builds/2987

    1:06:55 load avg: 3.39 [343/419/1] test_regrtest crashed (Exit code 1)
    Timeout (0:35:00)!
    Thread 0x00000cdc (most recent call first):
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 1366 in _readerthread
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 882 in run
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 944 in _bootstrap_inner
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 902 in _bootstrap

    Thread 0x00000b24 (most recent call first):
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 1039 in _wait_for_tstate_lock
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\threading.py", line 1023 in join
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 1395 in _communicate
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 1024 in communicate
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\subprocess.py", line 491 in run
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 504 in run_command
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 529 in run_python
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 680 in run_tests
    File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows7\build\lib\test\test_regrtest.py", line 1178 in test_multiprocessing_timeout
    ...

    @vstinner
    Copy link
    Member

    New changeset 46b0b81 by Victor Stinner in branch 'master':
    bpo-37531: regrtest main process uses shorter timeout (GH-16220)
    46b0b81

    @vstinner
    Copy link
    Member

    New changeset 5f1590d by Victor Stinner (Miss Islington (bot)) in branch '3.8':
    bpo-37531: regrtest main process uses shorter timeout (GH-16220) (GH-16224)
    5f1590d

    @vstinner
    Copy link
    Member

    New changeset 6591b4b by Victor Stinner (Miss Islington (bot)) in branch '3.7':
    bpo-37531: regrtest main process uses shorter timeout (GH-16220) (GH-16223)
    6591b4b

    @vstinner
    Copy link
    Member

    I created bpo-38207 "subprocess: On Windows, Popen.kill() + Popen.communicate() is blocking until all processes using the pipe close the pipe".

    @vstinner
    Copy link
    Member

    New changeset b9877cd by Victor Stinner in branch 'master':
    bpo-37531: Skip test_regrtest.test_multiprocessing_timeout() on Windows (GH-16247)
    b9877cd

    @vstinner
    Copy link
    Member

    Oh wait, test_regrtest.test_multiprocessing_timeout() now also hangs on FreeBSD? :-(

    AMD64 FreeBSD CURRENT Shared 3.8:
    https://buildbot.python.org/all/#/builders/212/builds/226

    running: test_regrtest (24 min 52 sec)
    0:32:54 load avg: 0.26 [423/423/1] test_regrtest crashed (Exit code 1)
    Timeout (0:25:00)!
    Thread 0x0000000800ac0000 (most recent call first):
    File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/selectors.py", line 415 in select
    File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/subprocess.py", line 1866 in _communicate
    File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/subprocess.py", line 1024 in communicate
    File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/subprocess.py", line 491 in run
    File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/test/test_regrtest.py", line 504 in run_command
    File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/test/test_regrtest.py", line 529 in run_python
    File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/test/test_regrtest.py", line 676 in run_tests
    File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/test/test_regrtest.py", line 1174 in test_multiprocessing_timeout
    File "/usr/home/buildbot/python/3.8.koobs-freebsd-current/build/Lib/unittest/case.py", line 633 in _callTestMethod

    @vstinner
    Copy link
    Member

    It seems like the new regrtest design doesn't work as I expected.

    Calling popen.kill() + popen.stdout.close() + popen.stderr.close() from a thread B does not always interrupt popen.communicate() in thread A.

    See https://bugs.python.org/issue38207#msg352729 for an example on Linux where communicate() still blocks until all child processes complete.

    I see different options:

    • Revert changes to restore regrtest old design which didn't have these new issues

    • Find a way to fix bpo-38207 on all platforms: be able to interrupt communicate() as soon as the process is killed and/or when all pipes are closed.

    • Call communicate() with shorter timeout to workaround the blocking communicate() issue (bpo-38207).

    • Maybe experiment asyncio which supports asynchronous subprocess.

    asyncio subprocess uses overlapped operations which can be cancelled. So maybe it isn't affected by bpo-38207.

    @vstinner
    Copy link
    Member

    New changeset b2dd2dd by Victor Stinner in branch 'master':
    bpo-37531: Skip test_regrtest.test_multiprocessing_timeout() on all platforms (GH-16282)
    b2dd2dd

    @vstinner
    Copy link
    Member

    New changeset fb7746d by Victor Stinner in branch '3.8':
    bpo-37531: sync regrtest with master branch (GH-16285)
    fb7746d

    @vstinner
    Copy link
    Member

    New changeset 5e1400a by Victor Stinner in branch '3.7':
    bpo-37531: sync regrtest with master branch (GH-16285) (GH-16289)
    5e1400a

    @vstinner
    Copy link
    Member

    vstinner commented Oct 7, 2019

    Oh, regrtest main process was killed by faulthandler watchdog: it was blocked in _get_result() for longer than 1 minute.

    Extract of _get_result():

            use_faulthandler = (self.ns.timeout is not None)
            timeout = PROGRESS_UPDATE
            while True:
                if use_faulthandler:
                    faulthandler.dump_traceback_later(timeout * 2.0, exit=True)
            # wait for a thread
            try:
                return self.output.get(timeout=timeout)
            except queue.Empty:
                pass
    
                # display progress
                running = get_running(self.workers)
                if running and not self.ns.pgo:
                    self.log('running: %s' % ', '.join(running))

    self.ns.timeout is supposed to be 1500.

    I don't understand why regrtest got a timeout of 60 seconds?

    https://buildbot.python.org/all/#/builders/168/builds/1583

    LD_LIBRARY_PATH=/usr/home/buildbot/python/3.x.koobs-freebsd-current/build ./python ./Tools/scripts/run_tests.py -j 1 -u all -W --slowest --fail-env-changed --timeout=1500 -j2 -j4
    == CPython 3.9.0a0 (heads/master:ed8efd8e2c, Oct 7 2019, 21:38:27) [Clang 8.0.1 (tags/RELEASE_801/final 366581)]
    == FreeBSD-13.0-CURRENT-amd64-64bit-ELF little-endian
    == cwd: /usr/home/buildbot/python/3.x.koobs-freebsd-current/build/build/test_python_67537
    == CPU count: 4
    == encodings: locale=UTF-8, FS=utf-8
    Using random seed 1307598
    0:00:00 load avg: 3.41 Run tests in parallel using 4 child processes
    (...)
    0:00:11 load avg: 3.55 [ 12/419] test_extcall passed
    0:00:13 load avg: 3.55 [ 13/419] test_pydoc passed
    0:00:14 load avg: 3.55 [ 14/419] test_devpoll skipped
    test_devpoll skipped -- test works only on Solaris OS family
    0:00:15 load avg: 3.55 [ 15/419] test_class passed
    0:00:17 load avg: 3.50 [ 16/419] test_ntpath passed
    Timeout (0:01:00)!
    Thread 0x0000000802d30c00 (most recent call first):
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/selectors.py", line 415 in select
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/subprocess.py", line 1964 in _communicate
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/subprocess.py", line 1113 in communicate
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest_mp.py", line 199 in _run_process
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest_mp.py", line 228 in _runtest
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest_mp.py", line 265 in run
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/threading.py", line 944 in _bootstrap_inner
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/threading.py", line 902 in _bootstrap

    Thread 0x0000000802d2fd00 (most recent call first):
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/selectors.py", line 415 in select
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/subprocess.py", line 1964 in _communicate
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/subprocess.py", line 1113 in communicate
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest_mp.py", line 199 in _run_process
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest_mp.py", line 228 in _runtest
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest_mp.py", line 265 in run
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/threading.py", line 944 in _bootstrap_inner
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/threading.py", line 902 in _bootstrap

    Thread 0x0000000802d2f800 (most recent call first):
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/selectors.py", line 415 in select
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/subprocess.py", line 1964 in _communicate
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/subprocess.py", line 1113 in communicate
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest_mp.py", line 199 in _run_process
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest_mp.py", line 228 in _runtest
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest_mp.py", line 265 in run
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/threading.py", line 944 in _bootstrap_inner
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/threading.py", line 902 in _bootstrap

    Thread 0x0000000802d30200 (most recent call first):
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/selectors.py", line 415 in select
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/subprocess.py", line 1964 in _communicate
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/subprocess.py", line 1113 in communicate
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest_mp.py", line 199 in _run_process
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest_mp.py", line 228 in _runtest
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest_mp.py", line 265 in run
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/threading.py", line 944 in _bootstrap_inner
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/threading.py", line 902 in _bootstrap

    Thread 0x0000000800acc000 (most recent call first):
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/threading.py", line 307 in wait
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/queue.py", line 179 in get
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest_mp.py", line 362 in _get_result
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest_mp.py", line 413 in run_tests
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/runtest_mp.py", line 434 in run_tests_multiprocess
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/main.py", line 513 in run_tests
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/main.py", line 687 in _main
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/main.py", line 634 in main
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/libregrtest/main.py", line 712 in main
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/test/main.py", line 2 in <module>
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/runpy.py", line 85 in _run_code
    File "/usr/home/buildbot/python/3.x.koobs-freebsd-current/build/Lib/runpy.py", line 192 in _run_module_as_main
    *** Error code 1

    @vstinner
    Copy link
    Member

    vstinner commented Oct 7, 2019

    I don't understand why regrtest got a timeout of 60 seconds?

    Oh, that comes from "timeout = PROGRESS_UPDATE" with "PROGRESS_UPDATE = 30.0".

    But it means that the main process was stuck for longer than 60 seconds!?

    The main process is supposed to write an update "running: ..." every 30 seconds.

    This buildbot is very slow. Maybe 30 and 60 seconds limits are too low?

    @vstinner
    Copy link
    Member

    vstinner commented Oct 8, 2019

    New changeset 0ec618a by Victor Stinner in branch 'master':
    bpo-37531: regrtest ignores output on timeout (GH-16659)
    0ec618a

    @miss-islington
    Copy link
    Contributor

    New changeset 598bfa4 by Miss Islington (bot) in branch '3.7':
    bpo-37531: regrtest ignores output on timeout (GH-16659)
    598bfa4

    @miss-islington
    Copy link
    Contributor

    New changeset 359a197 by Miss Islington (bot) in branch '3.8':
    bpo-37531: regrtest ignores output on timeout (GH-16659)
    359a197

    @vstinner
    Copy link
    Member

    vstinner commented Oct 9, 2019

    The initial issue has been fixed in branches 3.7, 3.8 and master. I close the issue. Thanks to everybody who help to make these changes possible and helped to debug regressions.

    It's not perfect, regrtest has to workaround bpo-38207 bug, but at least regrtest should no longer runs for several hours or days.

    regrtest should kill a worker process if it runs longer than --timeout seconds.

    The main regrtest process now kills a worker process running longer than timeout seconds. In practice, it's timeout x 1.5, to give time to the faulthandler watchdog to kill the process.

    If regrtest hits bpo-38207 bug, regrtest takes 2 min 30 sec to exit in the worst case.

    But it means that the main process was stuck for longer than 60 seconds!? The main process is supposed to write an update "running: ..." every 30 seconds. This buildbot is very slow. Maybe 30 and 60 seconds limits are too low?

    I changed this timeout from 1 minute to 5 minutes.

    Calling popen.kill() + popen.stdout.close() + popen.stderr.close() from a thread B does not always interrupt popen.communicate() in thread A.

    I propose to continue the discussion in bpo-38207 "subprocess: On Windows, Popen.kill() + Popen.communicate() is blocking until all processes using the pipe close the pipe".

    I implemented workarounds in regrtest so regrtest doesn't block *forever*: it only waits 30 seconds until a worker thread completes, and I added a timeout of 2 minutes on Python exit of the main regrtest process to prevent blocking in threading._shutdown(), because of the blocked threads.

    I don't understand why you get 2 typeperf.exe process: only the main process should spawn one. Worker processes don't need it and should not spawn it.

    I didn't check on my Windows VM if I have the issue. If someone sees more than one typeperf.exe process, please open a separated issue.

    Note: typeperf.exe is the correct spelling (typepref.exe is a typo ;-)).

    @vstinner vstinner closed this as completed Oct 9, 2019
    @vstinner
    Copy link
    Member

    New changeset a661392 by Victor Stinner in branch 'master':
    bpo-37531: regrtest now catchs ProcessLookupError (GH-16827)
    a661392

    @vstinner
    Copy link
    Member

    New changeset 67b8a1f by Victor Stinner in branch '3.8':
    [3.8] Update libregrtest from master (GH-19516)
    67b8a1f

    @vstinner
    Copy link
    Member

    New changeset b894b66 by Victor Stinner in branch '3.7':
    Update libregrtest from master (GH-19517)
    b894b66

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes tests Tests in the Lib/test dir type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants