classification
Title: regrtest: test suite completes but Tests Result is not displayed and the process hangs
Type: Stage: resolved
Components: Tests Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: pablogsal, vstinner
Priority: normal Keywords:

Created on 2019-09-16 19:31 by vstinner, last changed 2019-10-01 10:30 by vstinner. This issue is now closed.

Messages (5)
msg352574 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-16 19:31
https://buildbot.python.org/all/#/builders/225/builds/239

...
0:21:16 load avg: 2.54 [422/423] test_venv passed (1 min 31 sec) -- running: test_tools (1 min 27 sec)
running: test_tools (1 min 57 sec)
running: test_tools (2 min 27 sec)
running: test_tools (2 min 57 sec)
running: test_tools (3 min 27 sec)
0:23:42 load avg: 1.48 [423/423] test_tools passed (3 min 52 sec)
command timed out: 1200 seconds without output running ['make', 'buildbottest', 'TESTOPTS=-j2 -j4 ${BUILDBOT_TESTOPTS}', 'TESTPYTHONOPTS=', 'TESTTIMEOUT=900'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=2628.905564


It may be a regression caused by bpo-37531.
msg352581 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-16 20:20
> command timed out: 1200 seconds without output running ['make', 'buildbottest', 'TESTOPTS=-j2 -j4 ${BUILDBOT_TESTOPTS}', 'TESTPYTHONOPTS=', 'TESTTIMEOUT=900'], attempting to kill

faulthandler is supposed to display the traceback where the test hangs, but for the main process, libregrtest uses a delay of timeout x 2 seconds:

            self.worker_timeout = self.ns.timeout * 1.5
            self.main_timeout = self.ns.timeout * 2.0

Here timeout=900 ("TESTTIMEOUT=900"), so main_timeout = 1800 seconds (30 minutes), whereas buildbot timeout is 1200 seconds (20 minutes) :-(

In fact, main_timeout can be *way* shorter: the main process calls faulthandler.dump_traceback_later(self.main_timeout, exit=True) every max(PROGRESS_UPDATE, PROGRESS_MIN_TIME) seconds (or more often): every 30 seconds (of more often).

So main_process could be max(PROGRESS_UPDATE, PROGRESS_MIN_TIME) x 2 = 1 minute, instead of 30 minutes.
msg352582 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-16 20:21
> https://buildbot.python.org/all/#/builders/225/builds/239

This bug only occurred once, the next build (240) was fine.
msg352621 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-09-17 09:22
> https://buildbot.python.org/all/#/builders/225/builds/239

I forgot to mention the buildbot name: AMD64 FreeBSD 10-STABLE Non-Debug 3.8.

> It may be a regression caused by bpo-37531.
> (...)
> So main_process could be max(PROGRESS_UPDATE, PROGRESS_MIN_TIME) x 2 = 1 minute, instead of 30 minutes.

I pushed a change for that:

New changeset 46b0b81220a23bc4aee5ba3ba67e8cf1b5df7960 by Victor Stinner in branch 'master':
bpo-37531: regrtest main process uses shorter timeout (GH-16220)
https://github.com/python/cpython/commit/46b0b81220a23bc4aee5ba3ba67e8cf1b5df7960
msg353661 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-01 10:30
I pushed different regrtest bugfixes. The situation should now be less worse. I close the issue.
History
Date User Action Args
2019-10-01 10:30:23vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg353661

stage: resolved
2019-09-17 09:22:42vstinnersetmessages: + msg352621
2019-09-16 20:21:00vstinnersetmessages: + msg352582
2019-09-16 20:20:36vstinnersetmessages: + msg352581
2019-09-16 19:31:55vstinnercreate