classification
Title: regrtest: use process groups
Type: Stage: resolved
Components: Tests Versions: Python 3.9
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: db3l, vstinner
Priority: normal Keywords: patch

Created on 2019-10-16 22:46 by vstinner, last changed 2020-04-14 23:05 by vstinner. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 16829 merged vstinner, 2019-10-16 23:04
Messages (12)
msg354818 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-16 22:46
Problem.

When regrtest is interrupted by CTRL+c and regrtest multiprocessing code (-jN command line option) is used, regrtest randomly fails to stop come TestWorkerProcess threads.

The problem is that only the direct child process is kill by SIGKILL. If the test spawns a grandchild process, this one will not be killed.

Moreover, the grandchild process inherits the child process stdout and stderr. But regrtest mltiprocessing uses pipes for the child process stdout and stderr.

At the end, Popen.communicate() hangs randomly in the main regrtest process (in a TestWorkerProcess thread) until the child process *and* the grandchild process completes. Except that the main regrtest does not directly kill the grandchild process.

=> see bpo-38207


Solution.

I propose to:

(1) use Popen() with start_new_session=True: a worker process calls setsid() to create a new process group.
(2) Don't kill worker processes, but kill the process groups of worker processes.

Attached PR implements this solution.
msg354819 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-16 22:52
Example of the problem: sometimes, interrupting a multiprocessing test hangs. regrtest fails to interrupt a TestWorkerProcess thread.

$ ./python -m test test_multiprocessing_fork -j10 -F --timeout=60 --slowest
0:00:00 load avg: 1.09 Run tests in parallel using 10 child processes
^C
Kill <TestWorkerProcess #1 running test=test_multiprocessing_fork pid=4504 time=2.3 sec>
Kill <TestWorkerProcess #2 running test=test_multiprocessing_fork pid=4503 time=2.3 sec>
Kill <TestWorkerProcess #3 running test=test_multiprocessing_fork pid=4506 time=2.3 sec>
Kill <TestWorkerProcess #4 running test=test_multiprocessing_fork pid=4505 time=2.3 sec>
Kill <TestWorkerProcess #5 running test=test_multiprocessing_fork pid=4510 time=2.3 sec>
Kill <TestWorkerProcess #6 running test=test_multiprocessing_fork pid=4509 time=2.3 sec>
Kill <TestWorkerProcess #7 running test=test_multiprocessing_fork pid=4513 time=2.3 sec>
Kill <TestWorkerProcess #8 running test=test_multiprocessing_fork pid=4514 time=2.3 sec>
Kill <TestWorkerProcess #9 running test=test_multiprocessing_fork pid=4516 time=2.3 sec>
Kill <TestWorkerProcess #10 running test=test_multiprocessing_fork pid=4517 time=2.3 sec>
0:00:03 load avg: 1.88 Waiting for <TestWorkerProcess #1 running test=test_multiprocessing_fork pid=4504 time=3.3 sec> thread for 1.0 sec
0:00:04 load avg: 1.88 Waiting for <TestWorkerProcess #1 running test=test_multiprocessing_fork pid=4504 time=4.3 sec> thread for 2.0 sec
0:00:05 load avg: 1.73 Waiting for <TestWorkerProcess #1 running test=test_multiprocessing_fork pid=4504 time=5.3 sec> thread for 3.0 sec
(...)


With my PR, I cannot reproduce the "Waiting for ..." issue anymore.
msg354821 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-16 23:08
See also bpo-18969: "test suite: enable faulthandler timeout in assert_python". This issue proposes to use killpg() to send a signal to all Python processes spawned by regrtest: trigger faulthandler to dump the Python traceback of all Python threads.
msg354822 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-16 23:10
See also bpo-12413 "make faulthandler dump traceback of child processes" which looks like a duplicate of bpo-18969.
msg354899 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-18 13:49
New changeset ecb035cd14c11521276343397151929a94018a22 by Victor Stinner in branch 'master':
bpo-38502: regrtest uses process groups if available (GH-16829)
https://github.com/python/cpython/commit/ecb035cd14c11521276343397151929a94018a22
msg354900 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-18 13:52
I will wait at least one day to see how buildbots like this change before backporting it to 3.7 and 3.8.
msg354923 - (view) Author: David Bolen (db3l) Date: 2019-10-18 22:18
I don't know for sure that this is the cause but both 3.x builds following this commit on my bolen-ubuntu worker (Ubuntu 18.04.3) have had test_pty crash in the first attempt, with the retry succeeding.  For example https://buildbot.python.org/all/#/builders/141/builds/2679

In spot checking, the same behavior seems to be occurring on many other non-Windows builders as well, presumably those for which it gets used.

There doesn't seem to be anything useful produced by the first crash.  Just an entry for "test_pty crashed (Exit code -1)" in the test log.
msg354926 - (view) Author: David Bolen (db3l) Date: 2019-10-18 23:01
I can recreate this manually by running regrtest.py against test_pty.  Crashes with any "-j#" option, but fine when run sequentially.  Removing the process group change avoids the crash.

With the process group change in place, the trigger point appears to be the final "os.close(master_fd)" in PtyTest.test_basic.  Commenting that out avoids the crash.
msg355060 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-10-21 11:39
> I don't know for sure that this is the cause but both 3.x builds following this commit on my bolen-ubuntu worker (Ubuntu 18.04.3) have had test_pty crash in the first attempt, with the retry succeeding.  For example https://buildbot.python.org/all/#/builders/141/builds/2679

I created bpo-38547: "test_pty fails when using setsid()".
msg359504 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-07 12:18
The test_pty has been fixed.

regrtest now calls setsid() when using multiprocessing (-jN), but only in the main branch. It's a nice feature, but I'm not comfortable about backporting it yet. Maybe if tests become too unstable on other branches, we may backport the feature to 3.7 and 3.8.
msg366409 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-14 17:51
New changeset 67b8a1f0f0f78ec38b8626fa9f5b2f5a55c17e15 by Victor Stinner in branch '3.8':
[3.8] Update libregrtest from master (GH-19516)
https://github.com/python/cpython/commit/67b8a1f0f0f78ec38b8626fa9f5b2f5a55c17e15
msg366458 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-14 23:05
New changeset b894b669c98cc365b84cbb8d20f531f1d0686f59 by Victor Stinner in branch '3.7':
Update libregrtest from master (GH-19517)
https://github.com/python/cpython/commit/b894b669c98cc365b84cbb8d20f531f1d0686f59
History
Date User Action Args
2020-04-14 23:05:24vstinnersetmessages: + msg366458
2020-04-14 17:51:25vstinnersetmessages: + msg366409
2020-01-07 12:18:19vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg359504

stage: patch review -> resolved
2019-10-21 11:39:13vstinnersetmessages: + msg355060
2019-10-18 23:01:11db3lsetmessages: + msg354926
2019-10-18 22:18:49db3lsetnosy: + db3l
messages: + msg354923
2019-10-18 13:52:42vstinnersetmessages: + msg354900
2019-10-18 13:49:11vstinnersetmessages: + msg354899
2019-10-16 23:10:25vstinnersetmessages: + msg354822
2019-10-16 23:08:15vstinnersetmessages: + msg354821
2019-10-16 23:04:46vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request16378
2019-10-16 22:52:23vstinnersetmessages: + msg354819
2019-10-16 22:46:46vstinnercreate