Issue38502
Created on 2019-10-16 22:46 by vstinner, last changed 2020-04-14 23:05 by vstinner. This issue is now closed.
Pull Requests | |||
---|---|---|---|
URL | Status | Linked | Edit |
PR 16829 | merged | vstinner, 2019-10-16 23:04 |
Messages (12) | |||
---|---|---|---|
msg354818 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-10-16 22:46 | |
Problem. When regrtest is interrupted by CTRL+c and regrtest multiprocessing code (-jN command line option) is used, regrtest randomly fails to stop come TestWorkerProcess threads. The problem is that only the direct child process is kill by SIGKILL. If the test spawns a grandchild process, this one will not be killed. Moreover, the grandchild process inherits the child process stdout and stderr. But regrtest mltiprocessing uses pipes for the child process stdout and stderr. At the end, Popen.communicate() hangs randomly in the main regrtest process (in a TestWorkerProcess thread) until the child process *and* the grandchild process completes. Except that the main regrtest does not directly kill the grandchild process. => see bpo-38207 Solution. I propose to: (1) use Popen() with start_new_session=True: a worker process calls setsid() to create a new process group. (2) Don't kill worker processes, but kill the process groups of worker processes. Attached PR implements this solution. |
|||
msg354819 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-10-16 22:52 | |
Example of the problem: sometimes, interrupting a multiprocessing test hangs. regrtest fails to interrupt a TestWorkerProcess thread. $ ./python -m test test_multiprocessing_fork -j10 -F --timeout=60 --slowest 0:00:00 load avg: 1.09 Run tests in parallel using 10 child processes ^C Kill <TestWorkerProcess #1 running test=test_multiprocessing_fork pid=4504 time=2.3 sec> Kill <TestWorkerProcess #2 running test=test_multiprocessing_fork pid=4503 time=2.3 sec> Kill <TestWorkerProcess #3 running test=test_multiprocessing_fork pid=4506 time=2.3 sec> Kill <TestWorkerProcess #4 running test=test_multiprocessing_fork pid=4505 time=2.3 sec> Kill <TestWorkerProcess #5 running test=test_multiprocessing_fork pid=4510 time=2.3 sec> Kill <TestWorkerProcess #6 running test=test_multiprocessing_fork pid=4509 time=2.3 sec> Kill <TestWorkerProcess #7 running test=test_multiprocessing_fork pid=4513 time=2.3 sec> Kill <TestWorkerProcess #8 running test=test_multiprocessing_fork pid=4514 time=2.3 sec> Kill <TestWorkerProcess #9 running test=test_multiprocessing_fork pid=4516 time=2.3 sec> Kill <TestWorkerProcess #10 running test=test_multiprocessing_fork pid=4517 time=2.3 sec> 0:00:03 load avg: 1.88 Waiting for <TestWorkerProcess #1 running test=test_multiprocessing_fork pid=4504 time=3.3 sec> thread for 1.0 sec 0:00:04 load avg: 1.88 Waiting for <TestWorkerProcess #1 running test=test_multiprocessing_fork pid=4504 time=4.3 sec> thread for 2.0 sec 0:00:05 load avg: 1.73 Waiting for <TestWorkerProcess #1 running test=test_multiprocessing_fork pid=4504 time=5.3 sec> thread for 3.0 sec (...) With my PR, I cannot reproduce the "Waiting for ..." issue anymore. |
|||
msg354821 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-10-16 23:08 | |
See also bpo-18969: "test suite: enable faulthandler timeout in assert_python". This issue proposes to use killpg() to send a signal to all Python processes spawned by regrtest: trigger faulthandler to dump the Python traceback of all Python threads. |
|||
msg354822 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-10-16 23:10 | |
See also bpo-12413 "make faulthandler dump traceback of child processes" which looks like a duplicate of bpo-18969. |
|||
msg354899 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-10-18 13:49 | |
New changeset ecb035cd14c11521276343397151929a94018a22 by Victor Stinner in branch 'master': bpo-38502: regrtest uses process groups if available (GH-16829) https://github.com/python/cpython/commit/ecb035cd14c11521276343397151929a94018a22 |
|||
msg354900 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-10-18 13:52 | |
I will wait at least one day to see how buildbots like this change before backporting it to 3.7 and 3.8. |
|||
msg354923 - (view) | Author: David Bolen (db3l) | Date: 2019-10-18 22:18 | |
I don't know for sure that this is the cause but both 3.x builds following this commit on my bolen-ubuntu worker (Ubuntu 18.04.3) have had test_pty crash in the first attempt, with the retry succeeding. For example https://buildbot.python.org/all/#/builders/141/builds/2679 In spot checking, the same behavior seems to be occurring on many other non-Windows builders as well, presumably those for which it gets used. There doesn't seem to be anything useful produced by the first crash. Just an entry for "test_pty crashed (Exit code -1)" in the test log. |
|||
msg354926 - (view) | Author: David Bolen (db3l) | Date: 2019-10-18 23:01 | |
I can recreate this manually by running regrtest.py against test_pty. Crashes with any "-j#" option, but fine when run sequentially. Removing the process group change avoids the crash. With the process group change in place, the trigger point appears to be the final "os.close(master_fd)" in PtyTest.test_basic. Commenting that out avoids the crash. |
|||
msg355060 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2019-10-21 11:39 | |
> I don't know for sure that this is the cause but both 3.x builds following this commit on my bolen-ubuntu worker (Ubuntu 18.04.3) have had test_pty crash in the first attempt, with the retry succeeding. For example https://buildbot.python.org/all/#/builders/141/builds/2679 I created bpo-38547: "test_pty fails when using setsid()". |
|||
msg359504 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2020-01-07 12:18 | |
The test_pty has been fixed. regrtest now calls setsid() when using multiprocessing (-jN), but only in the main branch. It's a nice feature, but I'm not comfortable about backporting it yet. Maybe if tests become too unstable on other branches, we may backport the feature to 3.7 and 3.8. |
|||
msg366409 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2020-04-14 17:51 | |
New changeset 67b8a1f0f0f78ec38b8626fa9f5b2f5a55c17e15 by Victor Stinner in branch '3.8': [3.8] Update libregrtest from master (GH-19516) https://github.com/python/cpython/commit/67b8a1f0f0f78ec38b8626fa9f5b2f5a55c17e15 |
|||
msg366458 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2020-04-14 23:05 | |
New changeset b894b669c98cc365b84cbb8d20f531f1d0686f59 by Victor Stinner in branch '3.7': Update libregrtest from master (GH-19517) https://github.com/python/cpython/commit/b894b669c98cc365b84cbb8d20f531f1d0686f59 |
History | |||
---|---|---|---|
Date | User | Action | Args |
2020-04-14 23:05:24 | vstinner | set | messages: + msg366458 |
2020-04-14 17:51:25 | vstinner | set | messages: + msg366409 |
2020-01-07 12:18:19 | vstinner | set | status: open -> closed resolution: fixed messages: + msg359504 stage: patch review -> resolved |
2019-10-21 11:39:13 | vstinner | set | messages: + msg355060 |
2019-10-18 23:01:11 | db3l | set | messages: + msg354926 |
2019-10-18 22:18:49 | db3l | set | nosy:
+ db3l messages: + msg354923 |
2019-10-18 13:52:42 | vstinner | set | messages: + msg354900 |
2019-10-18 13:49:11 | vstinner | set | messages: + msg354899 |
2019-10-16 23:10:25 | vstinner | set | messages: + msg354822 |
2019-10-16 23:08:15 | vstinner | set | messages: + msg354821 |
2019-10-16 23:04:46 | vstinner | set | keywords:
+ patch stage: patch review pull_requests: + pull_request16378 |
2019-10-16 22:52:23 | vstinner | set | messages: + msg354819 |
2019-10-16 22:46:46 | vstinner | create |