classification
Title: [buildbot] test_asyncio hangs (killed after 3 hours) on Refleak buildbots
Type: Stage: resolved
Components: asyncio, Tests Versions: Python 3.11, Python 3.10
process
Status: closed Resolution: duplicate
Dependencies: Superseder: asyncio: MultiLoopWatcher has a race condition (test_asyncio: test_close_kill_running() hangs on AMD64 RHEL7 Refleaks 3.x)
View: 38323
Assigned To: Nosy List: asvetlov, chris.jerdonek, erlendaasland, pablogsal, shreyanavigyan, vstinner, yselivanov
Priority: normal Keywords:

Created on 2021-05-12 09:40 by vstinner, last changed 2021-06-09 21:25 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
wsl_log.txt shreyanavigyan, 2021-06-04 10:30
Messages (19)
msg393491 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-05-12 09:40
Buildbot kills the "test" step because it takes longer than 3 hours, whereas Python regrtest gets a timeout of 3 hour 15 min.

Refleak buildbots should get a longer timeout for the "test" step.

PPC64LE Fedora Rawhide Refleaks 3.10:
https://buildbot.python.org/all/#/builders/603/builds/7

./python  ./Tools/scripts/run_tests.py -j 1 -u all -W --slowest --fail-env-changed --timeout=11700 -R 3:3 -u-cpu -j10 
WARNING: Disable --verbose3 because it's incompatible with --huntrleaks: see http://bugs.python.org/issue27103
== CPython 3.10.0b1+ (heads/3.10:9a0e65c8e0, May 10 2021, 00:01:01) [GCC 11.1.1 20210428 (Red Hat 11.1.1-1)]
== Linux-5.7.0-0.rc0.git2.1.fc33.ppc64le-ppc64le-with-glibc2.33.9000 little-endian
== cwd: /home/buildbot/buildarea/3.10.cstratak-fedora-rawhide-ppc64le.refleak/build/build/test_python_1529753æ
== CPU count: 8
== encodings: locale=UTF-8, FS=utf-8
Using random seed 1661030
0:00:00 load avg: 3.25 Run tests in parallel using 10 child processes (timeout: 3 hour 15 min, worker timeout: 3 hour 20 min)
0:00:00 load avg: 3.25 [  1/427] test_tix skipped (resource denied)
(...)
2:58:49 load avg: 0.00 running: test_asyncio (2 hour 58 min)
2:59:19 load avg: 0.00 running: test_asyncio (2 hour 58 min)
2:59:49 load avg: 0.00 running: test_asyncio (2 hour 59 min)

command timed out: 10800 seconds elapsed running [b'make', b'buildbottest', b'TESTOPTS=-R 3:3 -u-cpu -j10 ${BUILDBOT_TESTOPTS}', b'TESTPYTHONOPTS=', b'TESTTIMEOUT=11700'], attempting to kill
msg393492 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-05-12 09:44
It seems to be a real bug in test_asyncio, since in other builds on the same buildbot worker, test_asyncio takes less than 30 minutes:

- build 8: test_asyncio: 27 min 48 sec
- build 5: test_asyncio: 21 min 44 sec
- build 2: test_asyncio: 20 min 42 sec
msg393493 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-05-12 09:45
I bet that the hang is a duplicate of the old bpo-38323 "asyncio: MultiLoopWatcher has a race condition (test_asyncio: test_close_kill_running() hangs on AMD64 RHEL7 Refleaks 3.x)" that I reported 2 years ago and is not fixed yet.
msg394923 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-02 15:31
> Refleak buildbots should get a longer timeout for the "test" step.

I changed buildbot Test step timeout from 3 hours to 4 hours:

* https://github.com/python/buildmaster-config/commit/9e0c812694d6fa599b4c8890045ed006fe7c1f6b
* https://github.com/python/buildmaster-config/pull/252
msg395035 - (view) Author: Shreyan Avigyan (shreyanavigyan) * Date: 2021-06-03 19:54
Victor, this issue is reproducible on Linux. I reproduced it on my WSL. And I've identified the issue is in test_close_kill_running. The patch for resolving the crash in bpo-38323 can be the cause of the newly introduced hang. test_close_kill_running is running and passing and again it is running all over again infinitely. And extending the buildbot timer may not help since it seems it's running infinitely. Even if we look at the buildbot log there is no sign of crash but suddenly buildbot says "timeout!".
msg395036 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-03 19:55
> Victor, this issue is reproducible on Linux. I reproduced it on my WSL.

How did you managed? I ran test_asyncio for almost 5 hours with -R in my linux machine and could not reproduce?
msg395037 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-03 19:56
>  And extending the buildbot timer may not help since it seems it's running infinitely

Extending the bot time was done so we can use faulthandler to identify the test, because buildbot was cancelling the whole build before that :)
msg395041 - (view) Author: Shreyan Avigyan (shreyanavigyan) * Date: 2021-06-03 20:02
> How did you managed? I ran test_asyncio for almost 5 hours with -R in my linux machine and could not reproduce?

Now it's not again not reproducing. Maybe I used some wrong flag. But I'll try to debug this issue.
msg395042 - (view) Author: Shreyan Avigyan (shreyanavigyan) * Date: 2021-06-03 20:05
I've also been trying to debug it for the whole day. Not sure what's causing this. (Why does all error occur in Fedora buildbots? If I remember correctly, another test also failed there.")
msg395048 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-03 21:08
> And I've identified the issue is in test_close_kill_running. The patch for resolving the crash in bpo-38323 can be the cause of the newly introduced hang

Also, how did you reached this conclusion?
msg395068 - (view) Author: Shreyan Avigyan (shreyanavigyan) * Date: 2021-06-04 06:27
The test result I got kind of seemed that the problem was in test_close_kill_running but it maybe that I used wrong flag (maybe -F?). And it is also weird that the error is occurring only in Fedora buildbot and not on others. Also the logs are not showing which test it was executing when the timeout occured.
msg395070 - (view) Author: Shreyan Avigyan (shreyanavigyan) * Date: 2021-06-04 10:30
I ran the test_suite on commit f3fa63ec75fdbb4a08a10957a5c631bf0c4a5970 in my WSL 1 Ubuntu and got some unexpected results. There are huge reference leaks in the tests and test_asyncio, test_subprocess, test_socket, etc., failed. I forked the branch a while ago so the HEAD is on f3fa63ec75fdbb4a08a10957a5c631bf0c4a5970. Don't know if these are fixed already but reported them here since there were no commits for any issue based on test_asyncio, test_subprocess recently. And most surprisingly I got a timeout error also but it was for "2:03:30 load avg: 0.52 running: test_ssl (1 hour 54 min), test_largefile (1 hour 18 min), test_io (1 hour 30 min)". I attached the log for full details. (I use WSL 1 so note it's written ./python.exe instead of ./python)
msg395091 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-04 15:27
Pablo:
> Extending the bot time was done so we can use faulthandler to identify the test, because buildbot was cancelling the whole build before that :)

Right. I increased buildbot timeout to check if buildbots hang because of bpo-38323 or if it's a new regression.

bpo-38323 is not fixed yet, I can still easily reproduce the hang on Linux is less than 1 minute:
https://bugs.python.org/issue38323#msg395090
msg395093 - (view) Author: Shreyan Avigyan (shreyanavigyan) * Date: 2021-06-04 15:48
> bpo-38323 is not fixed yet, I can still easily reproduce the hang on Linux is less than 1 minute:
https://bugs.python.org/issue38323#msg395090

Yes, with the commands you provided in the https://bugs.python.org/issue38323#msg395090 I also reproduced the crash on my WSL. And any suggestions on the other test failures in the log I attached? I don't know if they are fixed already in the latest commits though.
msg395094 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-04 16:01
> Yes, with the commands you provided in the https://bugs.python.org/issue38323#msg395090 I also reproduced the crash on my WSL. And any suggestions on the other test failures in the log I attached? I don't know if they are fixed already in the latest commits though.

Your logs don't show this error. This error is test_asyncio hanging, but in your logs the test fails (multiple test do, actually).
msg395095 - (view) Author: Shreyan Avigyan (shreyanavigyan) * Date: 2021-06-04 16:05
> Your logs don't show this error. This error is test_asyncio hanging, but in your logs the test fails (multiple test do, actually).

I didn't include this error. I was just asking if we could do anything about the multiple test failures (especially test_asyncio failure).
msg395101 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-04 16:36
> I was just asking if we could do anything about the multiple test failures (especially test_asyncio failure).

We don't see any failures as the ones in your logs in the buildbots or otherwise, so seems something wrong in your system or only affecting your system.

> I didn't include this error. 

Sorry Shreyan, but unfortunately I find your messages very confusing. You seem to be mixing several different problems and your logs don't seem to be directly related to the issue at hand. Please, in the future, add some context to your messages if possible.
msg395135 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-04 22:52
I bet that this issue is simply a duplicate of bpo-38323, since I can still (easily) reproduce it locally.

I fixed the buildbot configuration, so if the bug happens again on Refleak buildbots, at least we should know which test method hangs. If it's not bpo-38323, I suggest to open a new more specific issue about this test method.
msg395480 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-09 21:25
> https://github.com/python/buildmaster-config/commit/9e0c812694d6fa599b4c8890045ed006fe7c1f6b

aarch64 Fedora Rawhide Refleaks 3.8:
https://buildbot.python.org/all/#/builders/401/builds/47

This change worked as expected: buildbot now uses a timeout of 4 hours (14400 seconds).

At the first run, test_asyncio was blocked in test_close_kill_running() of test_asyncio/test_subprocess.py. From the traceback, it's not possible to be sure that the test uses MultiLoopChildWatcher. But it smells a lot like bpo-38323.

bpo-38323 was recently fixed in 3.9, 3.10 and main branches (by skipping MultiLoopChildWatcher tets, since this class has a known race condition). Sadly, the 3.8 branch no longer accepts bugfixes.

Logs:

0:00:00 load avg: 2.85 Run tests in parallel using 10 child processes (timeout: 3 hour 15 min, worker timeout: 3 hour 20 min)
(...)
3:21:15 load avg: 0.04 [423/423/1] test_asyncio crashed (Exit code 1)
beginning 6 repetitions
123456
.....Timeout (3:15:00)!
Thread 0x0000ffffaecbcd50 (most recent call first):
  File "/home/buildbot/buildarea/3.8.cstratak-fedora-rawhide-aarch64.refleak/build/Lib/selectors.py", line 468 in select
  File "/home/buildbot/buildarea/3.8.cstratak-fedora-rawhide-aarch64.refleak/build/Lib/asyncio/base_events.py", line 1823 in _run_once
  File "/home/buildbot/buildarea/3.8.cstratak-fedora-rawhide-aarch64.refleak/build/Lib/asyncio/base_events.py", line 570 in run_forever
  File "/home/buildbot/buildarea/3.8.cstratak-fedora-rawhide-aarch64.refleak/build/Lib/asyncio/base_events.py", line 603 in run_until_complete
  File "/home/buildbot/buildarea/3.8.cstratak-fedora-rawhide-aarch64.refleak/build/Lib/test/test_asyncio/test_subprocess.py", line 484 in test_close_kill_running
(...)
test_ensure_future_task (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ... ok
test_error_in_call_soon (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ... ok
test_exception_traceback (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ... ok
test_gather_shield (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ... ok
test_get_coro (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ... ok
test_get_stack (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ... ok
test_iscoroutinefunction (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ... ok
command timed out: 14400 seconds elapsed running [b'make', b'buildbottest', b'TESTOPTS=-R 3:3 -u-cpu -j10 ${BUILDBOT_TESTOPTS}', b'TESTPYTHONOPTS=', b'TESTTIMEOUT=11700'], attempting to kill
.....make: *** [Makefile:1173: buildbottest] Terminated
process killed by signal 15
program finished with exit code -1
elapsedTime=14400.032773
test_log_destroyed_pending_task (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ...
History
Date User Action Args
2021-06-09 21:25:27vstinnersetmessages: + msg395480
2021-06-04 22:52:58vstinnersetstatus: open -> closed
superseder: asyncio: MultiLoopWatcher has a race condition (test_asyncio: test_close_kill_running() hangs on AMD64 RHEL7 Refleaks 3.x)
messages: + msg395135

resolution: duplicate
stage: resolved
2021-06-04 16:36:45pablogsalsetmessages: + msg395101
2021-06-04 16:05:11shreyanavigyansetmessages: + msg395095
2021-06-04 16:01:29pablogsalsetmessages: + msg395094
2021-06-04 15:48:50shreyanavigyansetmessages: + msg395093
2021-06-04 15:27:16vstinnersetmessages: + msg395091
2021-06-04 10:37:52chris.jerdoneksetnosy: + chris.jerdonek
2021-06-04 10:30:59shreyanavigyansetfiles: + wsl_log.txt

messages: + msg395070
2021-06-04 06:27:57shreyanavigyansetmessages: + msg395068
2021-06-03 21:08:25pablogsalsetmessages: + msg395048
2021-06-03 20:05:16shreyanavigyansetmessages: + msg395042
2021-06-03 20:02:58shreyanavigyansetmessages: + msg395041
2021-06-03 19:56:34pablogsalsetmessages: + msg395037
2021-06-03 19:55:49pablogsalsetmessages: + msg395036
2021-06-03 19:54:14shreyanavigyansetnosy: + shreyanavigyan
messages: + msg395035
2021-06-02 15:31:11vstinnersetmessages: + msg394923
2021-05-22 19:32:44erlendaaslandsetnosy: + erlendaasland
2021-05-12 09:46:42vstinnersettitle: [buildbot] test_asyncio hangs after 3 hours on Refleak buildbots -> [buildbot] test_asyncio hangs (killed after 3 hours) on Refleak buildbots
2021-05-12 09:45:55vstinnersetmessages: + msg393493
2021-05-12 09:44:25vstinnersetmessages: + msg393492
2021-05-12 09:40:34vstinnercreate