msg393491 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2021-05-12 09:40 |
Buildbot kills the "test" step because it takes longer than 3 hours, whereas Python regrtest gets a timeout of 3 hour 15 min.
Refleak buildbots should get a longer timeout for the "test" step.
PPC64LE Fedora Rawhide Refleaks 3.10:
https://buildbot.python.org/all/#/builders/603/builds/7
./python ./Tools/scripts/run_tests.py -j 1 -u all -W --slowest --fail-env-changed --timeout=11700 -R 3:3 -u-cpu -j10
WARNING: Disable --verbose3 because it's incompatible with --huntrleaks: see http://bugs.python.org/issue27103
== CPython 3.10.0b1+ (heads/3.10:9a0e65c8e0, May 10 2021, 00:01:01) [GCC 11.1.1 20210428 (Red Hat 11.1.1-1)]
== Linux-5.7.0-0.rc0.git2.1.fc33.ppc64le-ppc64le-with-glibc2.33.9000 little-endian
== cwd: /home/buildbot/buildarea/3.10.cstratak-fedora-rawhide-ppc64le.refleak/build/build/test_python_1529753æ
== CPU count: 8
== encodings: locale=UTF-8, FS=utf-8
Using random seed 1661030
0:00:00 load avg: 3.25 Run tests in parallel using 10 child processes (timeout: 3 hour 15 min, worker timeout: 3 hour 20 min)
0:00:00 load avg: 3.25 [ 1/427] test_tix skipped (resource denied)
(...)
2:58:49 load avg: 0.00 running: test_asyncio (2 hour 58 min)
2:59:19 load avg: 0.00 running: test_asyncio (2 hour 58 min)
2:59:49 load avg: 0.00 running: test_asyncio (2 hour 59 min)
command timed out: 10800 seconds elapsed running [b'make', b'buildbottest', b'TESTOPTS=-R 3:3 -u-cpu -j10 ${BUILDBOT_TESTOPTS}', b'TESTPYTHONOPTS=', b'TESTTIMEOUT=11700'], attempting to kill
|
msg393492 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2021-05-12 09:44 |
It seems to be a real bug in test_asyncio, since in other builds on the same buildbot worker, test_asyncio takes less than 30 minutes:
- build 8: test_asyncio: 27 min 48 sec
- build 5: test_asyncio: 21 min 44 sec
- build 2: test_asyncio: 20 min 42 sec
|
msg393493 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2021-05-12 09:45 |
I bet that the hang is a duplicate of the old bpo-38323 "asyncio: MultiLoopWatcher has a race condition (test_asyncio: test_close_kill_running() hangs on AMD64 RHEL7 Refleaks 3.x)" that I reported 2 years ago and is not fixed yet.
|
msg394923 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2021-06-02 15:31 |
> Refleak buildbots should get a longer timeout for the "test" step.
I changed buildbot Test step timeout from 3 hours to 4 hours:
* https://github.com/python/buildmaster-config/commit/9e0c812694d6fa599b4c8890045ed006fe7c1f6b
* https://github.com/python/buildmaster-config/pull/252
|
msg395035 - (view) |
Author: Shreyan Avigyan (shreyanavigyan) * |
Date: 2021-06-03 19:54 |
Victor, this issue is reproducible on Linux. I reproduced it on my WSL. And I've identified the issue is in test_close_kill_running. The patch for resolving the crash in bpo-38323 can be the cause of the newly introduced hang. test_close_kill_running is running and passing and again it is running all over again infinitely. And extending the buildbot timer may not help since it seems it's running infinitely. Even if we look at the buildbot log there is no sign of crash but suddenly buildbot says "timeout!".
|
msg395036 - (view) |
Author: Pablo Galindo Salgado (pablogsal) * |
Date: 2021-06-03 19:55 |
> Victor, this issue is reproducible on Linux. I reproduced it on my WSL.
How did you managed? I ran test_asyncio for almost 5 hours with -R in my linux machine and could not reproduce?
|
msg395037 - (view) |
Author: Pablo Galindo Salgado (pablogsal) * |
Date: 2021-06-03 19:56 |
> And extending the buildbot timer may not help since it seems it's running infinitely
Extending the bot time was done so we can use faulthandler to identify the test, because buildbot was cancelling the whole build before that :)
|
msg395041 - (view) |
Author: Shreyan Avigyan (shreyanavigyan) * |
Date: 2021-06-03 20:02 |
> How did you managed? I ran test_asyncio for almost 5 hours with -R in my linux machine and could not reproduce?
Now it's not again not reproducing. Maybe I used some wrong flag. But I'll try to debug this issue.
|
msg395042 - (view) |
Author: Shreyan Avigyan (shreyanavigyan) * |
Date: 2021-06-03 20:05 |
I've also been trying to debug it for the whole day. Not sure what's causing this. (Why does all error occur in Fedora buildbots? If I remember correctly, another test also failed there.")
|
msg395048 - (view) |
Author: Pablo Galindo Salgado (pablogsal) * |
Date: 2021-06-03 21:08 |
> And I've identified the issue is in test_close_kill_running. The patch for resolving the crash in bpo-38323 can be the cause of the newly introduced hang
Also, how did you reached this conclusion?
|
msg395068 - (view) |
Author: Shreyan Avigyan (shreyanavigyan) * |
Date: 2021-06-04 06:27 |
The test result I got kind of seemed that the problem was in test_close_kill_running but it maybe that I used wrong flag (maybe -F?). And it is also weird that the error is occurring only in Fedora buildbot and not on others. Also the logs are not showing which test it was executing when the timeout occured.
|
msg395070 - (view) |
Author: Shreyan Avigyan (shreyanavigyan) * |
Date: 2021-06-04 10:30 |
I ran the test_suite on commit f3fa63ec75fdbb4a08a10957a5c631bf0c4a5970 in my WSL 1 Ubuntu and got some unexpected results. There are huge reference leaks in the tests and test_asyncio, test_subprocess, test_socket, etc., failed. I forked the branch a while ago so the HEAD is on f3fa63ec75fdbb4a08a10957a5c631bf0c4a5970. Don't know if these are fixed already but reported them here since there were no commits for any issue based on test_asyncio, test_subprocess recently. And most surprisingly I got a timeout error also but it was for "2:03:30 load avg: 0.52 running: test_ssl (1 hour 54 min), test_largefile (1 hour 18 min), test_io (1 hour 30 min)". I attached the log for full details. (I use WSL 1 so note it's written ./python.exe instead of ./python)
|
msg395091 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2021-06-04 15:27 |
Pablo:
> Extending the bot time was done so we can use faulthandler to identify the test, because buildbot was cancelling the whole build before that :)
Right. I increased buildbot timeout to check if buildbots hang because of bpo-38323 or if it's a new regression.
bpo-38323 is not fixed yet, I can still easily reproduce the hang on Linux is less than 1 minute:
https://bugs.python.org/issue38323#msg395090
|
msg395093 - (view) |
Author: Shreyan Avigyan (shreyanavigyan) * |
Date: 2021-06-04 15:48 |
> bpo-38323 is not fixed yet, I can still easily reproduce the hang on Linux is less than 1 minute:
https://bugs.python.org/issue38323#msg395090
Yes, with the commands you provided in the https://bugs.python.org/issue38323#msg395090 I also reproduced the crash on my WSL. And any suggestions on the other test failures in the log I attached? I don't know if they are fixed already in the latest commits though.
|
msg395094 - (view) |
Author: Pablo Galindo Salgado (pablogsal) * |
Date: 2021-06-04 16:01 |
> Yes, with the commands you provided in the https://bugs.python.org/issue38323#msg395090 I also reproduced the crash on my WSL. And any suggestions on the other test failures in the log I attached? I don't know if they are fixed already in the latest commits though.
Your logs don't show this error. This error is test_asyncio hanging, but in your logs the test fails (multiple test do, actually).
|
msg395095 - (view) |
Author: Shreyan Avigyan (shreyanavigyan) * |
Date: 2021-06-04 16:05 |
> Your logs don't show this error. This error is test_asyncio hanging, but in your logs the test fails (multiple test do, actually).
I didn't include this error. I was just asking if we could do anything about the multiple test failures (especially test_asyncio failure).
|
msg395101 - (view) |
Author: Pablo Galindo Salgado (pablogsal) * |
Date: 2021-06-04 16:36 |
> I was just asking if we could do anything about the multiple test failures (especially test_asyncio failure).
We don't see any failures as the ones in your logs in the buildbots or otherwise, so seems something wrong in your system or only affecting your system.
> I didn't include this error.
Sorry Shreyan, but unfortunately I find your messages very confusing. You seem to be mixing several different problems and your logs don't seem to be directly related to the issue at hand. Please, in the future, add some context to your messages if possible.
|
msg395135 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2021-06-04 22:52 |
I bet that this issue is simply a duplicate of bpo-38323, since I can still (easily) reproduce it locally.
I fixed the buildbot configuration, so if the bug happens again on Refleak buildbots, at least we should know which test method hangs. If it's not bpo-38323, I suggest to open a new more specific issue about this test method.
|
msg395480 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2021-06-09 21:25 |
> https://github.com/python/buildmaster-config/commit/9e0c812694d6fa599b4c8890045ed006fe7c1f6b
aarch64 Fedora Rawhide Refleaks 3.8:
https://buildbot.python.org/all/#/builders/401/builds/47
This change worked as expected: buildbot now uses a timeout of 4 hours (14400 seconds).
At the first run, test_asyncio was blocked in test_close_kill_running() of test_asyncio/test_subprocess.py. From the traceback, it's not possible to be sure that the test uses MultiLoopChildWatcher. But it smells a lot like bpo-38323.
bpo-38323 was recently fixed in 3.9, 3.10 and main branches (by skipping MultiLoopChildWatcher tets, since this class has a known race condition). Sadly, the 3.8 branch no longer accepts bugfixes.
Logs:
0:00:00 load avg: 2.85 Run tests in parallel using 10 child processes (timeout: 3 hour 15 min, worker timeout: 3 hour 20 min)
(...)
3:21:15 load avg: 0.04 [423/423/1] test_asyncio crashed (Exit code 1)
beginning 6 repetitions
123456
.....Timeout (3:15:00)!
Thread 0x0000ffffaecbcd50 (most recent call first):
File "/home/buildbot/buildarea/3.8.cstratak-fedora-rawhide-aarch64.refleak/build/Lib/selectors.py", line 468 in select
File "/home/buildbot/buildarea/3.8.cstratak-fedora-rawhide-aarch64.refleak/build/Lib/asyncio/base_events.py", line 1823 in _run_once
File "/home/buildbot/buildarea/3.8.cstratak-fedora-rawhide-aarch64.refleak/build/Lib/asyncio/base_events.py", line 570 in run_forever
File "/home/buildbot/buildarea/3.8.cstratak-fedora-rawhide-aarch64.refleak/build/Lib/asyncio/base_events.py", line 603 in run_until_complete
File "/home/buildbot/buildarea/3.8.cstratak-fedora-rawhide-aarch64.refleak/build/Lib/test/test_asyncio/test_subprocess.py", line 484 in test_close_kill_running
(...)
test_ensure_future_task (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ... ok
test_error_in_call_soon (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ... ok
test_exception_traceback (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ... ok
test_gather_shield (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ... ok
test_get_coro (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ... ok
test_get_stack (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ... ok
test_iscoroutinefunction (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ... ok
command timed out: 14400 seconds elapsed running [b'make', b'buildbottest', b'TESTOPTS=-R 3:3 -u-cpu -j10 ${BUILDBOT_TESTOPTS}', b'TESTPYTHONOPTS=', b'TESTTIMEOUT=11700'], attempting to kill
.....make: *** [Makefile:1173: buildbottest] Terminated
process killed by signal 15
program finished with exit code -1
elapsedTime=14400.032773
test_log_destroyed_pending_task (test.test_asyncio.test_tasks.PyTask_CFuture_Tests) ...
|
|
Date |
User |
Action |
Args |
2022-04-11 14:59:45 | admin | set | github: 88278 |
2021-06-09 21:25:27 | vstinner | set | messages:
+ msg395480 |
2021-06-04 22:52:58 | vstinner | set | status: open -> closed superseder: asyncio: MultiLoopWatcher has a race condition (test_asyncio: test_close_kill_running() hangs on AMD64 RHEL7 Refleaks 3.x) messages:
+ msg395135
resolution: duplicate stage: resolved |
2021-06-04 16:36:45 | pablogsal | set | messages:
+ msg395101 |
2021-06-04 16:05:11 | shreyanavigyan | set | messages:
+ msg395095 |
2021-06-04 16:01:29 | pablogsal | set | messages:
+ msg395094 |
2021-06-04 15:48:50 | shreyanavigyan | set | messages:
+ msg395093 |
2021-06-04 15:27:16 | vstinner | set | messages:
+ msg395091 |
2021-06-04 10:37:52 | chris.jerdonek | set | nosy:
+ chris.jerdonek
|
2021-06-04 10:30:59 | shreyanavigyan | set | files:
+ wsl_log.txt
messages:
+ msg395070 |
2021-06-04 06:27:57 | shreyanavigyan | set | messages:
+ msg395068 |
2021-06-03 21:08:25 | pablogsal | set | messages:
+ msg395048 |
2021-06-03 20:05:16 | shreyanavigyan | set | messages:
+ msg395042 |
2021-06-03 20:02:58 | shreyanavigyan | set | messages:
+ msg395041 |
2021-06-03 19:56:34 | pablogsal | set | messages:
+ msg395037 |
2021-06-03 19:55:49 | pablogsal | set | messages:
+ msg395036 |
2021-06-03 19:54:14 | shreyanavigyan | set | nosy:
+ shreyanavigyan messages:
+ msg395035
|
2021-06-02 15:31:11 | vstinner | set | messages:
+ msg394923 |
2021-05-22 19:32:44 | erlendaasland | set | nosy:
+ erlendaasland
|
2021-05-12 09:46:42 | vstinner | set | title: [buildbot] test_asyncio hangs after 3 hours on Refleak buildbots -> [buildbot] test_asyncio hangs (killed after 3 hours) on Refleak buildbots |
2021-05-12 09:45:55 | vstinner | set | messages:
+ msg393493 |
2021-05-12 09:44:25 | vstinner | set | messages:
+ msg393492 |
2021-05-12 09:40:34 | vstinner | create | |