classification
Title: test_compile killed by SIGKILL on AMD64 Ubuntu 3.x (Linux OOM Killer)
Type: Stage: resolved
Components: Tests Versions: Python 3.11
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: corona10, eric.snow, erlendaasland, orsenthil, pablogsal, vstinner
Priority: normal Keywords:

Created on 2021-06-09 08:06 by vstinner, last changed 2021-06-10 10:08 by pablogsal. This issue is now closed.

Messages (13)
msg395390 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-09 08:06
test_compile and test_multiprocessing_forkserver crashed with segfault (SIGSEGV) on AMD64 Ubuntu 3.x:
https://buildbot.python.org/all/#/builders/708/builds/31

It *seems* like test_compile.test_stack_overflow() crashed, but the log is not reliable so I cannot confirm.

According to buildbot, the responsible change is:
"bpo-43693: Un-revert commit f3fa63e. (#26609)(10 hours ago)"
https://github.com/python/cpython/commit/3e1c7167d86a2a928cdcb659094aa10bb5550c4c

So Eric, can you please investigate the change? If nobody is available to fix the buildbot, I suggest to revert the change.


Python was built in debug mode with:

./configure --prefix '$(PWD)/target' --with-pydebug
make all


test.pythoninfo:

CC.version: gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0
os.uname: posix.uname_result(sysname='Linux', nodename='doxy.learntosolveit.com', release='5.11.0-18-generic', version='#19-Ubuntu SMP Fri May 7 14:22:03 UTC 2021', machine='x86_64')
platform.platform: Linux-5.11.0-18-generic-x86_64-with-glibc2.33
sys.thread_info: sys.thread_info(name='pthread', lock='semaphore', version='NPTL 2.33')


Logs:

./python  ./Tools/scripts/run_tests.py -j 1 -u all -W --slowest --fail-env-changed --timeout=900 -j2 --junit-xml test-results.xml 
== CPython 3.11.0a0 (heads/main:3e1c7167d8, Jun 8 2021, 22:09:42) [GCC 10.3.0]
== Linux-5.11.0-18-generic-x86_64-with-glibc2.33 little-endian
== cwd: /home/buildbot/buildarea/3.x.skumaran-ubuntu-x86_64/build/build/test_python_1439770æ
== CPU count: 1
== encodings: locale=UTF-8, FS=utf-8
Using random seed 5059550
0:00:00 load avg: 0.97 Run tests in parallel using 2 child processes (timeout: 15 min, worker timeout: 20 min)
(...)
0:00:43 load avg: 2.22 running: test_compile (34.7 sec), test_signal (30.8 sec)
0:01:12 load avg: 3.84 [ 13/427/1] test_compile crashed (Exit code -9) -- running: test_signal (59.6 sec)
(...)
0:06:26 load avg: 1.84 running: test_concurrent_futures (42.0 sec), test_multiprocessing_forkserver (30.0 sec)
0:06:56 load avg: 3.91 running: test_concurrent_futures (1 min 12 sec), test_multiprocessing_forkserver (1 min)
0:07:26 load avg: 5.47 running: test_concurrent_futures (1 min 42 sec), test_multiprocessing_forkserver (1 min 30 sec)
0:07:58 load avg: 5.93 running: test_concurrent_futures (2 min 13 sec), test_multiprocessing_forkserver (2 min 2 sec)
0:08:30 load avg: 5.73 running: test_concurrent_futures (2 min 44 sec), test_multiprocessing_forkserver (2 min 33 sec)
0:08:48 load avg: 4.62 [ 85/427/2] test_multiprocessing_forkserver crashed (Exit code -9) -- running: test_concurrent_futures (3 min 3 sec)
(...)
2 tests failed:
    test_compile test_multiprocessing_forkserver
(...)
0:27:56 load avg: 1.28 Re-running test_compile in verbose mode
test_and (test.test_compile.TestExpressionStackSize) ... ok
(...)
test_sequence_unpacking_error (test.test_compile.TestSpecifics) ... ok
test_single_statement (test.test_compile.TestSpecifics) ... ok
test_stack_overflow (test.test_compile.TestSpecifics) ... 
make: *** [Makefile:1256: buildbottest] Killed
program finished with exit code 2
elapsedTime=1684.973552
msg395391 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-09 08:06
See also bpo-44348 "test_exceptions.ExceptionTests.test_recursion_in_except_handler stack overflow on Windows debug builds".
msg395408 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-09 11:08
I don't think that's a segfault. That seems that the process was killed no? Also, the buildbot is green so this is not happening in the latest builds
msg395419 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-09 13:15
> I don't think that's a segfault. That seems that the process was killed no? Also, the buildbot is green so this is not happening in the latest builds

* (1) 0:01:12, test_compile child process was killed by signal -9
* (2) 0:08:48, test_multiprocessing_forkserver child process was killed by signal -9
* (3) 0:27:56, test_compile main process was killed (unknown signal... I bet on signal -9, SIGSEGV)

Maybe it was a manual action, but it sounds like a strange coincidence that 3 processes were killed in the same build, and it wasn't at the same time.
msg395422 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-09 14:23
But SIGSEGV is signal 11, not -9
msg395425 - (view) Author: Erlend E. Aasland (erlendaasland) * (Python triager) Date: 2021-06-09 14:42
Isn't this just an (explicit) SIGKILL? The _exit code_ seems to be -9, not the signal number.
msg395441 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-09 17:17
I am quite sure this is not a segmentation fault, Victor.
msg395442 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021-06-09 17:18
We'll wait for more builds, but for now the buildbot is green so I think this should be closed and reopened if we see it again.
msg395455 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-09 20:02
Oh right, exit code -9 means killed by SIGKILL, it doesn't not mean killed SIGSEGV. Sorry about the confusion.

How can a signal be killed by SIGKILL? Can it be related to Linux OOM Killer?

Senthil: Would you mind to have a look at the server logs to see if you see anything suspicious?
msg395456 - (view) Author: Erlend E. Aasland (erlendaasland) * (Python triager) Date: 2021-06-09 20:05
Oh, right, there is of course a connection between the exit code and the signal number. Thanks for the reminder :)
msg395457 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2021-06-09 20:06
Yes, this was related to the Linux OOM Killer. The agent went down
shortly after this. Either multiple parallel jobs might have led to OOM
or something else. I will see if logs provide more information.
msg395483 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-06-09 21:32
> Yes, this was related to the Linux OOM Killer.

Oh ok. Maybe you should give more memory to your worker, or you should spawn less jobs in parallel (-j1 instead of -j2). Or you should disable other services which eat memory.

How much memory does it have?
msg395488 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2021-06-09 22:20
>  Maybe you should give more memory to your worker, or you should spawn less jobs in parallel

It was related to high number of jobs in that particular agent and result in OOM Kill from the Linux kernel - https://pastebin.com/559H4ksa

The machine has 1GB Ram, but I realize that it has only one 1 CPU (This seems not optimal, minimal of 2 CPU seems to be recommendation - https://devguide.python.org/buildworker/) 

I will change it to run few jobs in parallel, and disable some services which are not used) and we could see again.

For this, I would rather side with an agent resource issue than a compiler issue. Sorry for that.


--- 

I also notice number unsuccessful SSH attempts on the server (today) - https://pastebin.com/ab0EKDuF


The agent got unreachable probably due this, and I did reboot of the agent from the cloud console, so that I could login and see what might have happened.
History
Date User Action Args
2021-06-10 10:08:10pablogsalsetstatus: open -> closed
resolution: not a bug
stage: resolved
2021-06-09 22:20:06orsenthilsetmessages: + msg395488
2021-06-09 21:32:48vstinnersetmessages: + msg395483
title: test_compile killed by SIGKILL on AMD64 Ubuntu 3.x -> test_compile killed by SIGKILL on AMD64 Ubuntu 3.x (Linux OOM Killer)
2021-06-09 20:06:48orsenthilsetmessages: + msg395457
2021-06-09 20:05:36erlendaaslandsetmessages: + msg395456
2021-06-09 20:02:24vstinnersetnosy: + orsenthil

messages: + msg395455
title: test_compile segfault on AMD64 Ubuntu 3.x -> test_compile killed by SIGKILL on AMD64 Ubuntu 3.x
2021-06-09 17:18:19pablogsalsetmessages: + msg395442
2021-06-09 17:17:09pablogsalsetmessages: + msg395441
2021-06-09 14:58:12corona10setnosy: + corona10
2021-06-09 14:42:39erlendaaslandsetmessages: + msg395425
2021-06-09 14:23:38pablogsalsetmessages: + msg395422
2021-06-09 13:15:17vstinnersetmessages: + msg395419
2021-06-09 11:08:01pablogsalsetmessages: + msg395408
2021-06-09 08:15:32erlendaaslandsetnosy: + erlendaasland
2021-06-09 08:06:47vstinnersetmessages: + msg395391
2021-06-09 08:06:22vstinnercreate