Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_compile killed by SIGKILL on AMD64 Ubuntu 3.x (Linux OOM Killer) #88526

Closed
vstinner opened this issue Jun 9, 2021 · 13 comments
Closed

test_compile killed by SIGKILL on AMD64 Ubuntu 3.x (Linux OOM Killer) #88526

vstinner opened this issue Jun 9, 2021 · 13 comments
Labels
3.11 only security fixes tests Tests in the Lib/test dir

Comments

@vstinner
Copy link
Member

vstinner commented Jun 9, 2021

BPO 44360
Nosy @orsenthil, @vstinner, @ericsnowcurrently, @corona10, @pablogsal, @erlend-aasland

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2021-06-10.10:08:10.046>
created_at = <Date 2021-06-09.08:06:22.863>
labels = ['invalid', 'tests', '3.11']
title = 'test_compile killed by SIGKILL on AMD64 Ubuntu 3.x (Linux OOM Killer)'
updated_at = <Date 2021-06-10.10:08:10.045>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2021-06-10.10:08:10.045>
actor = 'pablogsal'
assignee = 'none'
closed = True
closed_date = <Date 2021-06-10.10:08:10.046>
closer = 'pablogsal'
components = ['Tests']
creation = <Date 2021-06-09.08:06:22.863>
creator = 'vstinner'
dependencies = []
files = []
hgrepos = []
issue_num = 44360
keywords = []
message_count = 13.0
messages = ['395390', '395391', '395408', '395419', '395422', '395425', '395441', '395442', '395455', '395456', '395457', '395483', '395488']
nosy_count = 6.0
nosy_names = ['orsenthil', 'vstinner', 'eric.snow', 'corona10', 'pablogsal', 'erlendaasland']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = 'resolved'
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue44360'
versions = ['Python 3.11']

@vstinner
Copy link
Member Author

vstinner commented Jun 9, 2021

test_compile and test_multiprocessing_forkserver crashed with segfault (SIGSEGV) on AMD64 Ubuntu 3.x:
https://buildbot.python.org/all/#/builders/708/builds/31

It *seems* like test_compile.test_stack_overflow() crashed, but the log is not reliable so I cannot confirm.

According to buildbot, the responsible change is:
"bpo-43693: Un-revert commit f3fa63e. (bpo-26609)(10 hours ago)"
3e1c716

So Eric, can you please investigate the change? If nobody is available to fix the buildbot, I suggest to revert the change.

Python was built in debug mode with:

./configure --prefix '$(PWD)/target' --with-pydebug
make all

test.pythoninfo:

CC.version: gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0
os.uname: posix.uname_result(sysname='Linux', nodename='doxy.learntosolveit.com', release='5.11.0-18-generic', version='#19-Ubuntu SMP Fri May 7 14:22:03 UTC 2021', machine='x86_64')
platform.platform: Linux-5.11.0-18-generic-x86_64-with-glibc2.33
sys.thread_info: sys.thread_info(name='pthread', lock='semaphore', version='NPTL 2.33')

Logs:

./python ./Tools/scripts/run_tests.py -j 1 -u all -W --slowest --fail-env-changed --timeout=900 -j2 --junit-xml test-results.xml
== CPython 3.11.0a0 (heads/main:3e1c7167d8, Jun 8 2021, 22:09:42) [GCC 10.3.0]
== Linux-5.11.0-18-generic-x86_64-with-glibc2.33 little-endian
== cwd: /home/buildbot/buildarea/3.x.skumaran-ubuntu-x86_64/build/build/test_python_1439770æ
== CPU count: 1
== encodings: locale=UTF-8, FS=utf-8
Using random seed 5059550
0:00:00 load avg: 0.97 Run tests in parallel using 2 child processes (timeout: 15 min, worker timeout: 20 min)
(...)
0:00:43 load avg: 2.22 running: test_compile (34.7 sec), test_signal (30.8 sec)
0:01:12 load avg: 3.84 [ 13/427/1] test_compile crashed (Exit code -9) -- running: test_signal (59.6 sec)
(...)
0:06:26 load avg: 1.84 running: test_concurrent_futures (42.0 sec), test_multiprocessing_forkserver (30.0 sec)
0:06:56 load avg: 3.91 running: test_concurrent_futures (1 min 12 sec), test_multiprocessing_forkserver (1 min)
0:07:26 load avg: 5.47 running: test_concurrent_futures (1 min 42 sec), test_multiprocessing_forkserver (1 min 30 sec)
0:07:58 load avg: 5.93 running: test_concurrent_futures (2 min 13 sec), test_multiprocessing_forkserver (2 min 2 sec)
0:08:30 load avg: 5.73 running: test_concurrent_futures (2 min 44 sec), test_multiprocessing_forkserver (2 min 33 sec)
0:08:48 load avg: 4.62 [ 85/427/2] test_multiprocessing_forkserver crashed (Exit code -9) -- running: test_concurrent_futures (3 min 3 sec)
(...)
2 tests failed:
test_compile test_multiprocessing_forkserver
(...)
0:27:56 load avg: 1.28 Re-running test_compile in verbose mode
test_and (test.test_compile.TestExpressionStackSize) ... ok
(...)
test_sequence_unpacking_error (test.test_compile.TestSpecifics) ... ok
test_single_statement (test.test_compile.TestSpecifics) ... ok
test_stack_overflow (test.test_compile.TestSpecifics) ...
make: *** [Makefile:1256: buildbottest] Killed
program finished with exit code 2
elapsedTime=1684.973552

@vstinner vstinner added 3.11 only security fixes tests Tests in the Lib/test dir labels Jun 9, 2021
@vstinner
Copy link
Member Author

vstinner commented Jun 9, 2021

See also bpo-44348 "test_exceptions.ExceptionTests.test_recursion_in_except_handler stack overflow on Windows debug builds".

@pablogsal
Copy link
Member

I don't think that's a segfault. That seems that the process was killed no? Also, the buildbot is green so this is not happening in the latest builds

@vstinner
Copy link
Member Author

vstinner commented Jun 9, 2021

I don't think that's a segfault. That seems that the process was killed no? Also, the buildbot is green so this is not happening in the latest builds

  • (1) 0:01:12, test_compile child process was killed by signal -9
  • (2) 0:08:48, test_multiprocessing_forkserver child process was killed by signal -9
  • (3) 0:27:56, test_compile main process was killed (unknown signal... I bet on signal -9, SIGSEGV)

Maybe it was a manual action, but it sounds like a strange coincidence that 3 processes were killed in the same build, and it wasn't at the same time.

@pablogsal
Copy link
Member

But SIGSEGV is signal 11, not -9

@erlend-aasland
Copy link
Contributor

Isn't this just an (explicit) SIGKILL? The _exit code_ seems to be -9, not the signal number.

@pablogsal
Copy link
Member

I am quite sure this is not a segmentation fault, Victor.

@pablogsal
Copy link
Member

We'll wait for more builds, but for now the buildbot is green so I think this should be closed and reopened if we see it again.

@vstinner
Copy link
Member Author

vstinner commented Jun 9, 2021

Oh right, exit code -9 means killed by SIGKILL, it doesn't not mean killed SIGSEGV. Sorry about the confusion.

How can a signal be killed by SIGKILL? Can it be related to Linux OOM Killer?

Senthil: Would you mind to have a look at the server logs to see if you see anything suspicious?

@vstinner vstinner changed the title test_compile segfault on AMD64 Ubuntu 3.x test_compile killed by SIGKILL on AMD64 Ubuntu 3.x Jun 9, 2021
@vstinner vstinner changed the title test_compile segfault on AMD64 Ubuntu 3.x test_compile killed by SIGKILL on AMD64 Ubuntu 3.x Jun 9, 2021
@erlend-aasland
Copy link
Contributor

Oh, right, there is of course a connection between the exit code and the signal number. Thanks for the reminder :)

@orsenthil
Copy link
Member

Yes, this was related to the Linux OOM Killer. The agent went down
shortly after this. Either multiple parallel jobs might have led to OOM
or something else. I will see if logs provide more information.

@vstinner
Copy link
Member Author

vstinner commented Jun 9, 2021

Yes, this was related to the Linux OOM Killer.

Oh ok. Maybe you should give more memory to your worker, or you should spawn less jobs in parallel (-j1 instead of -j2). Or you should disable other services which eat memory.

How much memory does it have?

@vstinner vstinner changed the title test_compile killed by SIGKILL on AMD64 Ubuntu 3.x test_compile killed by SIGKILL on AMD64 Ubuntu 3.x (Linux OOM Killer) Jun 9, 2021
@vstinner vstinner changed the title test_compile killed by SIGKILL on AMD64 Ubuntu 3.x test_compile killed by SIGKILL on AMD64 Ubuntu 3.x (Linux OOM Killer) Jun 9, 2021
@orsenthil
Copy link
Member

Maybe you should give more memory to your worker, or you should spawn less jobs in parallel

It was related to high number of jobs in that particular agent and result in OOM Kill from the Linux kernel - https://pastebin.com/559H4ksa

The machine has 1GB Ram, but I realize that it has only one 1 CPU (This seems not optimal, minimal of 2 CPU seems to be recommendation - https://devguide.python.org/buildworker/)

I will change it to run few jobs in parallel, and disable some services which are not used) and we could see again.

For this, I would rather side with an agent resource issue than a compiler issue. Sorry for that.


I also notice number unsuccessful SSH attempts on the server (today) - https://pastebin.com/ab0EKDuF

The agent got unreachable probably due this, and I did reboot of the agent from the cloud console, so that I could login and see what might have happened.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.11 only security fixes tests Tests in the Lib/test dir
Projects
None yet
Development

No branches or pull requests

4 participants