New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_compile killed by SIGKILL on AMD64 Ubuntu 3.x (Linux OOM Killer) #88526
Comments
test_compile and test_multiprocessing_forkserver crashed with segfault (SIGSEGV) on AMD64 Ubuntu 3.x: It *seems* like test_compile.test_stack_overflow() crashed, but the log is not reliable so I cannot confirm. According to buildbot, the responsible change is: So Eric, can you please investigate the change? If nobody is available to fix the buildbot, I suggest to revert the change. Python was built in debug mode with: ./configure --prefix '$(PWD)/target' --with-pydebug test.pythoninfo: CC.version: gcc (Ubuntu 10.3.0-1ubuntu1) 10.3.0 Logs: ./python ./Tools/scripts/run_tests.py -j 1 -u all -W --slowest --fail-env-changed --timeout=900 -j2 --junit-xml test-results.xml |
See also bpo-44348 "test_exceptions.ExceptionTests.test_recursion_in_except_handler stack overflow on Windows debug builds". |
I don't think that's a segfault. That seems that the process was killed no? Also, the buildbot is green so this is not happening in the latest builds |
Maybe it was a manual action, but it sounds like a strange coincidence that 3 processes were killed in the same build, and it wasn't at the same time. |
But SIGSEGV is signal 11, not -9 |
Isn't this just an (explicit) SIGKILL? The _exit code_ seems to be -9, not the signal number. |
I am quite sure this is not a segmentation fault, Victor. |
We'll wait for more builds, but for now the buildbot is green so I think this should be closed and reopened if we see it again. |
Oh right, exit code -9 means killed by SIGKILL, it doesn't not mean killed SIGSEGV. Sorry about the confusion. How can a signal be killed by SIGKILL? Can it be related to Linux OOM Killer? Senthil: Would you mind to have a look at the server logs to see if you see anything suspicious? |
Oh, right, there is of course a connection between the exit code and the signal number. Thanks for the reminder :) |
Yes, this was related to the Linux OOM Killer. The agent went down |
Oh ok. Maybe you should give more memory to your worker, or you should spawn less jobs in parallel (-j1 instead of -j2). Or you should disable other services which eat memory. How much memory does it have? |
It was related to high number of jobs in that particular agent and result in OOM Kill from the Linux kernel - https://pastebin.com/559H4ksa The machine has 1GB Ram, but I realize that it has only one 1 CPU (This seems not optimal, minimal of 2 CPU seems to be recommendation - https://devguide.python.org/buildworker/) I will change it to run few jobs in parallel, and disable some services which are not used) and we could see again. For this, I would rather side with an agent resource issue than a compiler issue. Sorry for that. I also notice number unsuccessful SSH attempts on the server (today) - https://pastebin.com/ab0EKDuF The agent got unreachable probably due this, and I did reboot of the agent from the cloud console, so that I could login and see what might have happened. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: