classification
Title: Buildbot reliability
Type: behavior Stage: resolved
Components: Tests Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: barry, ned.deily, pitrou, skrah, vstinner
Priority: normal Keywords: buildbot

Created on 2011-04-30 06:43 by skrah, last changed 2011-05-07 09:06 by skrah. This issue is now closed.

Files
File name Uploaded Description Edit
freebsd-amd64-log.txt skrah, 2011-05-02 18:23
Messages (10)
msg134839 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-04-30 06:43
The FreeBSD-AMD64 bot exhibits sporadic hanging in unspecific places.
FreeBSD is running under kvm in the background. When the hanging occurs,
the virtual machine uses 100% CPU and I can't log in via ssh, so I have
to kill the kvm process.

The fact that the ssh login fails if a user process is misbehaving
seems like a FreeBSD/kvm issue to me. However, this problem did not
occur when I set up the bot a couple of weeks ago.


I've started a series of older revision builds to see if anything
recent causes this.
msg134890 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-04-30 23:15
> The FreeBSD-AMD64 bot exhibits sporadic hanging in unspecific places.

You can try a shorter regrtest timeout, edit Lib/test/regrtest.py near:

    if hasattr(faulthandler, 'dump_tracebacks_later'):
        timeout = 60*60

(or use --timeout option of the regrtest.py program)

If you have an access to a terminal (using ssh), you can also set a signal to dump the traceback: edit regrtest.py to add "import signal; faulthandler.register(signal.SIGUSR1, all_threads=True)" after "faulthandler.enable()". Then use "kill -USR1 pid" to dump the traceback.

Or the problem is an unlimited loop while dumping the traceback because of a timeout :-D In this case, disable the timeout using --timeout=0 option of regrtest.py.
msg134901 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-05-01 06:03
Thanks Victor, I can try some of that.

Could this also be a problem with the buildbot software or a networking
problem? The Ubuntu PPC bot might have the same issue. Here the tests
appear to be finished but the clean doesn't start:

http://www.python.org/dev/buildbot/all/builders/PPC%20Ubuntu%203.1/builds/387/steps/test/logs/stdio
http://www.python.org/dev/buildbot/all/builders/PPC%20Ubuntu%203.1/builds/387
msg134922 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2011-05-01 19:36
That might be another instance of this:

   http://thread.gmane.org/gmane.comp.python.devel/123698

You might want to bring this up on python-dev.
msg134997 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-05-02 18:23
Going through the logs, this indeed looks like a buildbot software
issue to me. I attach the logs that correspond to this incident:

http://www.python.org/dev/buildbot/all/builders/AMD64%20FreeBSD%208.2%203.2/builds/85

After ...

2011-04-30 01:10:56+0200 [Broker,client]   closing stdin
2011-04-30 01:10:56+0200 [Broker,client]   using PTY: False

... normally you should see:

... [-] command finished with signal None, exit code 0, elapsedTime:


But there is nothing until I restarted the bot.
msg135084 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-05-03 22:15
Another instance:

2011-05-03 20:18:08+0200 [Broker,client]   closing stdin
2011-05-03 20:18:08+0200 [Broker,client]   using PTY: False
2011-05-03 20:20:38+0200 [-] sending app-level keepalive


Again this is missing:

... [-] command finished with signal None, exit code 0, elapsedTime:


Also, as we speak the Ubuntu PPC bot is hanging as well:


http://www.python.org/dev/buildbot/all/builders/PPC%20Ubuntu%202.7/builds/386/steps/test/logs/stdio



Antoine, do you have access to the server logs for the relevant
times? My bot is on CEST.
msg135085 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2011-05-03 22:40
My Ubuntu PPC server is having hardware problems.  It will just intermittently shut off.  I've reset the SMU and the PRAM, vacuumed out the guts, reseated the RAM, pulled any possibly problematic 3rd party boards, and it still crashes.  I was watching the syslog and it didn't look like a thermal shutdown, though it acted like that.  The only thing I can think of is a power supply problem, so I'm going to see if I can find an inexpensive replacement.  In the meantime, this machine will be offline for a couple of weeks at least.
msg135174 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-05-05 07:10
The FreeBSD bot had these error messages in the log files:

1) kernel: swap_pager: indefinite wait buffer: device
2) Approaching the limit on PV entries, consider increasing either the vm.pmap.shpgperproc or the vm.pmap.p
v_entry_max sysctl.


I set up the bot from scratch with these changes:

a) Use swap partition (2GB) instead of swap file (2 GB).

b) Use these sysctls:

     kern.ipc.shm_use_phys=1
     vm.pmap.shpgperproc=4096
     vm.pmap.pv_entry_max=16777216

c) Use self-compiled Python2.7 instead of the system Python2.6.


Let's see how that works out. Error 1) is bad, perhaps FreeBSD
does not play well with the qcow2 file system under high load.
msg135175 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-05-05 07:36
On second thought, I don't want to debug possible qcow2 issues, so
I made another change:

d) Use raw format for the image.
msg135421 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-05-07 09:06
I think the FreeBSD bot changes are working out fine. The Ubuntu-PPC
issues were unrelated, so I'm closing this.
History
Date User Action Args
2011-05-07 09:06:55skrahsetstatus: open -> closed
messages: + msg135421

keywords: + buildbot
resolution: fixed
stage: resolved
2011-05-05 07:36:05skrahsetmessages: + msg135175
2011-05-05 07:10:24skrahsetmessages: + msg135174
2011-05-03 22:40:58barrysetmessages: + msg135085
2011-05-03 22:17:09skrahsetnosy: + barry
2011-05-03 22:15:40skrahsetmessages: + msg135084
title: FreeBSD-AMD64 bot sporadic hanging -> Buildbot reliability
2011-05-02 18:23:41skrahsetfiles: + freebsd-amd64-log.txt

messages: + msg134997
2011-05-01 19:36:41ned.deilysetnosy: + ned.deily
messages: + msg134922
2011-05-01 06:03:43skrahsetmessages: + msg134901
2011-04-30 23:15:44vstinnersetnosy: + vstinner
messages: + msg134890
2011-04-30 06:43:10skrahcreate