Issue41642
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2020-08-26 09:29 by xtreak, last changed 2022-04-11 14:59 by admin.
Messages (23) | |||
---|---|---|---|
msg375925 - (view) | Author: Karthikeyan Singaravelan (xtreak) * | Date: 2020-08-26 09:29 | |
It seems many of the RHEL and Fedora builds fail due to disk space https://buildbot.python.org/all/#/builders/185/builds/2 ./configure: line 2382: cannot create temp file for here-document: No space left on device ./configure: line 2394: cannot create temp file for here-document: No space left on device ./configure: line 2429: cannot create temp file for here-document: No space left on device ./configure: line 2591: cannot create temp file for here-document: No space left on device ./configure: line 2595: cannot create temp file for here-document: No space left on device ./configure: line 2599: cannot create temp file for here-document: No space left on device ./configure: line 2603: cannot create temp file for here-document: No space left on device ./configure: line 2607: cannot create temp file for here-document: No space left on device ./configure: line 2611: cannot create temp file for here-document: No space left on device |
|||
msg375926 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-08-26 09:33 | |
> It seems many of the RHEL and Fedora builds fail due to disk space These workers have different owners and so need to reach different people. We should list all impacted workers. > https://buildbot.python.org/all/#/builders/185/builds/2 AMD64 RHEL8 3.x is the worker: cstratak-RHEL8-x86_64. |
|||
msg375928 - (view) | Author: Charalampos Stratakis (cstratak) * | Date: 2020-08-26 10:57 | |
There is an issue which I discovered after I returned from holidays, basically the buildbot-worker keeps getting disconnected from master, so builds start and end abruptly, retaining some artifacts. The next second it tried again with the same results, eventually filling the hard disk with the artifacts. Might be due to an updated package, but I've yet to discover what the issue is. |
|||
msg375930 - (view) | Author: Charalampos Stratakis (cstratak) * | Date: 2020-08-26 13:04 | |
There were almost 10GB of remnant cc* files in /tmp from the compilers used, which I presume were also the temporary artifacts which remained there after the disconnects. Cleaned those up and rebooted the RHEL8 x86_64 buildbot. |
|||
msg375931 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-08-26 13:51 | |
python-builder-rawhide had its /tmp partition full of temporary "ccXXXX.XXX" files. Before: /tmp was full at 100% (3.9 GB). After sudo rm -f /tmp/cc*, only 52 KB are used (1%). I'm not sure why gcc/clang left so many temporary files :-/ There are many large (22 MB) assembly files (.s). |
|||
msg375934 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-08-26 14:58 | |
Statistics on partition which are the most full. Fedora Rawhide x86-64 is ok: /dev/mapper/vg_root_python--builder--rawhide.osci.io-root 14G 5,4G 7,6G 42% / /dev/mapper/vg_root_python--builder--rawhide.osci.io-home 36G 24G 11G 70% /home Fedora Stable x86-64 is ok: /dev/mapper/vg_root_python--builder2--rawhide.osci.io-root 14G 7,7G 5,2G 60% / /dev/mapper/vg_root_python--builder2--rawhide.osci.io-home 36G 23G 12G 67% /home RHEL8 x86-64 is ok: /dev/mapper/vg_root_python--builder--rhel8.osci.io-root 14G 3,5G 9,5G 27% / /dev/mapper/vg_root_python--builder--rhel8.osci.io-home 36G 9,7G 24G 29% /home RHEL7 x86-64 is ok: /dev/mapper/vg_root_python--builder--rhel7.osci.io-root 7,6G 3,6G 3,7G 49% / /dev/mapper/vg_root_python--builder--rhel7.osci.io-home 22G 15G 5,9G 71% /home RHEL8 FIPS x86-64 is ok: /dev/mapper/vg_root_python--builder--rhel8--fips.osci.io-root 15G 2.8G 12G 20% / /dev/mapper/vg_root_python--builder--rhel8--fips.osci.io-home 34G 3.7G 29G 12% /home Fedora Rawhide AArch64 is ok: /dev/mapper/fedora-root 44G 26G 19G 58% / tmpfs 16G 436K 16G 1% /tmp Fedora Stable AArch64 is ok: /dev/mapper/fedora-root 44G 33G 11G 76% / tmpfs 16G 1,6G 15G 11% /tmp RHEL7 AArch64 is ok: /dev/mapper/rhel-root 44G 15G 30G 33% / RHEL8 AArch64 had like 22 GB in /tmp, I removed them. It's now better: * before: /dev/mapper/rhel-root 44G 41G 3.3G 93% / * after: /dev/mapper/rhel-root 44G 16G 28G 36% / Fedora Stable ppc64le /tmp contained 1 GB of temporay files. I removed them. Before: /dev/mapper/fedora-root 45G 29G 17G 63% / tmpfs 4.0G 1.1G 3.0G 27% /tmp Fedora Rawhide ppc64le is ok: /dev/mapper/fedora-root 45G 27G 19G 59% / tmpfs 4.0G 384K 4.0G 1% /tmp RHEL7 ppc64le is ok: /dev/mapper/rhel-root 45G 19G 27G 42% / RHEL8 ppc64le had 22 GB of old files in /tmp: removed, rebooted. Before: /dev/mapper/rhel-root 45G 41G 4.2G 91% / |
|||
msg376006 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-08-27 16:29 | |
27 buildbot workers are detached. They are are detached every minute! Affected workers: aixtools-aix-power6 billenstein-macos bolen-ubuntu bolen-windows10 bolen-windows7 cstratak-RHEL7-aarch64 cstratak-RHEL7-ppc64le cstratak-RHEL7-x86_64 cstratak-RHEL8-fips-x86_64 cstratak-RHEL8-ppc64le cstratak-fedora-stable-ppc64le edelsohn-aix-ppc64 edelsohn-debian-z edelsohn-fedora-ppc64 edelsohn-fedora-rawhide-z edelsohn-fedora-z edelsohn-rhel-z edelsohn-rhel8-z edelsohn-sles-z gps-clang-ubsan gps-debian-profile-opt gps-raspbian isidentical-centos-power8 kloth-win64 koobs-freebsd-564d ware-alpine ware-win81-release Output of: grep Worker.detach twistd.log|sed -e 's!.*(!!g;s!)!!g'|sort -u|uniq Example of server logs: 2020-08-27 15:33:49+0000 [Broker,33580,10.132.169.156] Worker.detached(billenstein-macos) 2020-08-27 15:34:43+0000 [Broker,33608,10.132.169.156] Worker.detached(billenstein-macos) 2020-08-27 15:35:40+0000 [Broker,33635,10.132.169.157] Worker.detached(billenstein-macos) (...) |
|||
msg376007 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-08-27 16:30 | |
I closed bpo-41648 "edelsohn-* buildbot worker failing with: No space left on device" as a duplicate of this issue. |
|||
msg376008 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2020-08-27 16:31 | |
I have found a large number of un-removed files in /tmp. Things seem to function better with Buildbots running older 0.x "buildslave" as opposed to newer "builtbot-worker" instances. |
|||
msg376009 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-08-27 16:36 | |
> I have found a large number of un-removed files in /tmp. Right. I found many /tmp/ccXXXX.XXX and /tmp/tmpXXXXX files. Around 20 GB of these files! Maybe using passing "-pipe" to gcc/clang would avoid the /tmp/ccXXXX.XXX files when a build is interrupted. For example, I saw assembly files (.s) of around 20 MB. I don't know what are the /tmp/tmpXXXXX files. I'm disappointed that in 2020, buildbot has no safe way to ensure that all created files are removed at the end of a build. chroot, containers, etc. are effecient way to ensure that everything is removed at the end of a build. |
|||
msg376010 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-08-27 16:38 | |
On the worker (client) side, I see many "lost remote step" every 1 to 3 minutes. Example with the PPC64LE Fedora Stable (cstratak-fedora-stable-ppc64le) worker: 2020-08-27 01:30:09-0400 [Broker,client] lost remote step 2020-08-27 01:31:57-0400 [Broker,client] lost remote step 2020-08-27 01:34:34-0400 [Broker,client] lost remote step 2020-08-27 01:36:29-0400 [Broker,client] lost remote step 2020-08-27 01:38:37-0400 [Broker,client] lost remote step (...) |
|||
msg376011 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-08-27 17:18 | |
The buildbot server migrated to a new machine and is now behind a load balancer. tcp/80 (buildbot web page, HTTP) and tcp/9020 (used by buildbot workers) are both behind the load balancer. Maybe the load balancer closes TCP connections which are idle for 60 seconds? Buildbot workers have a TCP keepalive option of 1 hour (3600 seconds) by default: https://docs.buildbot.net/latest/manual/configuration/workers.html#master-worker-tcp-keepalive Pablo told me that his worker uses a keepalive of 2 minutes (120 seconds). |
|||
msg376022 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-08-28 08:32 | |
> The buildbot server migrated to a new machine and is now behind a load balancer. tcp/80 (buildbot web page, HTTP) and tcp/9020 (used by buildbot workers) are both behind the load balancer. > (...) > Buildbot workers have a TCP keepalive option of 1 hour (3600 seconds) by default (...) Ernest confirmed that there are edge load balancers for the PSF infra in DigitalOcean. He updated the load balancers to offer a full 24 hour timeout on buildbot TCP connections. (Yesterday around 17:30 UTC.) It seems like it doesn't fix the issue. Example in server logs: (...) 2020-08-28 08:21:55+0000 [Broker,50831,10.132.169.157] Worker.detached(koobs-freebsd-564d) 2020-08-28 08:23:25+0000 [Broker,50856,10.132.169.157] Worker.detached(koobs-freebsd-564d) 2020-08-28 08:24:55+0000 [Broker,50881,10.132.169.157] Worker.detached(koobs-freebsd-564d) 2020-08-28 08:26:26+0000 [Broker,50906,10.132.169.157] Worker.detached(koobs-freebsd-564d) 2020-08-28 08:27:56+0000 [Broker,50931,10.132.169.156] Worker.detached(koobs-freebsd-564d) (...) |
|||
msg376023 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-08-28 08:45 | |
I added keepalive_interval=60 parameter to Worker() in the server configuration: https://github.com/python/buildmaster-config/commit/2d28a4cfe77a3e206028613524a1e938801a1655 |
|||
msg376026 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-08-28 09:41 | |
There are multiple errors in the buildbot server logs. I'm not sure if it's related or not. 2020-08-28 09:16:25+0000 [-] while invoking <bound method HttpStatusPushBase.buildStarted of <buildbot.reporters.github.GitHubStatusPush object at 0x7f661dd58850>> Traceback (most recent call last): File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/srv/buildbot/venv/lib/python3.8/site-packages/buildbot/reporters/http.py", line 80, in getMoreInfoAndSend yield self.send(build) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator return _cancellableInlineCallbacks(gen) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks _inlineCallbacks(None, g, status) --- <exception caught here> --- File "/srv/buildbot/venv/lib/python3.8/site-packages/buildbot/reporters/http.py", line 80, in getMoreInfoAndSend yield self.send(build) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/srv/buildbot/venv/lib/python3.8/site-packages/buildbot/reporters/github.py", line 133, in send m = re.search(r"refs/pull/([0-9]*)/merge", branch) File "/usr/lib/python3.8/re.py", line 201, in search return _compile(pattern, flags).search(string) builtins.TypeError: expected string or bytes-like object Error when stopping the server. It seems like this one is just that a client tries to reconnect whereas the server is down: 2020-08-28 09:30:43+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client] BuildMaster is stopped 2020-08-28 09:30:43+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client] invalid login from unknown user 'ware-win81-release' 2020-08-28 09:30:43+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client] Peer will receive following PB traceback: 2020-08-28 09:30:43+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client] Unhandled Error Traceback (most recent call last): File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 460, in callback self._startRunCallbacks(result) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks self._runCallbacks() File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1475, in gotResult _inlineCallbacks(r, g, status) --- <exception caught here> --- File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/srv/buildbot/venv/lib/python3.8/site-packages/buildbot/pbmanager.py", line 217, in requestAvatarId eventually(self.master.initLock.release) builtins.AttributeError: 'NoneType' object has no attribute 'initLock' |
|||
msg376027 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-08-28 09:50 | |
On the server side, it seems like the "edelsohn-rhel8-z" worker is detached because its TCP connection is closed, only 87 seconds after the worker was attached. I added some debug traces: 2020-08-28 09:44:02+0000 [Broker,2,10.132.169.156] worker 'edelsohn-rhel8-z' attaching from IPv4Address(type='TCP', host='10.132.169.156', port=56234) 2020-08-28 09:44:02+0000 [Broker,2,10.132.169.156] Got workerinfo from 'edelsohn-rhel8-z' (...) 2020-08-28 09:45:29+0000 [Broker,2,10.132.169.156] @ detached: worker=<Worker 'edelsohn-rhel8-z'> 2020-08-28 09:45:29+0000 [Broker,2,10.132.169.156] @ detached: TB: File "<string>", line 1, in <module> File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/scripts/twistd.py", line 31, in run app.run(runApp, ServerOptions) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/application/app.py", line 674, in run runApp(config) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/scripts/twistd.py", line 25, in runApp runner.run() File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/application/app.py", line 385, in run self.postApplication() File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/scripts/_twistd_unix.py", line 268, in postApplication self.startReactor(None, self.oldstdout, self.oldstderr) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/application/app.py", line 398, in startReactor runReactorWithLogging( File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/application/app.py", line 312, in runReactorWithLogging reactor.run() File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/base.py", line 1283, in run self.mainLoop() File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/base.py", line 1295, in mainLoop self.doIteration(t) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/epollreactor.py", line 235, in doPoll log.callWithLogger(selectable, _drdw, selectable, fd, event) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/python/log.py", line 103, in callWithLogger return callWithContext({"system": lp}, func, *args, **kw) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/python/log.py", line 86, in callWithContext return context.call({ILogContext: newCtx}, func, *args, **kw) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/python/context.py", line 122, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/python/context.py", line 85, in callWithContext return func(*args,**kw) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/posixbase.py", line 627, in _doReadOrWrite self._disconnectSelectable(selectable, why, inRead) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/posixbase.py", line 252, in _disconnectSelectable selectable.readConnectionLost(f) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/tcp.py", line 307, in readConnectionLost self.connectionLost(reason) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/internet/tcp.py", line 327, in connectionLost protocol.connectionLost(reason) File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/spread/pb.py", line 717, in connectionLost notifier() File "/srv/buildbot/venv/lib/python3.8/site-packages/twisted/spread/pb.py", line 1572, in maybeLogout fn() File "/srv/buildbot/venv/lib/python3.8/site-packages/buildbot/pbmanager.py", line 190, in <lambda> return (pb.IPerspective, persp, lambda: persp.detached(mind)) File "/srv/buildbot/venv/lib/python3.8/site-packages/buildbot/worker/protocols/pb.py", line 155, in detached import traceback, io; out = io.StringIO(); traceback.print_stack(file=out); tb = out.getvalue() It doesn't say if the client closed the connection on purpose, or if the load balancer closed an inactive connection. |
|||
msg376038 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2020-08-28 14:05 | |
I can provide some information from the logs of one of the buildbots, or change a parameter. Let me know. |
|||
msg376230 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-09-02 12:23 | |
> I can provide some information from the logs of one of the buildbots, or change a parameter. Let me know. David: can you please change the buildbot client configuration to use "buildbot-api.python.org" host name? This host name doesn't go through the PSF load balancer. It seems like TCP connection issues are coming from the load balancer. Also, please reduce the keepalive to 60 seconds (keepalive_interval). |
|||
msg376234 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-09-02 13:08 | |
Charris, Pablo and me identified that TCP connections are closed by the load balancer on some buildbot workers. When the "buildbot.python.org" host name is used, TCP connections (tcp port 9020) go through a load balancer. Ernest exposed the TCP port 9020 directly to the Internet (without the load balancer) using a new host name: "buildbot-api.python.org". Buildbot workers should be updated to use "buildbot-api.python.org". I also suggest to use a keepalive of 60 seconds, rather than 600 seconds. If your worker got impacted the this issue, I strongly advice you to clean up manually the temporary directory (/tmp). When a worker was disconnected, the build was interrupted without removing temporary files. On some workers, we got around 20 GB of temporary files in /tmp: "ccXXXX" files and "tmpXXXX" files. I guess that some files are coming from the compiler, some other from the Python test suite. I updated the buildbot client configuration of the 9 workers operated by Red Hat: Fedora Rawhide x64-86 Fedora Stable x64-86 RHEL8 x64-86 RHEL7 x64-86 RHEL8 FIPS x86-64 Fedora Rawhide AArch64 Fedora Stable AArch64 RHEL 8 ppc64le RHEL 7 ppc64le On our owners, I used the following commands: systemctl stop buildbot-worker.service du -sh /tmp; rm -f /tmp/{cc,tmp}*; du -sh /tmp sed -i -e "s/buildmaster_host = 'buildbot.python.org'/buildmaster_host = 'buildbot-api.python.org'/;s/keepalive = .*/keepalive = 60/" /home/buildbot/buildarea/buildbot.tac; grep -E '(host|keepalive) =' /home/buildbot/buildarea/buildbot.tac systemctl start buildbot-worker.service systemctl status buildbot-worker.service |
|||
msg376235 - (view) | Author: STINNER Victor (vstinner) * | Date: 2020-09-02 13:11 | |
Ah, I also updated: Fedora Stable ppc64le Fedora Rawhide ppc64le |
|||
msg376238 - (view) | Author: David Edelsohn (David.Edelsohn) * | Date: 2020-09-02 14:14 | |
I have updated edelsohn-aix-ppc64 edelsohn-debian-z edelsohn-fedora-ppc64 edelsohn-fedora-rawhide-z edelsohn-fedora-z edelsohn-rhel-z edelsohn-rhel8-z edelsohn-sles-z aixtools-aix-power6 |
|||
msg376258 - (view) | Author: Gregory P. Smith (gregory.p.smith) * | Date: 2020-09-02 19:06 | |
The gps-* bots have been updated. |
|||
msg376259 - (view) | Author: Pablo Galindo Salgado (pablogsal) * | Date: 2020-09-02 19:11 | |
All the pablogsal-* buildbots have been updated |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:35 | admin | set | github: 85808 |
2020-09-02 19:11:07 | pablogsal | set | messages: + msg376259 |
2020-09-02 19:06:07 | gregory.p.smith | set | nosy:
+ gregory.p.smith messages: + msg376258 |
2020-09-02 14:14:12 | David.Edelsohn | set | messages: + msg376238 |
2020-09-02 13:11:46 | vstinner | set | messages: + msg376235 |
2020-09-02 13:08:13 | vstinner | set | messages: + msg376234 |
2020-09-02 12:23:16 | vstinner | set | messages: + msg376230 |
2020-08-28 14:05:01 | David.Edelsohn | set | messages: + msg376038 |
2020-08-28 09:50:46 | vstinner | set | messages: + msg376027 |
2020-08-28 09:41:23 | vstinner | set | messages: + msg376026 |
2020-08-28 08:45:09 | vstinner | set | messages: + msg376023 |
2020-08-28 08:32:35 | vstinner | set | messages:
+ msg376022 title: RHEL and fedora buildbots fail due to disk space error -> Buildbot: workers detached every minute and "no space left on device" issue |
2020-08-27 17:18:13 | vstinner | set | messages: + msg376011 |
2020-08-27 16:38:01 | vstinner | set | messages: + msg376010 |
2020-08-27 16:36:33 | vstinner | set | messages: + msg376009 |
2020-08-27 16:31:40 | David.Edelsohn | set | nosy:
+ David.Edelsohn messages: + msg376008 title: Buildbot: workers detached every minute and "no space left on device" issue -> RHEL and fedora buildbots fail due to disk space error |
2020-08-27 16:30:08 | vstinner | set | messages: + msg376007 |
2020-08-27 16:29:45 | vstinner | set | messages:
+ msg376006 title: RHEL and fedora buildbots fail due to disk space error -> Buildbot: workers detached every minute and "no space left on device" issue |
2020-08-27 16:26:35 | vstinner | link | issue41648 superseder |
2020-08-26 14:58:37 | vstinner | set | messages: + msg375934 |
2020-08-26 13:51:54 | vstinner | set | messages: + msg375931 |
2020-08-26 13:04:00 | cstratak | set | messages: + msg375930 |
2020-08-26 10:57:33 | cstratak | set | messages: + msg375928 |
2020-08-26 09:33:36 | vstinner | set | messages: + msg375926 |
2020-08-26 09:29:54 | xtreak | create |