This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: selectors test_interrupted_retry is flaky
Type: behavior Stage: resolved
Components: Tests Versions: Python 3.4
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: neologix Nosy List: christian.heimes, neologix, pitrou, r.david.murray
Priority: normal Keywords:

Created on 2013-10-25 08:06 by christian.heimes, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (13)
msg201225 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-10-25 08:06
The test seems to be unstable on some build bots:

http://buildbot.python.org/all/builders/x86%20Gentoo%20Non-Debug%203.x/builds/5271/steps/test/logs/stdio

======================================================================
FAIL: test_interrupted_retry (test.test_selectors.EpollSelectorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/buildslave/3.x.murray-gentoo-wide/build/Lib/test/test_selectors.py", line 299, in test_interrupted_retry
    self.assertLess(time() - t, 2.5)
AssertionError: 105.86884237639606 not less than 2.5

----------------------------------------------------------------------
Ran 58 tests in 127.113s
msg201237 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-25 13:39
This failure doesn't make sense: whether SIGALRM is delivered or not, epoll() should time out after 2 seconds, not 105s.

What's more, here's another test_subprocess failure:
"""
======================================================================
ERROR: test_wait_timeout (test.test_subprocess.ProcessTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/buildslave/3.x.murray-gentoo-wide/build/Lib/test/test_subprocess.py", line 993, in test_wait_timeout
    self.assertEqual(p.wait(timeout=3), 0)
  File "/var/lib/buildslave/3.x.murray-gentoo-wide/build/Lib/subprocess.py", line 1514, in wait
    raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['/var/lib/buildslave/3.x.murray-gentoo-wide/build/python', '-c', 'import time; time.sleep(0.3)']' timed out after 3 seconds
"""

Even more strange, a test_asyncio failure:
"""
======================================================================
FAIL: test_interrupted_retry (test.test_selectors.EpollSelectorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/buildslave/3.x.murray-gentoo-wide/build/Lib/test/test_selectors.py", line 299, in test_interrupted_retry
    self.assertLess(time() - t, 2.5)
AssertionError: 105.86884237639606 not less than 2.5
"""

asyncio uses selectors as backend, so let's admit the test fails, but what's really surprising is that the timeout is *exactly* the same, up to the last digit.

Something fishy is going on: David, did something change recently on that buildbot?
msg201238 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-25 14:05
Flaky VM clock perhaps?
msg201240 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-25 14:44
> Flaky VM clock perhaps?

Yeah, thought about that, but here we see *exactly* the same value for
two time intervals (using time.monotonic()), computed over different
test suite runs. I'd expected something more erratic...
msg201241 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-25 14:49
The only change on the buildbot for some months now was increasing the size of the in-memory /tmp filesystem, which I did yesterday.  That also involved a reboot, though, and it is conceivable that there were software changes made since the last reboot and before this one.  I don't *think* there were, but I can't promise it.  No kernel changes, though, for sure.
msg201242 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-25 15:22
Could you try to run the test in a loop under strace ("strace -ttTf") ?
msg201678 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-29 21:35
Here's another failure:
"""
======================================================================
FAIL: test_interpreter_shutdown
(test.test_concurrent_futures.ProcessPoolShutdownTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/buildslave/3.x.murray-gentoo-wide/build/Lib/test/test_concurrent_futures.py",
line 77, in tearDown
    self.assertLess(dt, 60, "synchronization issue: test lasted too long")
AssertionError: 108.88635897636414 not less than 60 : synchronization
issue: test lasted too long
"""

Many timeout-related tests are failing randomly on this buildbot, so I
assume that's really a buildbot issue (is it a VM?).
msg201679 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-29 21:45
Yes, it is a linux vserver VM.  I would think we would want tests to not fail in VMs, though, in this day and age.
msg201681 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-29 21:47
Also, unless people haven't been reporting other errors, these buildbots have been stable up until recently, with the exception of the multiprocessing hangs that plague all the buildbots as far as I know.  (My impression could be mistaken, though.)
msg201682 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-10-29 21:55
> Also, unless people haven't been reporting other errors, these buildbots have been stable up until recently

Well, the fact that several tests - which were working just fine -
started failing randomly recently (like test_concurrent_futures,
test_subprocess) is a strong hint that the culprit is the buildbot (I
mean, the test_selectors failure just calls epoll_wait(): there is a
known overflow for epoll_wait() timeout, but for large timeout values,
not 2s).

This wouldn't be the first time I see clock issues on a virtual machine...
msg201685 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-29 22:13
What could cause a VM clock problem on a machine that hasn't been modified, but the VMs have been rebooted recently (ie: it isn't stale state inside the VM itself).  Anything you can suggest I look at?
msg201687 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-10-29 22:20
I'm going to reboot the host just to see if that makes any difference.  It's been up for 105 days.
msg203652 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-11-21 16:21
Looking at the current buildbot history I don't see these errors any more.  I'm not sure how frequent they were, but I think I'll mark this as resolved and we can reopen it if we see it again.
History
Date User Action Args
2022-04-11 14:57:52adminsetgithub: 63585
2013-11-21 16:21:32r.david.murraysetstatus: open -> closed
resolution: not a bug
messages: + msg203652

stage: needs patch -> resolved
2013-10-29 22:20:47r.david.murraysetmessages: + msg201687
2013-10-29 22:13:17r.david.murraysetmessages: + msg201685
2013-10-29 21:55:12neologixsetmessages: + msg201682
2013-10-29 21:47:42r.david.murraysetmessages: + msg201681
2013-10-29 21:45:19r.david.murraysetmessages: + msg201679
2013-10-29 21:35:56neologixsetmessages: + msg201678
2013-10-25 15:22:05neologixsetmessages: + msg201242
2013-10-25 14:49:15r.david.murraysetmessages: + msg201241
2013-10-25 14:44:48neologixsetmessages: + msg201240
2013-10-25 14:05:25pitrousetnosy: + pitrou
messages: + msg201238
2013-10-25 13:39:51neologixsetnosy: + r.david.murray
messages: + msg201237
2013-10-25 08:06:31christian.heimescreate