Issue 19386: selectors test_interrupted_retry is flaky

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/63585

classification

Title:	selectors test_interrupted_retry is flaky
Type:	behavior	Stage:	resolved
Components:	Tests	Versions:	Python 3.4

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:	neologix	Nosy List:	christian.heimes, neologix, pitrou, r.david.murray
Priority:	normal	Keywords:

Created on 2013-10-25 08:06 by christian.heimes, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (13)
msg201225 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2013-10-25 08:06
The test seems to be unstable on some build bots: http://buildbot.python.org/all/builders/x86%20Gentoo%20Non-Debug%203.x/builds/5271/steps/test/logs/stdio ====================================================================== FAIL: test_interrupted_retry (test.test_selectors.EpollSelectorTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/lib/buildslave/3.x.murray-gentoo-wide/build/Lib/test/test_selectors.py", line 299, in test_interrupted_retry self.assertLess(time() - t, 2.5) AssertionError: 105.86884237639606 not less than 2.5 ---------------------------------------------------------------------- Ran 58 tests in 127.113s
msg201237 - (view)	Author: Charles-François Natali (neologix) *	Date: 2013-10-25 13:39
This failure doesn't make sense: whether SIGALRM is delivered or not, epoll() should time out after 2 seconds, not 105s. What's more, here's another test_subprocess failure: """ ====================================================================== ERROR: test_wait_timeout (test.test_subprocess.ProcessTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/lib/buildslave/3.x.murray-gentoo-wide/build/Lib/test/test_subprocess.py", line 993, in test_wait_timeout self.assertEqual(p.wait(timeout=3), 0) File "/var/lib/buildslave/3.x.murray-gentoo-wide/build/Lib/subprocess.py", line 1514, in wait raise TimeoutExpired(self.args, timeout) subprocess.TimeoutExpired: Command '['/var/lib/buildslave/3.x.murray-gentoo-wide/build/python', '-c', 'import time; time.sleep(0.3)']' timed out after 3 seconds """ Even more strange, a test_asyncio failure: """ ====================================================================== FAIL: test_interrupted_retry (test.test_selectors.EpollSelectorTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/lib/buildslave/3.x.murray-gentoo-wide/build/Lib/test/test_selectors.py", line 299, in test_interrupted_retry self.assertLess(time() - t, 2.5) AssertionError: 105.86884237639606 not less than 2.5 """ asyncio uses selectors as backend, so let's admit the test fails, but what's really surprising is that the timeout is exactly the same, up to the last digit. Something fishy is going on: David, did something change recently on that buildbot?
msg201238 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-10-25 14:05
Flaky VM clock perhaps?
msg201240 - (view)	Author: Charles-François Natali (neologix) *	Date: 2013-10-25 14:44
> Flaky VM clock perhaps? Yeah, thought about that, but here we see exactly the same value for two time intervals (using time.monotonic()), computed over different test suite runs. I'd expected something more erratic...
msg201241 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-10-25 14:49
The only change on the buildbot for some months now was increasing the size of the in-memory /tmp filesystem, which I did yesterday. That also involved a reboot, though, and it is conceivable that there were software changes made since the last reboot and before this one. I don't think there were, but I can't promise it. No kernel changes, though, for sure.
msg201242 - (view)	Author: Charles-François Natali (neologix) *	Date: 2013-10-25 15:22
Could you try to run the test in a loop under strace ("strace -ttTf") ?
msg201678 - (view)	Author: Charles-François Natali (neologix) *	Date: 2013-10-29 21:35
Here's another failure: """ ====================================================================== FAIL: test_interpreter_shutdown (test.test_concurrent_futures.ProcessPoolShutdownTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/var/lib/buildslave/3.x.murray-gentoo-wide/build/Lib/test/test_concurrent_futures.py", line 77, in tearDown self.assertLess(dt, 60, "synchronization issue: test lasted too long") AssertionError: 108.88635897636414 not less than 60 : synchronization issue: test lasted too long """ Many timeout-related tests are failing randomly on this buildbot, so I assume that's really a buildbot issue (is it a VM?).
msg201679 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-10-29 21:45
Yes, it is a linux vserver VM. I would think we would want tests to not fail in VMs, though, in this day and age.
msg201681 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-10-29 21:47
Also, unless people haven't been reporting other errors, these buildbots have been stable up until recently, with the exception of the multiprocessing hangs that plague all the buildbots as far as I know. (My impression could be mistaken, though.)
msg201682 - (view)	Author: Charles-François Natali (neologix) *	Date: 2013-10-29 21:55
> Also, unless people haven't been reporting other errors, these buildbots have been stable up until recently Well, the fact that several tests - which were working just fine - started failing randomly recently (like test_concurrent_futures, test_subprocess) is a strong hint that the culprit is the buildbot (I mean, the test_selectors failure just calls epoll_wait(): there is a known overflow for epoll_wait() timeout, but for large timeout values, not 2s). This wouldn't be the first time I see clock issues on a virtual machine...
msg201685 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-10-29 22:13
What could cause a VM clock problem on a machine that hasn't been modified, but the VMs have been rebooted recently (ie: it isn't stale state inside the VM itself). Anything you can suggest I look at?
msg201687 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-10-29 22:20
I'm going to reboot the host just to see if that makes any difference. It's been up for 105 days.
msg203652 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-11-21 16:21
Looking at the current buildbot history I don't see these errors any more. I'm not sure how frequent they were, but I think I'll mark this as resolved and we can reopen it if we see it again.

History
Date	User	Action	Args
2022-04-11 14:57:52	admin	set	github: 63585
2013-11-21 16:21:32	r.david.murray	set	status: open -> closed resolution: not a bug messages: + msg203652 stage: needs patch -> resolved
2013-10-29 22:20:47	r.david.murray	set	messages: + msg201687
2013-10-29 22:13:17	r.david.murray	set	messages: + msg201685
2013-10-29 21:55:12	neologix	set	messages: + msg201682
2013-10-29 21:47:42	r.david.murray	set	messages: + msg201681
2013-10-29 21:45:19	r.david.murray	set	messages: + msg201679
2013-10-29 21:35:56	neologix	set	messages: + msg201678
2013-10-25 15:22:05	neologix	set	messages: + msg201242
2013-10-25 14:49:15	r.david.murray	set	messages: + msg201241
2013-10-25 14:44:48	neologix	set	messages: + msg201240
2013-10-25 14:05:25	pitrou	set	nosy: + pitrou messages: + msg201238
2013-10-25 13:39:51	neologix	set	nosy: + r.david.murray messages: + msg201237
2013-10-25 08:06:31	christian.heimes	create