classification
Title: Windows 10 buildbot: 15 min timeout on test_mmap.test_large_filesize()
Type: Stage:
Components: Tests Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: db3l, paul.moore, pdgoins, steve.dower, tim.golden, vstinner, zach.ware
Priority: deferred blocker Keywords:

Created on 2018-04-25 14:24 by vstinner, last changed 2018-05-17 22:52 by pdgoins.

Messages (11)
msg315736 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-04-25 14:24
Windows10 3.x:

http://buildbot.python.org/all/#/builders/3/builds/817

Timeout (0:15:00)!
Thread 0x00001e5c (most recent call first):
  File "D:\buildarea\3.x.bolen-windows10\build\lib\test\test_mmap.py", line 754 in _make_test_file
  File "D:\buildarea\3.x.bolen-windows10\build\lib\test\test_mmap.py", line 769 in test_large_filesize
  ...
msg315738 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-04-25 14:27
The test also fails on Windows10 3.6 and Windows10 3.7.
msg315739 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-04-25 14:30
Similar bug on AMD64 Windows8 3.x buildbot:

Timeout (0:15:00)!
Thread 0x00001014 (most recent call first):
  File "D:\buildarea\3.x.bolen-windows8\build\lib\test\test_mmap.py", line 754 in _make_test_file
  File "D:\buildarea\3.x.bolen-windows8\build\lib\test\test_mmap.py", line 764 in test_large_offset

Same failure on AMD64 Windows8 3.6 and AMD64 Windows8 3.7.
msg316239 - (view) Author: Zachary Ware (zach.ware) * (Python committer) Date: 2018-05-06 18:33
David, has anything gone wonky on those builders lately?  Should we consider passing `-u-largefile` on them?
msg316245 - (view) Author: David Bolen (db3l) Date: 2018-05-06 21:25
Nothing for my part, but there was an Azure "maintenance" on 4/16 where both the Win8 and Win10 buildbots were migrated to "newer" hardware (the first change to their physical machine for a really long time).  No word on why.  The VM machine class was unchanged, and the tests run on a local disk.  But perhaps there has been a performance or other impact (Spectre changes?)

That seems to line up in terms of timing, for example with Win10 3.x build 795 and Win8 3.x build 831 both running the AM of 4/16 with the first failures.

Oddly, I'm having trouble finding a reference to test_mmap in older Win10 3.x build logs, but on Win8 3.x builds it was definitely one of the slower tests before that point, although only in the 4-5 minute range.

So not sure.  Assuming it was the "upgrade" and given the persistence of the failure since, I suppose it's safer to assume this is a new normal for that platform - if skipping the particular test(s) would work, I say go for it at least for now.

-- David
msg316251 - (view) Author: Paul Goins (pdgoins) * Date: 2018-05-07 04:50
For what it's worth as a newbie here, I'll add my 2 cents.  (This is partially copied from my mail on python-dev.)

The one thing which I did notice between the last builds which passed and the current builds is, under the "10 slowest tests" header, test_io has shot up in terms of time to complete:

3.6: from 2 min 13 sec (passing) to 9 min 25 sec (tests timing out)
3.7: from 2 min 10 sec (passing) to 9 min 26 sec (tests timing out)
3.x: from 3 min 39 sec (passing) to 9 min 17 sec (tests timing out)

i.e. roughly 3x longer to run this test suite.

I did a local sanity test of that particular test suite, comparing the last passing changeset from 3.x with the HEAD of master, and I saw no difference in performance.  Thus, my bet is on a build agent related issue such as running low on disk space.  The Azure changes mentioned seems also a possible culprit.
msg316252 - (view) Author: David Bolen (db3l) Date: 2018-05-07 05:39
Disk space seems unlikely as a reason - the local disk is 60GB on each buildbot, of which at least 50GB+ is generally free.  So that shouldn't be a problem (the WinXP/Win7 buildbots operate with far less, like 5-10GB).

I had also considered that perhaps something went wrong with the temp folder setting during the reboot so it ended up on the system disk (network) rather than the local disk, but it looks fine, and I see build and temporary files all created on the local disk.

I believe some of the worst overheads for Spectre patches were found in I/O intensive workloads.  Interestingly, neither buildbot processor reports PCID support, which might make it worse (though I don't know for sure if Azure exposes those flags to guests).

Of course, it could just be some other change in the new physical machines in some way.  There have been no other guest changes for a while (last was a build tool change in January).

Whatever it is, it certainly seems correlated to the maintenance, and appears to be consistent after that point.

-- David
msg316764 - (view) Author: Paul Goins (pdgoins) * Date: 2018-05-16 05:37
If we consider this to be the new normal...  What about increasing the timeout to accomodate?  From what I could gather, builds were running around 13 minutes or so before the changes, so the timeout wasn't much above actual exec time to begin with.  Maybe bump the timeout up to 20, maybe even 30 minutes?  (Granted, I don't know the intent of the timeout: whether it was intended to capture sudden performance delays like this in case of adverse code changes, or whether 15 minutes simply "sounded good" and thus was set.)

I know that for my own projects I would be reluctant to suggest the above; it seems better to dig in and understand why there was such a jump in exec time on test_io.  But without having access to the build agents nor having an environment handy which reproduces the issue, this is the best suggestion I can come up with to move this towards resolution.
msg316970 - (view) Author: Paul Goins (pdgoins) * Date: 2018-05-17 17:51
Also, to be clear, I understand what David is saying and don't disagree at all.  And if this really is Spectre related, it may be difficult to "go back" and test a before case, so my last comment about "digging in" may be rather moot.

Does tweaking the timeout sound right (or good enough)?  Or are there other things we ought to try first?
msg316986 - (view) Author: David Bolen (db3l) Date: 2018-05-17 20:50
A longer timeout might be another workaround, but for myself, I tend to favor Zachary's original suggestion of eliminating largefile tests for the moment as simplest and most robust.  It would also reduce the overall percentage of test time currently spent on a small number of tests.

Time (and budget) permitting, I do hope to try some tests on newer Azure machine types to see if they behave differently, which might be a longer term option.
msg316990 - (view) Author: Paul Goins (pdgoins) * Date: 2018-05-17 22:52
I don't know enough about those tests and how important they are considered.  My only concern would be with the increased risk of undetected breakage caused by removing them from CI, but if people think the risk is negligible and/or acceptable, it's fine by me.

Anything to change in git, or is this purely build agent stuff at this point?
History
Date User Action Args
2018-05-17 22:52:17pdgoinssetmessages: + msg316990
2018-05-17 20:50:40db3lsetmessages: + msg316986
2018-05-17 17:51:11pdgoinssetmessages: + msg316970
2018-05-16 05:37:24pdgoinssetmessages: + msg316764
2018-05-07 05:39:58db3lsetmessages: + msg316252
2018-05-07 04:50:29pdgoinssetnosy: + pdgoins
messages: + msg316251
2018-05-06 21:25:49db3lsetmessages: + msg316245
2018-05-06 18:33:45zach.waresetnosy: + db3l
messages: + msg316239
2018-05-02 06:32:57ned.deilysetpriority: normal -> deferred blocker
nosy: + paul.moore, tim.golden, zach.ware, steve.dower
2018-04-25 14:30:42vstinnersetmessages: + msg315739
versions: + Python 3.6, Python 3.7
2018-04-25 14:27:59vstinnersetmessages: + msg315738
2018-04-25 14:24:55vstinnercreate