classification
Title: Windows 10 buildbot: 15 min timeout on test_mmap.test_large_filesize()
Type: Stage: resolved
Components: Tests Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: db3l, ned.deily, paul.moore, pdgoins, steve.dower, tim.golden, vstinner, zach.ware
Priority: high Keywords:

Created on 2018-04-25 14:24 by vstinner, last changed 2018-05-29 21:34 by vstinner. This issue is now closed.

Messages (19)
msg315736 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-04-25 14:24
Windows10 3.x:

http://buildbot.python.org/all/#/builders/3/builds/817

Timeout (0:15:00)!
Thread 0x00001e5c (most recent call first):
  File "D:\buildarea\3.x.bolen-windows10\build\lib\test\test_mmap.py", line 754 in _make_test_file
  File "D:\buildarea\3.x.bolen-windows10\build\lib\test\test_mmap.py", line 769 in test_large_filesize
  ...
msg315738 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-04-25 14:27
The test also fails on Windows10 3.6 and Windows10 3.7.
msg315739 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-04-25 14:30
Similar bug on AMD64 Windows8 3.x buildbot:

Timeout (0:15:00)!
Thread 0x00001014 (most recent call first):
  File "D:\buildarea\3.x.bolen-windows8\build\lib\test\test_mmap.py", line 754 in _make_test_file
  File "D:\buildarea\3.x.bolen-windows8\build\lib\test\test_mmap.py", line 764 in test_large_offset

Same failure on AMD64 Windows8 3.6 and AMD64 Windows8 3.7.
msg316239 - (view) Author: Zachary Ware (zach.ware) * (Python committer) Date: 2018-05-06 18:33
David, has anything gone wonky on those builders lately?  Should we consider passing `-u-largefile` on them?
msg316245 - (view) Author: David Bolen (db3l) Date: 2018-05-06 21:25
Nothing for my part, but there was an Azure "maintenance" on 4/16 where both the Win8 and Win10 buildbots were migrated to "newer" hardware (the first change to their physical machine for a really long time).  No word on why.  The VM machine class was unchanged, and the tests run on a local disk.  But perhaps there has been a performance or other impact (Spectre changes?)

That seems to line up in terms of timing, for example with Win10 3.x build 795 and Win8 3.x build 831 both running the AM of 4/16 with the first failures.

Oddly, I'm having trouble finding a reference to test_mmap in older Win10 3.x build logs, but on Win8 3.x builds it was definitely one of the slower tests before that point, although only in the 4-5 minute range.

So not sure.  Assuming it was the "upgrade" and given the persistence of the failure since, I suppose it's safer to assume this is a new normal for that platform - if skipping the particular test(s) would work, I say go for it at least for now.

-- David
msg316251 - (view) Author: Paul Goins (pdgoins) * Date: 2018-05-07 04:50
For what it's worth as a newbie here, I'll add my 2 cents.  (This is partially copied from my mail on python-dev.)

The one thing which I did notice between the last builds which passed and the current builds is, under the "10 slowest tests" header, test_io has shot up in terms of time to complete:

3.6: from 2 min 13 sec (passing) to 9 min 25 sec (tests timing out)
3.7: from 2 min 10 sec (passing) to 9 min 26 sec (tests timing out)
3.x: from 3 min 39 sec (passing) to 9 min 17 sec (tests timing out)

i.e. roughly 3x longer to run this test suite.

I did a local sanity test of that particular test suite, comparing the last passing changeset from 3.x with the HEAD of master, and I saw no difference in performance.  Thus, my bet is on a build agent related issue such as running low on disk space.  The Azure changes mentioned seems also a possible culprit.
msg316252 - (view) Author: David Bolen (db3l) Date: 2018-05-07 05:39
Disk space seems unlikely as a reason - the local disk is 60GB on each buildbot, of which at least 50GB+ is generally free.  So that shouldn't be a problem (the WinXP/Win7 buildbots operate with far less, like 5-10GB).

I had also considered that perhaps something went wrong with the temp folder setting during the reboot so it ended up on the system disk (network) rather than the local disk, but it looks fine, and I see build and temporary files all created on the local disk.

I believe some of the worst overheads for Spectre patches were found in I/O intensive workloads.  Interestingly, neither buildbot processor reports PCID support, which might make it worse (though I don't know for sure if Azure exposes those flags to guests).

Of course, it could just be some other change in the new physical machines in some way.  There have been no other guest changes for a while (last was a build tool change in January).

Whatever it is, it certainly seems correlated to the maintenance, and appears to be consistent after that point.

-- David
msg316764 - (view) Author: Paul Goins (pdgoins) * Date: 2018-05-16 05:37
If we consider this to be the new normal...  What about increasing the timeout to accomodate?  From what I could gather, builds were running around 13 minutes or so before the changes, so the timeout wasn't much above actual exec time to begin with.  Maybe bump the timeout up to 20, maybe even 30 minutes?  (Granted, I don't know the intent of the timeout: whether it was intended to capture sudden performance delays like this in case of adverse code changes, or whether 15 minutes simply "sounded good" and thus was set.)

I know that for my own projects I would be reluctant to suggest the above; it seems better to dig in and understand why there was such a jump in exec time on test_io.  But without having access to the build agents nor having an environment handy which reproduces the issue, this is the best suggestion I can come up with to move this towards resolution.
msg316970 - (view) Author: Paul Goins (pdgoins) * Date: 2018-05-17 17:51
Also, to be clear, I understand what David is saying and don't disagree at all.  And if this really is Spectre related, it may be difficult to "go back" and test a before case, so my last comment about "digging in" may be rather moot.

Does tweaking the timeout sound right (or good enough)?  Or are there other things we ought to try first?
msg316986 - (view) Author: David Bolen (db3l) Date: 2018-05-17 20:50
A longer timeout might be another workaround, but for myself, I tend to favor Zachary's original suggestion of eliminating largefile tests for the moment as simplest and most robust.  It would also reduce the overall percentage of test time currently spent on a small number of tests.

Time (and budget) permitting, I do hope to try some tests on newer Azure machine types to see if they behave differently, which might be a longer term option.
msg316990 - (view) Author: Paul Goins (pdgoins) * Date: 2018-05-17 22:52
I don't know enough about those tests and how important they are considered.  My only concern would be with the increased risk of undetected breakage caused by removing them from CI, but if people think the risk is negligible and/or acceptable, it's fine by me.

Anything to change in git, or is this purely build agent stuff at this point?
msg317349 - (view) Author: David Bolen (db3l) Date: 2018-05-22 23:25
I have migrated the win8 and win10 builders to a different machine type on Azure, which seems to have restored more reasonable performance for the tests, at least in the first set of builds.  Assuming that continues to hold true, it should resolve this issue.
msg317461 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-23 21:47
test_mmap.test_large_offset() failed on x86 Windows7 3.7:

http://buildbot.python.org/all/#/builders/111/builds/265

David Bolen: do you prefer to upgrade the VM to a faster account on Azure, or change the timeout?


0:41:01 [144/415/1] test_mmap crashed (Exit code 1) -- running: test_concurrent_futures (287 sec)
Timeout (0:15:00)!
Thread 0x00000b58 (most recent call first):
  File "D:\cygwin\home\db3l\buildarea\3.7.bolen-windows7\build\lib\test\test_mmap.py", line 754 in _make_test_file
  File "D:\cygwin\home\db3l\buildarea\3.7.bolen-windows7\build\lib\test\test_mmap.py", line 764 in test_large_offset
  (...)
msg317466 - (view) Author: David Bolen (db3l) Date: 2018-05-23 22:14
The win7 builder isn't on Azure, so changing host type isn't an option at the moment.  For my part, I'd prefer removing the largefile tests for that one rather than increasing timeout.  The tests generate a huge amount of I/O, so my guess is it's just not practical in the current builder environment.

I should note that unlike the win8/win10 builders, there have been no VM or host changes for the win7 builder in quite a while, so if these failures are recent, they can't necessarily be explained by test environment changes.  With that said, the win7 builder is certainly resource constrained in terms of CPU and I/O so perhaps the tests were always marginal.  Or perhaps the use of -j2 is exacerbating the situation depending on which pair of tests overlap with test_mmap.
msg317467 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-05-23 22:16
Thanks, David.  Yes, I think disabling "largefile" tests on that buildbot makes the most sense.  We're working on it.
msg317522 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-05-24 03:50
Update: we flubbed the first attempt at altering the Windows7 buildbot config and, at the moment, I'm waiting for the fix for that to get manually copied to the buildbot.
msg317587 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2018-05-24 16:26
The Windows7 buildbot now runs the test suite with the "largefile" resource disabled and no longer timed out on test_mmap.  So that immediate problem is now resolved.  Before closing the issue, it would be good to discuss whether we should do the same for some of the other Windows buildbots (perhaps allow "largefile" only on the buildbot with the most capacity?) and/or disabling "largefile" on all Windows CI tests (if not done already).
msg317606 - (view) Author: David Bolen (db3l) Date: 2018-05-24 19:16
For my buildbots, I suspect win8/win10 should be ok at this point.  Among any other changes their local disk appears to be an SSD now, which makes a big difference.  I observed steady 250-300MB/s write I/O for the duration of the mmap test, for example, so no wonder that test highlights any I/O bottlenecks.  

Those two buildbots are independent so it's more a question of running the tests at all versus just on one.

On the other hand, win7/winxp compete on the same host, with traditional disks.  So including winxp as a slow windows build together with win7 couldn't hurt, although as 2.7 only, winxp runs far fewer tests nowadays.
msg318098 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-29 21:34
Some buildbots migrated to faster Azure models. Tests are now run without the largefile test resource on other buildbots. It seems like the issue is gone. Thanks for everyone who helped to fix this bug!
History
Date User Action Args
2018-05-29 21:34:04vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg318098

stage: resolved
2018-05-24 19:16:16db3lsetmessages: + msg317606
2018-05-24 16:26:34ned.deilysetpriority: deferred blocker -> high

messages: + msg317587
2018-05-24 03:50:15ned.deilysetmessages: + msg317522
2018-05-23 22:16:30ned.deilysetnosy: + ned.deily
messages: + msg317467
2018-05-23 22:14:22db3lsetmessages: + msg317466
2018-05-23 21:47:48vstinnersetmessages: + msg317461
2018-05-22 23:25:11db3lsetmessages: + msg317349
2018-05-17 22:52:17pdgoinssetmessages: + msg316990
2018-05-17 20:50:40db3lsetmessages: + msg316986
2018-05-17 17:51:11pdgoinssetmessages: + msg316970
2018-05-16 05:37:24pdgoinssetmessages: + msg316764
2018-05-07 05:39:58db3lsetmessages: + msg316252
2018-05-07 04:50:29pdgoinssetnosy: + pdgoins
messages: + msg316251
2018-05-06 21:25:49db3lsetmessages: + msg316245
2018-05-06 18:33:45zach.waresetnosy: + db3l
messages: + msg316239
2018-05-02 06:32:57ned.deilysetpriority: normal -> deferred blocker
nosy: + paul.moore, tim.golden, zach.ware, steve.dower
2018-04-25 14:30:42vstinnersetmessages: + msg315739
versions: + Python 3.6, Python 3.7
2018-04-25 14:27:59vstinnersetmessages: + msg315738
2018-04-25 14:24:55vstinnercreate