This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: AMD64 Windows10 3.x crash with Windows fatal exception: stack overflow
Type: Stage: resolved
Components: Windows Versions: Python 3.10
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Mark.Shannon, db3l, gvanrossum, neonene, pablogsal, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Priority: release blocker Keywords: patch

Created on 2021-02-19 20:26 by vstinner, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 24739 merged db3l, 2021-03-04 01:58
PR 32023 merged neonene, 2022-03-21 07:09
Messages (16)
msg387348 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-02-19 20:26
AMD64 Windows10 3.x, build 856:
https://buildbot.python.org/all/#/builders/146/builds/856

17 tests failed:
    test_exceptions test_fileio test_io test_isinstance test_json
    test_lib2to3 test_logging test_pickle test_pickletools
    test_plistlib test_richcmp test_runpy test_sys test_threading
    test_traceback test_typing test_xml_etree

It may be a regression caused by:
"bpo-43166: Disable ceval.c optimisations for Windows debug builds (GH-24485)"
https://github.com/python/cpython/commit/b74396c3167cc780f01309148db02709bc37b432

The latest successful build was 11 days ago (Feb 9):
https://buildbot.python.org/all/#/builders/146/builds/819

----

0:01:46 load avg: 14.25 [ 61/426/1] test_xml_etree crashed (Exit code 3221225725) -- running: test_tokenize (46.4 sec), test_largefile (1 min 10 sec)
Windows fatal exception: stack overflow

Current thread 0x0000110c (most recent call first):
  File "D:\buildarea\3.x.bolen-windows10\build\lib\xml\etree\ElementTree.py", line 178 in __repr__
  File "D:\buildarea\3.x.bolen-windows10\build\lib\xml\etree\ElementTree.py", line 178 in __repr__
  File "D:\buildarea\3.x.bolen-windows10\build\lib\xml\etree\ElementTree.py", line 178 in __repr__
  ...

======================================================================
FAIL: test_recursion_limit (test.test_threading.ThreadingExceptionTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\buildarea\3.x.bolen-windows10\build\lib\test\test_threading.py", line 1203, in test_recursion_limit
    self.assertEqual(p.returncode, 0, "Unexpected error: " + stderr.decode())
AssertionError: 3221225725 != 0 : Unexpected error: 

0:05:09 load avg: 29.65 [140/426/3] test_traceback crashed (Exit code 3221225725) -- running: test_concurrent_futures (2 min 6 sec), test_mmap (2 min)
Windows fatal exception: stack overflow

Current thread 0x00000118 (most recent call first):
  File "D:\buildarea\3.x.bolen-windows10\build\lib\test\test_traceback.py", line 1154 in f
  File "D:\buildarea\3.x.bolen-windows10\build\lib\test\test_traceback.py", line 1156 in f
  File "D:\buildarea\3.x.bolen-windows10\build\lib\test\test_traceback.py", line 1156 in f
  ...

0:06:14 load avg: 29.15 [151/426/4] test_lib2to3 crashed (Exit code 3221225725) -- running: test_multiprocessing_spawn (31.7 sec), test_capi (1 min 3 sec), test_mmap (3 min 4 sec)
Windows fatal exception: stack overflow

Current thread 0x00001b90 (most recent call first):
  File "D:\buildarea\3.x.bolen-windows10\build\lib\lib2to3\pytree.py", line 496 in generate_matches
  File "D:\buildarea\3.x.bolen-windows10\build\lib\lib2to3\pytree.py", line 845 in generate_matches
  ...
msg387351 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-02-19 20:40
Looks like someone increased the size of locals in the ceval functions.

If any new buffers have been added recently, moving the processing into a helper function or using a single buffer instead of allocating them in separate if/case blocks can help reduce it.
msg387352 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-02-19 20:46
If someone wants to measure the stack memory usage per function call, _testcapi.stack_pointer() can be used: see bpo-30866.
msg387816 - (view) Author: David Bolen (db3l) * Date: 2021-02-28 20:04
I don't think it's actually any change in ceval per se, or any new buffers, just how the compiler code generation has changed due to this commit.

Based on some local testing, the triggering issue is the exclusion of the optimization pragma entirely in debug mode.  It appears that the existing code required the pragma to be enabled to stay within the available stack space.

A quick reproduction is running test_pickletools.  It fails abruptly and quickly, at least on the worker, without the pragma, but runs to completion if enabled.

Perhaps that aspect of the commit should be reverted, maybe with a separate flag for debugging ceval (but not normal testing)?  Alternatively, is there a way to increase the stack in debug mode to compensate?  Though I guess that would risk missing a release-build only stack issue.
msg387825 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-03-01 01:05
Thanks David!

So the crash only happens in debug mode?

PEP 651 promises to take reduce the C stack usage in the future. Therefore, I would be okay with rolling this back for the time being.
If we do roll it back, maybe add a comment to ceval.c explaining
that even in debug mode this file is optimized?

(The change to the pragma can stay, those two letters have been unused for a decade.)
msg387826 - (view) Author: David Bolen (db3l) * Date: 2021-03-01 01:39
I hadn't tested release mode earlier, since the commit only removed the pragma in debug builds, but I just built a release build with the commit in place on the worker and it seems fine.

So barring stack changes (enlarging or improving usage) it appears the optimize pragma is required on Windows, at least in order to pass the current test suite.
msg387840 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-03-01 04:41
Thanks! That's new info, at least, it wasn't clear to me from Victor's
original comment that this was for a debug build (even though as you say it
can be deduced from context).

Do you feel like submitting a PR along the lines I suggested?
msg387853 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2021-03-01 11:04
PEP 651 would prevent any crashes, although some of the tests might still fail with a StackOverflowException, if they weren't expecting a RecursionError.
msg387943 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-03-02 17:26
We already have a separate recursion limit for debug builds on Windows, so it's probably best all around to reduce that (if it's only recursion tests that are failing). That way, we won't forget to fix up debugging later if/when we fix recursion.
msg387966 - (view) Author: David Bolen (db3l) * Date: 2021-03-02 20:43
Steve, where is that configured?  If reducing that further would resolve the crashes while retaining ceval debugging, maybe that's a reasonable trade-off, though based on my testing, reverting still seems simpler.

Right now the debug build on the buildbot appears use the standard recursion limit of 1000, and some quick grep-fu didn't find a clear spot (either internally or just for the tests) where it would be reduced.  Could perhaps how it's done be something that doesn't influence buildbot builds?

The failing tests do all appear to be recursion tests of one form or another.  Based on further testing, we'd need to have a recursion limit of 290 or below to get most of the test suite to pass.  Presumably something like 250 for a bit of head room.

However, lowering the limit appears to cover only most of the failures, not all.  Some tests directly play with the limits (like test_exceptions.test_recursion_in_except_handler and test_sys.test_recursionlimit_recovery) and still get into trouble, aborting the process, no matter how low the default recursion limit is.  so those tests also need changes to pass.  In addition, lowering to 290 created one new failure - test_compile.test_extended_arg now fails with a recursion error.  It seems to need a limit of about 900 to pass; I guess the test could temporarily reset or something but that seems especially kludgy.

Out of curiousity, are these these failures not occurring elsewhere?  It seems like something that would be happening in general (at least for anyone compiling debug builds).  I thought of checking CI builds, but they appear to use release mode, so would be unaffected.
msg387978 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-03-02 22:30
Hmm, yeah, it seems like the debug-specific definition has gone. Wonder when that happened.

In that case, go ahead and revert, but I'd suggest doing it by just removing the new Py_DEBUG condition rather than undoing all the changes. Should be nothing wrong with the rest of it.

IIRC, PR builds run under release because the compile-time increase is more than made up by the test time decrease.
msg387983 - (view) Author: David Bolen (db3l) * Date: 2021-03-02 23:11
Yes, that was the idea (revert the PyDEBUG condition only); it sounds like everyone is on the same page.

I've been doing buildbot duties only for a while (no code contributions since long before the current process), so there may be a short delay while I figure out this newfangled PR thing...
msg388072 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-03-04 03:09
New changeset 131d5516409791b170b09a6ef8ed8463c9b09015 by db3l in branch 'master':
bpo-43271: Re-enable ceval.c optimizations for Windows debug builds (GH-24739)
https://github.com/python/cpython/commit/131d5516409791b170b09a6ef8ed8463c9b09015
msg388073 - (view) Author: David Bolen (db3l) * Date: 2021-03-04 03:30
The win10 buildbot is green again, so I think this can be closed.
msg388074 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-03-04 04:21
Thanks for all your research and the PR!
msg388289 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-08 19:28
Thanks for the fix David Bolen ;-)
History
Date User Action Args
2022-04-11 14:59:41adminsetgithub: 87437
2022-03-21 07:09:09neonenesetnosy: + neonene, pablogsal

pull_requests: + pull_request30112
2021-03-08 19:28:19vstinnersetmessages: + msg388289
2021-03-04 04:21:30gvanrossumsetstatus: open -> closed
resolution: fixed
messages: + msg388074

stage: patch review -> resolved
2021-03-04 03:30:07db3lsetmessages: + msg388073
2021-03-04 03:09:56gvanrossumsetmessages: + msg388072
2021-03-04 01:58:32db3lsetkeywords: + patch
stage: patch review
pull_requests: + pull_request23510
2021-03-02 23:11:16db3lsetmessages: + msg387983
2021-03-02 22:30:39steve.dowersetmessages: + msg387978
2021-03-02 20:43:45db3lsetmessages: + msg387966
2021-03-02 17:26:29steve.dowersetmessages: + msg387943
2021-03-01 11:04:29Mark.Shannonsetmessages: + msg387853
2021-03-01 04:41:20gvanrossumsetmessages: + msg387840
2021-03-01 01:39:40db3lsetmessages: + msg387826
2021-03-01 01:05:03gvanrossumsetmessages: + msg387825
2021-02-28 20:04:32db3lsetnosy: + db3l
messages: + msg387816
2021-02-19 23:38:46pablogsalsetpriority: normal -> release blocker
2021-02-19 22:51:04gvanrossumsetnosy: + gvanrossum, Mark.Shannon
2021-02-19 20:46:42vstinnersetmessages: + msg387352
2021-02-19 20:40:21steve.dowersetmessages: + msg387351
2021-02-19 20:26:54vstinnercreate