classification
Title: AMD64 Windows10 3.x crash with Windows fatal exception: stack overflow
Type: Stage:
Components: Windows Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Mark.Shannon, db3l, gvanrossum, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Priority: release blocker Keywords:

Created on 2021-02-19 20:26 by vstinner, last changed 2021-03-01 11:04 by Mark.Shannon.

Messages (8)
msg387348 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-02-19 20:26
AMD64 Windows10 3.x, build 856:
https://buildbot.python.org/all/#/builders/146/builds/856

17 tests failed:
    test_exceptions test_fileio test_io test_isinstance test_json
    test_lib2to3 test_logging test_pickle test_pickletools
    test_plistlib test_richcmp test_runpy test_sys test_threading
    test_traceback test_typing test_xml_etree

It may be a regression caused by:
"bpo-43166: Disable ceval.c optimisations for Windows debug builds (GH-24485)"
https://github.com/python/cpython/commit/b74396c3167cc780f01309148db02709bc37b432

The latest successful build was 11 days ago (Feb 9):
https://buildbot.python.org/all/#/builders/146/builds/819

----

0:01:46 load avg: 14.25 [ 61/426/1] test_xml_etree crashed (Exit code 3221225725) -- running: test_tokenize (46.4 sec), test_largefile (1 min 10 sec)
Windows fatal exception: stack overflow

Current thread 0x0000110c (most recent call first):
  File "D:\buildarea\3.x.bolen-windows10\build\lib\xml\etree\ElementTree.py", line 178 in __repr__
  File "D:\buildarea\3.x.bolen-windows10\build\lib\xml\etree\ElementTree.py", line 178 in __repr__
  File "D:\buildarea\3.x.bolen-windows10\build\lib\xml\etree\ElementTree.py", line 178 in __repr__
  ...

======================================================================
FAIL: test_recursion_limit (test.test_threading.ThreadingExceptionTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\buildarea\3.x.bolen-windows10\build\lib\test\test_threading.py", line 1203, in test_recursion_limit
    self.assertEqual(p.returncode, 0, "Unexpected error: " + stderr.decode())
AssertionError: 3221225725 != 0 : Unexpected error: 

0:05:09 load avg: 29.65 [140/426/3] test_traceback crashed (Exit code 3221225725) -- running: test_concurrent_futures (2 min 6 sec), test_mmap (2 min)
Windows fatal exception: stack overflow

Current thread 0x00000118 (most recent call first):
  File "D:\buildarea\3.x.bolen-windows10\build\lib\test\test_traceback.py", line 1154 in f
  File "D:\buildarea\3.x.bolen-windows10\build\lib\test\test_traceback.py", line 1156 in f
  File "D:\buildarea\3.x.bolen-windows10\build\lib\test\test_traceback.py", line 1156 in f
  ...

0:06:14 load avg: 29.15 [151/426/4] test_lib2to3 crashed (Exit code 3221225725) -- running: test_multiprocessing_spawn (31.7 sec), test_capi (1 min 3 sec), test_mmap (3 min 4 sec)
Windows fatal exception: stack overflow

Current thread 0x00001b90 (most recent call first):
  File "D:\buildarea\3.x.bolen-windows10\build\lib\lib2to3\pytree.py", line 496 in generate_matches
  File "D:\buildarea\3.x.bolen-windows10\build\lib\lib2to3\pytree.py", line 845 in generate_matches
  ...
msg387351 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2021-02-19 20:40
Looks like someone increased the size of locals in the ceval functions.

If any new buffers have been added recently, moving the processing into a helper function or using a single buffer instead of allocating them in separate if/case blocks can help reduce it.
msg387352 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-02-19 20:46
If someone wants to measure the stack memory usage per function call, _testcapi.stack_pointer() can be used: see bpo-30866.
msg387816 - (view) Author: David Bolen (db3l) Date: 2021-02-28 20:04
I don't think it's actually any change in ceval per se, or any new buffers, just how the compiler code generation has changed due to this commit.

Based on some local testing, the triggering issue is the exclusion of the optimization pragma entirely in debug mode.  It appears that the existing code required the pragma to be enabled to stay within the available stack space.

A quick reproduction is running test_pickletools.  It fails abruptly and quickly, at least on the worker, without the pragma, but runs to completion if enabled.

Perhaps that aspect of the commit should be reverted, maybe with a separate flag for debugging ceval (but not normal testing)?  Alternatively, is there a way to increase the stack in debug mode to compensate?  Though I guess that would risk missing a release-build only stack issue.
msg387825 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-03-01 01:05
Thanks David!

So the crash only happens in debug mode?

PEP 651 promises to take reduce the C stack usage in the future. Therefore, I would be okay with rolling this back for the time being.
If we do roll it back, maybe add a comment to ceval.c explaining
that even in debug mode this file is optimized?

(The change to the pragma can stay, those two letters have been unused for a decade.)
msg387826 - (view) Author: David Bolen (db3l) Date: 2021-03-01 01:39
I hadn't tested release mode earlier, since the commit only removed the pragma in debug builds, but I just built a release build with the commit in place on the worker and it seems fine.

So barring stack changes (enlarging or improving usage) it appears the optimize pragma is required on Windows, at least in order to pass the current test suite.
msg387840 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-03-01 04:41
Thanks! That's new info, at least, it wasn't clear to me from Victor's
original comment that this was for a debug build (even though as you say it
can be deduced from context).

Do you feel like submitting a PR along the lines I suggested?
msg387853 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2021-03-01 11:04
PEP 651 would prevent any crashes, although some of the tests might still fail with a StackOverflowException, if they weren't expecting a RecursionError.
History
Date User Action Args
2021-03-01 11:04:29Mark.Shannonsetmessages: + msg387853
2021-03-01 04:41:20gvanrossumsetmessages: + msg387840
2021-03-01 01:39:40db3lsetmessages: + msg387826
2021-03-01 01:05:03gvanrossumsetmessages: + msg387825
2021-02-28 20:04:32db3lsetnosy: + db3l
messages: + msg387816
2021-02-19 23:38:46pablogsalsetpriority: normal -> release blocker
2021-02-19 22:51:04gvanrossumsetnosy: + gvanrossum, Mark.Shannon
2021-02-19 20:46:42vstinnersetmessages: + msg387352
2021-02-19 20:40:21steve.dowersetmessages: + msg387351
2021-02-19 20:26:54vstinnercreate