classification
Title: Block stack size for frame objects should be dynamically sizable
Type: enhancement Stage: patch review
Components: Interpreter Core Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Mark.Shannon, gregory.p.smith, gvanrossum, python-dev, serhiy.storchaka, tomkpz
Priority: normal Keywords: patch

Created on 2021-01-13 01:54 by tomkpz, last changed 2021-03-20 23:23 by gregory.p.smith.

Pull Requests
URL Status Linked Edit
PR 24204 closed python-dev, 2021-01-13 02:07
Messages (9)
msg384991 - (view) Author: Thomas Anderson (tomkpz) * Date: 2021-01-13 01:54
Currently the block stack size is hardcoded to 20.  Similar to how the value stack is dynamically sizable, we should make the block stack dynamically sizable.  This will reduce space on average (since the typical number of blocks for a function is well below 20) and allow code generators to generate code with more deep nesting.  Note: the motivation is not necessarily to reduce memory usage, but to make L1 cache misses less likely for stack objects.
msg385111 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2021-01-15 13:37
Reducing the size of the frame object seems like a worthwhile goal, but what's the point in increasing the maximum block stack?
msg385120 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-01-15 17:06
Getting rid of hardcoded limits is good. And at first look the proposed PR looks good (some minor details can be discussed).

Although there is different approach to solve the same problem. The stack of blocks is used to set handlers for exceptions. For example, when enter the try block, it pushes the handler that points to except and/or finally clauses, when enter the with block, it pushes the handler that calls __exit__, etc. The stack of handlers is changed at run time, and it was the only solution to this problem. But after reorganizes of bytecode in latest Python versions (mainly by Mark) it is now possible to determine handlers for every instruction statically, at compile time. Instead of stack of blocks we would have a table of addresses of handlers. It is more efficient approach and it is used in modern C++ compilers. The only technical issue is compact and platform-independent representation of the table (because the size of the uncompressed table would be larger than the size of the code, but most of entries are repeated and usually are small integers).

It would make PR 24204 unneeded, so I suggest to wait some time before reviewing it.
msg385125 - (view) Author: Thomas Anderson (tomkpz) * Date: 2021-01-15 19:08
> Reducing the size of the frame object seems like a worthwhile goal, but what's the point in increasing the maximum block stack?

No point for humans, but it may be useful for code generators.
msg385409 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2021-01-21 11:31
I see no advantage of getting rid of the limit of 20.

No one ever gets near 20 deep in practice.
Given the limit has been there for so long, it is likely that some tooling that expects the depth of try-statements to be limited.

Why would a code generator need to nest try statements so deeply? I'm curious.
msg388697 - (view) Author: Thomas Anderson (tomkpz) * Date: 2021-03-15 00:56
IIRC, some transpilers for functional languages create deeply nested code.  In particular for things like haskell's do notation.

Anyway, when I wrote the PR, it was initially to reduce the frame size.  Then once I had dynamic block stack sizing working, I realized there was no longer a need to keep the limit of 20 blocks.  It was just compile.c that had the artificial limit, so I removed it.  I can add the limit back in the PR, but I'm not sure what benefit that would give.
msg388698 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-03-15 01:26
I don't see what putting the limit back in this PR would give, but I do question that we need more block nesting, and this PR causes a lot of code churn (including a new API to create frames). It would be more convincing if you could actually point to a code generator that would benefit, rather than hand-waving "code generators might use it."

I'm also not sure that we'll see measurable benefits in terms of memory access locality. Have you tried to benchmark this? Even a micro-benchmark aiming to show there is *some* effect would be helpful.

At the same time I'm intrigued by Serhiy's idea, which could well reduce the size of the bytecode and the cost of instruction decoding by avoiding all dynamic block stack manipulation.

There are also other ideas floating about for improving memory locality related to the frame stack, e.g. putting the stack frames in an array instead of a linked list.
msg388715 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-03-15 09:23
> There are also other ideas floating about for improving memory locality related to the frame stack, e.g. putting the stack frames in an array instead of a linked list.

Would it work with generators and coroutines?
msg388747 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021-03-15 14:51
Yes, if we keep the top of the stack and the current frame in separate
variables.--
--Guido (mobile)
History
Date User Action Args
2021-03-20 23:23:36gregory.p.smithsetnosy: + gregory.p.smith
2021-03-15 14:51:39gvanrossumsetmessages: + msg388747
2021-03-15 09:23:58serhiy.storchakasetmessages: + msg388715
2021-03-15 01:26:20gvanrossumsetmessages: + msg388698
2021-03-15 00:56:32tomkpzsetmessages: + msg388697
2021-01-21 11:31:27Mark.Shannonsetmessages: + msg385409
2021-01-15 19:08:40tomkpzsetmessages: + msg385125
2021-01-15 17:06:16serhiy.storchakasetnosy: + gvanrossum, serhiy.storchaka
messages: + msg385120
2021-01-15 13:37:26Mark.Shannonsetnosy: + Mark.Shannon
messages: + msg385111
2021-01-13 02:07:40python-devsetkeywords: + patch
nosy: + python-dev

pull_requests: + pull_request23029
stage: patch review
2021-01-13 01:54:09tomkpzcreate