classification
Title: Make multiple passes of the peephole optimizer until bytecode cannot be optimized further
Type: Stage: patch review
Components: Interpreter Core Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: tim.peters Nosy List: brett.cannon, pablogsal, rhettinger, serhiy.storchaka, tim.peters
Priority: normal Keywords: patch

Created on 2019-06-14 01:58 by pablogsal, last changed 2019-06-15 21:11 by rhettinger.

Pull Requests
URL Status Linked Edit
PR 14068 open pablogsal, 2019-06-14 01:58
Messages (9)
msg345543 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-06-14 01:58
The peephole optimizer currently does not optimize the result of its own optimization when its possible. For example, consider the following code:

    if a:
        if b:
            if (c or d):
                foo()
        else:
            bar()
    else:
        baz()

The bytecode for this is:

  2           0 LOAD_GLOBAL              0 (a)
              2 POP_JUMP_IF_FALSE       32

  4           4 LOAD_GLOBAL              1 (b)
              6 POP_JUMP_IF_FALSE       24

  5           8 LOAD_GLOBAL              2 (c)
             10 POP_JUMP_IF_TRUE        16

  6          12 LOAD_GLOBAL              3 (d)
             14 POP_JUMP_IF_FALSE       30

  7     >>   16 LOAD_GLOBAL              4 (foo)
             18 CALL_FUNCTION            0
             20 POP_TOP
             22 JUMP_ABSOLUTE           38

  9     >>   24 LOAD_GLOBAL              5 (bar)
             26 CALL_FUNCTION            0
             28 POP_TOP
        >>   30 JUMP_FORWARD             6 (to 38)

 11     >>   32 LOAD_GLOBAL              6 (baz)
             34 CALL_FUNCTION            0
             36 POP_TOP
        >>   38 LOAD_CONST               0 (None)
             40 RETURN_VALUE

Notice that the 14 POP_JUMP_IF_FALSE 30 jumps to another jump (30 JUMP_FORWARD 6 (to 38)). If we repeat the optimizations until the resulting bytecode does not change more we get:

  2           0 LOAD_GLOBAL              0 (a)
              2 POP_JUMP_IF_FALSE       32

  4           4 LOAD_GLOBAL              1 (b)
              6 POP_JUMP_IF_FALSE       24

  5           8 LOAD_GLOBAL              2 (c)
             10 POP_JUMP_IF_TRUE        16

  6          12 LOAD_GLOBAL              3 (d)

  5          14 POP_JUMP_IF_FALSE       38

  7     >>   16 LOAD_GLOBAL              4 (foo)
             18 CALL_FUNCTION            0
             20 POP_TOP
             22 JUMP_ABSOLUTE           38

  9     >>   24 LOAD_GLOBAL              5 (bar)
             26 CALL_FUNCTION            0
             28 POP_TOP
             30 JUMP_FORWARD             6 (to 38)

 11     >>   32 LOAD_GLOBAL              6 (baz)
             34 CALL_FUNCTION            0
             36 POP_TOP
        >>   38 LOAD_CONST              

Notice that in this example the original instruction now is (14 POP_JUMP_IF_FALSE  38).
msg345638 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-06-14 21:40
FWIW, when the peephole optimizer was originally accepted, it was partly on the condition that we kept it simple and fast.  Perhaps, the sentiment has changed, perhaps not.
msg345701 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-06-15 16:41
We should handle this with care and prove that the optimization loop is finite.
msg345705 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-06-15 16:58
We can add a #define for a maximum numbers of iterations. The pyperformance test suite shows speed ups of 2-3% without PGO/LTO. I will ran more concise experiments including PGO but even with a max cap of iterations (10 in my experiments) we get nice improvements without much more complexity.
msg345706 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-06-15 17:01
Probably is not that easy, but a naive thought respecting loops:

As the peephole optimizer only produces code that is the same length or smaller and it works by feeling noops and deleting them, the only loop would be code that is of the same length but different, but that cannot happen by adding noops and removing them, right?
msg345707 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-06-15 17:45
Tim, do you have any thoughts on this?

In the past, there was an implicit red-line that we would not cross even if a performance improvement was possible.  Given that that was a long time ago and that the newer devs are both more aggressive and more fearless, we ought to re-examine whether the red-line should be made explicit or whether it no longer applies.

For this particular patch, a source of discomfort is that the optimization step takes longer (due to multiple passes).  So, we sometimes get faster wordcode but that time it takes to get there is longer.
msg345710 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-06-15 17:52
>So, we sometimes get faster wordcode but that time it takes to get there is longer.

Is worth noting that the (very small) overhead to obtain the wordcode happens once but the benefits of the resulting opcode will be felt every time you execute it.
msg345713 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-06-15 19:53
I'm familiar with the trade-offs (I'm the original designer of this code), but it is possible to use that reasoning to talk yourself into thinking that any optimization to code generation is worth it.  

The speed of the compile does matter as well (think about uses of exec() in dataclasses for example).  Also, the level of complexity matters as well.  The last time we introduced recursive passes for part of the peepholer (in constant folding), it took years to get it fully debugged and it caused a number of bug reports with respect to total compilation time.

FWIW, Guido was never sold on the merits of having a peephole optimizer and the tool spent much of its life on the verge of being ripped-out.  To accommodate his wishes and our design instincts, we agreed at the outset to keep this "dirt simple".  (Put another way, for over a decade every developer who looked at this resisted going down this obvious line of development).

That said, I'll leave it to Tim to decide.  He helped talked Guido into letting me add this and he is the one who first proposed setting tight limits on how far we would go.

(One last thought, be sure to coordinate with Serhiy who is actively transferring responsibilities upstream to the AST level manipulations so they can be analyzed at a higher level.)
msg345715 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-06-15 20:49
I afraid that the time of the peephole optimization is nothing in comparison with the cost of deduplicating constants in compile.c.
History
Date User Action Args
2019-06-15 21:11:32rhettingersetassignee: tim.peters
2019-06-15 20:49:08serhiy.storchakasetmessages: + msg345715
2019-06-15 19:53:55rhettingersetmessages: + msg345713
2019-06-15 17:52:48pablogsalsetmessages: + msg345710
2019-06-15 17:45:31rhettingersetnosy: + tim.peters
messages: + msg345707
2019-06-15 17:01:59pablogsalsetmessages: + msg345706
2019-06-15 16:58:59pablogsalsetmessages: + msg345705
2019-06-15 16:41:48serhiy.storchakasetmessages: + msg345701
2019-06-14 21:40:08rhettingersetnosy: + rhettinger
messages: + msg345638
2019-06-14 18:10:51brett.cannonsetnosy: + brett.cannon
2019-06-14 02:00:32pablogsalsetnosy: + serhiy.storchaka
2019-06-14 01:58:49pablogsalsetkeywords: + patch
stage: patch review
pull_requests: + pull_request13927
2019-06-14 01:58:30pablogsalcreate