Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speedup for / while / if with better bytecode #46711

Closed
pitrou opened this issue Mar 22, 2008 · 33 comments
Closed

speedup for / while / if with better bytecode #46711

pitrou opened this issue Mar 22, 2008 · 33 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) pending The issue will be closed if no feedback is provided performance Performance or resource usage

Comments

@pitrou
Copy link
Member

pitrou commented Mar 22, 2008

BPO 2459
Nosy @rhettinger, @gpshead, @jcea, @pitrou, @giampaolo, @tiran, @djc, @phsilva
Files
  • predict_loop.diff
  • for_iter.patch
  • trunk-opt-loop.patch: for_iter.patch merged with python/issues-test-cpython#4715
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2008-03-22.22:25:03.988>
    labels = ['interpreter-core', 'performance']
    title = 'speedup for / while / if with better bytecode'
    updated_at = <Date 2014-06-22.12:04:07.520>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2014-06-22.12:04:07.520>
    actor = 'BreamoreBoy'
    assignee = 'jyasskin'
    closed = False
    closed_date = None
    closer = None
    components = ['Interpreter Core']
    creation = <Date 2008-03-22.22:25:03.988>
    creator = 'pitrou'
    dependencies = []
    files = ['9877', '12923', '13227']
    hgrepos = []
    issue_num = 2459
    keywords = ['patch']
    message_count = 31.0
    messages = ['64343', '64359', '64364', '64367', '64383', '64506', '64537', '64572', '64573', '64577', '64599', '64603', '66027', '68687', '80591', '80593', '80598', '80995', '81029', '81958', '81962', '81976', '82556', '82557', '83003', '83012', '83022', '83088', '106883', '192575', '221249']
    nosy_count = 15.0
    nosy_names = ['nnorwitz', 'collinwinter', 'rhettinger', 'gregory.p.smith', 'jcea', 'tzot', 'pitrou', 'giampaolo.rodola', 'christian.heimes', 'jyasskin', 'stutzbach', 'lauromoura', 'djc', 'phsilva', 'BreamoreBoy']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'languishing'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue2459'
    versions = ['Python 2.7', 'Python 3.3', 'Python 3.4']

    @pitrou pitrou added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Mar 22, 2008
    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 22, 2008

    This is a preliminary patch to speedup for and while loops (it will also
    be able to speedup list comps and gen comps, if I get to do it).
    The patch works by putting the loop condition test at the end of loop,
    which allows removing one JUMP_ABSOLUTE byte code out of the critical path.

    For this two new opcodes are introduced:

    • FOR_ITER2 is the same as FOR_ITER except that it does an /absolute/
      jump if the iterator is /not/ exhausted (when other uses of FOR_ITER are
      replaced with FOR_ITER2, we can of course restore the original naming)
    • JUMP_ABS_IF_TRUE /pops/ the top of the stack and does an absolute jump
      if its value is true

    Some micro-benchmarks:

    ./python -m timeit "for x in xrange(10000): pass"
    Before: 1000 loops, best of 3: 782 usec per loop
    After: 1000 loops, best of 3: 412 usec per loop

    ./python -m timeit "x=100" "while x: x -= 1"
    Before: 10000 loops, best of 3: 22.1 usec per loop
    After: 100000 loops, best of 3: 16.6 usec per loop

    ./python Tools/pybench/pybench.py -t ForLoops
    Before: 365ms per round
    After: 234ms per round

    Also, pystone gets 5% faster (from 43300 to 45800).

    Now for the less shiny things:

    1. I'm having problems modifying the pure Python compiler module. For
      some reasons it seems to have stricter requirements than compile.c
      itself does (!); I get some assertion error in
      compiler.pyassem.FlowGraph.fixupOrderHonorNext as soon as a loop gets
      non-trivial.

    2. Line numbers probably need to be fixed. The lnotab format may even
      have to be adapted in order to accomodate non-monotonically increasing
      line numbers.

    Is there some interest in this patch? If yes, I'd like to have your
    input on the two bullet points above :)

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 23, 2008

    This new patch includes surgery to the compiler package (especially
    flowgraph trickery) in order to make it work with the new opcodes. I
    think my changes are sane but since the package seems basically
    untested, unmaintained and almost unused, there may be some glitches.
    However, test_compiler passes.

    (test_dis will need to be updated for the new opcodes, not a big deal)

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 23, 2008

    Removed latest patch, it was half-baked.

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 23, 2008

    This new patch should be ok. The block ordering algorithm in
    compiler.pyassem looks entirely clean now, to the extent that the
    previous "fixup" hacks have been disabled.

    Attaching loops3.py.

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 23, 2008

    loops4.patch adds a mechanism to avoid blocking signal catching in empty
    loops (such as "for x in it: pass" or "while x: pass"). Much of the
    speedup is still retained.

    ./python -m timeit "for x in xrange(10000): pass"
    Before: 1000 loops, best of 3: 737 usec per loop
    After: 1000 loops, best of 3: 438 usec per loop

    ./python -m timeit "x=100" "while x: x -= 1"
    Before: 10000 loops, best of 3: 21.7 usec per loop
    After: 100000 loops, best of 3: 16.6 usec per loop

    ./python Tools/pybench/pybench.py -t ForLoops
    Before: 364ms per round
    After: 242ms per round

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 25, 2008

    By the way, the compiler package fix has been isolated and cleaned up as
    part of bpo-2472. Maybe it would be nice to start with that one.

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 26, 2008

    This new patch also updates the code generation for list comprehensions.
    Here are some micro-benchmarks:

    ./python -m timeit -s "l = range(100)" "[x for x in l]"
    Before: 100000 loops, best of 3: 14.9 usec per loop
    After: 100000 loops, best of 3: 12.5 usec per loop

    ./python -m timeit -s "l = range(100)" "[x for x in l if x]"
    Before: 10000 loops, best of 3: 24.1 usec per loop
    After: 100000 loops, best of 3: 18.8 usec per loop

    ./python -m timeit -s "l = range(100)" "[x for x in l if not x]"
    Before: 100000 loops, best of 3: 15.5 usec per loop
    After: 100000 loops, best of 3: 11.9 usec per loop

    Please note that this patch is orthogonal with Neal's patch in bpo-2183, so
    the two combined should be quite interesting for the list comprehensions
    bytecode.

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 26, 2008

    This new patch completes the bytecode modifications. For/while loops as
    well as list comprehensions and generator expressions are a bit faster
    now. Also, as a side effect I've introduced a speed improvement for "if"
    statements and expressions...

    Some micro-benchmarks (completing the ones already given above):

    ./python Tools/pybench/pybench.py -t IfThenElse
    Before: 167ms per round
    After: 136ms per round

    ./python -m timeit -s "y=range(100)" "sum(x for x in y)"
    Before: 10000 loops, best of 3: 20.4 usec per loop
    After: 100000 loops, best of 3: 17.9 usec per loop

    ./python -m timeit -s "y=range(100)" "sum(x for x in y if x)"
    Before: 10000 loops, best of 3: 28.5 usec per loop
    After: 10000 loops, best of 3: 23.3 usec per loop

    ./python -m timeit -s "y=range(100)" "sum(x for x in y if not x)"
    Before: 100000 loops, best of 3: 16.4 usec per loop
    After: 100000 loops, best of 3: 12.1 usec per loop

    ./python -m timeit -s "x,y,z=1,2,3" "x if y else z"
    Before: 1000000 loops, best of 3: 0.218 usec per loop
    After: 10000000 loops, best of 3: 0.159 usec per loop

    A couple of tests seem to be failing in obscure ways in the test suite,
    I'll try to examine them. Most of the test suite runs fine though.

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 26, 2008

    Ok, the fix for the bizarre failures was really simple. Now the only
    failing tests are in test_trace (because it makes assumptions regarding
    the bytecode that aren't true anymore, I'll have to adapt the tests).

    @nnorwitz
    Copy link
    Mannequin

    nnorwitz mannequin commented Mar 27, 2008

    Antoine, I hope to look at this patch eventually. Unfortunately, there
    are too many build/test problems that need to be resolved before the
    next release. If you can help out with those, I will be able to review
    this patch sooner.

    @arigo
    Copy link
    Mannequin

    arigo mannequin commented Mar 27, 2008

    Can you see if this simpler patch also gives speed-ups?
    (predict_loop.diff)

    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 27, 2008

    Armin, your patch gives a speed-up for "for" loops and comprehensions,
    although a bit less. Also, it doesn't speed up "while" loops and "if"
    statements at all. For some reasons it also appears to make pystone a
    bit slower. Here are some micro-benchmarks:

    ./python -m timeit "for x in xrange(10000): pass"
    Before: 1000 loops, best of 3: 758 usec per loop
    After: 1000 loops, best of 3: 483 usec per loop

    ./python -m timeit "x=100" "while x: x -= 1"
    Before: 10000 loops, best of 3: 21.8 usec per loop
    After: 10000 loops, best of 3: 21.6 usec per loop

    ./python -m timeit -s "l = range(100)" "[x for x in l]"
    Before: 100000 loops, best of 3: 14.9 usec per loop
    After: 100000 loops, best of 3: 13.3 usec per loop

    ./python -m timeit -s "l = range(100)" "[x for x in l if x]"
    Before: 10000 loops, best of 3: 23.9 usec per loop
    After: 10000 loops, best of 3: 22.3 usec per loop

    ./python -m timeit -s "l = range(100)" "[x for x in l if not x]"
    Before: 100000 loops, best of 3: 15.8 usec per loop
    After: 100000 loops, best of 3: 13.9 usec per loop

    ./python Tools/pybench/pybench.py -t IfThenElse
    Before: 164ms per round
    After: 166ms per round

    @pitrou
    Copy link
    Member Author

    pitrou commented Apr 30, 2008

    Finally I had to slightly change the lnotab format to have the right
    tracing semantics: the change is that line number increments are now
    signed bytes rather than unsigned.

    Still, there is a small change in tracing behaviour (see test_trace.py):
    the for statement is traced twice in the first loop iteration, that's
    because the iterator object is retrieved (GET_ITER) at the beginning of
    the loop while the next iterator value is fetched (FOR_ITER) at the end
    of the loop. I don't believe this is very painful.

    All in all, the whole test suite now passes fine. The performance
    numbers are the same as before.

    @pitrou pitrou changed the title speedup loops with better bytecode speedup for / while / if with better bytecode Apr 30, 2008
    @tzot
    Copy link
    Mannequin

    tzot mannequin commented Jun 24, 2008

    A pointer to previous (minor) research:

    http://groups.google.com/group/comp.lang.python/browse_frm/thread/72505e3cb6d9cb1a/e486759f06ec4ee5

    esp. after Terry Reedy's post

    @rhettinger
    Copy link
    Contributor

    Reminder, make sure we can still break out of a "while 1: pass".

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 26, 2009

    Reminder, make sure we can still break out of a "while 1: pass".

    Yes, the patch takes care of that.

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 26, 2009

    The patches don't apply cleanly anymore, I'll regenerate a new one.

    @pitrou
    Copy link
    Member Author

    pitrou commented Feb 2, 2009

    Here is an updated patch against trunk.
    To be honest I don't think it has a lot of future:

    • it changes the lnotab format (to allow negative offsets)
    • it rewrites the buggy block reordering code in the pure Python
      compiler package, but nobody seems interested in maintaining that
      package (see bpo-2472).
      Perhaps we should just declare the experiment failed and switch to
      something else.

    @rhettinger
    Copy link
    Contributor

    I would like to see this go forward. It looks promising.

    @collinwinter
    Copy link
    Mannequin

    collinwinter mannequin commented Feb 13, 2009

    I don't see the changes to the lnotab format being a roadblock; just
    mention it in NEWS. Likewise, the pure-Python compiler package shouldn't
    be a high priority; your changes to that package look good enough.

    I'm seeing encouraging speed-ups out of this (with gcc 4.3.1 x86_64,
    compiling Python as 64-bit):
    Django templates (render a 150x150 table 100 times):
    Min: 0.595 -> 0.589: 0.94% faster
    Avg: 0.599 -> 0.591: 1.30% faster

    Spitfire templates (render a 1000x1000 table 100 times):
    Min: 0.751 -> 0.729: 2.98% faster
    Avg: 0.753 -> 0.730: 3.09% faster

    None of the apps I've benchmarked are negatively impacted. I only have
    two minor comments. Please commit this.

    Review comments:

    • The changes to Python/compile.c:compiler_if(), compiler_for(),
      compiler_while() have some indentation issues (tabs and such).
    • Functions like
      def foo():
      while True:
      pass
      have a useless JUMP_FORWARD 0 instruction, but I don't think it's worth
      teaching the peepholer to remove them since it doesn't happen in other
      circumstances (that I can tell).

    @pitrou
    Copy link
    Member Author

    pitrou commented Feb 13, 2009

    Hello Collin,

    Thanks for taking a look.

    I don't see the changes to the lnotab format being a roadblock; just
    mention it in NEWS. Likewise, the pure-Python compiler package shouldn't
    be a high priority; your changes to that package look good enough.

    Well, I have good news: the fixes to the pure Python compiler have been
    accepted and committed by Neil Schemenauer in r69373.

    I'm seeing encouraging speed-ups out of this (with gcc 4.3.1 x86_64,
    compiling Python as 64-bit):
    Django templates (render a 150x150 table 100 times):
    Min: 0.595 -> 0.589: 0.94% faster
    Avg: 0.599 -> 0.591: 1.30% faster

    Spitfire templates (render a 1000x1000 table 100 times):
    Min: 0.751 -> 0.729: 2.98% faster
    Avg: 0.753 -> 0.730: 3.09% faster

    Not a tremendous speedup but not totally insignificant either.
    (I see you like Spitfire :-))

    None of the apps I've benchmarked are negatively impacted. I only have
    two minor comments. Please commit this.

    Before committing I want to know what to do with the new jump opcodes,
    with respect to the alternative proposal I've made in bpo-4715.
    Ideally, I should fold the bpo-4715 patch back into the present patch,
    since I think the bpo-4715 approach is more thought out.

    @collinwinter
    Copy link
    Mannequin

    collinwinter mannequin commented Feb 13, 2009

    On Fri, Feb 13, 2009 at 10:37 AM, Antoine Pitrou <report@bugs.python.org> wrote:

    Antoine Pitrou <pitrou@free.fr> added the comment:

    Hello Collin,

    Thanks for taking a look.

    > I don't see the changes to the lnotab format being a roadblock; just
    > mention it in NEWS. Likewise, the pure-Python compiler package shouldn't
    > be a high priority; your changes to that package look good enough.

    Well, I have good news: the fixes to the pure Python compiler have been
    accepted and committed by Neil Schemenauer in r69373.

    Yeah, I saw that. Fantastic.

    > I'm seeing encouraging speed-ups out of this (with gcc 4.3.1 x86_64,
    > compiling Python as 64-bit):
    > Django templates (render a 150x150 table 100 times):
    > Min: 0.595 -> 0.589: 0.94% faster
    > Avg: 0.599 -> 0.591: 1.30% faster
    >
    > Spitfire templates (render a 1000x1000 table 100 times):
    > Min: 0.751 -> 0.729: 2.98% faster
    > Avg: 0.753 -> 0.730: 3.09% faster

    Not a tremendous speedup but not totally insignificant either.
    (I see you like Spitfire :-))

    Well, Spitfire and Django represent very different ways of
    implementing a template system, so I like to measure both when doing
    application benchmarks. Template systems tend to be heavy CPU
    consumers for webapps, which is why I include them.

    > None of the apps I've benchmarked are negatively impacted. I only have
    > two minor comments. Please commit this.

    Before committing I want to know what to do with the new jump opcodes,
    with respect to the alternative proposal I've made in bpo-4715.
    Ideally, I should fold the bpo-4715 patch back into the present patch,
    since I think the bpo-4715 approach is more thought out.

    That sounds good, especially since Jeffrey and I have already reviewed bpo-4715.

    @collinwinter
    Copy link
    Mannequin

    collinwinter mannequin commented Feb 21, 2009

    On Fri, Feb 13, 2009 at 3:23 PM, Collin Winter <collinw@gmail.com> wrote:

    On Fri, Feb 13, 2009 at 10:37 AM, Antoine Pitrou <report@bugs.python.org> wrote:
    > Before committing I want to know what to do with the new jump opcodes,
    > with respect to the alternative proposal I've made in bpo-4715.
    > Ideally, I should fold the bpo-4715 patch back into the present patch,
    > since I think the bpo-4715 approach is more thought out.

    That sounds good, especially since Jeffrey and I have already reviewed bpo-4715.

    If you don't have the bandwidth to integrate 4715 into this patch, I
    can do that, if you want.

    Collin

    @pitrou
    Copy link
    Member Author

    pitrou commented Feb 21, 2009

    If you don't have the bandwidth to integrate 4715 into this patch, I
    can do that, if you want.

    Collin, that would be very nice from you. You could also apply Jeffrey's
    naming suggestions in bpo-4715.

    Thanks!

    Antoine.

    @jyasskin
    Copy link
    Mannequin

    jyasskin mannequin commented Mar 2, 2009

    I've updated for_iter.patch to the latest trunk, merging in bpo-4715.
    I also changed tracing a bit so that the first line of a loop doesn't
    get traced twice in the first iteration, and added to test_dis to check
    that decreasing line numbers work there.

    Review at http://codereview.appspot.com/20103 if you like.

    Performance:

    32-bit gcc-4.3 Intel Core2:
    2to3:
    25.09 -> 24.63: 1.87% faster

    Django:
    Min: 0.614 -> 0.609: 0.86% faster
    Avg: 0.616 -> 0.635: 3.09% slower

    Pickle: (cPickle)
    Min: 1.647 -> 1.651: 0.21% slower
    Avg: 1.650 -> 1.656: 0.39% slower

    PyBench:
    Min: 5341 -> 5273: 1.29% faster
    Avg: 5450 -> 5397: 0.98% faster

    SlowPickle: (pickle)
    Min: 1.384 -> 1.341: 3.13% faster
    Avg: 1.385 -> 1.343: 3.08% faster

    Spitfire:
    Min: 0.773 -> 0.690: 11.97% faster
    Avg: 0.776 -> 0.695: 11.62% faster
    Spitfire re-run:
    Min: 0.740 -> 0.693: 6.81% faster
    Avg: 0.744 -> 0.695: 6.93% faster

    SlowUnpickle: (pickle)
    Min: 0.645 -> 0.668: 3.37% slower
    Avg: 0.646 -> 0.670: 3.59% slower
    SlowUnpickle re-run:
    Min: 0.645 -> 0.660: 2.31% slower
    Avg: 0.645 -> 0.661: 2.32% slower

    Unpickle: (cPickle)
    Min: 1.015 -> 1.006: 0.89% faster
    Avg: 1.021 -> 1.009: 1.16% faster

    64-bit gcc-4.3 Intel Core2
    2to3:
    22.31 -> 21.97: 1.57% faster

    Django:
    Min: 0.577 -> 0.564: 2.29% faster
    Avg: 0.579 -> 0.566: 2.20% faster

    Pickle:
    Min: 1.162 -> 1.178: 1.35% slower
    Avg: 1.166 -> 1.183: 1.37% slower

    PyBench:
    Min: 4498 -> 4193: 7.27% faster
    Avg: 4586 -> 4276: 7.25% faster

    SlowPickle:
    Min: 1.212 -> 1.133: 6.92% faster
    Avg: 1.213 -> 1.135: 6.92% faster

    Spitfire:
    Min: 0.631 -> 0.617: 2.32% faster
    Avg: 0.632 -> 0.621: 1.75% faster

    SlowUnpickle:
    Min: 0.575 -> 0.551: 4.31% faster
    Avg: 0.576 -> 0.553: 4.14% faster

    Unpickle:
    Min: 0.708 -> 0.722: 1.88% slower
    Avg: 0.745 -> 0.736: 1.20% faster

    @jyasskin jyasskin mannequin assigned pitrou Mar 2, 2009
    @pitrou
    Copy link
    Member Author

    pitrou commented Mar 2, 2009

    I've updated for_iter.patch to the latest trunk, merging in bpo-4715.
    I also changed tracing a bit so that the first line of a loop doesn't
    get traced twice in the first iteration, and added to test_dis to check
    that decreasing line numbers work there.

    Thanks a lot!

    By the way, why do you bench cPickle? Does your test call Python code
    significantly?

    Overall, the results look positive although not overwhelming.

    @jyasskin
    Copy link
    Mannequin

    jyasskin mannequin commented Mar 2, 2009

    No particular reason for cPickle. It sometimes shows when we've caused
    problems by increasing the code size, and shows the size of any random
    effects that the compiler causes by moving code around. Could you
    double-check the patch to see if I did anything dumb?

    @jyasskin
    Copy link
    Mannequin

    jyasskin mannequin commented Mar 3, 2009

    Hold off on reviewing this. There's one bug around the peepholer not
    turning itself off when line numbers skip by more than 127, and another
    around the traceback generator still assuming line numbers are unsigned.
    I'll post another patch when those are fixed and tested. I may not get
    to it for a day or two, so someone else is welcome to update the patch
    instead. :)

    @jyasskin jyasskin mannequin assigned jyasskin and unassigned pitrou Mar 3, 2009
    @djc
    Copy link
    Member

    djc commented Jun 2, 2010

    Is this still relevant / will it get some love in the future?

    @tiran
    Copy link
    Member

    tiran commented Jul 7, 2013

    Is this enhancement still relevant?

    @tiran tiran added the stale Stale PR or inactive for long period of time. label Jul 7, 2013
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jun 22, 2014

    As a lot of work has gone into this it saddens me to see it languishing. Surely if Python performance is to be improved the bytecode for conditionals and loops is one of the places if not the place to do it? Are there any names missing from the nosy list that ought to be there?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @iritkatriel
    Copy link
    Member

    This experiment was abandoned over a decade ago and things have moved on since then. Unless someone objects I will close the issue.

    @iritkatriel iritkatriel added the pending The issue will be closed if no feedback is provided label Sep 1, 2022
    @pitrou
    Copy link
    Member Author

    pitrou commented Sep 1, 2022

    Definitely outdated :-)

    @pitrou pitrou closed this as completed Sep 1, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) pending The issue will be closed if no feedback is provided performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants