speedup for / while / if with better bytecode #46711

pitrou · 2008-03-22T22:25:04Z

BPO	2459
Nosy	@rhettinger, @gpshead, @jcea, @pitrou, @giampaolo, @tiran, @djc, @phsilva
Files	predict_loop.diff for_iter.patch trunk-opt-loop.patch: for_iter.patch merged with python/issues-test-cpython#4715

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2008-03-22.22:25:03.988>
labels = ['interpreter-core', 'performance']
title = 'speedup for / while / if with better bytecode'
updated_at = <Date 2014-06-22.12:04:07.520>
user = 'https://github.com/pitrou'

bugs.python.org fields:

activity = <Date 2014-06-22.12:04:07.520>
actor = 'BreamoreBoy'
assignee = 'jyasskin'
closed = False
closed_date = None
closer = None
components = ['Interpreter Core']
creation = <Date 2008-03-22.22:25:03.988>
creator = 'pitrou'
dependencies = []
files = ['9877', '12923', '13227']
hgrepos = []
issue_num = 2459
keywords = ['patch']
message_count = 31.0
messages = ['64343', '64359', '64364', '64367', '64383', '64506', '64537', '64572', '64573', '64577', '64599', '64603', '66027', '68687', '80591', '80593', '80598', '80995', '81029', '81958', '81962', '81976', '82556', '82557', '83003', '83012', '83022', '83088', '106883', '192575', '221249']
nosy_count = 15.0
nosy_names = ['nnorwitz', 'collinwinter', 'rhettinger', 'gregory.p.smith', 'jcea', 'tzot', 'pitrou', 'giampaolo.rodola', 'christian.heimes', 'jyasskin', 'stutzbach', 'lauromoura', 'djc', 'phsilva', 'BreamoreBoy']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'languishing'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue2459'
versions = ['Python 2.7', 'Python 3.3', 'Python 3.4']

pitrou · 2008-03-22T22:28:29Z

This is a preliminary patch to speedup for and while loops (it will also
be able to speedup list comps and gen comps, if I get to do it).
The patch works by putting the loop condition test at the end of loop,
which allows removing one JUMP_ABSOLUTE byte code out of the critical path.

For this two new opcodes are introduced:

FOR_ITER2 is the same as FOR_ITER except that it does an /absolute/
jump if the iterator is /not/ exhausted (when other uses of FOR_ITER are
replaced with FOR_ITER2, we can of course restore the original naming)
JUMP_ABS_IF_TRUE /pops/ the top of the stack and does an absolute jump
if its value is true

Some micro-benchmarks:

./python -m timeit "for x in xrange(10000): pass"
Before: 1000 loops, best of 3: 782 usec per loop
After: 1000 loops, best of 3: 412 usec per loop

./python -m timeit "x=100" "while x: x -= 1"
Before: 10000 loops, best of 3: 22.1 usec per loop
After: 100000 loops, best of 3: 16.6 usec per loop

./python Tools/pybench/pybench.py -t ForLoops
Before: 365ms per round
After: 234ms per round

Also, pystone gets 5% faster (from 43300 to 45800).

Now for the less shiny things:

I'm having problems modifying the pure Python compiler module. For
some reasons it seems to have stricter requirements than compile.c
itself does (!); I get some assertion error in
compiler.pyassem.FlowGraph.fixupOrderHonorNext as soon as a loop gets
non-trivial.
Line numbers probably need to be fixed. The lnotab format may even
have to be adapted in order to accomodate non-monotonically increasing
line numbers.

Is there some interest in this patch? If yes, I'd like to have your
input on the two bullet points above :)

pitrou · 2008-03-23T14:18:08Z

This new patch includes surgery to the compiler package (especially
flowgraph trickery) in order to make it work with the new opcodes. I
think my changes are sane but since the package seems basically
untested, unmaintained and almost unused, there may be some glitches.
However, test_compiler passes.

(test_dis will need to be updated for the new opcodes, not a big deal)

pitrou · 2008-03-23T14:44:38Z

Removed latest patch, it was half-baked.

pitrou · 2008-03-23T15:51:03Z

This new patch should be ok. The block ordering algorithm in
compiler.pyassem looks entirely clean now, to the extent that the
previous "fixup" hacks have been disabled.

Attaching loops3.py.

pitrou · 2008-03-23T23:02:37Z

loops4.patch adds a mechanism to avoid blocking signal catching in empty
loops (such as "for x in it: pass" or "while x: pass"). Much of the
speedup is still retained.

./python -m timeit "for x in xrange(10000): pass"
Before: 1000 loops, best of 3: 737 usec per loop
After: 1000 loops, best of 3: 438 usec per loop

./python -m timeit "x=100" "while x: x -= 1"
Before: 10000 loops, best of 3: 21.7 usec per loop
After: 100000 loops, best of 3: 16.6 usec per loop

./python Tools/pybench/pybench.py -t ForLoops
Before: 364ms per round
After: 242ms per round

pitrou · 2008-03-25T19:19:25Z

By the way, the compiler package fix has been isolated and cleaned up as
part of bpo-2472. Maybe it would be nice to start with that one.

pitrou · 2008-03-26T02:09:46Z

This new patch also updates the code generation for list comprehensions.
Here are some micro-benchmarks:

./python -m timeit -s "l = range(100)" "[x for x in l]"
Before: 100000 loops, best of 3: 14.9 usec per loop
After: 100000 loops, best of 3: 12.5 usec per loop

./python -m timeit -s "l = range(100)" "[x for x in l if x]"
Before: 10000 loops, best of 3: 24.1 usec per loop
After: 100000 loops, best of 3: 18.8 usec per loop

./python -m timeit -s "l = range(100)" "[x for x in l if not x]"
Before: 100000 loops, best of 3: 15.5 usec per loop
After: 100000 loops, best of 3: 11.9 usec per loop

Please note that this patch is orthogonal with Neal's patch in bpo-2183, so
the two combined should be quite interesting for the list comprehensions
bytecode.

pitrou · 2008-03-26T23:09:26Z

This new patch completes the bytecode modifications. For/while loops as
well as list comprehensions and generator expressions are a bit faster
now. Also, as a side effect I've introduced a speed improvement for "if"
statements and expressions...

Some micro-benchmarks (completing the ones already given above):

./python Tools/pybench/pybench.py -t IfThenElse
Before: 167ms per round
After: 136ms per round

./python -m timeit -s "y=range(100)" "sum(x for x in y)"
Before: 10000 loops, best of 3: 20.4 usec per loop
After: 100000 loops, best of 3: 17.9 usec per loop

./python -m timeit -s "y=range(100)" "sum(x for x in y if x)"
Before: 10000 loops, best of 3: 28.5 usec per loop
After: 10000 loops, best of 3: 23.3 usec per loop

./python -m timeit -s "y=range(100)" "sum(x for x in y if not x)"
Before: 100000 loops, best of 3: 16.4 usec per loop
After: 100000 loops, best of 3: 12.1 usec per loop

./python -m timeit -s "x,y,z=1,2,3" "x if y else z"
Before: 1000000 loops, best of 3: 0.218 usec per loop
After: 10000000 loops, best of 3: 0.159 usec per loop

A couple of tests seem to be failing in obscure ways in the test suite,
I'll try to examine them. Most of the test suite runs fine though.

pitrou · 2008-03-26T23:43:06Z

Ok, the fix for the bizarre failures was really simple. Now the only
failing tests are in test_trace (because it makes assumptions regarding
the bytecode that aren't true anymore, I'll have to adapt the tests).

nnorwitz · 2008-03-27T05:04:31Z

Antoine, I hope to look at this patch eventually. Unfortunately, there
are too many build/test problems that need to be resolved before the
next release. If you can help out with those, I will be able to review
this patch sooner.

arigo · 2008-03-27T16:21:49Z

Can you see if this simpler patch also gives speed-ups?
(predict_loop.diff)

pitrou · 2008-03-27T16:56:14Z

Armin, your patch gives a speed-up for "for" loops and comprehensions,
although a bit less. Also, it doesn't speed up "while" loops and "if"
statements at all. For some reasons it also appears to make pystone a
bit slower. Here are some micro-benchmarks:

./python -m timeit "for x in xrange(10000): pass"
Before: 1000 loops, best of 3: 758 usec per loop
After: 1000 loops, best of 3: 483 usec per loop

./python -m timeit "x=100" "while x: x -= 1"
Before: 10000 loops, best of 3: 21.8 usec per loop
After: 10000 loops, best of 3: 21.6 usec per loop

./python -m timeit -s "l = range(100)" "[x for x in l]"
Before: 100000 loops, best of 3: 14.9 usec per loop
After: 100000 loops, best of 3: 13.3 usec per loop

./python -m timeit -s "l = range(100)" "[x for x in l if x]"
Before: 10000 loops, best of 3: 23.9 usec per loop
After: 10000 loops, best of 3: 22.3 usec per loop

./python -m timeit -s "l = range(100)" "[x for x in l if not x]"
Before: 100000 loops, best of 3: 15.8 usec per loop
After: 100000 loops, best of 3: 13.9 usec per loop

./python Tools/pybench/pybench.py -t IfThenElse
Before: 164ms per round
After: 166ms per round

pitrou · 2008-04-30T21:46:19Z

Finally I had to slightly change the lnotab format to have the right
tracing semantics: the change is that line number increments are now
signed bytes rather than unsigned.

Still, there is a small change in tracing behaviour (see test_trace.py):
the for statement is traced twice in the first loop iteration, that's
because the iterator object is retrieved (GET_ITER) at the beginning of
the loop while the next iterator value is fetched (FOR_ITER) at the end
of the loop. I don't believe this is very painful.

All in all, the whole test suite now passes fine. The performance
numbers are the same as before.

tzot · 2008-06-24T13:21:16Z

A pointer to previous (minor) research:

http://groups.google.com/group/comp.lang.python/browse_frm/thread/72505e3cb6d9cb1a/e486759f06ec4ee5

esp. after Terry Reedy's post

rhettinger · 2009-01-26T21:09:22Z

Reminder, make sure we can still break out of a "while 1: pass".

pitrou · 2009-01-26T21:18:41Z

Reminder, make sure we can still break out of a "while 1: pass".

Yes, the patch takes care of that.

pitrou · 2009-01-26T23:08:37Z

The patches don't apply cleanly anymore, I'll regenerate a new one.

pitrou · 2009-02-02T20:36:52Z

Here is an updated patch against trunk.
To be honest I don't think it has a lot of future:

it changes the lnotab format (to allow negative offsets)
it rewrites the buggy block reordering code in the pure Python
compiler package, but nobody seems interested in maintaining that
package (see bpo-2472).
Perhaps we should just declare the experiment failed and switch to
something else.

rhettinger · 2009-02-03T07:09:31Z

I would like to see this go forward. It looks promising.

collinwinter · 2009-02-13T18:13:58Z

I don't see the changes to the lnotab format being a roadblock; just
mention it in NEWS. Likewise, the pure-Python compiler package shouldn't
be a high priority; your changes to that package look good enough.

I'm seeing encouraging speed-ups out of this (with gcc 4.3.1 x86_64,
compiling Python as 64-bit):
Django templates (render a 150x150 table 100 times):
Min: 0.595 -> 0.589: 0.94% faster
Avg: 0.599 -> 0.591: 1.30% faster

Spitfire templates (render a 1000x1000 table 100 times):
Min: 0.751 -> 0.729: 2.98% faster
Avg: 0.753 -> 0.730: 3.09% faster

None of the apps I've benchmarked are negatively impacted. I only have
two minor comments. Please commit this.

Review comments:

The changes to Python/compile.c:compiler_if(), compiler_for(),
compiler_while() have some indentation issues (tabs and such).
Functions like
def foo():
while True:
pass
have a useless JUMP_FORWARD 0 instruction, but I don't think it's worth
teaching the peepholer to remove them since it doesn't happen in other
circumstances (that I can tell).

pitrou · 2009-02-13T18:37:23Z

Hello Collin,

Thanks for taking a look.

I don't see the changes to the lnotab format being a roadblock; just
mention it in NEWS. Likewise, the pure-Python compiler package shouldn't
be a high priority; your changes to that package look good enough.

Well, I have good news: the fixes to the pure Python compiler have been
accepted and committed by Neil Schemenauer in r69373.

I'm seeing encouraging speed-ups out of this (with gcc 4.3.1 x86_64,
compiling Python as 64-bit):
Django templates (render a 150x150 table 100 times):
Min: 0.595 -> 0.589: 0.94% faster
Avg: 0.599 -> 0.591: 1.30% faster

Spitfire templates (render a 1000x1000 table 100 times):
Min: 0.751 -> 0.729: 2.98% faster
Avg: 0.753 -> 0.730: 3.09% faster

Not a tremendous speedup but not totally insignificant either.
(I see you like Spitfire :-))

None of the apps I've benchmarked are negatively impacted. I only have
two minor comments. Please commit this.

Before committing I want to know what to do with the new jump opcodes,
with respect to the alternative proposal I've made in bpo-4715.
Ideally, I should fold the bpo-4715 patch back into the present patch,
since I think the bpo-4715 approach is more thought out.

collinwinter · 2009-02-13T23:23:07Z

On Fri, Feb 13, 2009 at 10:37 AM, Antoine Pitrou <report@bugs.python.org> wrote:

Antoine Pitrou <pitrou@free.fr> added the comment:

Hello Collin,

Thanks for taking a look.

> I don't see the changes to the lnotab format being a roadblock; just
> mention it in NEWS. Likewise, the pure-Python compiler package shouldn't
> be a high priority; your changes to that package look good enough.

Well, I have good news: the fixes to the pure Python compiler have been
accepted and committed by Neil Schemenauer in r69373.

Yeah, I saw that. Fantastic.

> I'm seeing encouraging speed-ups out of this (with gcc 4.3.1 x86_64,
> compiling Python as 64-bit):
> Django templates (render a 150x150 table 100 times):
> Min: 0.595 -> 0.589: 0.94% faster
> Avg: 0.599 -> 0.591: 1.30% faster
>
> Spitfire templates (render a 1000x1000 table 100 times):
> Min: 0.751 -> 0.729: 2.98% faster
> Avg: 0.753 -> 0.730: 3.09% faster

Not a tremendous speedup but not totally insignificant either.
(I see you like Spitfire :-))

Well, Spitfire and Django represent very different ways of
implementing a template system, so I like to measure both when doing
application benchmarks. Template systems tend to be heavy CPU
consumers for webapps, which is why I include them.

> None of the apps I've benchmarked are negatively impacted. I only have
> two minor comments. Please commit this.

Before committing I want to know what to do with the new jump opcodes,
with respect to the alternative proposal I've made in bpo-4715.
Ideally, I should fold the bpo-4715 patch back into the present patch,
since I think the bpo-4715 approach is more thought out.

That sounds good, especially since Jeffrey and I have already reviewed bpo-4715.

collinwinter · 2009-02-21T00:30:58Z

On Fri, Feb 13, 2009 at 3:23 PM, Collin Winter <collinw@gmail.com> wrote:

On Fri, Feb 13, 2009 at 10:37 AM, Antoine Pitrou <report@bugs.python.org> wrote:
> Before committing I want to know what to do with the new jump opcodes,
> with respect to the alternative proposal I've made in bpo-4715.
> Ideally, I should fold the bpo-4715 patch back into the present patch,
> since I think the bpo-4715 approach is more thought out.

That sounds good, especially since Jeffrey and I have already reviewed bpo-4715.

If you don't have the bandwidth to integrate 4715 into this patch, I
can do that, if you want.

Collin

pitrou · 2009-02-21T01:37:58Z

If you don't have the bandwidth to integrate 4715 into this patch, I
can do that, if you want.

Collin, that would be very nice from you. You could also apply Jeffrey's
naming suggestions in bpo-4715.

Thanks!

Antoine.

jyasskin · 2009-03-02T06:11:39Z

I've updated for_iter.patch to the latest trunk, merging in bpo-4715.
I also changed tracing a bit so that the first line of a loop doesn't
get traced twice in the first iteration, and added to test_dis to check
that decreasing line numbers work there.

Review at http://codereview.appspot.com/20103 if you like.

Performance:

32-bit gcc-4.3 Intel Core2:
2to3:
25.09 -> 24.63: 1.87% faster

Django:
Min: 0.614 -> 0.609: 0.86% faster
Avg: 0.616 -> 0.635: 3.09% slower

Pickle: (cPickle)
Min: 1.647 -> 1.651: 0.21% slower
Avg: 1.650 -> 1.656: 0.39% slower

PyBench:
Min: 5341 -> 5273: 1.29% faster
Avg: 5450 -> 5397: 0.98% faster

SlowPickle: (pickle)
Min: 1.384 -> 1.341: 3.13% faster
Avg: 1.385 -> 1.343: 3.08% faster

Spitfire:
Min: 0.773 -> 0.690: 11.97% faster
Avg: 0.776 -> 0.695: 11.62% faster
Spitfire re-run:
Min: 0.740 -> 0.693: 6.81% faster
Avg: 0.744 -> 0.695: 6.93% faster

SlowUnpickle: (pickle)
Min: 0.645 -> 0.668: 3.37% slower
Avg: 0.646 -> 0.670: 3.59% slower
SlowUnpickle re-run:
Min: 0.645 -> 0.660: 2.31% slower
Avg: 0.645 -> 0.661: 2.32% slower

Unpickle: (cPickle)
Min: 1.015 -> 1.006: 0.89% faster
Avg: 1.021 -> 1.009: 1.16% faster

64-bit gcc-4.3 Intel Core2
2to3:
22.31 -> 21.97: 1.57% faster

Django:
Min: 0.577 -> 0.564: 2.29% faster
Avg: 0.579 -> 0.566: 2.20% faster

Pickle:
Min: 1.162 -> 1.178: 1.35% slower
Avg: 1.166 -> 1.183: 1.37% slower

PyBench:
Min: 4498 -> 4193: 7.27% faster
Avg: 4586 -> 4276: 7.25% faster

SlowPickle:
Min: 1.212 -> 1.133: 6.92% faster
Avg: 1.213 -> 1.135: 6.92% faster

Spitfire:
Min: 0.631 -> 0.617: 2.32% faster
Avg: 0.632 -> 0.621: 1.75% faster

SlowUnpickle:
Min: 0.575 -> 0.551: 4.31% faster
Avg: 0.576 -> 0.553: 4.14% faster

Unpickle:
Min: 0.708 -> 0.722: 1.88% slower
Avg: 0.745 -> 0.736: 1.20% faster

pitrou · 2009-03-02T11:52:45Z

I've updated for_iter.patch to the latest trunk, merging in bpo-4715.
I also changed tracing a bit so that the first line of a loop doesn't
get traced twice in the first iteration, and added to test_dis to check
that decreasing line numbers work there.

Thanks a lot!

By the way, why do you bench cPickle? Does your test call Python code
significantly?

Overall, the results look positive although not overwhelming.

jyasskin · 2009-03-02T16:24:41Z

No particular reason for cPickle. It sometimes shows when we've caused
problems by increasing the code size, and shows the size of any random
effects that the compiler causes by moving code around. Could you
double-check the patch to see if I did anything dumb?

jyasskin · 2009-03-03T19:45:30Z

Hold off on reviewing this. There's one bug around the peepholer not
turning itself off when line numbers skip by more than 127, and another
around the traceback generator still assuming line numbers are unsigned.
I'll post another patch when those are fixed and tested. I may not get
to it for a day or two, so someone else is welcome to update the patch
instead. :)

djc · 2010-06-02T14:17:02Z

Is this still relevant / will it get some love in the future?

tiran · 2013-07-07T16:13:01Z

Is this enhancement still relevant?

BreamoreBoy · 2014-06-22T12:04:07Z

As a lot of work has gone into this it saddens me to see it languishing. Surely if Python performance is to be improved the bytecode for conditionals and loops is one of the places if not the place to do it? Are there any names missing from the nosy list that ought to be there?

iritkatriel · 2022-09-01T14:14:57Z

This experiment was abandoned over a decade ago and things have moved on since then. Unless someone objects I will close the issue.

pitrou · 2022-09-01T14:52:59Z

Definitely outdated :-)

pitrou added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Mar 22, 2008

pitrou changed the title ~~speedup loops with better bytecode~~ speedup for / while / if with better bytecode Apr 30, 2008

jyasskin mannequin assigned pitrou Mar 2, 2009

jyasskin mannequin assigned jyasskin and unassigned pitrou Mar 3, 2009

tiran added the stale Stale PR or inactive for long period of time. label Jul 7, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

iritkatriel mentioned this issue May 20, 2022

gh-91432: Replace JUMP+FOR_ITER with FOR_END #70016

Closed

9 tasks

iritkatriel added the pending The issue will be closed if no feedback is provided label Sep 1, 2022

pitrou closed this as completed Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speedup for / while / if with better bytecode #46711

speedup for / while / if with better bytecode #46711

pitrou commented Mar 22, 2008

pitrou commented Mar 22, 2008

pitrou commented Mar 23, 2008

pitrou commented Mar 23, 2008

pitrou commented Mar 23, 2008

pitrou commented Mar 23, 2008

pitrou commented Mar 25, 2008

pitrou commented Mar 26, 2008

pitrou commented Mar 26, 2008

pitrou commented Mar 26, 2008

nnorwitz mannequin commented Mar 27, 2008

arigo mannequin commented Mar 27, 2008

pitrou commented Mar 27, 2008

pitrou commented Apr 30, 2008

tzot mannequin commented Jun 24, 2008

rhettinger commented Jan 26, 2009

pitrou commented Jan 26, 2009

pitrou commented Jan 26, 2009

pitrou commented Feb 2, 2009

rhettinger commented Feb 3, 2009

collinwinter mannequin commented Feb 13, 2009

pitrou commented Feb 13, 2009

collinwinter mannequin commented Feb 13, 2009

collinwinter mannequin commented Feb 21, 2009

pitrou commented Feb 21, 2009

jyasskin mannequin commented Mar 2, 2009

pitrou commented Mar 2, 2009

jyasskin mannequin commented Mar 2, 2009

jyasskin mannequin commented Mar 3, 2009

djc commented Jun 2, 2010

tiran commented Jul 7, 2013

BreamoreBoy mannequin commented Jun 22, 2014

iritkatriel commented Sep 1, 2022

pitrou commented Sep 1, 2022

speedup for / while / if with better bytecode #46711

speedup for / while / if with better bytecode #46711

Comments

pitrou commented Mar 22, 2008

pitrou commented Mar 22, 2008

pitrou commented Mar 23, 2008

pitrou commented Mar 23, 2008

pitrou commented Mar 23, 2008

pitrou commented Mar 23, 2008

pitrou commented Mar 25, 2008

pitrou commented Mar 26, 2008

pitrou commented Mar 26, 2008

pitrou commented Mar 26, 2008

nnorwitz mannequin commented Mar 27, 2008

arigo mannequin commented Mar 27, 2008

pitrou commented Mar 27, 2008

pitrou commented Apr 30, 2008

tzot mannequin commented Jun 24, 2008

rhettinger commented Jan 26, 2009

pitrou commented Jan 26, 2009

pitrou commented Jan 26, 2009

pitrou commented Feb 2, 2009

rhettinger commented Feb 3, 2009

collinwinter mannequin commented Feb 13, 2009

pitrou commented Feb 13, 2009

collinwinter mannequin commented Feb 13, 2009

collinwinter mannequin commented Feb 21, 2009

pitrou commented Feb 21, 2009

jyasskin mannequin commented Mar 2, 2009

pitrou commented Mar 2, 2009

jyasskin mannequin commented Mar 2, 2009

jyasskin mannequin commented Mar 3, 2009

djc commented Jun 2, 2010

tiran commented Jul 7, 2013

BreamoreBoy mannequin commented Jun 22, 2014

iritkatriel commented Sep 1, 2022

pitrou commented Sep 1, 2022