classification
Title: CALL_ATTR opcode
Type: Stage:
Components: Interpreter Core Versions: Python 2.3
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: arigo, gvanrossum, twouters
Priority: normal Keywords: patch

Created on 2003-03-25 23:16 by twouters, last changed 2006-03-20 21:25 by twouters. This issue is now closed.

Files
File name Uploaded Description Edit
call_attr.diff twouters, 2003-03-25 23:18 1st call_attr patch
call_attr2.diff twouters, 2003-04-17 23:41 2nd call_attr patch
call_attr4.diff twouters, 2003-04-18 22:48 4th call_attr patch (fixed)
call_attr5.diff twouters, 2003-05-02 12:53 5th call_attr patch (with CALL_PROFILE)
Messages (16)
msg43166 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2003-03-25 23:16
The result of the PyCore sprint of me and Brett: the CALL_ATTR opcode (LOAD_ATTR and CALL_FUNCTION combined) that skips the PyMethod creation and destruction for classic classes (but not newstyle classes, yet.)

The code is somewhat rough yet, it needs commenting, some renaming, and most importantly testing. It seems to work, however, and provides between a 35% and 5% speedup. (5% in 'average' code, up to 35% in instance method calls and instance creation alone.) It also needs to be updated to include newstyle classes. I will likely work on this on the flight home.
msg43167 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2003-03-25 23:18
Logged In: YES 
user_id=34209

attaching patch.
msg43168 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2003-04-17 23:41
Logged In: YES 
user_id=34209

Revised patch that includes avoiding the wrapping of
(Python) methods on newstyle classes as well as oldstyle
classes. The patch checks to see if a particular type uses
PyObject_GenericGetAttr, and if so, uses a near-copy of that
function to get an unwrapped PyFunction object.
PyCFunctionObject objects are not magically treated, and
fall back to the slow path... PyCFunction's descr's don't
have direct access to an 'unwrapped' version, they create
PyCFunctionObjectss on demand based on a PyCFunction -- a
function pointer.

Some simple testing (using timeit.py) suggests about a 20%
increase in speed for 'x.foo()' where 'x' is a newstyle
class instance. However, a real-world-ish benchmark (very
hard to find, for newstyle classes) suggests otherwise:
running 'timeit.py' on "pass" consistently takes about 3%
longer.

I'm certain the problem lies in the fact that the patch
doesn't consider PyCFunctions, which forces part of the slow
MRO lookup and instnace-dict checking to be re-done for
C-functions on newstyle types (of which there are a heck of
a lot.) Either handling PyMethodDescrs the same way as
PyFunctionObjects, or circumventing the slow path in some
way (by returning non-method-but-found objects) will
probably fix that.
msg43169 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2003-04-18 00:29
Logged In: YES 
user_id=6380

I would think that you should avoid the double lookups
somehow...  How difficult is that?
msg43170 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2003-04-18 00:42
Logged In: YES 
user_id=34209

Well, currently, the neutered getattr functions just bail
out and return NULL whenever they don't get what they
expect. I guess they go back to being 'full' getattr's, with
the exception that they 'return' two values: the retrieved
object, and a status indicator (to indicate found-method and
found-thing-but-not-method) Maybe the real getattr functions
should be implemented in terms of the neutered version then,
that would at least solve some maintenance issues :-)

But time enough for that tomorrow or this weekend (if the
weather doesn't keep being so terribly sunny and warm.)
msg43171 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2003-04-18 22:22
Logged In: YES 
user_id=34209

Alright, here is a re-worked patch, with a toggle to choose
between a blatant copy-paste and some refactoring; see below.

The patch works by creating a new opcode, CALL_ATTR, which
is used for all <expression>.<name>(<args>) occurances. What
<expression> and <args> are, is not important, they are
compiled separately.

The CALL_ATTR opcode implementation is optimized for two
special cases: one where <expression> resulted in an
(old-style) instance, and one where <expression> resulted in
an instance of a new-style type of which the tp_getattro is
PyObject_GenericGetAttr.

The PyInstance part is done by savagely short-cutting the
usual getattr dance for instances; if it sees anything but a
PyFunction, it will fall back to a slow path. The rationale
is that if X in 'X.spam(' is an old-style class, and that
expression is not going to raise an exception, it is very
rare for 'spam' to be anything but a PyFunction. Trying to
cope with all the idiosyncracies of would slow down the
common case too much.

The PyObject_GenericGetAttr version uses a slightly modified
version of PyObject_GenericGetAttr that, when finding a
descr of the desired name, doesn't call the 'descr_get'
method but returns a status flag. The caller (call_attr)
then decides based on the type of the descr whether to call
descr_get or not. It currently only specialcases
PyFunctions. PyCFunctions, PyStaticMethods and
PyClassMethods are tricky to specialcase and/or need some of
the massaging that descr_get would do. I have not yet looked
at other callable descr's.

I had initially rewritten PyObject_GenericGetAttr() to use
the modified version, but this appears to be a significant
performance hit in normal attribute retrieval (quite common,
on newstyle classes.) Likewise, Brett and I had refactored
the call_function part of call_attr and call_function into a
separate function, but that, too, was a big hit in the
common function-calling case. Unfortunately, not doing that
refactoring means a lot of copied code, so I included both
in the patch. It may be that the slow path can be optimized
by simplyfying the refactored parts so that the compiler
understands how to inline them (e.g. the stackpointer
fudging call_function/call_callable does.) 

The default is the ugly-but-fast way, define
CALL_ATTR_SLOW_BUT_PRETTY_PATH to use the slow(er) path. The
slow(er) path is enough slower to nullify the benefit of the
patch in most of the benchmarks I ran; the fast path is only
slightly slower in some areas (probably due to cache
dynamics) but faster in every other situations, including
unexpected areas (that's not cache dynamics, of course,
that's just coder brilliance. :-)

However, finding a good benchmark is near impossible. I
added some newstyle-classes tests to PyBench, but even
normal tests were giving bizarrely negative results.
Checking those results with small scripts of timeit.py
showed entirely different results. And when pybench reported
a total 2% slowdown in the 'slow path' new code, it managed
to report that about 5% faster, consistently. timeit.py is
more consistent, and helped me determine the 'slow path' was
really slowing things down. Calling an empty method with no
arguments is about 20% faster for newstyle classes and about
30% for oldstyle classes, according to timeit.py.

Still no test for call_attr though.

I would love for people to test the code, both paths, and
give me input. I also welcome ideas on handling more
descr's, I may have missed a few unwritten rules about them.
msg43172 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2003-04-18 22:48
Logged In: YES 
user_id=34209

Oops, that patch contained a C++-ism, ceval.c:3504 and 3505
needed to be swapped. Uploaded a new version.
msg43173 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2003-04-19 18:42
Logged In: YES 
user_id=6380

Alas, like others, I see a tiny slowdown on pystone (maybe 3
percent).

This is with the default version of patch version 4 (fixed).  
msg43174 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2003-04-21 22:48
Logged In: YES 
user_id=34209

Pystone is not likely to show much speedup, as it contains
exactly 2 instances of CALL_ATTR, only barely in the main
loop. However, it should not slow down because of CALL_ATTR
either; the two CALL_ATTRs are of the most optimized sort,
old-style instance methods, and none of the other code paths
have changed *at all* (in the fast-and-ugly mode of the
patch, which is the default.)

There are only two reasons I can think of that explain a
slower pystone: code cache and the switch statement layout.
This apparently does not influence my (somewhat
high-end-ish) test machines, but does yours (and others.)
Both are more or less outside my control. They might be
fixed by switch reordering or rearranging of the code so the
compiler optimizes it better, but that's very platform
specific and lacking a proper test-bed for that situation, I
can't do it.

Alternatively, there may be some funk with regards to
bytecode version. If bytecode wasn't properly regenerated, I
can imagine not seeing *any* speedup. Have you tried
something as simple as 

./python timeit.py -s 'class X:' -s '  def spam(self): pass'
-s 'x = X()' 'x.spam()'

? This gives a good 30% speedup on my home PC. Bytecode
problems wouldn't influence pystone though.


msg43175 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2003-04-22 12:20
Logged In: YES 
user_id=6380

OK, I did a more realistic benchmark: startup time of Zope3.
With more or less current CVS Python 2.3 (but not Raymond
H's bytecode optimizations), it took 3.52 seconds. With your
patch (and all .pyc files rebuilt) it took 3.47 seconds. 
That's about a percent and a half. (With Python 2.2 it took
4.08 seconds.)
msg43176 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2003-04-22 12:38
Logged In: YES 
user_id=6380

However, you should be proud of yourself nevertheless. On
the same Zope3 startup "benchmark", Raymond Hettinger's
bytecode optimizations scored 0% improvement.  Python -O
didn't make any difference either. So it's a touch
benchmark! (And yes, I did remove the .pyc/.pyo files each
time.) (And no, I can't test -OO because Zope uses the
docstrings.)
msg43177 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2003-04-23 11:43
Logged In: YES 
user_id=34209

I'd not be proud in any case; neither the concept nor most
of the implementation was my idea :-) But my questions still
stand; the 3% slowdown in pystone is a lot if all you see is
a 1.5% gain in Zope3 startup time, but I can't think of any
ways to fix that before next friday, or before 2.3 is
released. (The problem is either code size or switch length,
which can be 'fixed' by culling code, re-ordering the switch
again or changing the switch into a computed-goto.) And I'd
need to find me a machine with less cache to actually notice
the slowdown :-)

msg43178 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2003-04-23 18:11
Logged In: YES 
user_id=6380

A suggestion: put some static global counters in the new
code that count how many times each of the significant
branches are taken, and a little routine that prints out the
counts on exit. Then people can run various benchmarks and
report the numbers, and that may help you decide which paths
are worth more work...
msg43179 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2003-05-02 12:53
Logged In: YES 
user_id=34209

Well, I just re-used the call-profiling stuff that Jeremy (I
think) already put in, but I don't see much use for getting
people to run it. It shows the expected number of CALL_ATTR
uses in e.g. pystone, no unexpected overheads inside modules
or anything.

The problem with the code isn't mispredicted branches, but
general slowdown if *none* of the branches are taken. The
only result I can think of of having people run with
call-profiling on would be to find out we need to
special-case other common tp_getattro functions -- which
would result in more code and thus likely more general slowdown.

However, patch with CALL_PROFILEing atatched. The call
profile code changed somewhat, as this code returns a dict
of counters, instead of a tuple. I found a 21-element tuple
somewhat unwieldy :) I use a code snippet like:

    print "Call stats:"
    items =  sys.callstats().items()
    items = [(value, key) for key, value in items]
    items.sort()
    items.reverse()
    for value,key in items:
        print "%30s: %30s"%(key, value)

at the end of testing code (pystone, regrtest, etc.) Don't
forget to #define CALL_PROFILE somewhere.
msg43180 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2005-02-27 18:58
Logged In: YES 
user_id=4771

By the way, I didn't see discussed anywhere the fact that
the present optimization introduces a semantic change: in
'x.name(expr)' it evaluates 'expr' before calling
'x.__getattribute__("name")'.

If we don't care, fine.  If we do, it seems to me the whole
approach is doomed...
msg43181 - (view) Author: Thomas Wouters (twouters) * (Python committer) Date: 2006-03-20 21:25
Logged In: YES 
user_id=34209

Closing this ancient patch, I don't think it applies anymore ;)
History
Date User Action Args
2003-03-25 23:16:09twouterscreate