Issue709744
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2003-03-25 23:16 by twouters, last changed 2022-04-10 16:07 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
call_attr.diff | twouters, 2003-03-25 23:18 | 1st call_attr patch | ||
call_attr2.diff | twouters, 2003-04-17 23:41 | 2nd call_attr patch | ||
call_attr4.diff | twouters, 2003-04-18 22:48 | 4th call_attr patch (fixed) | ||
call_attr5.diff | twouters, 2003-05-02 12:53 | 5th call_attr patch (with CALL_PROFILE) |
Messages (16) | |||
---|---|---|---|
msg43166 - (view) | Author: Thomas Wouters (twouters) * | Date: 2003-03-25 23:16 | |
The result of the PyCore sprint of me and Brett: the CALL_ATTR opcode (LOAD_ATTR and CALL_FUNCTION combined) that skips the PyMethod creation and destruction for classic classes (but not newstyle classes, yet.) The code is somewhat rough yet, it needs commenting, some renaming, and most importantly testing. It seems to work, however, and provides between a 35% and 5% speedup. (5% in 'average' code, up to 35% in instance method calls and instance creation alone.) It also needs to be updated to include newstyle classes. I will likely work on this on the flight home. |
|||
msg43167 - (view) | Author: Thomas Wouters (twouters) * | Date: 2003-03-25 23:18 | |
Logged In: YES user_id=34209 attaching patch. |
|||
msg43168 - (view) | Author: Thomas Wouters (twouters) * | Date: 2003-04-17 23:41 | |
Logged In: YES user_id=34209 Revised patch that includes avoiding the wrapping of (Python) methods on newstyle classes as well as oldstyle classes. The patch checks to see if a particular type uses PyObject_GenericGetAttr, and if so, uses a near-copy of that function to get an unwrapped PyFunction object. PyCFunctionObject objects are not magically treated, and fall back to the slow path... PyCFunction's descr's don't have direct access to an 'unwrapped' version, they create PyCFunctionObjectss on demand based on a PyCFunction -- a function pointer. Some simple testing (using timeit.py) suggests about a 20% increase in speed for 'x.foo()' where 'x' is a newstyle class instance. However, a real-world-ish benchmark (very hard to find, for newstyle classes) suggests otherwise: running 'timeit.py' on "pass" consistently takes about 3% longer. I'm certain the problem lies in the fact that the patch doesn't consider PyCFunctions, which forces part of the slow MRO lookup and instnace-dict checking to be re-done for C-functions on newstyle types (of which there are a heck of a lot.) Either handling PyMethodDescrs the same way as PyFunctionObjects, or circumventing the slow path in some way (by returning non-method-but-found objects) will probably fix that. |
|||
msg43169 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2003-04-18 00:29 | |
Logged In: YES user_id=6380 I would think that you should avoid the double lookups somehow... How difficult is that? |
|||
msg43170 - (view) | Author: Thomas Wouters (twouters) * | Date: 2003-04-18 00:42 | |
Logged In: YES user_id=34209 Well, currently, the neutered getattr functions just bail out and return NULL whenever they don't get what they expect. I guess they go back to being 'full' getattr's, with the exception that they 'return' two values: the retrieved object, and a status indicator (to indicate found-method and found-thing-but-not-method) Maybe the real getattr functions should be implemented in terms of the neutered version then, that would at least solve some maintenance issues :-) But time enough for that tomorrow or this weekend (if the weather doesn't keep being so terribly sunny and warm.) |
|||
msg43171 - (view) | Author: Thomas Wouters (twouters) * | Date: 2003-04-18 22:22 | |
Logged In: YES user_id=34209 Alright, here is a re-worked patch, with a toggle to choose between a blatant copy-paste and some refactoring; see below. The patch works by creating a new opcode, CALL_ATTR, which is used for all <expression>.<name>(<args>) occurances. What <expression> and <args> are, is not important, they are compiled separately. The CALL_ATTR opcode implementation is optimized for two special cases: one where <expression> resulted in an (old-style) instance, and one where <expression> resulted in an instance of a new-style type of which the tp_getattro is PyObject_GenericGetAttr. The PyInstance part is done by savagely short-cutting the usual getattr dance for instances; if it sees anything but a PyFunction, it will fall back to a slow path. The rationale is that if X in 'X.spam(' is an old-style class, and that expression is not going to raise an exception, it is very rare for 'spam' to be anything but a PyFunction. Trying to cope with all the idiosyncracies of would slow down the common case too much. The PyObject_GenericGetAttr version uses a slightly modified version of PyObject_GenericGetAttr that, when finding a descr of the desired name, doesn't call the 'descr_get' method but returns a status flag. The caller (call_attr) then decides based on the type of the descr whether to call descr_get or not. It currently only specialcases PyFunctions. PyCFunctions, PyStaticMethods and PyClassMethods are tricky to specialcase and/or need some of the massaging that descr_get would do. I have not yet looked at other callable descr's. I had initially rewritten PyObject_GenericGetAttr() to use the modified version, but this appears to be a significant performance hit in normal attribute retrieval (quite common, on newstyle classes.) Likewise, Brett and I had refactored the call_function part of call_attr and call_function into a separate function, but that, too, was a big hit in the common function-calling case. Unfortunately, not doing that refactoring means a lot of copied code, so I included both in the patch. It may be that the slow path can be optimized by simplyfying the refactored parts so that the compiler understands how to inline them (e.g. the stackpointer fudging call_function/call_callable does.) The default is the ugly-but-fast way, define CALL_ATTR_SLOW_BUT_PRETTY_PATH to use the slow(er) path. The slow(er) path is enough slower to nullify the benefit of the patch in most of the benchmarks I ran; the fast path is only slightly slower in some areas (probably due to cache dynamics) but faster in every other situations, including unexpected areas (that's not cache dynamics, of course, that's just coder brilliance. :-) However, finding a good benchmark is near impossible. I added some newstyle-classes tests to PyBench, but even normal tests were giving bizarrely negative results. Checking those results with small scripts of timeit.py showed entirely different results. And when pybench reported a total 2% slowdown in the 'slow path' new code, it managed to report that about 5% faster, consistently. timeit.py is more consistent, and helped me determine the 'slow path' was really slowing things down. Calling an empty method with no arguments is about 20% faster for newstyle classes and about 30% for oldstyle classes, according to timeit.py. Still no test for call_attr though. I would love for people to test the code, both paths, and give me input. I also welcome ideas on handling more descr's, I may have missed a few unwritten rules about them. |
|||
msg43172 - (view) | Author: Thomas Wouters (twouters) * | Date: 2003-04-18 22:48 | |
Logged In: YES user_id=34209 Oops, that patch contained a C++-ism, ceval.c:3504 and 3505 needed to be swapped. Uploaded a new version. |
|||
msg43173 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2003-04-19 18:42 | |
Logged In: YES user_id=6380 Alas, like others, I see a tiny slowdown on pystone (maybe 3 percent). This is with the default version of patch version 4 (fixed). |
|||
msg43174 - (view) | Author: Thomas Wouters (twouters) * | Date: 2003-04-21 22:48 | |
Logged In: YES user_id=34209 Pystone is not likely to show much speedup, as it contains exactly 2 instances of CALL_ATTR, only barely in the main loop. However, it should not slow down because of CALL_ATTR either; the two CALL_ATTRs are of the most optimized sort, old-style instance methods, and none of the other code paths have changed *at all* (in the fast-and-ugly mode of the patch, which is the default.) There are only two reasons I can think of that explain a slower pystone: code cache and the switch statement layout. This apparently does not influence my (somewhat high-end-ish) test machines, but does yours (and others.) Both are more or less outside my control. They might be fixed by switch reordering or rearranging of the code so the compiler optimizes it better, but that's very platform specific and lacking a proper test-bed for that situation, I can't do it. Alternatively, there may be some funk with regards to bytecode version. If bytecode wasn't properly regenerated, I can imagine not seeing *any* speedup. Have you tried something as simple as ./python timeit.py -s 'class X:' -s ' def spam(self): pass' -s 'x = X()' 'x.spam()' ? This gives a good 30% speedup on my home PC. Bytecode problems wouldn't influence pystone though. |
|||
msg43175 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2003-04-22 12:20 | |
Logged In: YES user_id=6380 OK, I did a more realistic benchmark: startup time of Zope3. With more or less current CVS Python 2.3 (but not Raymond H's bytecode optimizations), it took 3.52 seconds. With your patch (and all .pyc files rebuilt) it took 3.47 seconds. That's about a percent and a half. (With Python 2.2 it took 4.08 seconds.) |
|||
msg43176 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2003-04-22 12:38 | |
Logged In: YES user_id=6380 However, you should be proud of yourself nevertheless. On the same Zope3 startup "benchmark", Raymond Hettinger's bytecode optimizations scored 0% improvement. Python -O didn't make any difference either. So it's a touch benchmark! (And yes, I did remove the .pyc/.pyo files each time.) (And no, I can't test -OO because Zope uses the docstrings.) |
|||
msg43177 - (view) | Author: Thomas Wouters (twouters) * | Date: 2003-04-23 11:43 | |
Logged In: YES user_id=34209 I'd not be proud in any case; neither the concept nor most of the implementation was my idea :-) But my questions still stand; the 3% slowdown in pystone is a lot if all you see is a 1.5% gain in Zope3 startup time, but I can't think of any ways to fix that before next friday, or before 2.3 is released. (The problem is either code size or switch length, which can be 'fixed' by culling code, re-ordering the switch again or changing the switch into a computed-goto.) And I'd need to find me a machine with less cache to actually notice the slowdown :-) |
|||
msg43178 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2003-04-23 18:11 | |
Logged In: YES user_id=6380 A suggestion: put some static global counters in the new code that count how many times each of the significant branches are taken, and a little routine that prints out the counts on exit. Then people can run various benchmarks and report the numbers, and that may help you decide which paths are worth more work... |
|||
msg43179 - (view) | Author: Thomas Wouters (twouters) * | Date: 2003-05-02 12:53 | |
Logged In: YES user_id=34209 Well, I just re-used the call-profiling stuff that Jeremy (I think) already put in, but I don't see much use for getting people to run it. It shows the expected number of CALL_ATTR uses in e.g. pystone, no unexpected overheads inside modules or anything. The problem with the code isn't mispredicted branches, but general slowdown if *none* of the branches are taken. The only result I can think of of having people run with call-profiling on would be to find out we need to special-case other common tp_getattro functions -- which would result in more code and thus likely more general slowdown. However, patch with CALL_PROFILEing atatched. The call profile code changed somewhat, as this code returns a dict of counters, instead of a tuple. I found a 21-element tuple somewhat unwieldy :) I use a code snippet like: print "Call stats:" items = sys.callstats().items() items = [(value, key) for key, value in items] items.sort() items.reverse() for value,key in items: print "%30s: %30s"%(key, value) at the end of testing code (pystone, regrtest, etc.) Don't forget to #define CALL_PROFILE somewhere. |
|||
msg43180 - (view) | Author: Armin Rigo (arigo) * | Date: 2005-02-27 18:58 | |
Logged In: YES user_id=4771 By the way, I didn't see discussed anywhere the fact that the present optimization introduces a semantic change: in 'x.name(expr)' it evaluates 'expr' before calling 'x.__getattribute__("name")'. If we don't care, fine. If we do, it seems to me the whole approach is doomed... |
|||
msg43181 - (view) | Author: Thomas Wouters (twouters) * | Date: 2006-03-20 21:25 | |
Logged In: YES user_id=34209 Closing this ancient patch, I don't think it applies anymore ;) |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-10 16:07:53 | admin | set | github: 38216 |
2003-03-25 23:16:09 | twouters | create |