Issue 26814: [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/71001

classification

Title:	[WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments
Type:	performance	Stage:
Components:		Versions:	Python 3.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	jstasiak, larry, rhettinger, serhiy.storchaka, vstinner, yselivanov
Priority:	normal	Keywords:	patch

Created on 2016-04-21 08:57 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
call_stack.patch	vstinner, 2016-04-21 08:57		review
call_stack-2.patch	vstinner, 2016-04-21 10:42		review
call_stack-3.patch	vstinner, 2016-04-21 15:03		review
ad4a53ed1fbf.diff	vstinner, 2016-04-22 10:41		review
bench_fast.py	vstinner, 2016-04-22 11:40
bench_fast-2.py	vstinner, 2016-04-22 11:52
34456cce64bb.patch	vstinner, 2016-05-19 13:37		review

Repositories containing patches
https://hg.python.org/sandbox/fastcall

Messages (34)
msg263899 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-21 08:57
Attached patch adds the following new function: PyObject* _PyObject_CallStack(PyObject func, PyObject stack, int na, int nk); where na is the number of positional arguments and nk is the number of (key, pair) arguments stored in the stack. Example of C code to call a function with one positional argument: PyObject stack[1]; stack[0] = arg; return _PyObject_CallStack(func, stack, 1, 0); Simple, isn't it? The difference with PyObject_Call() is that its API avoids the creation of a tuple and a dictionary to pass parameters to functions when possible. Currently, the temporary tuple and dict can be avoided to call Python functions (nice, isn't it?) and C function declared with METH_O (not the most common API, but many functions are declared like that). The patch only modifies property_descr_set() to test the feature, but I'm sure that a lot of C code can be modified to use this new function to beneift from its optimization. Should we make this new _PyObject_CallStack() function official: call it PyObject_CallStack() (without the understand prefix) or experiment it in CPython 3.6 and decide later to make it public? If it's made private, it will require a large replacement patch later to replace all calls to _PyObject_CallStack() with PyObject_CallStack() (strip the underscore prefix). The next step is to add a new METH_STACK flag to pass parameters to C functions using a similar API (PyObject **stack, int na, int nk) and modify the argument clinic to use this new API. Thanks to Larry Hasting who gave me the idea in a previous edition of Pycon US ;-) This issue was created after the discussion on issue #26811 which is an issue in a micro-optimization in property_descr_set() to avoid the creation of a tuple: it caches a private tuple inside property_descr_set().
msg263907 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-04-21 09:53
"Stack" in the function name looks a little confusing. I understand that this is related to the stack of bytecode interpreter, but this looks as raising pretty deep implementation detail. The way of packing positional and keyword arguments in the continuous array is not clear. Wouldn't be better to provide separate arguments for positional and keyword arguments? What is the performance effect of using this function? For example compare the performance of namedtuple's attribute access of current code, the code with with this patch, and unoptimized code in 3.4: ./python -m timeit -r 11 -s "from collections import namedtuple as n; a = n('n', 'a b c')(1, 2, 3)" -- "a.a" Is there any use of this function with keyword arguments?
msg263908 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-21 10:20
Microbenchmark on Python 3.6, best of 3 runs: ./python -m timeit -r 11 -s "from collections import namedtuple as n; a = n('n', 'a b c')(1, 2, 3)" -- "a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a" * Python 3.6 unpatched: 0.968 usec * call_stack.patch: 1.27 usec * Python 3.6 with property_descr_get() of Python 3.4: 1.32 usec "Python 3.6 with property_descr_get() of Python 3.4": replace the current optimization with "return PyObject_CallFunctionObjArgs(gs->prop_get, obj, NULL);". Oh, in fact the tested code calls a property where the final function is operator.itemgetter(0). _PyObject_CallStack() creates a temporary tuple to call PyObject_Call() which calls func->ob_type->tp_call, itemgetter_call(). Problem: tp_call API uses (PyObject args, PyObject kwargs). It doesn't accept directly a stack (a C array of PyObject*). And it may be more difficult to modify tp_call. In short, my patch disables the optimization on property with my current incomplete implementation.
msg263909 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-04-21 10:28
See also issue23507. May be your function help to optimize filter(), map(), sorted()?
msg263910 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-21 10:42
call_stack-2.patch: A little bit more complete patch, it adds a tp_call_stack field to PyTypeObject an use it in _PyObject_CallStack(). Updated microbenchmark on Python 3.6, best of 3 runs: ./python -m timeit -r 11 -s "from collections import namedtuple as n; a = n('n', 'a b c')(1, 2, 3)" -- "a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a" * Python 3.6 unpatched: 0.968 usec * call_stack.patch: 1.27 usec * Python 3.6 with property_descr_get() of Python 3.4: 1.32 usec * call_stack-2.patch: 0.664 usec call_stack-2.patch makes this micro-benchmark 31% faster, not bad! It also makes calls to C functions almost 2x as fast if you replace current unoptimized calls with _PyObject_CallStack()!! IHMO we should continue to experiment, making function calls 2x faster is worth it ;-) Serhiy: "See also issue23507. May be your function help to optimize filter(), map(), sorted()?" IMHO the API is generic enough to be usable in a lot of cases. Serhiy: "Is there any use of this function with keyword arguments?" Calling functions with keywords is probably the least common case for function calls in C code. But I would like to provide a fast function to call with keywords. Maybe we need two functions just to make the API cleaner? The difference would just be that "int k" would be omitted? I proposed an API (PyObject **stack, int na, int nk) based on the current code in Python/ceval.c. I'm not sure that it's the best API ever :-) In fact, there is already PyObject_CallFunctionObjArgs() which can be modified to reuse internally _PyObject_CallStack(), and its API is maybe more convenient than my proposed API.
msg263918 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-04-21 13:45
With call_stack-2.patch attribute access in namedtuple is only 25% slower than attribute access in ordinary Python object! Definitely this this worth to continue to experiment! But adding new slot to PyTypeObject sets the bar too high. Try to use your function to speed up all cases mentioned in issue23507: sorted()/list.sort(), min() and max() with the key argument, filter(), map(), some iterators from itertools (groupby(), dropwhile(), takewhile(), accumulate(), filterfalse()), thin wrappers around special method (round(), math.floor(), etc). Use it in wrappers around PyObject_Call() like PyObject_CallFunctionObjArgs(). May be this will cause an effect even on some macrobenchmarks.
msg263920 - (view)	Author: Larry Hastings (larry) *	Date: 2016-04-21 14:24
Yes, I've been working on a patch to do this as well. I called the calling convention METH_RAW, to go alongside METH_ZERO METH_O etc. My calling convention was exactly the same as yours: PyObject (PyObject o, PyObject **stack, int na, int nk). I only had to modify two functions in ceval.c to support it: ext_do_call() and call_function(). And yes, the overarching goal was to have Argument Clinic generate custom argument parsing code for every function. Supporting the calling convention was the easy part; generating code was quite complicated. I believe I got a very simple version of it working at one point, supporting positional parameters only, with some optional arguments. Parsing arguments by hand gets very complicated indeed when you introduce keyword arguments. I haven't touched this patch in most of a year. I hope to return to it someday. In the meantime it's fine by me if you add support for this and rewrite some functions by hand to use it. p.s. My last name has two S's. If you continue to leave off one of them, I shall remove one from yours, Mr. TINNER.
msg263923 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-21 15:03
Since early microbenchmarks are promising, I wrote a more complete implementations which tries to use the fast-path (avoid temporary tuple/dict) in all PyObject_Call() functions. The next step would be to add a METH_FASTCALL flag. IMHO adding such new flag requires to enhance Argument Clinic to be able to use it, at least when a function doesn't accept keyword parameters. PyObject_CallFunction() & friends have a weird API: if call with the format string "O", the behaviour depends if the object parameter is a tuple or not. If it's a tuple, the tuple is unpacked. It's a little bit weird. I recall that it led to a bug in the implementation in generators in Python: issue #21209! Moreover, if the format string is "(...)", parenthesis are ignored. If you want to call a function with one argument which is a tuple, you have to write "((...))". It's a little bit weird, but we cannot change that without breaking the (Python) world :-) call_stack-3.patch: I renamed the main function to _PyObject_FastCall() * I added PyObject_CallNoArg(): call a function with no parameter * I added Py_VaBuildStack() and _Py_VaBuildStack_SizeT() helpers for PyObject_Call() functions using a format string I renamed the new slot to tp_fastcall Nice change in the WITH_CLEANUP_START opcode (ceval.c): - /* XXX Not the fastest way to call it... */ - res = PyObject_CallFunctionObjArgs(exit_func, exc, val, tb, NULL); + arg_stack[0] = exc; + arg_stack[1] = val; + arg_stack[2] = tb; + res = _PyObject_FastCall(exit_func, arg_stack, 3, 0); I don't know if it's a common byetcode, nor if the change is really faster.
msg263924 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-21 15:05
> I believe I got a very simple version of it working at one point, supporting positional parameters only, with some optional arguments. Yeah, that would be a nice first step. > p.s. My last name has two S's. If you continue to leave off one of them, I shall remove one from yours, Mr. TINNER. Ooops, I'm sorry Guido Hastings :-(
msg263926 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-04-21 17:04
PyObject_Call*() implementations with _PyObject_FastCall() look much more complex than with PyObject_Call() (even not counting additional complex functions in modsupport.c). And I'm not sure there is a benefit. May be for first stage we can do without this.
msg263946 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-22 00:44
I created a repository. I will work there and make some experiment. It would help to have a better idea of the concrete performance. When I will have a better view of all requires changes to get best performances everywhere, I will start a discussion to see which parts are worth it or not. In my latest microbenchmarks, functions calls (C/Python, mixed) are between 8% and 40% faster. I'm now running the CPython benchmark suite.
msg263995 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-22 11:10
Changes of my current implementation, ad4a53ed1fbf.diff. The good thing is that all changes are internals (really?). Even if you don't modify your C extensions (nor your Python code), you should benefit of the new fast call is a lot of cases. IMHO the best tricky part are changes on the PyTypeObject. Is it ok to add a new tp_fastcall slot? Should we add even more slots using the fast call convention like tp_fastnew and tp_fastinit? How should we handle the inheritance of types with that? () Add 2 new public functions: PyObject PyObject_CallNoArg(PyObject func); PyObject PyObject_CallArg1(PyObject func, PyObject arg); () Add 1 new private function: PyObject _PyObject_FastCall(PyObject func, PyObject stack, int na, int nk); _PyObject_FastCall() is the root of the new feature. () type: add a new "tp_fastcall" field to the PyTypeObject structure. It's unclear to me how inheritance is handled here. Maybe it's simply broken, but it's strange because it looks like it works :-) Maybe it's very rare that tp_call is overidden in a child class? TODO: maybe reuse the "tp_call" field? (risk of major backward incompatibility...) () slots: add a new "fastwrapper" field to the wrappercase structure. Add a fast wrapper to all slots (really all? i should check). I don't think that consumers of the C API are of this change, or maybe only a few projects. TODO: maybe remove "fastwrapper" and reuse the "wrapper" field? (low risk of backward compatibility?) () Implement fast call for Python function (_PyFunction_FastCall) and C functions (PyCFunction_FastCall) () Add a new METH_FASTCALL calling convention for C functions. Right now, it is used for 4 builtin functions: sorted(), getattr(), iter(), next(). Argument Clinic should be modified to emit C code using this new fast calling convention. () Implement fast call in the following functions (types): - method() - method_descriptor() - wrapper_descriptor() - method_wrapper() - operator.itemgetter => used by collections.namedtuple to get an item by its name () Modify PyObject_Call() functins to reuse internally the fast call. "tp_fastcall" is preferred over "tp_call" (FIXME: is it really useful to do that?). The following functions are able to avoid temporary tuple/dict without having to modify the code calling them: - PyObject_CallFunction() - PyObject_CallMethod(), _PyObject_CallMethodId() - PyObject_CallFunctionObjArgs(), PyObject_CallMethodObjArgs() It's not required to modify code using these functions to use the 3 new shiny functions (PyObject_CallNoArg, PyObject_CallArg1, _PyObject_FastCall). For example, replacing PyObject_CallFunctionObjArgs(func, NULL) with PyObject_CallNoArg(func) is just a micro-optimization, the tuple is already avoided. But PyObject_CallNoArg() should use less memory of the C stack and be a "little bit" faster. (*) Add new helpers: new Include/pystack.h file, Py_VaBuildStack(), etc. Please ignore unrelated changes.
msg263996 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-22 11:12
Related issue: issue #23507, "Tuple creation is too slow".
msg263999 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-22 11:40
Some microbenchmarks: bench_fast.py. == Python 3.6 / Python 3.6 FASTCALL == ----------------------------------+--------------+--------------- Tests \| /tmp/default \| /tmp/fastcall ----------------------------------+--------------+--------------- filter \| 241 us () \| 166 us (-31%) map \| 205 us () \| 168 us (-18%) sorted(list, key=lambda x: x) \| 242 us () \| 162 us (-33%) sorted(list) \| 27.7 us () \| 27.8 us b=MyBytes(); bytes(b) \| 549 ns () \| 533 ns namedtuple.attr \| 2.03 us () \| 1.56 us (-23%) object.__setattr__(obj, "x", 1) \| 347 ns () \| 218 ns (-37%) object.__getattribute__(obj, "x") \| 331 ns () \| 200 ns (-40%) getattr(1, "real") \| 267 ns () \| 150 ns (-44%) bounded_pymethod(1, 2) \| 193 ns () \| 190 ns unbound_pymethod(obj, 1, 2 \| 195 ns () \| 192 ns ----------------------------------+--------------+--------------- Total \| 719 us () \| 526 us (-27%) ----------------------------------+--------------+--------------- == Compare Python 3.4 / Python 3.6 / Python 3.6 FASTCALL == Common platform: Timer: time.perf_counter Python unicode implementation: PEP 393 Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Platform: Linux-4.4.4-301.fc23.x86_64-x86_64-with-fedora-23-Twenty_Three SCM: hg revision=abort: repository . not found! tag=abort: repository . not found! branch=abort: repository . not found! date=abort: no repository found in '/home/haypo/prog/python' (.hg not found)! Bits: int=32, long=64, long long=64, size_t=64, void=64 Platform of campaign /tmp/py34: Python version: 3.4.3 (default, Jun 29 2015, 12:16:01) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] CFLAGS: -Wno-unused-result -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv Timer precision: 78 ns Date: 2016-04-22 13:37:52 Platform of campaign /tmp/default: Python version: 3.6.0a0 (default:496e094f4734, Apr 22 2016, 02:18:13) [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)] CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Timer precision: 103 ns Date: 2016-04-22 13:38:07 Platform of campaign /tmp/fastcall: Python version: 3.6.0a0 (default:ad4a53ed1fbf, Apr 22 2016, 12:42:15) [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)] Timer precision: 99 ns CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Date: 2016-04-22 13:38:21 ----------------------------------+-------------+----------------+--------------- Tests \| /tmp/py34 \| /tmp/default \| /tmp/fastcall ----------------------------------+-------------+----------------+--------------- filter \| 325 us () \| 241 us (-26%) \| 166 us (-49%) map \| 260 us () \| 205 us (-21%) \| 168 us (-35%) sorted(list, key=lambda x: x) \| 354 us () \| 242 us (-32%) \| 162 us (-54%) sorted(list) \| 46.9 us () \| 27.7 us (-41%) \| 27.8 us (-41%) b=MyBytes(); bytes(b) \| 839 ns () \| 549 ns (-35%) \| 533 ns (-36%) namedtuple.attr \| 4.51 us () \| 2.03 us (-55%) \| 1.56 us (-65%) object.__setattr__(obj, "x", 1) \| 447 ns () \| 347 ns (-22%) \| 218 ns (-51%) object.__getattribute__(obj, "x") \| 401 ns () \| 331 ns (-17%) \| 200 ns (-50%) getattr(1, "real") \| 236 ns () \| 267 ns (+13%) \| 150 ns (-36%) bounded_pymethod(1, 2) \| 249 ns () \| 193 ns (-22%) \| 190 ns (-24%) unbound_pymethod(obj, 1, 2 \| 251 ns () \| 195 ns (-22%) \| 192 ns (-23%) ----------------------------------+-------------+----------------+--------------- Total \| 993 us (*) \| 719 us (-28%) \| 526 us (-47%) ----------------------------------+-------------+----------------+---------------
msg264003 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-22 11:52
For more fun, comparison between Python 2.7 / 3.4 / 3.6 / 3.6 FASTCALL. ----------------------------------+-------------+----------------+----------------+--------------- Tests \| py27 \| py34 \| py36 \| fast ----------------------------------+-------------+----------------+----------------+--------------- filter \| 165 us () \| 318 us (+93%) \| 237 us (+43%) \| 165 us map \| 209 us () \| 258 us (+24%) \| 202 us \| 171 us (-18%) sorted(list, key=lambda x: x) \| 272 us () \| 348 us (+28%) \| 237 us (-13%) \| 163 us (-40%) sorted(list) \| 33.7 us () \| 47.8 us (+42%) \| 27.3 us (-19%) \| 27.7 us (-18%) b=MyBytes(); bytes(b) \| 3.31 us () \| 835 ns (-75%) \| 510 ns (-85%) \| 561 ns (-83%) namedtuple.attr \| 4.63 us () \| 4.51 us \| 1.98 us (-57%) \| 1.57 us (-66%) object.__setattr__(obj, "x", 1) \| 463 ns () \| 440 ns \| 343 ns (-26%) \| 222 ns (-52%) object.__getattribute__(obj, "x") \| 323 ns () \| 396 ns (+23%) \| 316 ns \| 196 ns (-39%) getattr(1, "real") \| 218 ns () \| 237 ns (+8%) \| 264 ns (+21%) \| 147 ns (-33%) bounded_pymethod(1, 2) \| 213 ns () \| 244 ns (+14%) \| 194 ns (-9%) \| 188 ns (-12%) unbound_pymethod(obj, 1, 2) \| 345 ns () \| 247 ns (-29%) \| 196 ns (-43%) \| 191 ns (-45%) func() \| 161 ns () \| 211 ns (+31%) \| 161 ns \| 157 ns func(1, 2, 3) \| 219 ns () \| 247 ns (+13%) \| 196 ns (-10%) \| 190 ns (-13%) ----------------------------------+-------------+----------------+----------------+--------------- Total \| 689 us () \| 980 us (+42%) \| 707 us \| 531 us (-23%) ----------------------------------+-------------+----------------+----------------+--------------- I didn't know that Python 3.4 was so much slower than Python 2.7 on function calls!? Note: Python 2.7 and Python 3.4 are system binaries (Fedora 22), wheras Python 3.6 and Python 3.6 FASTCALL are compiled manually. Ignore "b=MyBytes(); bytes(b)", this benchmark is written for Python 3. -- details: Common platform: Bits: int=32, long=64, long long=64, size_t=64, void*=64 Platform: Linux-4.4.4-301.fc23.x86_64-x86_64-with-fedora-23-Twenty_Three CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Platform of campaign py27: CFLAGS: -fno-strict-aliasing -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv Python unicode implementation: UCS-4 Timer precision: 954 ns Python version: 2.7.10 (default, Sep 8 2015, 17:20:17) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] Timer: time.time Platform of campaign py34: Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) CFLAGS: -Wno-unused-result -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv Timer precision: 84 ns Python unicode implementation: PEP 393 Python version: 3.4.3 (default, Jun 29 2015, 12:16:01) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] Timer: time.perf_counter Platform of campaign py36: Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) Python version: 3.6.0a0 (default:496e094f4734, Apr 22 2016, 02:18:13) [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)] CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Python unicode implementation: PEP 393 Timer: time.perf_counter Platform of campaign fast: Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Python unicode implementation: PEP 393 Python version: 3.6.0a0 (default:ad4a53ed1fbf, Apr 22 2016, 12:42:15) [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)]
msg264009 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-04-22 12:52
Could you compare filter(), map() and sorted() performance with your patch and with issue23507 patch?
msg264021 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-22 14:56
Results of the CPython benchmark suite on the revision 6c376e866330 of https://hg.python.org/sandbox/fastcall compared to CPython 3.6 at the revision 496e094f4734. It's surprising than call_simple is 1.08x slower in fastcall. This slowdown is not acceptable and should be fixed. It probable explains why many other benchmarks are slower. Hopefully, some benchmarks are faster, between 1.02x and 1.09x faster. IMHO there are still performance issues in my current implementation that can and must be fixed. At least, we have a starting point to compare performances. $ python3 -u perf.py ../default/python ../fastcall/python -b all (...) Report on Linux smithers 4.4.4-301.fc23.x86_64 #1 SMP Fri Mar 4 17:42:42 UTC 2016 x86_64 x86_64 Total CPU cores: 8 [ slower ] ### 2to3 ### 6.859604 -> 6.985351: 1.02x slower ### call_method_slots ### Min: 0.308846 -> 0.317780: 1.03x slower Avg: 0.308902 -> 0.318667: 1.03x slower Significant (t=-464.83) Stddev: 0.00003 -> 0.00026: 9.8974x larger ### call_simple ### Min: 0.232594 -> 0.251789: 1.08x slower Avg: 0.232816 -> 0.252443: 1.08x slower Significant (t=-911.97) Stddev: 0.00024 -> 0.00011: 2.2373x smaller ### chaos ### Min: 0.273084 -> 0.284790: 1.04x slower Avg: 0.273951 -> 0.293177: 1.07x slower Significant (t=-7.57) Stddev: 0.00036 -> 0.01796: 49.9421x larger ### django_v3 ### Min: 0.549604 -> 0.569982: 1.04x slower Avg: 0.550557 -> 0.571038: 1.04x slower Significant (t=-204.09) Stddev: 0.00046 -> 0.00054: 1.1747x larger ### float ### Min: 0.261939 -> 0.269224: 1.03x slower Avg: 0.268475 -> 0.276515: 1.03x slower Significant (t=-12.22) Stddev: 0.00301 -> 0.00354: 1.1757x larger ### formatted_logging ### Min: 0.325786 -> 0.334440: 1.03x slower Avg: 0.326827 -> 0.335968: 1.03x slower Significant (t=-34.44) Stddev: 0.00129 -> 0.00136: 1.0503x larger ### mako_v2 ### Min: 0.039642 -> 0.044765: 1.13x slower Avg: 0.040251 -> 0.045562: 1.13x slower Significant (t=-323.73) Stddev: 0.00028 -> 0.00024: 1.1558x smaller ### meteor_contest ### Min: 0.196589 -> 0.203667: 1.04x slower Avg: 0.197497 -> 0.204782: 1.04x slower Significant (t=-76.06) Stddev: 0.00050 -> 0.00045: 1.1111x smaller ### nqueens ### Min: 0.274664 -> 0.285866: 1.04x slower Avg: 0.275285 -> 0.286774: 1.04x slower Significant (t=-68.34) Stddev: 0.00091 -> 0.00076: 1.2036x smaller ### pickle_list ### Min: 0.262687 -> 0.269629: 1.03x slower Avg: 0.263804 -> 0.270789: 1.03x slower Significant (t=-50.14) Stddev: 0.00070 -> 0.00070: 1.0004x larger ### raytrace ### Min: 1.272960 -> 1.284516: 1.01x slower Avg: 1.276398 -> 1.368574: 1.07x slower Significant (t=-3.41) Stddev: 0.00157 -> 0.19115: 122.0022x larger ### regex_compile ### Min: 0.335753 -> 0.343820: 1.02x slower Avg: 0.336273 -> 0.344894: 1.03x slower Significant (t=-127.84) Stddev: 0.00026 -> 0.00040: 1.5701x larger ### regex_effbot ### Min: 0.048656 -> 0.050810: 1.04x slower Avg: 0.048692 -> 0.051619: 1.06x slower Significant (t=-69.92) Stddev: 0.00002 -> 0.00030: 16.7793x larger ### silent_logging ### Min: 0.069539 -> 0.071172: 1.02x slower Avg: 0.069679 -> 0.071230: 1.02x slower Significant (t=-124.08) Stddev: 0.00009 -> 0.00002: 3.7073x smaller ### simple_logging ### Min: 0.278439 -> 0.287736: 1.03x slower Avg: 0.279504 -> 0.288811: 1.03x slower Significant (t=-52.46) Stddev: 0.00084 -> 0.00093: 1.1074x larger ### telco ### Min: 0.012480 -> 0.013104: 1.05x slower Avg: 0.012561 -> 0.013157: 1.05x slower Significant (t=-100.42) Stddev: 0.00004 -> 0.00002: 1.5881x smaller ### unpack_sequence ### Min: 0.000047 -> 0.000048: 1.03x slower Avg: 0.000047 -> 0.000048: 1.03x slower Significant (t=-1170.16) Stddev: 0.00000 -> 0.00000: 1.0749x larger ### unpickle_list ### Min: 0.325310 -> 0.330080: 1.01x slower Avg: 0.326484 -> 0.333974: 1.02x slower Significant (t=-24.19) Stddev: 0.00100 -> 0.00195: 1.9392x larger [ faster ] ### chameleon_v2 ### Min: 5.525575 -> 5.263668: 1.05x faster Avg: 5.541444 -> 5.281893: 1.05x faster Significant (t=85.79) Stddev: 0.01107 -> 0.01831: 1.6539x larger ### etree_iterparse ### Min: 0.212073 -> 0.197146: 1.08x faster Avg: 0.215504 -> 0.200254: 1.08x faster Significant (t=61.07) Stddev: 0.00119 -> 0.00130: 1.0893x larger ### etree_parse ### Min: 0.282983 -> 0.260390: 1.09x faster Avg: 0.284333 -> 0.262758: 1.08x faster Significant (t=77.34) Stddev: 0.00102 -> 0.00169: 1.6628x larger ### etree_process ### Min: 0.218953 -> 0.213683: 1.02x faster Avg: 0.221036 -> 0.215280: 1.03x faster Significant (t=25.98) Stddev: 0.00114 -> 0.00108: 1.0580x smaller ### hexiom2 ### Min: 122.001408 -> 118.967112: 1.03x faster Avg: 122.108010 -> 119.110115: 1.03x faster Significant (t=16.81) Stddev: 0.15076 -> 0.20224: 1.3415x larger ### pathlib ### Min: 0.088533 -> 0.084888: 1.04x faster Avg: 0.088916 -> 0.085280: 1.04x faster Significant (t=257.68) Stddev: 0.00014 -> 0.00017: 1.1725x larger The following not significant results are hidden, use -v to show them: call_method, call_method_unknown, etree_generate, fannkuch, fastpickle, fastunpickle, go, json_dump_v2, json_load, nbody, normal_startup, pickle_dict, pidigits, regex_v8, richards, spectral_norm, startup_nosite, tornado_http.
msg264098 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-04-24 06:37
I have collected statistics about using CALL_FUNCTION* opcodes in compliled code during running CPython testsuite. According to it, 99.4% emitted opcodes is the CALL_FUNCTION opcode, and 89% of emitted CALL_FUNCTION opcodes have only positional arguments, and 98% of them have not more than 3 arguments. That was about calls from Python code. All convenient C API functions (like PyObject_CallFunction and PyObject_CallFunctionObjArgs) used for direct calling in C code use only positional arguments. Thus I think we need to optimize only cases of calling with small number (0-3) of positional arguments.
msg264101 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-24 07:15
> Thus I think we need to optimize only cases of calling with small number (0-3) of positional arguments. My code is optimized to up to 10 positional arguments: with 0..10 arguments, the C stack is used to hold the array of PyObject. For more arguments, an array is allocated in the heap memory. + / 10 positional parameters or 5 (key, value) pairs for keyword parameters. + 40 bytes on 32-bit or 80 bytes on 64-bit. / +# define _PyStack_SIZE 10 For keyword parameters, I don't know yet what is the best API (fatest API). Right now, I'm also using the same PyObject* array for positional and keyword arguments using "int nk", but maybe a dictionary is faster to combinary keyword arguments and to parse keyword arguments.
msg264102 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-04-24 07:37
I think you can simplify the patch by dropping keyword arguments support from fastcall. Then you can decrease _PyStack_SIZE to 4 (larger size will serve only 1.7% of calls), and may be refactor a code since an array of 4 pointers consumes less C stack than an array of 10 pointers.
msg264518 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-29 20:35
Results of the CPython benchmark suite. Reference = default branch at rev 496e094f4734, patched: fastcall fork at rev 2b4b7def2949. I got many issues to get a reliable benchmark output: * https://mail.python.org/pipermail/speed/2016-April/000329.html * https://mail.python.org/pipermail/speed/2016-April/000341.html The benchmark was run with CPU isolation. Both binaries were compiled with PGO+LTO. Report on Linux smithers 4.4.4-301.fc23.x86_64 #1 SMP Fri Mar 4 17:42:42 UTC 2016 x86_64 x86_64 Total CPU cores: 8 ### call_method_slots ### Min: 0.289704 -> 0.269634: 1.07x faster Avg: 0.290149 -> 0.275953: 1.05x faster Significant (t=162.17) Stddev: 0.00019 -> 0.00150: 8.1176x larger ### call_method_unknown ### Min: 0.275295 -> 0.302810: 1.10x slower Avg: 0.280201 -> 0.309166: 1.10x slower Significant (t=-200.65) Stddev: 0.00161 -> 0.00191: 1.1909x larger ### call_simple ### Min: 0.202163 -> 0.207939: 1.03x slower Avg: 0.202332 -> 0.208662: 1.03x slower Significant (t=-636.09) Stddev: 0.00008 -> 0.00015: 2.0130x larger ### chameleon_v2 ### Min: 4.349474 -> 3.901936: 1.11x faster Avg: 4.377664 -> 3.942932: 1.11x faster Significant (t=62.39) Stddev: 0.01403 -> 0.06826: 4.8635x larger ### django_v3 ### Min: 0.484456 -> 0.462013: 1.05x faster Avg: 0.489186 -> 0.465189: 1.05x faster Significant (t=53.10) Stddev: 0.00415 -> 0.00180: 2.3096x smaller ### etree_generate ### Min: 0.193538 -> 0.182069: 1.06x faster Avg: 0.196306 -> 0.184403: 1.06x faster Significant (t=65.94) Stddev: 0.00140 -> 0.00115: 1.2181x smaller ### etree_iterparse ### Min: 0.189955 -> 0.177583: 1.07x faster Avg: 0.195268 -> 0.183411: 1.06x faster Significant (t=27.04) Stddev: 0.00316 -> 0.00304: 1.0386x smaller ### etree_process ### Min: 0.166556 -> 0.158617: 1.05x faster Avg: 0.168822 -> 0.160672: 1.05x faster Significant (t=43.33) Stddev: 0.00125 -> 0.00140: 1.1205x larger ### fannkuch ### Min: 0.859842 -> 0.878412: 1.02x slower Avg: 0.865138 -> 0.889188: 1.03x slower Significant (t=-14.97) Stddev: 0.00718 -> 0.01436: 2.0000x larger ### float ### Min: 0.222095 -> 0.214706: 1.03x faster Avg: 0.226273 -> 0.218210: 1.04x faster Significant (t=21.61) Stddev: 0.00307 -> 0.00212: 1.4469x smaller ### hexiom2 ### Min: 100.489630 -> 94.765364: 1.06x faster Avg: 101.204871 -> 94.885605: 1.07x faster Significant (t=77.45) Stddev: 0.25310 -> 0.05016: 5.0454x smaller ### meteor_contest ### Min: 0.181076 -> 0.176904: 1.02x faster Avg: 0.181759 -> 0.177783: 1.02x faster Significant (t=43.68) Stddev: 0.00061 -> 0.00067: 1.1041x larger ### nbody ### Min: 0.208752 -> 0.217011: 1.04x slower Avg: 0.211552 -> 0.219621: 1.04x slower Significant (t=-69.45) Stddev: 0.00080 -> 0.00084: 1.0526x larger ### pathlib ### Min: 0.077121 -> 0.070698: 1.09x faster Avg: 0.078310 -> 0.071958: 1.09x faster Significant (t=133.39) Stddev: 0.00069 -> 0.00081: 1.1735x larger ### pickle_dict ### Min: 0.530379 -> 0.514363: 1.03x faster Avg: 0.531325 -> 0.515902: 1.03x faster Significant (t=154.33) Stddev: 0.00086 -> 0.00050: 1.7213x smaller ### pickle_list ### Min: 0.253445 -> 0.263959: 1.04x slower Avg: 0.255362 -> 0.267402: 1.05x slower Significant (t=-95.47) Stddev: 0.00075 -> 0.00101: 1.3447x larger ### raytrace ### Min: 1.071042 -> 1.030849: 1.04x faster Avg: 1.076629 -> 1.109029: 1.03x slower Significant (t=-3.93) Stddev: 0.00199 -> 0.08246: 41.4609x larger ### regex_compile ### Min: 0.286053 -> 0.273454: 1.05x faster Avg: 0.287171 -> 0.274422: 1.05x faster Significant (t=153.16) Stddev: 0.00067 -> 0.00050: 1.3452x smaller ### regex_effbot ### Min: 0.044186 -> 0.048192: 1.09x slower Avg: 0.044336 -> 0.048513: 1.09x slower Significant (t=-172.41) Stddev: 0.00020 -> 0.00014: 1.4671x smaller ### richards ### Min: 0.137456 -> 0.135029: 1.02x faster Avg: 0.138993 -> 0.136028: 1.02x faster Significant (t=20.35) Stddev: 0.00116 -> 0.00088: 1.3247x smaller ### silent_logging ### Min: 0.060288 -> 0.056344: 1.07x faster Avg: 0.060380 -> 0.056518: 1.07x faster Significant (t=310.27) Stddev: 0.00011 -> 0.00005: 2.1029x smaller ### telco ### Min: 0.010735 -> 0.010441: 1.03x faster Avg: 0.010849 -> 0.010557: 1.03x faster Significant (t=34.04) Stddev: 0.00007 -> 0.00005: 1.3325x smaller ### unpickle_list ### Min: 0.290750 -> 0.297958: 1.02x slower Avg: 0.292741 -> 0.299419: 1.02x slower Significant (t=-41.62) Stddev: 0.00133 -> 0.00090: 1.4852x smaller The following not significant results are hidden, use -v to show them: 2to3, call_method, chaos, etree_parse, fastpickle, fastunpickle, formatted_logging, go, json_dump_v2, json_load, mako_v2, normal_startup, nqueens, pidigits, regex_v8, simple_logging, spectral_norm, startup_nosite, tornado_http, unpack_sequence.
msg264519 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-29 20:37
> Results of the CPython benchmark suite. Reference = default branch at rev 496e094f4734, patched: fastcall fork at rev 2b4b7def2949. Oh, I forgot to mention that I modified perf.py to run each benchmark using 10 fresh processes to test multiple random seeds for the randomized hash function, instead of testing a fixed seed (PYTHONHASHSEED=1). This change should reduce the noise in the benchmark results. I ran the benchmark suite using --rigorous. I will open a new issue later for my perf.py change.
msg264525 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-04-29 21:43
Could you repeat benchmarks on different computer? Better with different CPU or compiler.
msg264526 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-29 21:55
> Could you repeat benchmarks on different computer? Better with different CPU or compiler. Sorry, I don't really have the bandwith to repeat the benchmarks. PGO+LTO compilation is slow and running the benchmark suite in rigorous mode is very slow. What do you expect from running the benchmark on a different computer?
msg264529 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-04-29 22:16
Results look as a noise. Some tests become slower, others become faster. If results on different machine will show the same sets of slowing down and speeding up tests, this likely is not a noise.
msg264530 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-29 22:23
> Results look as a noise. As I wrote, it's really hard to get a reliable benchmark result. I did my best. See also discussions about the CPython benchmark suite on the speed list: https://mail.python.org/pipermail/speed/ I'm not sure that you will get less noise on other computers. IMHO many benchmarks are simply "broken" (not reliable).
msg265856 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-05-19 13:30
Hi, I made progress on my FASTCALL branch. I removed tp_fastnew, tp_fastinit and tp_fastnew fields from PyTypeObject to replace them with new type flags (ex: Py_TPFLAGS_FASTNEW) to avoid code duplication and reduce the memory footprint. Before, each function was simply duplicated. This change introduces a backward incompatibility change: it's not more possible to call directly tp_new, tp_init and tp_call. I don't know yet if such change would be acceptable in Python 3.6, nor if it is worth it. I spent a lot of ot time on the CPython benchmark suite to check for performance regression. In fact, I spent most of my time to try to understand why most benchmarks looked completly unstable. I now tuned correctly my system and patched perf.py to get reliable benchmarks. On the latest run of the benchmark suite, most benchmarks are faster! I have to investigate why 3 benchmarks are still slower. In the run, normal_startup was not significant, etree_parse was faster (instead of slower), but raytrace was already slower (but only 1.13x slower). It may be the "noise" of the PGO compilation. I already noticed that once: see the issue #27056 "pickle: constant propagation in _Unpickler_Read()". Result of the benchmark suite: slower (3): * raytrace: 1.06x slower * etree_parse: 1.03x slower * normal_startup: 1.02x slower faster (18): * unpickle_list: 1.11x faster * chameleon_v2: 1.09x faster * etree_generate: 1.08x faster * etree_process: 1.08x faster * mako_v2: 1.06x faster * call_method_unknown: 1.06x faster * django_v3: 1.05x faster * regex_compile: 1.05x faster * etree_iterparse: 1.05x faster * fastunpickle: 1.05x faster * meteor_contest: 1.05x faster * pickle_dict: 1.05x faster * float: 1.04x faster * pathlib: 1.04x faster * silent_logging: 1.04x faster * call_method: 1.03x faster * json_dump_v2: 1.03x faster * call_simple: 1.03x faster not significant (21): * 2to3 * call_method_slots * chaos * fannkuch * fastpickle * formatted_logging * go * json_load * nbody * nqueens * pickle_list * pidigits * regex_effbot * regex_v8 * richards * simple_logging * spectral_norm * startup_nosite * telco * tornado_http * unpack_sequence I know that my patch is simply giant and cannot be merged like that. Since the performance is still promising, I plan to split my giant patch into smaller patches, easier to review. I will try to check that individual patches don't make Python slower. This work will take time.
msg265857 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-05-19 13:37
New patch: 34456cce64bb.patch $ diffstat 34456cce64bb.patch .hgignore \| 3 Makefile.pre.in \| 37 b/Doc/includes/shoddy.c \| 2 b/Include/Python.h \| 1 b/Include/abstract.h \| 17 b/Include/descrobject.h \| 14 b/Include/funcobject.h \| 6 b/Include/methodobject.h \| 6 b/Include/modsupport.h \| 20 b/Include/object.h \| 28 b/Lib/json/encoder.py \| 1 b/Lib/test/test_extcall.py \| 19 b/Lib/test/test_sys.py \| 6 b/Modules/_collectionsmodule.c \| 14 b/Modules/_csv.c \| 15 b/Modules/_ctypes/_ctypes.c \| 12 b/Modules/_ctypes/stgdict.c \| 2 b/Modules/_datetimemodule.c \| 47 b/Modules/_elementtree.c \| 11 b/Modules/_functoolsmodule.c \| 113 +- b/Modules/_io/clinic/_iomodule.c.h \| 8 b/Modules/_io/clinic/bufferedio.c.h \| 42 b/Modules/_io/clinic/bytesio.c.h \| 42 b/Modules/_io/clinic/fileio.c.h \| 26 b/Modules/_io/clinic/iobase.c.h \| 26 b/Modules/_io/clinic/stringio.c.h \| 34 b/Modules/_io/clinic/textio.c.h \| 40 b/Modules/_io/iobase.c \| 4 b/Modules/_json.c \| 24 b/Modules/_lsprof.c \| 4 b/Modules/_operator.c \| 11 b/Modules/_pickle.c \| 106 - b/Modules/_posixsubprocess.c \| 15 b/Modules/_sre.c \| 11 b/Modules/_ssl.c \| 9 b/Modules/_testbuffer.c \| 4 b/Modules/_testcapimodule.c \| 4 b/Modules/_threadmodule.c \| 32 b/Modules/_tkinter.c \| 11 b/Modules/arraymodule.c \| 29 b/Modules/cjkcodecs/clinic/multibytecodec.c.h \| 50 b/Modules/clinic/_bz2module.c.h \| 8 b/Modules/clinic/_codecsmodule.c.h \| 318 +++-- b/Modules/clinic/_cryptmodule.c.h \| 10 b/Modules/clinic/_datetimemodule.c.h \| 8 b/Modules/clinic/_dbmmodule.c.h \| 26 b/Modules/clinic/_elementtree.c.h \| 86 - b/Modules/clinic/_gdbmmodule.c.h \| 26 b/Modules/clinic/_lzmamodule.c.h \| 16 b/Modules/clinic/_opcode.c.h \| 10 b/Modules/clinic/_pickle.c.h \| 34 b/Modules/clinic/_sre.c.h \| 124 +- b/Modules/clinic/_ssl.c.h \| 74 - b/Modules/clinic/_tkinter.c.h \| 50 b/Modules/clinic/_winapi.c.h \| 124 +- b/Modules/clinic/arraymodule.c.h \| 34 b/Modules/clinic/audioop.c.h \| 210 ++- b/Modules/clinic/binascii.c.h \| 36 b/Modules/clinic/cmathmodule.c.h \| 24 b/Modules/clinic/fcntlmodule.c.h \| 34 b/Modules/clinic/grpmodule.c.h \| 14 b/Modules/clinic/md5module.c.h \| 8 b/Modules/clinic/posixmodule.c.h \| 642 ++++++----- b/Modules/clinic/pyexpat.c.h \| 32 b/Modules/clinic/sha1module.c.h \| 8 b/Modules/clinic/sha256module.c.h \| 14 b/Modules/clinic/sha512module.c.h \| 14 b/Modules/clinic/signalmodule.c.h \| 50 b/Modules/clinic/unicodedata.c.h \| 42 b/Modules/clinic/zlibmodule.c.h \| 68 - b/Modules/itertoolsmodule.c \| 20 b/Modules/main.c \| 2 b/Modules/pyexpat.c \| 3 b/Modules/signalmodule.c \| 9 b/Modules/xxsubtype.c \| 4 b/Objects/abstract.c \| 403 ++++--- b/Objects/bytesobject.c \| 2 b/Objects/classobject.c \| 36 b/Objects/clinic/bytearrayobject.c.h \| 90 - b/Objects/clinic/bytesobject.c.h \| 66 - b/Objects/clinic/dictobject.c.h \| 10 b/Objects/clinic/unicodeobject.c.h \| 10 b/Objects/descrobject.c \| 162 +- b/Objects/dictobject.c \| 26 b/Objects/enumobject.c \| 8 b/Objects/exceptions.c \| 91 + b/Objects/fileobject.c \| 29 b/Objects/floatobject.c \| 25 b/Objects/funcobject.c \| 77 - b/Objects/genobject.c \| 2 b/Objects/iterobject.c \| 6 b/Objects/listobject.c \| 20 b/Objects/longobject.c \| 40 b/Objects/methodobject.c \| 139 ++ b/Objects/object.c \| 4 b/Objects/odictobject.c \| 2 b/Objects/rangeobject.c \| 12 b/Objects/tupleobject.c \| 21 b/Objects/typeobject.c \| 1463 +++++++++++++++++++------- b/Objects/unicodeobject.c \| 58 - b/Objects/weakrefobject.c \| 22 b/PC/clinic/msvcrtmodule.c.h \| 42 b/PC/clinic/winreg.c.h \| 128 +- b/PC/clinic/winsound.c.h \| 26 b/PCbuild/pythoncore.vcxproj \| 4 b/Parser/tokenizer.c \| 7 b/Python/ast.c \| 31 b/Python/bltinmodule.c \| 173 +-- b/Python/ceval.c \| 591 +++++++++- b/Python/clinic/bltinmodule.c.h \| 104 + b/Python/clinic/import.c.h \| 18 b/Python/codecs.c \| 17 b/Python/errors.c \| 105 - b/Python/getargs.c \| 284 ++++- b/Python/import.c \| 27 b/Python/modsupport.c \| 244 +++- b/Python/pythonrun.c \| 10 b/Python/sysmodule.c \| 32 b/Tools/clinic/clinic.py \| 115 +- pystack.c \| 288 +++++ pystack.h \| 64 + 121 files changed, 5420 insertions(+), 2802 deletions(-)
msg265859 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-05-19 13:38
Status of the my FASTCALL implementation (34456cce64bb.patch): * Add METH_FASTCALL calling convention to C functions, similar to METH_VARARGS\|METH_KEYWORDS * Clinic uses METH_FASTCALL when possible (it may use METH_FASTCALL for all cases in the future) * Add new C functions: - _PyObject_FastCall(func, stack, nargs, kwds): root of the FASTCALL branch - PyObject_CallNoArg(func) - PyObject_CallArg1(func, arg) * Add new type flags changing the calling conventions of tp_new, tp_init and tp_call: - Py_TPFLAGS_FASTNEW - Py_TPFLAGS_FASTINIT - Py_TPFLAGS_FASTCALL * Backward incompatible change of Py_TPFLAGS_FASTNEW and Py_TPFLAGS_FASTINIT flags: calling explicitly type->tp_new() and type->tp_init() is now a bug and is likely to crash, since the calling convention can now be FASTCALL. * New _PyType_CallNew() and _PyType_CallInit() functions to call tp_new and tp_init of a type. Functions which called tp_new and tp_init directly were patched. * New helpers function to parse functions functions: - PyArg_ParseStack() - PyArg_ParseStackAndKeywords() - PyArg_UnpackStack() * New Py_Build functons: - Py_BuildStack() - Py_VaBuildStack() * New _PyStack API to handle a stack: - _PyStack_Alloc(), _PyStack_Free(), _PyStack_Copy() - _PyStack_FromTuple() - _PyStack_FromBorrowedTuple() - _PyStack_AsTuple(), _PyStack_AsTupleSlice() - ... * Many changes were done in the typeobject.c file to handle FASTCALL, new type flags, handle correctly flags when a new type is created, etc. * ceval.c: add _PyFunction_FastCall() function (somehow, I only exposed existing code) A large part of the patch changes existing code to use the new calling convention in many functions of many modules. Some changes were generated by the Argument Clinic. IMHO the best would be to use Argument Clinic in more places, rather than patching manually the code.
msg265887 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-05-19 19:51
> Result of the benchmark suite: > > slower (3): > > * raytrace: 1.06x slower > * etree_parse: 1.03x slower > * normal_startup: 1.02x slower Hum, I recompiled the patched Python, again with PGO+LTO, and ran the same benchmark with the same command. In short, I replayed exaclty the same scenario. And... Only raytrace remains slower, etree_parse and normal_startup moved to the "not significant" list. The difference in the benchmark result doesn't come from the benchmark. For example, I ran gain the normal_startup benchmark 3 times: I got the same result 3 times. ### normal_startup ### Avg: 0.295168 +/- 0.000991 -> 0.294926 +/- 0.00048: 1.00x faster Not significant ### normal_startup ### Avg: 0.294871 +/- 0.000606 -> 0.294883 +/- 0.00072: 1.00x slower Not significant ### normal_startup ### Avg: 0.295096 +/- 0.000706 -> 0.294967 +/- 0.00068: 1.00x faster Not significant IMHO the difference comes from the data collected by PGO.
msg265896 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-05-19 21:03
> In short, I replayed exaclty the same scenario. And... Only raytrace remains slower, (...) Oh, it looks like the reference binary calls the garbage collector less frequently than the patched python. In the patched Python, collections of the generation 2 are needed, whereas no collection of the generation 2 is needed on the reference binary.
msg265938 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-05-20 12:05
> unpickle_list: 1.11x faster This result was unfair: my fastcall branch contained the optimization of the issue #27056. I just pushed this optimization into the default branch. I ran again the benchmark: the result is now "not significant", as expected, since the benchmark is a microbenchmark testing C functions of Modules/_pickle.c, it doesn't really rely on the performance of (C or Python) functions calls.
msg266359 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-05-25 14:05
I fixed even more issues with my setup to run benchmark. Results should be even more reliable. Moreover, I fixed multiple reference leaks in the code which introduced performance regressions. I started to write articles to explain how to run stable benchmarks: * https://haypo.github.io/journey-to-stable-benchmark-system.html * https://haypo.github.io/journey-to-stable-benchmark-deadcode.html * https://haypo.github.io/journey-to-stable-benchmark-average.html Summary of benchmarks at the revision e6f3bf996c01: Faster (25): - pickle_list: 1.29x faster - etree_generate: 1.22x faster - pickle_dict: 1.19x faster - etree_process: 1.16x faster - mako_v2: 1.13x faster - telco: 1.09x faster - raytrace: 1.08x faster - etree_iterparse: 1.08x faster - regex_compile: 1.07x faster - json_dump_v2: 1.07x faster - etree_parse: 1.06x faster - regex_v8: 1.05x faster - call_method_unknown: 1.05x faster - chameleon_v2: 1.05x faster - fastunpickle: 1.04x faster - django_v3: 1.04x faster - chaos: 1.04x faster - 2to3: 1.03x faster - pathlib: 1.03x faster - unpickle_list: 1.03x faster - json_load: 1.03x faster - fannkuch: 1.03x faster - call_method: 1.02x faster - unpack_sequence: 1.02x faster - call_method_slots: 1.02x faster Slower (4): - regex_effbot: 1.08x slower - nbody: 1.08x slower - spectral_norm: 1.07x slower - nqueens: 1.06x slower Not significat (13): - tornado_http - startup_nosite - simple_logging - silent_logging - richards - pidigits - normal_startup - meteor_contest - go - formatted_logging - float - fastpickle - call_simple I'm now investigating why 4 benchmarks are slower. Note: I'm still using my patched CPython benchmark suite to get more stable benchmark. I will send patches upstream later.
msg274124 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-09-01 13:15
I splitted the giant patch into smaller patches easier to review. The first part (_PyObject_FastCall, _PyObject_FastCallDict) is already merged. Other issues were opened to implement the full feature. I now close this issue.

History
Date	User	Action	Args
2022-04-11 14:58:29	admin	set	github: 71001
2016-09-01 13:15:25	vstinner	set	status: open -> closed resolution: fixed messages: + msg274124
2016-05-25 14:05:19	vstinner	set	messages: + msg266359
2016-05-20 12:05:27	vstinner	set	messages: + msg265938
2016-05-19 21:03:51	vstinner	set	messages: + msg265896
2016-05-19 19:51:46	vstinner	set	messages: + msg265887
2016-05-19 13:38:54	vstinner	set	messages: + msg265859
2016-05-19 13:38:19	vstinner	set	files: + 34456cce64bb.patch messages: + msg265857
2016-05-19 13:36:19	vstinner	set	files: - 34456cce64bb.diff
2016-05-19 13:35:17	vstinner	set	files: + 34456cce64bb.diff
2016-05-19 13:30:46	vstinner	set	messages: + msg265856
2016-05-09 22:55:14	jstasiak	set	nosy: + jstasiak
2016-04-29 22:23:44	vstinner	set	messages: + msg264530
2016-04-29 22:16:35	serhiy.storchaka	set	messages: + msg264529
2016-04-29 21:55:12	vstinner	set	messages: + msg264526
2016-04-29 21:43:53	serhiy.storchaka	set	messages: + msg264525
2016-04-29 20:37:52	vstinner	set	messages: + msg264519
2016-04-29 20:35:56	vstinner	set	messages: + msg264518
2016-04-24 07:37:58	serhiy.storchaka	set	messages: + msg264102
2016-04-24 07:15:35	vstinner	set	messages: + msg264101
2016-04-24 06:37:35	serhiy.storchaka	set	messages: + msg264098
2016-04-22 14:56:57	vstinner	set	messages: + msg264021
2016-04-22 12:52:39	serhiy.storchaka	set	messages: + msg264009
2016-04-22 11:52:19	vstinner	set	files: + bench_fast-2.py messages: + msg264003
2016-04-22 11:40:11	vstinner	set	files: + bench_fast.py messages: + msg263999
2016-04-22 11:12:30	vstinner	set	messages: + msg263996
2016-04-22 11:10:16	vstinner	set	messages: + msg263995
2016-04-22 10:41:52	vstinner	set	files: + ad4a53ed1fbf.diff
2016-04-22 00:44:11	vstinner	set	messages: + msg263946 title: Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments -> [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments
2016-04-22 00:41:28	vstinner	set	hgrepos: + hgrepo342
2016-04-21 17:04:54	serhiy.storchaka	set	messages: + msg263926
2016-04-21 15:05:13	vstinner	set	messages: + msg263924
2016-04-21 15:03:09	vstinner	set	files: + call_stack-3.patch messages: + msg263923 title: Add a new _PyObject_CallStack() function which avoids the creation of a tuple or dict for arguments -> Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments
2016-04-21 14:24:16	larry	set	messages: + msg263920
2016-04-21 13:45:49	serhiy.storchaka	set	messages: + msg263918
2016-04-21 10:42:27	vstinner	set	files: + call_stack-2.patch messages: + msg263910
2016-04-21 10:28:27	serhiy.storchaka	set	messages: + msg263909
2016-04-21 10:20:50	vstinner	set	messages: + msg263908
2016-04-21 09:53:53	serhiy.storchaka	set	messages: + msg263907
2016-04-21 08:58:02	vstinner	set	nosy: + yselivanov
2016-04-21 08:57:21	vstinner	create