classification
Title: [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments
Type: performance Stage:
Components: Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: jstasiak, larry, rhettinger, serhiy.storchaka, vstinner, yselivanov
Priority: normal Keywords: patch

Created on 2016-04-21 08:57 by vstinner, last changed 2016-09-01 13:15 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
call_stack.patch vstinner, 2016-04-21 08:57 review
call_stack-2.patch vstinner, 2016-04-21 10:42 review
call_stack-3.patch vstinner, 2016-04-21 15:03 review
ad4a53ed1fbf.diff vstinner, 2016-04-22 10:41 review
bench_fast.py vstinner, 2016-04-22 11:40
bench_fast-2.py vstinner, 2016-04-22 11:52
34456cce64bb.patch vstinner, 2016-05-19 13:37 review
Repositories containing patches
https://hg.python.org/sandbox/fastcall
Messages (34)
msg263899 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-21 08:57
Attached patch adds the following new function:

   PyObject* _PyObject_CallStack(PyObject *func,
                                 PyObject **stack, 
                                 int na, int nk);

where na is the number of positional arguments and nk is the number of (key, pair) arguments stored in the stack.

Example of C code to call a function with one positional argument:

    PyObject *stack[1];
    stack[0] = arg;
    return _PyObject_CallStack(func, stack, 1, 0);

Simple, isn't it?

The difference with PyObject_Call() is that its API avoids the creation of a tuple and a dictionary to pass parameters to functions when possible. Currently, the temporary tuple and dict can be avoided to call Python functions (nice, isn't it?) and C function declared with METH_O (not the most common API, but many functions are declared like that).

The patch only modifies property_descr_set() to test the feature, but I'm sure that *a lot* of C code can be modified to use this new function to beneift from its optimization.

Should we make this new _PyObject_CallStack() function official: call it PyObject_CallStack() (without the understand prefix) or experiment it in CPython 3.6 and decide later to make it public? If it's made private, it will require a large replacement patch later to replace all calls to _PyObject_CallStack() with PyObject_CallStack() (strip the underscore prefix).

The next step is to add a new METH_STACK flag to pass parameters to C functions using a similar API (PyObject **stack, int na, int nk) and modify the argument clinic to use this new API.

Thanks to Larry Hasting who gave me the idea in a previous edition of Pycon US ;-)

This issue was created after the discussion on issue #26811 which is an issue in a micro-optimization in property_descr_set() to avoid the creation of a tuple: it caches a private tuple inside property_descr_set().
msg263907 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-21 09:53
"Stack" in the function name looks a little confusing. I understand that this is related to the stack of bytecode interpreter, but this looks as raising pretty deep implementation detail. The way of packing positional and keyword arguments in the continuous array is not clear. Wouldn't be better to provide separate arguments for positional and keyword arguments?

What is the performance effect of using this function? For example compare the performance of namedtuple's attribute access of current code, the code with with this patch, and unoptimized code in 3.4:

    ./python -m timeit -r 11 -s "from collections import namedtuple as n; a = n('n', 'a b c')(1, 2, 3)" -- "a.a"

Is there any use of this function with keyword arguments?
msg263908 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-21 10:20
Microbenchmark on Python 3.6, best of 3 runs:

./python -m timeit -r 11 -s "from collections import namedtuple as n; a = n('n', 'a b c')(1, 2, 3)" -- "a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a"

* Python 3.6 unpatched: 0.968 usec
* call_stack.patch: 1.27 usec
* Python 3.6 with property_descr_get() of Python 3.4: 1.32 usec

"Python 3.6 with property_descr_get() of Python 3.4": replace the current optimization with "return PyObject_CallFunctionObjArgs(gs->prop_get, obj, NULL);".

Oh, in fact the tested code calls a property where the final function is operator.itemgetter(0). _PyObject_CallStack() creates a temporary tuple to call PyObject_Call() which calls func->ob_type->tp_call, itemgetter_call().

Problem: tp_call API uses (PyObject *args, PyObject *kwargs). It doesn't accept directly a stack (a C array of PyObject*). And it may be more difficult to modify tp_call.

In short, my patch disables the optimization on property with my current incomplete implementation.
msg263909 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-21 10:28
See also issue23507. May be your function help to optimize filter(), map(), sorted()?
msg263910 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-21 10:42
call_stack-2.patch: A little bit more complete patch, it adds a tp_call_stack field to PyTypeObject an use it in _PyObject_CallStack().

Updated microbenchmark on Python 3.6, best of 3 runs:

./python -m timeit -r 11 -s "from collections import namedtuple as n; a = n('n', 'a b c')(1, 2, 3)" -- "a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a; a.a"

* Python 3.6 unpatched: 0.968 usec
* call_stack.patch: 1.27 usec
* Python 3.6 with property_descr_get() of Python 3.4: 1.32 usec
* call_stack-2.patch: 0.664 usec

call_stack-2.patch makes this micro-benchmark 31% faster, not bad! It also makes calls to C functions almost 2x as fast if you replace current unoptimized calls with _PyObject_CallStack()!!

IHMO we should continue to experiment, making function calls 2x faster is worth it ;-)


Serhiy: "See also issue23507. May be your function help to optimize filter(), map(), sorted()?"

IMHO the API is generic enough to be usable in a lot of cases.


Serhiy: "Is there any use of this function with keyword arguments?"

Calling functions with keywords is probably the least common case for function calls in C code. But I would like to provide a fast function to call with keywords. Maybe we need two functions just to make the API cleaner? The difference would just be that "int k" would be omitted?

I proposed an API (PyObject **stack, int na, int nk) based on the current code in Python/ceval.c. I'm not sure that it's the best API ever :-)

In fact, there is already PyObject_CallFunctionObjArgs() which can be modified to reuse internally _PyObject_CallStack(), and its API is maybe more convenient than my proposed API.
msg263918 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-21 13:45
With call_stack-2.patch attribute access in namedtuple is only 25% slower than attribute access in ordinary Python object! Definitely this this worth to continue to experiment!

But adding new slot to PyTypeObject sets the bar too high. Try to use your function to speed up all cases mentioned in issue23507: sorted()/list.sort(), min() and max() with the key argument, filter(), map(), some iterators from itertools (groupby(), dropwhile(), takewhile(), accumulate(), filterfalse()), thin wrappers around special method (round(), math.floor(), etc). Use it in wrappers around PyObject_Call() like PyObject_CallFunctionObjArgs(). May be this will cause an effect even on some macrobenchmarks.
msg263920 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2016-04-21 14:24
Yes, I've been working on a patch to do this as well.  I called the calling convention METH_RAW, to go alongside METH_ZERO METH_O etc.  My calling convention was exactly the same as yours: PyObject *(PyObject *o, PyObject **stack, int na, int nk).  I only had to modify two functions in ceval.c to support it: ext_do_call() and call_function().

And yes, the overarching goal was to have Argument Clinic generate custom argument parsing code for every function.  Supporting the calling convention was the easy part; generating code was quite complicated.  I believe I got a very simple version of it working at one point, supporting positional parameters only, with some optional arguments.  Parsing arguments by hand gets very complicated indeed when you introduce keyword arguments.

I haven't touched this patch in most of a year.  I hope to return to it someday.  In the meantime it's fine by me if you add support for this and rewrite some functions by hand to use it.

p.s. My last name has two S's.  If you continue to leave off one of them, I shall remove one from yours, Mr. TINNER.
msg263923 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-21 15:03
Since early microbenchmarks are promising, I wrote a more complete implementations which tries to use the fast-path (avoid temporary tuple/dict) in all PyObject_Call*() functions.

The next step would be to add a METH_FASTCALL flag. IMHO adding such new flag requires to enhance Argument Clinic to be able to use it, at least when a function doesn't accept keyword parameters.

PyObject_CallFunction() & friends have a weird API: if call with the format string "O", the behaviour depends if the object parameter is a tuple or not. If it's a tuple, the tuple is unpacked. It's a little bit weird. I recall that it led to a bug in the implementation in generators in Python: issue #21209! Moreover, if the format string is "(...)", parenthesis are ignored. If you want to call a function with one argument which is a tuple, you have to write "((...))". It's a little bit weird, but we cannot change that without breaking the (Python) world :-)

call_stack-3.patch:

* I renamed the main function to _PyObject_FastCall()
* I added PyObject_CallNoArg(): call a function with no parameter
* I added Py_VaBuildStack() and _Py_VaBuildStack_SizeT() helpers for PyObject_Call*() functions using a format string
* I renamed the new slot to tp_fastcall

Nice change in the WITH_CLEANUP_START opcode (ceval.c):

-            /* XXX Not the fastest way to call it... */
-            res = PyObject_CallFunctionObjArgs(exit_func, exc, val, tb, NULL);

+            arg_stack[0] = exc;
+            arg_stack[1] = val;
+            arg_stack[2] = tb;
+            res = _PyObject_FastCall(exit_func, arg_stack, 3, 0);

I don't know if it's a common byetcode, nor if the change is really faster.
msg263924 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-21 15:05
> I believe I got a very simple version of it working at one point, supporting positional parameters only, with some optional arguments.

Yeah, that would be a nice first step.

> p.s. My last name has two S's.  If you continue to leave off one of them, I shall remove one from yours, Mr. TINNER.

Ooops, I'm sorry Guido Hastings :-(
msg263926 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-21 17:04
PyObject_Call*() implementations with _PyObject_FastCall() look much more complex than with PyObject_Call() (even not counting additional complex functions in modsupport.c). And I'm not sure there is a benefit. May be for first stage we can do without this.
msg263946 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-22 00:44
I created a repository. I will work there and make some experiment. It would help to have a better idea of the concrete performance. When I will have a better view of all requires changes to get best performances everywhere, I will start a discussion to see which parts are worth it or not. In my latest microbenchmarks, functions calls (C/Python, mixed) are between 8% and 40% faster. I'm now running the CPython benchmark suite.
msg263995 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-22 11:10
Changes of my current implementation, ad4a53ed1fbf.diff.

The good thing is that all changes are internals (really?). Even if you don't modify your C extensions (nor your Python code), you should benefit of the new fast call is *a lot* of cases.

IMHO the best tricky part are changes on the PyTypeObject. Is it ok to add a new tp_fastcall slot? Should we add even more slots using the fast call convention like tp_fastnew and tp_fastinit? How should we handle the inheritance of types with that?


(*) Add 2 new public functions:

PyObject* PyObject_CallNoArg(PyObject *func);
PyObject* PyObject_CallArg1(PyObject *func, PyObject *arg);


(*) Add 1 new private function:

PyObject* _PyObject_FastCall(PyObject *func, PyObject **stack, int na, int nk);

_PyObject_FastCall() is the root of the new feature.


(*) type: add a new "tp_fastcall" field to the PyTypeObject structure.

It's unclear to me how inheritance is handled here. Maybe it's simply broken, but it's strange because it looks like it works :-) Maybe it's very rare that tp_call is overidden in a child class?

TODO: maybe reuse the "tp_call" field? (risk of major backward incompatibility...)


(*) slots: add a new "fastwrapper" field to the wrappercase structure. Add a fast wrapper to all slots (really all? i should check).

I don't think that consumers of the C API are of this change, or maybe only a few projects.

TODO: maybe remove "fastwrapper" and reuse the "wrapper" field? (low risk of backward compatibility?)


(*) Implement fast call for Python function (_PyFunction_FastCall) and C functions (PyCFunction_FastCall)


(*) Add a new METH_FASTCALL calling convention for C functions. Right now, it is used for 4 builtin functions: sorted(), getattr(), iter(), next().

Argument Clinic should be modified to emit C code using this new fast calling convention.


(*) Implement fast call in the following functions (types):

- method()
- method_descriptor()
- wrapper_descriptor()
- method_wrapper()
- operator.itemgetter => used by collections.namedtuple to get an item by its name


(*) Modify PyObject_Call*() functins to reuse internally the fast call. "tp_fastcall" is preferred over "tp_call" (FIXME: is it really useful to do that?).

The following functions are able to avoid temporary tuple/dict without having to modify the code calling them:

- PyObject_CallFunction()
- PyObject_CallMethod(), _PyObject_CallMethodId()
- PyObject_CallFunctionObjArgs(), PyObject_CallMethodObjArgs()

It's not required to modify code using these functions to use the 3 new shiny functions (PyObject_CallNoArg, PyObject_CallArg1, _PyObject_FastCall). For example, replacing PyObject_CallFunctionObjArgs(func, NULL) with PyObject_CallNoArg(func) is just a micro-optimization, the tuple is already avoided. But PyObject_CallNoArg() should use less memory of the C stack and be a "little bit" faster.


(*) Add new helpers: new Include/pystack.h file, Py_VaBuildStack(), etc.


Please ignore unrelated changes.
msg263996 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-22 11:12
Related issue: issue #23507, "Tuple creation is too slow".
msg263999 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-22 11:40
Some microbenchmarks: bench_fast.py.

== Python 3.6 / Python 3.6 FASTCALL ==

----------------------------------+--------------+---------------
Tests                             | /tmp/default |  /tmp/fastcall
----------------------------------+--------------+---------------
filter                            |   241 us (*) |  166 us (-31%)
map                               |   205 us (*) |  168 us (-18%)
sorted(list, key=lambda x: x)     |   242 us (*) |  162 us (-33%)
sorted(list)                      |  27.7 us (*) |        27.8 us
b=MyBytes(); bytes(b)             |   549 ns (*) |         533 ns
namedtuple.attr                   |  2.03 us (*) | 1.56 us (-23%)
object.__setattr__(obj, "x", 1)   |   347 ns (*) |  218 ns (-37%)
object.__getattribute__(obj, "x") |   331 ns (*) |  200 ns (-40%)
getattr(1, "real")                |   267 ns (*) |  150 ns (-44%)
bounded_pymethod(1, 2)            |   193 ns (*) |         190 ns
unbound_pymethod(obj, 1, 2        |   195 ns (*) |         192 ns
----------------------------------+--------------+---------------
Total                             |   719 us (*) |  526 us (-27%)
----------------------------------+--------------+---------------


== Compare Python 3.4 / Python 3.6 / Python 3.6 FASTCALL ==

Common platform:
Timer: time.perf_counter
Python unicode implementation: PEP 393
Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Platform: Linux-4.4.4-301.fc23.x86_64-x86_64-with-fedora-23-Twenty_Three
SCM: hg revision=abort: repository . not found! tag=abort: repository . not found! branch=abort: repository . not found! date=abort: no repository found in '/home/haypo/prog/python' (.hg not found)!
Bits: int=32, long=64, long long=64, size_t=64, void*=64

Platform of campaign /tmp/py34:
Python version: 3.4.3 (default, Jun 29 2015, 12:16:01) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]
CFLAGS: -Wno-unused-result -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv
Timer precision: 78 ns
Date: 2016-04-22 13:37:52

Platform of campaign /tmp/default:
Python version: 3.6.0a0 (default:496e094f4734, Apr 22 2016, 02:18:13) [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)]
CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
Timer precision: 103 ns
Date: 2016-04-22 13:38:07

Platform of campaign /tmp/fastcall:
Python version: 3.6.0a0 (default:ad4a53ed1fbf, Apr 22 2016, 12:42:15) [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)]
Timer precision: 99 ns
CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
Date: 2016-04-22 13:38:21

----------------------------------+-------------+----------------+---------------
Tests                             |   /tmp/py34 |   /tmp/default |  /tmp/fastcall
----------------------------------+-------------+----------------+---------------
filter                            |  325 us (*) |  241 us (-26%) |  166 us (-49%)
map                               |  260 us (*) |  205 us (-21%) |  168 us (-35%)
sorted(list, key=lambda x: x)     |  354 us (*) |  242 us (-32%) |  162 us (-54%)
sorted(list)                      | 46.9 us (*) | 27.7 us (-41%) | 27.8 us (-41%)
b=MyBytes(); bytes(b)             |  839 ns (*) |  549 ns (-35%) |  533 ns (-36%)
namedtuple.attr                   | 4.51 us (*) | 2.03 us (-55%) | 1.56 us (-65%)
object.__setattr__(obj, "x", 1)   |  447 ns (*) |  347 ns (-22%) |  218 ns (-51%)
object.__getattribute__(obj, "x") |  401 ns (*) |  331 ns (-17%) |  200 ns (-50%)
getattr(1, "real")                |  236 ns (*) |  267 ns (+13%) |  150 ns (-36%)
bounded_pymethod(1, 2)            |  249 ns (*) |  193 ns (-22%) |  190 ns (-24%)
unbound_pymethod(obj, 1, 2        |  251 ns (*) |  195 ns (-22%) |  192 ns (-23%)
----------------------------------+-------------+----------------+---------------
Total                             |  993 us (*) |  719 us (-28%) |  526 us (-47%)
----------------------------------+-------------+----------------+---------------
msg264003 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-22 11:52
For more fun, comparison between Python 2.7 / 3.4 / 3.6 / 3.6 FASTCALL.

----------------------------------+-------------+----------------+----------------+---------------
Tests                             |        py27 |           py34 |           py36 |           fast
----------------------------------+-------------+----------------+----------------+---------------
filter                            |  165 us (*) |  318 us (+93%) |  237 us (+43%) |         165 us
map                               |  209 us (*) |  258 us (+24%) |         202 us |  171 us (-18%)
sorted(list, key=lambda x: x)     |  272 us (*) |  348 us (+28%) |  237 us (-13%) |  163 us (-40%)
sorted(list)                      | 33.7 us (*) | 47.8 us (+42%) | 27.3 us (-19%) | 27.7 us (-18%)
b=MyBytes(); bytes(b)             | 3.31 us (*) |  835 ns (-75%) |  510 ns (-85%) |  561 ns (-83%)
namedtuple.attr                   | 4.63 us (*) |        4.51 us | 1.98 us (-57%) | 1.57 us (-66%)
object.__setattr__(obj, "x", 1)   |  463 ns (*) |         440 ns |  343 ns (-26%) |  222 ns (-52%)
object.__getattribute__(obj, "x") |  323 ns (*) |  396 ns (+23%) |         316 ns |  196 ns (-39%)
getattr(1, "real")                |  218 ns (*) |   237 ns (+8%) |  264 ns (+21%) |  147 ns (-33%)
bounded_pymethod(1, 2)            |  213 ns (*) |  244 ns (+14%) |   194 ns (-9%) |  188 ns (-12%)
unbound_pymethod(obj, 1, 2)       |  345 ns (*) |  247 ns (-29%) |  196 ns (-43%) |  191 ns (-45%)
func()                            |  161 ns (*) |  211 ns (+31%) |         161 ns |         157 ns
func(1, 2, 3)                     |  219 ns (*) |  247 ns (+13%) |  196 ns (-10%) |  190 ns (-13%)
----------------------------------+-------------+----------------+----------------+---------------
Total                             |  689 us (*) |  980 us (+42%) |         707 us |  531 us (-23%)
----------------------------------+-------------+----------------+----------------+---------------


I didn't know that Python 3.4 was so much slower than Python 2.7 on function calls!?

Note: Python 2.7 and Python 3.4 are system binaries (Fedora 22), wheras Python 3.6 and Python 3.6 FASTCALL are compiled manually.

Ignore "b=MyBytes(); bytes(b)", this benchmark is written for Python 3.

--

details:

Common platform:
Bits: int=32, long=64, long long=64, size_t=64, void*=64
Platform: Linux-4.4.4-301.fc23.x86_64-x86_64-with-fedora-23-Twenty_Three
CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

Platform of campaign py27:
CFLAGS: -fno-strict-aliasing -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv
Python unicode implementation: UCS-4
Timer precision: 954 ns
Python version: 2.7.10 (default, Sep 8 2015, 17:20:17) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]
Timer: time.time

Platform of campaign py34:
Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
CFLAGS: -Wno-unused-result -DDYNAMIC_ANNOTATIONS_ENABLED=1 -DNDEBUG -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv
Timer precision: 84 ns
Python unicode implementation: PEP 393
Python version: 3.4.3 (default, Jun 29 2015, 12:16:01) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)]
Timer: time.perf_counter

Platform of campaign py36:
Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
Python version: 3.6.0a0 (default:496e094f4734, Apr 22 2016, 02:18:13) [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)]
CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
Python unicode implementation: PEP 393
Timer: time.perf_counter

Platform of campaign fast:
Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09)
CFLAGS: -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
Python unicode implementation: PEP 393
Python version: 3.6.0a0 (default:ad4a53ed1fbf, Apr 22 2016, 12:42:15) [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)]
msg264009 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-22 12:52
Could you compare filter(), map() and sorted() performance with your patch and with issue23507 patch?
msg264021 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-22 14:56
Results of the CPython benchmark suite on the revision 6c376e866330 of  https://hg.python.org/sandbox/fastcall compared to CPython 3.6 at the revision 496e094f4734.

It's surprising than call_simple is 1.08x slower in fastcall. This slowdown is not acceptable and should be fixed. It probable explains why many other benchmarks are slower.

Hopefully, some benchmarks are faster, between 1.02x and 1.09x faster.

IMHO there are still performance issues in my current implementation that can and must be fixed. At least, we have a starting point to compare performances.


$ python3 -u perf.py ../default/python ../fastcall/python -b all
(...)
Report on Linux smithers 4.4.4-301.fc23.x86_64 #1 SMP Fri Mar 4 17:42:42 UTC 2016 x86_64 x86_64
Total CPU cores: 8

[ slower ]

### 2to3 ###
6.859604 -> 6.985351: 1.02x slower

### call_method_slots ###
Min: 0.308846 -> 0.317780: 1.03x slower
Avg: 0.308902 -> 0.318667: 1.03x slower
Significant (t=-464.83)
Stddev: 0.00003 -> 0.00026: 9.8974x larger

### call_simple ###
Min: 0.232594 -> 0.251789: 1.08x slower
Avg: 0.232816 -> 0.252443: 1.08x slower
Significant (t=-911.97)
Stddev: 0.00024 -> 0.00011: 2.2373x smaller

### chaos ###
Min: 0.273084 -> 0.284790: 1.04x slower
Avg: 0.273951 -> 0.293177: 1.07x slower
Significant (t=-7.57)
Stddev: 0.00036 -> 0.01796: 49.9421x larger

### django_v3 ###
Min: 0.549604 -> 0.569982: 1.04x slower
Avg: 0.550557 -> 0.571038: 1.04x slower
Significant (t=-204.09)
Stddev: 0.00046 -> 0.00054: 1.1747x larger

### float ###
Min: 0.261939 -> 0.269224: 1.03x slower
Avg: 0.268475 -> 0.276515: 1.03x slower
Significant (t=-12.22)
Stddev: 0.00301 -> 0.00354: 1.1757x larger

### formatted_logging ###
Min: 0.325786 -> 0.334440: 1.03x slower
Avg: 0.326827 -> 0.335968: 1.03x slower
Significant (t=-34.44)
Stddev: 0.00129 -> 0.00136: 1.0503x larger

### mako_v2 ###
Min: 0.039642 -> 0.044765: 1.13x slower
Avg: 0.040251 -> 0.045562: 1.13x slower
Significant (t=-323.73)
Stddev: 0.00028 -> 0.00024: 1.1558x smaller

### meteor_contest ###
Min: 0.196589 -> 0.203667: 1.04x slower
Avg: 0.197497 -> 0.204782: 1.04x slower
Significant (t=-76.06)
Stddev: 0.00050 -> 0.00045: 1.1111x smaller

### nqueens ###
Min: 0.274664 -> 0.285866: 1.04x slower
Avg: 0.275285 -> 0.286774: 1.04x slower
Significant (t=-68.34)
Stddev: 0.00091 -> 0.00076: 1.2036x smaller

### pickle_list ###
Min: 0.262687 -> 0.269629: 1.03x slower
Avg: 0.263804 -> 0.270789: 1.03x slower
Significant (t=-50.14)
Stddev: 0.00070 -> 0.00070: 1.0004x larger

### raytrace ###
Min: 1.272960 -> 1.284516: 1.01x slower
Avg: 1.276398 -> 1.368574: 1.07x slower
Significant (t=-3.41)
Stddev: 0.00157 -> 0.19115: 122.0022x larger

### regex_compile ###
Min: 0.335753 -> 0.343820: 1.02x slower
Avg: 0.336273 -> 0.344894: 1.03x slower
Significant (t=-127.84)
Stddev: 0.00026 -> 0.00040: 1.5701x larger

### regex_effbot ###
Min: 0.048656 -> 0.050810: 1.04x slower
Avg: 0.048692 -> 0.051619: 1.06x slower
Significant (t=-69.92)
Stddev: 0.00002 -> 0.00030: 16.7793x larger

### silent_logging ###
Min: 0.069539 -> 0.071172: 1.02x slower
Avg: 0.069679 -> 0.071230: 1.02x slower
Significant (t=-124.08)
Stddev: 0.00009 -> 0.00002: 3.7073x smaller

### simple_logging ###
Min: 0.278439 -> 0.287736: 1.03x slower
Avg: 0.279504 -> 0.288811: 1.03x slower
Significant (t=-52.46)
Stddev: 0.00084 -> 0.00093: 1.1074x larger

### telco ###
Min: 0.012480 -> 0.013104: 1.05x slower
Avg: 0.012561 -> 0.013157: 1.05x slower
Significant (t=-100.42)
Stddev: 0.00004 -> 0.00002: 1.5881x smaller

### unpack_sequence ###
Min: 0.000047 -> 0.000048: 1.03x slower
Avg: 0.000047 -> 0.000048: 1.03x slower
Significant (t=-1170.16)
Stddev: 0.00000 -> 0.00000: 1.0749x larger

### unpickle_list ###
Min: 0.325310 -> 0.330080: 1.01x slower
Avg: 0.326484 -> 0.333974: 1.02x slower
Significant (t=-24.19)
Stddev: 0.00100 -> 0.00195: 1.9392x larger

[ faster ]

### chameleon_v2 ###
Min: 5.525575 -> 5.263668: 1.05x faster
Avg: 5.541444 -> 5.281893: 1.05x faster
Significant (t=85.79)
Stddev: 0.01107 -> 0.01831: 1.6539x larger

### etree_iterparse ###
Min: 0.212073 -> 0.197146: 1.08x faster
Avg: 0.215504 -> 0.200254: 1.08x faster
Significant (t=61.07)
Stddev: 0.00119 -> 0.00130: 1.0893x larger

### etree_parse ###
Min: 0.282983 -> 0.260390: 1.09x faster
Avg: 0.284333 -> 0.262758: 1.08x faster
Significant (t=77.34)
Stddev: 0.00102 -> 0.00169: 1.6628x larger

### etree_process ###
Min: 0.218953 -> 0.213683: 1.02x faster
Avg: 0.221036 -> 0.215280: 1.03x faster
Significant (t=25.98)
Stddev: 0.00114 -> 0.00108: 1.0580x smaller

### hexiom2 ###
Min: 122.001408 -> 118.967112: 1.03x faster
Avg: 122.108010 -> 119.110115: 1.03x faster
Significant (t=16.81)
Stddev: 0.15076 -> 0.20224: 1.3415x larger

### pathlib ###
Min: 0.088533 -> 0.084888: 1.04x faster
Avg: 0.088916 -> 0.085280: 1.04x faster
Significant (t=257.68)
Stddev: 0.00014 -> 0.00017: 1.1725x larger


The following not significant results are hidden, use -v to show them:
call_method, call_method_unknown, etree_generate, fannkuch, fastpickle, fastunpickle, go, json_dump_v2, json_load, nbody, normal_startup, pickle_dict, pidigits, regex_v8, richards, spectral_norm, startup_nosite, tornado_http.
msg264098 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-24 06:37
I have collected statistics about using CALL_FUNCTION* opcodes in compliled code during running CPython testsuite. According to it, 99.4% emitted opcodes is the CALL_FUNCTION opcode, and 89% of emitted CALL_FUNCTION opcodes have only positional arguments, and 98% of them have not more than 3 arguments.

That was about calls from Python code. All convenient C API functions (like PyObject_CallFunction and PyObject_CallFunctionObjArgs) used for direct calling in C code use only positional arguments.

Thus I think we need to optimize only cases of calling with small number (0-3) of positional arguments.
msg264101 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-24 07:15
> Thus I think we need to optimize only cases of calling with small number (0-3) of positional arguments.

My code is optimized to up to 10 positional arguments: with 0..10 arguments, the C stack is used to hold the array of PyObject*. For more arguments, an array is allocated in the heap memory.

+   /* 10 positional parameters or 5 (key, value) pairs for keyword parameters.
+      40 bytes on 32-bit or 80 bytes on 64-bit. */
+#  define _PyStack_SIZE 10

For keyword parameters, I don't know yet what is the best API (fatest API). Right now, I'm also using the same PyObject** array for positional and keyword arguments using "int nk", but maybe a dictionary is faster to combinary keyword arguments and to parse keyword arguments.
msg264102 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-24 07:37
I think you can simplify the patch by dropping keyword arguments support from fastcall. Then you can decrease _PyStack_SIZE to 4 (larger size will serve only 1.7% of calls), and may be refactor a code since an array of 4 pointers consumes less C stack than an array of 10 pointers.
msg264518 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-29 20:35
Results of the CPython benchmark suite. Reference = default branch at rev 496e094f4734, patched: fastcall fork at rev 2b4b7def2949.

I got many issues to get a reliable benchmark output:

* https://mail.python.org/pipermail/speed/2016-April/000329.html
* https://mail.python.org/pipermail/speed/2016-April/000341.html

The benchmark was run with CPU isolation. Both binaries were compiled with PGO+LTO.

Report on Linux smithers 4.4.4-301.fc23.x86_64 #1 SMP Fri Mar 4 17:42:42 UTC 2016 x86_64 x86_64
Total CPU cores: 8

### call_method_slots ###
Min: 0.289704 -> 0.269634: 1.07x faster
Avg: 0.290149 -> 0.275953: 1.05x faster
Significant (t=162.17)
Stddev: 0.00019 -> 0.00150: 8.1176x larger

### call_method_unknown ###
Min: 0.275295 -> 0.302810: 1.10x slower
Avg: 0.280201 -> 0.309166: 1.10x slower
Significant (t=-200.65)
Stddev: 0.00161 -> 0.00191: 1.1909x larger

### call_simple ###
Min: 0.202163 -> 0.207939: 1.03x slower
Avg: 0.202332 -> 0.208662: 1.03x slower
Significant (t=-636.09)
Stddev: 0.00008 -> 0.00015: 2.0130x larger

### chameleon_v2 ###
Min: 4.349474 -> 3.901936: 1.11x faster
Avg: 4.377664 -> 3.942932: 1.11x faster
Significant (t=62.39)
Stddev: 0.01403 -> 0.06826: 4.8635x larger

### django_v3 ###
Min: 0.484456 -> 0.462013: 1.05x faster
Avg: 0.489186 -> 0.465189: 1.05x faster
Significant (t=53.10)
Stddev: 0.00415 -> 0.00180: 2.3096x smaller

### etree_generate ###
Min: 0.193538 -> 0.182069: 1.06x faster
Avg: 0.196306 -> 0.184403: 1.06x faster
Significant (t=65.94)
Stddev: 0.00140 -> 0.00115: 1.2181x smaller

### etree_iterparse ###
Min: 0.189955 -> 0.177583: 1.07x faster
Avg: 0.195268 -> 0.183411: 1.06x faster
Significant (t=27.04)
Stddev: 0.00316 -> 0.00304: 1.0386x smaller

### etree_process ###
Min: 0.166556 -> 0.158617: 1.05x faster
Avg: 0.168822 -> 0.160672: 1.05x faster
Significant (t=43.33)
Stddev: 0.00125 -> 0.00140: 1.1205x larger

### fannkuch ###
Min: 0.859842 -> 0.878412: 1.02x slower
Avg: 0.865138 -> 0.889188: 1.03x slower
Significant (t=-14.97)
Stddev: 0.00718 -> 0.01436: 2.0000x larger

### float ###
Min: 0.222095 -> 0.214706: 1.03x faster
Avg: 0.226273 -> 0.218210: 1.04x faster
Significant (t=21.61)
Stddev: 0.00307 -> 0.00212: 1.4469x smaller

### hexiom2 ###
Min: 100.489630 -> 94.765364: 1.06x faster
Avg: 101.204871 -> 94.885605: 1.07x faster
Significant (t=77.45)
Stddev: 0.25310 -> 0.05016: 5.0454x smaller

### meteor_contest ###
Min: 0.181076 -> 0.176904: 1.02x faster
Avg: 0.181759 -> 0.177783: 1.02x faster
Significant (t=43.68)
Stddev: 0.00061 -> 0.00067: 1.1041x larger

### nbody ###
Min: 0.208752 -> 0.217011: 1.04x slower
Avg: 0.211552 -> 0.219621: 1.04x slower
Significant (t=-69.45)
Stddev: 0.00080 -> 0.00084: 1.0526x larger

### pathlib ###
Min: 0.077121 -> 0.070698: 1.09x faster
Avg: 0.078310 -> 0.071958: 1.09x faster
Significant (t=133.39)
Stddev: 0.00069 -> 0.00081: 1.1735x larger

### pickle_dict ###
Min: 0.530379 -> 0.514363: 1.03x faster
Avg: 0.531325 -> 0.515902: 1.03x faster
Significant (t=154.33)
Stddev: 0.00086 -> 0.00050: 1.7213x smaller

### pickle_list ###
Min: 0.253445 -> 0.263959: 1.04x slower
Avg: 0.255362 -> 0.267402: 1.05x slower
Significant (t=-95.47)
Stddev: 0.00075 -> 0.00101: 1.3447x larger

### raytrace ###
Min: 1.071042 -> 1.030849: 1.04x faster
Avg: 1.076629 -> 1.109029: 1.03x slower
Significant (t=-3.93)
Stddev: 0.00199 -> 0.08246: 41.4609x larger

### regex_compile ###
Min: 0.286053 -> 0.273454: 1.05x faster
Avg: 0.287171 -> 0.274422: 1.05x faster
Significant (t=153.16)
Stddev: 0.00067 -> 0.00050: 1.3452x smaller

### regex_effbot ###
Min: 0.044186 -> 0.048192: 1.09x slower
Avg: 0.044336 -> 0.048513: 1.09x slower
Significant (t=-172.41)
Stddev: 0.00020 -> 0.00014: 1.4671x smaller

### richards ###
Min: 0.137456 -> 0.135029: 1.02x faster
Avg: 0.138993 -> 0.136028: 1.02x faster
Significant (t=20.35)
Stddev: 0.00116 -> 0.00088: 1.3247x smaller

### silent_logging ###
Min: 0.060288 -> 0.056344: 1.07x faster
Avg: 0.060380 -> 0.056518: 1.07x faster
Significant (t=310.27)
Stddev: 0.00011 -> 0.00005: 2.1029x smaller

### telco ###
Min: 0.010735 -> 0.010441: 1.03x faster
Avg: 0.010849 -> 0.010557: 1.03x faster
Significant (t=34.04)
Stddev: 0.00007 -> 0.00005: 1.3325x smaller

### unpickle_list ###
Min: 0.290750 -> 0.297958: 1.02x slower
Avg: 0.292741 -> 0.299419: 1.02x slower
Significant (t=-41.62)
Stddev: 0.00133 -> 0.00090: 1.4852x smaller

The following not significant results are hidden, use -v to show them:
2to3, call_method, chaos, etree_parse, fastpickle, fastunpickle, formatted_logging, go, json_dump_v2, json_load, mako_v2, normal_startup, nqueens, pidigits, regex_v8, simple_logging, spectral_norm, startup_nosite, tornado_http, unpack_sequence.
msg264519 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-29 20:37
> Results of the CPython benchmark suite. Reference = default branch at rev 496e094f4734, patched: fastcall fork at rev 2b4b7def2949.

Oh, I forgot to mention that I modified perf.py to run each benchmark using 10 fresh processes to test multiple random seeds for the randomized hash function, instead of testing a fixed seed (PYTHONHASHSEED=1). This change should reduce the noise in the benchmark results.

I ran the benchmark suite using --rigorous.

I will open a new issue later for my perf.py change.
msg264525 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-29 21:43
Could you repeat benchmarks on different computer? Better with different CPU or compiler.
msg264526 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-29 21:55
> Could you repeat benchmarks on different computer? Better with different CPU or compiler.

Sorry, I don't really have the bandwith to repeat the benchmarks. PGO+LTO compilation is slow and running the benchmark suite in rigorous mode is very slow.

What do you expect from running the benchmark on a different computer?
msg264529 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-29 22:16
Results look as a noise. Some tests become slower, others become faster. If results on different machine will show the same sets of slowing down and speeding up tests, this likely is not a noise.
msg264530 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-29 22:23
> Results look as a noise.

As I wrote, it's really hard to get a reliable benchmark result. I did my best.

See also discussions about the CPython benchmark suite on the speed list:
https://mail.python.org/pipermail/speed/

I'm not sure that you will get less noise on other computers. IMHO many benchmarks are simply "broken" (not reliable).
msg265856 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-19 13:30
Hi,

I made progress on my FASTCALL branch. I removed tp_fastnew, tp_fastinit and
tp_fastnew fields from PyTypeObject to replace them with new type flags (ex:
Py_TPFLAGS_FASTNEW) to avoid code duplication and reduce the memory footprint.
Before, each function was simply duplicated.

This change introduces a backward incompatibility change: it's not more
possible to call directly tp_new, tp_init and tp_call. I don't know yet if such
change would be acceptable in Python 3.6, nor if it is worth it.

I spent a lot of ot time on the CPython benchmark suite to check for
performance regression. In fact, I spent most of my time to try to understand
why most benchmarks looked completly unstable. I now tuned correctly my
system and patched perf.py to get reliable benchmarks.

On the latest run of the benchmark suite, most benchmarks are faster! I have to investigate why 3 benchmarks are still slower. In the run, normal_startup was not significant, etree_parse was faster (instead of slower), but raytrace was already slower (but only 1.13x slower). It may be the "noise" of the PGO compilation. I already noticed that once: see the issue #27056 "pickle: constant propagation in _Unpickler_Read()".

Result of the benchmark suite:

slower (3):

* raytrace: 1.06x slower
* etree_parse: 1.03x slower
* normal_startup: 1.02x slower

faster (18):

* unpickle_list: 1.11x faster
* chameleon_v2: 1.09x faster
* etree_generate: 1.08x faster
* etree_process: 1.08x faster
* mako_v2: 1.06x faster
* call_method_unknown: 1.06x faster
* django_v3: 1.05x faster
* regex_compile: 1.05x faster
* etree_iterparse: 1.05x faster
* fastunpickle: 1.05x faster
* meteor_contest: 1.05x faster
* pickle_dict: 1.05x faster
* float: 1.04x faster
* pathlib: 1.04x faster
* silent_logging: 1.04x faster
* call_method: 1.03x faster
* json_dump_v2: 1.03x faster
* call_simple: 1.03x faster

not significant (21):

* 2to3
* call_method_slots
* chaos
* fannkuch
* fastpickle
* formatted_logging
* go
* json_load
* nbody
* nqueens
* pickle_list
* pidigits
* regex_effbot
* regex_v8
* richards
* simple_logging
* spectral_norm
* startup_nosite
* telco
* tornado_http
* unpack_sequence

I know that my patch is simply giant and cannot be merged like that.

Since the performance is still promising, I plan to split my giant
patch into smaller patches, easier to review. I will try to check that
individual patches don't make Python slower. This work will take time.
msg265857 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-19 13:37
New patch: 34456cce64bb.patch

$ diffstat 34456cce64bb.patch 
 .hgignore                                     |    3 
 Makefile.pre.in                               |   37 
 b/Doc/includes/shoddy.c                       |    2 
 b/Include/Python.h                            |    1 
 b/Include/abstract.h                          |   17 
 b/Include/descrobject.h                       |   14 
 b/Include/funcobject.h                        |    6 
 b/Include/methodobject.h                      |    6 
 b/Include/modsupport.h                        |   20 
 b/Include/object.h                            |   28 
 b/Lib/json/encoder.py                         |    1 
 b/Lib/test/test_extcall.py                    |   19 
 b/Lib/test/test_sys.py                        |    6 
 b/Modules/_collectionsmodule.c                |   14 
 b/Modules/_csv.c                              |   15 
 b/Modules/_ctypes/_ctypes.c                   |   12 
 b/Modules/_ctypes/stgdict.c                   |    2 
 b/Modules/_datetimemodule.c                   |   47 
 b/Modules/_elementtree.c                      |   11 
 b/Modules/_functoolsmodule.c                  |  113 +-
 b/Modules/_io/clinic/_iomodule.c.h            |    8 
 b/Modules/_io/clinic/bufferedio.c.h           |   42 
 b/Modules/_io/clinic/bytesio.c.h              |   42 
 b/Modules/_io/clinic/fileio.c.h               |   26 
 b/Modules/_io/clinic/iobase.c.h               |   26 
 b/Modules/_io/clinic/stringio.c.h             |   34 
 b/Modules/_io/clinic/textio.c.h               |   40 
 b/Modules/_io/iobase.c                        |    4 
 b/Modules/_json.c                             |   24 
 b/Modules/_lsprof.c                           |    4 
 b/Modules/_operator.c                         |   11 
 b/Modules/_pickle.c                           |  106 -
 b/Modules/_posixsubprocess.c                  |   15 
 b/Modules/_sre.c                              |   11 
 b/Modules/_ssl.c                              |    9 
 b/Modules/_testbuffer.c                       |    4 
 b/Modules/_testcapimodule.c                   |    4 
 b/Modules/_threadmodule.c                     |   32 
 b/Modules/_tkinter.c                          |   11 
 b/Modules/arraymodule.c                       |   29 
 b/Modules/cjkcodecs/clinic/multibytecodec.c.h |   50 
 b/Modules/clinic/_bz2module.c.h               |    8 
 b/Modules/clinic/_codecsmodule.c.h            |  318 +++--
 b/Modules/clinic/_cryptmodule.c.h             |   10 
 b/Modules/clinic/_datetimemodule.c.h          |    8 
 b/Modules/clinic/_dbmmodule.c.h               |   26 
 b/Modules/clinic/_elementtree.c.h             |   86 -
 b/Modules/clinic/_gdbmmodule.c.h              |   26 
 b/Modules/clinic/_lzmamodule.c.h              |   16 
 b/Modules/clinic/_opcode.c.h                  |   10 
 b/Modules/clinic/_pickle.c.h                  |   34 
 b/Modules/clinic/_sre.c.h                     |  124 +-
 b/Modules/clinic/_ssl.c.h                     |   74 -
 b/Modules/clinic/_tkinter.c.h                 |   50 
 b/Modules/clinic/_winapi.c.h                  |  124 +-
 b/Modules/clinic/arraymodule.c.h              |   34 
 b/Modules/clinic/audioop.c.h                  |  210 ++-
 b/Modules/clinic/binascii.c.h                 |   36 
 b/Modules/clinic/cmathmodule.c.h              |   24 
 b/Modules/clinic/fcntlmodule.c.h              |   34 
 b/Modules/clinic/grpmodule.c.h                |   14 
 b/Modules/clinic/md5module.c.h                |    8 
 b/Modules/clinic/posixmodule.c.h              |  642 ++++++-----
 b/Modules/clinic/pyexpat.c.h                  |   32 
 b/Modules/clinic/sha1module.c.h               |    8 
 b/Modules/clinic/sha256module.c.h             |   14 
 b/Modules/clinic/sha512module.c.h             |   14 
 b/Modules/clinic/signalmodule.c.h             |   50 
 b/Modules/clinic/unicodedata.c.h              |   42 
 b/Modules/clinic/zlibmodule.c.h               |   68 -
 b/Modules/itertoolsmodule.c                   |   20 
 b/Modules/main.c                              |    2 
 b/Modules/pyexpat.c                           |    3 
 b/Modules/signalmodule.c                      |    9 
 b/Modules/xxsubtype.c                         |    4 
 b/Objects/abstract.c                          |  403 ++++---
 b/Objects/bytesobject.c                       |    2 
 b/Objects/classobject.c                       |   36 
 b/Objects/clinic/bytearrayobject.c.h          |   90 -
 b/Objects/clinic/bytesobject.c.h              |   66 -
 b/Objects/clinic/dictobject.c.h               |   10 
 b/Objects/clinic/unicodeobject.c.h            |   10 
 b/Objects/descrobject.c                       |  162 +-
 b/Objects/dictobject.c                        |   26 
 b/Objects/enumobject.c                        |    8 
 b/Objects/exceptions.c                        |   91 +
 b/Objects/fileobject.c                        |   29 
 b/Objects/floatobject.c                       |   25 
 b/Objects/funcobject.c                        |   77 -
 b/Objects/genobject.c                         |    2 
 b/Objects/iterobject.c                        |    6 
 b/Objects/listobject.c                        |   20 
 b/Objects/longobject.c                        |   40 
 b/Objects/methodobject.c                      |  139 ++
 b/Objects/object.c                            |    4 
 b/Objects/odictobject.c                       |    2 
 b/Objects/rangeobject.c                       |   12 
 b/Objects/tupleobject.c                       |   21 
 b/Objects/typeobject.c                        | 1463 +++++++++++++++++++-------
 b/Objects/unicodeobject.c                     |   58 -
 b/Objects/weakrefobject.c                     |   22 
 b/PC/clinic/msvcrtmodule.c.h                  |   42 
 b/PC/clinic/winreg.c.h                        |  128 +-
 b/PC/clinic/winsound.c.h                      |   26 
 b/PCbuild/pythoncore.vcxproj                  |    4 
 b/Parser/tokenizer.c                          |    7 
 b/Python/ast.c                                |   31 
 b/Python/bltinmodule.c                        |  173 +--
 b/Python/ceval.c                              |  591 +++++++++-
 b/Python/clinic/bltinmodule.c.h               |  104 +
 b/Python/clinic/import.c.h                    |   18 
 b/Python/codecs.c                             |   17 
 b/Python/errors.c                             |  105 -
 b/Python/getargs.c                            |  284 ++++-
 b/Python/import.c                             |   27 
 b/Python/modsupport.c                         |  244 +++-
 b/Python/pythonrun.c                          |   10 
 b/Python/sysmodule.c                          |   32 
 b/Tools/clinic/clinic.py                      |  115 +-
 pystack.c                                     |  288 +++++
 pystack.h                                     |   64 +
 121 files changed, 5420 insertions(+), 2802 deletions(-)
msg265859 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-19 13:38
Status of the my FASTCALL implementation (34456cce64bb.patch):

* Add METH_FASTCALL calling convention to C functions, similar
  to METH_VARARGS|METH_KEYWORDS
* Clinic uses METH_FASTCALL when possible (it may use METH_FASTCALL
  for all cases in the future)
* Add new C functions:

  - _PyObject_FastCall(func, stack, nargs, kwds): root of the FASTCALL branch
  - PyObject_CallNoArg(func)
  - PyObject_CallArg1(func, arg)

* Add new type flags changing the calling conventions of tp_new, tp_init and
  tp_call:

  - Py_TPFLAGS_FASTNEW
  - Py_TPFLAGS_FASTINIT
  - Py_TPFLAGS_FASTCALL

* Backward incompatible change of Py_TPFLAGS_FASTNEW and Py_TPFLAGS_FASTINIT
  flags: calling explicitly type->tp_new() and type->tp_init() is now a bug
  and is likely to crash, since the calling convention can now be FASTCALL.

* New _PyType_CallNew() and _PyType_CallInit() functions to call tp_new and
  tp_init of a type. Functions which called tp_new and tp_init directly were
  patched.

* New helpers function to parse functions functions:

  - PyArg_ParseStack()
  - PyArg_ParseStackAndKeywords()
  - PyArg_UnpackStack()

* New Py_Build functons:

  - Py_BuildStack()
  - Py_VaBuildStack()

* New _PyStack API to handle a stack:

  - _PyStack_Alloc(), _PyStack_Free(), _PyStack_Copy()
  - _PyStack_FromTuple()
  - _PyStack_FromBorrowedTuple()
  - _PyStack_AsTuple(), _PyStack_AsTupleSlice()
  - ...

* Many changes were done in the typeobject.c file to handle FASTCALL, new
  type flags, handle correctly flags when a new type is created, etc.

* ceval.c: add _PyFunction_FastCall() function (somehow, I only exposed
  existing code)

A large part of the patch changes existing code to use the new calling
convention in many functions of many modules. Some changes were generated
by the Argument Clinic. IMHO the best would be to use Argument Clinic in more
places, rather than patching manually the code.
msg265887 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-19 19:51
> Result of the benchmark suite:
>
> slower (3):
>
> * raytrace: 1.06x slower
> * etree_parse: 1.03x slower
> * normal_startup: 1.02x slower

Hum, I recompiled the patched Python, again with PGO+LTO, and ran the same benchmark with the same command. In short, I replayed exaclty the same scenario. And... Only raytrace remains slower, etree_parse and normal_startup moved to the "not significant" list.

The difference in the benchmark result doesn't come from the benchmark. For example, I ran gain the normal_startup benchmark 3 times: I got the same result 3 times.

### normal_startup ###
Avg: 0.295168 +/- 0.000991 -> 0.294926 +/- 0.00048: 1.00x faster
Not significant

### normal_startup ###
Avg: 0.294871 +/- 0.000606 -> 0.294883 +/- 0.00072: 1.00x slower
Not significant

### normal_startup ###
Avg: 0.295096 +/- 0.000706 -> 0.294967 +/- 0.00068: 1.00x faster
Not significant

IMHO the difference comes from the data collected by PGO.
msg265896 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-19 21:03
> In short, I replayed exaclty the same scenario. And... Only raytrace remains slower, (...)

Oh, it looks like the reference binary calls the garbage collector less frequently than the patched python. In the patched Python, collections of the generation 2 are needed, whereas no collection of the generation 2 is needed on the reference binary.
msg265938 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-20 12:05
> unpickle_list: 1.11x faster

This result was unfair: my fastcall branch contained the optimization of the issue #27056. I just pushed this optimization into the default branch.

I ran again the benchmark: the result is now "not significant", as expected, since the benchmark is a microbenchmark testing C functions of Modules/_pickle.c, it doesn't really rely on the performance of (C or Python) functions calls.
msg266359 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-25 14:05
I fixed even more issues with my setup to run benchmark. Results should be even more reliable. Moreover, I fixed multiple reference leaks in the code which introduced performance regressions. I started to write articles to explain how to run stable benchmarks:

* https://haypo.github.io/journey-to-stable-benchmark-system.html
* https://haypo.github.io/journey-to-stable-benchmark-deadcode.html
* https://haypo.github.io/journey-to-stable-benchmark-average.html

Summary of benchmarks at the revision e6f3bf996c01:

Faster (25):
- pickle_list: 1.29x faster
- etree_generate: 1.22x faster
- pickle_dict: 1.19x faster
- etree_process: 1.16x faster
- mako_v2: 1.13x faster
- telco: 1.09x faster
- raytrace: 1.08x faster
- etree_iterparse: 1.08x faster
- regex_compile: 1.07x faster
- json_dump_v2: 1.07x faster
- etree_parse: 1.06x faster
- regex_v8: 1.05x faster
- call_method_unknown: 1.05x faster
- chameleon_v2: 1.05x faster
- fastunpickle: 1.04x faster
- django_v3: 1.04x faster
- chaos: 1.04x faster
- 2to3: 1.03x faster
- pathlib: 1.03x faster
- unpickle_list: 1.03x faster
- json_load: 1.03x faster
- fannkuch: 1.03x faster
- call_method: 1.02x faster
- unpack_sequence: 1.02x faster
- call_method_slots: 1.02x faster

Slower (4):
- regex_effbot: 1.08x slower
- nbody: 1.08x slower
- spectral_norm: 1.07x slower
- nqueens: 1.06x slower

Not significat (13):
- tornado_http
- startup_nosite
- simple_logging
- silent_logging
- richards
- pidigits
- normal_startup
- meteor_contest
- go
- formatted_logging
- float
- fastpickle
- call_simple

I'm now investigating why 4 benchmarks are slower.

Note: I'm still using my patched CPython benchmark suite to get more stable benchmark. I will send patches upstream later.
msg274124 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-01 13:15
I splitted the giant patch into smaller patches easier to review. The first part (_PyObject_FastCall, _PyObject_FastCallDict) is already merged. Other issues were opened to implement the full feature. I now close this issue.
History
Date User Action Args
2016-09-01 13:15:25vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg274124
2016-05-25 14:05:19vstinnersetmessages: + msg266359
2016-05-20 12:05:27vstinnersetmessages: + msg265938
2016-05-19 21:03:51vstinnersetmessages: + msg265896
2016-05-19 19:51:46vstinnersetmessages: + msg265887
2016-05-19 13:38:54vstinnersetmessages: + msg265859
2016-05-19 13:38:19vstinnersetfiles: + 34456cce64bb.patch

messages: + msg265857
2016-05-19 13:36:19vstinnersetfiles: - 34456cce64bb.diff
2016-05-19 13:35:17vstinnersetfiles: + 34456cce64bb.diff
2016-05-19 13:30:46vstinnersetmessages: + msg265856
2016-05-09 22:55:14jstasiaksetnosy: + jstasiak
2016-04-29 22:23:44vstinnersetmessages: + msg264530
2016-04-29 22:16:35serhiy.storchakasetmessages: + msg264529
2016-04-29 21:55:12vstinnersetmessages: + msg264526
2016-04-29 21:43:53serhiy.storchakasetmessages: + msg264525
2016-04-29 20:37:52vstinnersetmessages: + msg264519
2016-04-29 20:35:56vstinnersetmessages: + msg264518
2016-04-24 07:37:58serhiy.storchakasetmessages: + msg264102
2016-04-24 07:15:35vstinnersetmessages: + msg264101
2016-04-24 06:37:35serhiy.storchakasetmessages: + msg264098
2016-04-22 14:56:57vstinnersetmessages: + msg264021
2016-04-22 12:52:39serhiy.storchakasetmessages: + msg264009
2016-04-22 11:52:19vstinnersetfiles: + bench_fast-2.py

messages: + msg264003
2016-04-22 11:40:11vstinnersetfiles: + bench_fast.py

messages: + msg263999
2016-04-22 11:12:30vstinnersetmessages: + msg263996
2016-04-22 11:10:16vstinnersetmessages: + msg263995
2016-04-22 10:41:52vstinnersetfiles: + ad4a53ed1fbf.diff
2016-04-22 00:44:11vstinnersetmessages: + msg263946
title: Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments -> [WIP] Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments
2016-04-22 00:41:28vstinnersethgrepos: + hgrepo342
2016-04-21 17:04:54serhiy.storchakasetmessages: + msg263926
2016-04-21 15:05:13vstinnersetmessages: + msg263924
2016-04-21 15:03:09vstinnersetfiles: + call_stack-3.patch

messages: + msg263923
title: Add a new _PyObject_CallStack() function which avoids the creation of a tuple or dict for arguments -> Add a new _PyObject_FastCall() function which avoids the creation of a tuple or dict for arguments
2016-04-21 14:24:16larrysetmessages: + msg263920
2016-04-21 13:45:49serhiy.storchakasetmessages: + msg263918
2016-04-21 10:42:27vstinnersetfiles: + call_stack-2.patch

messages: + msg263910
2016-04-21 10:28:27serhiy.storchakasetmessages: + msg263909
2016-04-21 10:20:50vstinnersetmessages: + msg263908
2016-04-21 09:53:53serhiy.storchakasetmessages: + msg263907
2016-04-21 08:58:02vstinnersetnosy: + yselivanov
2016-04-21 08:57:21vstinnercreate