classification
Title: decimal: Use FASTCALL and/or Argument Clinic
Type: performance Stage:
Components: Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: skrah Nosy List: haypo, rhettinger, serhiy.storchaka, skrah
Priority: normal Keywords: patch

Created on 2017-01-17 16:38 by haypo, last changed 2017-02-06 16:08 by haypo.

Files
File name Uploaded Description Edit
decimal.patch haypo, 2017-01-17 16:38
decimal-2.patch haypo, 2017-01-27 15:31 review
Messages (16)
msg285665 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-17 16:38
I'm currently working on the isuse #29259: "Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects". I used bm_telco of the performance benchmark suite to check which functions still require to create a temporary tuple for positional arguments and a temporary dict for keyword arguments. I found 3 remaining functions which have an impact on the result of the benchmark:

* print(): optimized by the issue #29296
* _struct.unpack(): I just created the issue #29300 "Modify the _struct module to use FASTCALL and Argument Clinic"
* _decimal.Decimal.quantize()

I would like to know if Stephan would be ok to modify the _decimal module to use FASTCALL. I know that recently he reverted changes to keep the same code base on Python 3.5, 3.6 and 3.7.

With 4 changes (tp_fastcall #29259, print #29296, unpack #29300 and this issue), bm_telco becomes 22% faster which is not negligible!

   20.9 ms +- 0.5 ms => 16.4 ms +- 0.5 ms

Attached decimal.patch patch is the strict minimum change to optimize bm_telco, but I would prefer to change all _decimal functions and methods using METH_VARARGS and METH_VARARGS|METH_KEYWORDS to convert them to METH_FASTCALL.

The best would be to use Argument Clinic. AC exists since Python 3.5, so it should be possible to keep the same code base on Python 3.5-3.7, only generated code would be different.
msg285670 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-17 17:10
Oh wait, I'm not sure that attached patch has a significant impact on performances. It seems like the speedup mostly comes from the print patch:
http://bugs.python.org/issue29296#msg285668

But well, it is still interesting to use METH_FASTCALL in decimal ;-)
msg285671 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-17 17:12
See msg207652 in issue20177.
msg285673 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-01-17 17:29
This should not go in without Stephan Krah's approval.  This code is finely tuned and carefully arranged.
msg285680 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-17 20:42
Raymond Hettinger: "This should not go in without Stephan Krah's approval.  This code is finely tuned and carefully arrang"

Sure, that's why I opened this issue.


Serhiy Storchaka: "See msg207652 in issue20177."

Oh, I didn't know that Stefan Krah already gave his opinion.

Since the performance impact is unclear, I prefer to close this issue. I will come back if argument parsing is more clearly identified as a performance bottleneck.
msg286359 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-27 14:56
Hello, I'm back! :-)

I almost abandonned my the tp_fastcall change (#29259).

print() now uses FASTCALL (#29296), it already made bm_telco faster:
https://speed.python.org/timeline/#/?exe=5&ben=telco&env=1&revs=50&equid=off&quarts=on&extr=on
(see the speedup around January 16 at the right)

For the unpack module, I'm still working on my patch in the issue #29300, but it doesn't seem to have a significant impact on bm_telco.

So the remaining question is the usage of FASTCALL or Argument Clinic in the _decimal module.


> Since the performance impact is unclear, I prefer to close this issue. I will come back if argument parsing is more clearly identified as a performance bottleneck.

I ran a new benchmark and FASTCALL makes bm_telco benchmark 1.11x faster:

Median +- std dev: [ref] 19.4 ms +- 0.7 ms -> [decimal-2.patch] 17.5 ms +- 0.6 ms: 1.11x faster (-10%)

So I decided to reopen the issue.

Stefan: what do you think of using Argument Clinic and/or FASTCALL in _decimal? Is "bm_telco 1.11x faster" significant enough for you to justify such change?

"The best would be to use Argument Clinic. AC exists since Python 3.5, so it should be possible to keep the same code base on Python 3.5-3.7, only generated code would be different."
msg286368 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-27 15:23
I would dissuade from direct using of FASTCALL, but Argument Clinic is enough reliable. I would recommend to use Argument Clinic unless Stefan supports external fork of the decimal module and is limited by public API.
msg286369 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-27 15:31
Oops, I forgot to attach decimal-2.patch, here you have.


> I would dissuade from direct using of FASTCALL, but Argument Clinic is enough reliable.

I wrote decimal-2.patch in a few minutes, when I was trying to find all functions still calling _PyStack_AsTuple() in my tp_fast{new,init,call} fork. I just wrote it to identify which parts of CPython can be optimized to make bm_telco faster.

I agree that Argument Clinic should be preferred over using directly METH_FASTCALL, especially for the _decimal module: Stefan already wrote that he wants to have the same code base on Python 3.5, 3.6 and 3.7.


> I would recommend to use Argument Clinic unless Stefan supports external fork of the decimal module and is limited by public API.

What is the status of Argument Clinic? Is it something "public" or not?

The status of _decimal is also unclear to me. Is it part of CPython or not? :-) I know that there is also a "backported" module for Python 2:

   https://pypi.python.org/pypi/cdecimal
   http://www.bytereef.org/mpdecimal/index.html

IMHO it's a great tool. Maybe it's time to make it usable outside CPython in Python 3.7? Or maybe we should wait until the remaining missing features are implemented? Issues #20291 and #29299 for example.

I'm now waiting for Stefan's feedback ;-)
msg286371 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2017-01-27 15:43
I'll take a look when I have the opportunity.

AC will not happen: It makes the module too large and unreadable.
msg286372 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-27 15:45
> What is the status of Argument Clinic? Is it something "public" or not?

No, it is for for internal CPython use only. It lacks some features (support 
of var-positional and var-keyword parameters, optional parameters without 
default value), the syntax of positional-only parameters is not officially 
accepted still, and future optimizations can require incompatible changes.

Only when all CPython builtins and extensions be converted to Argument Clinic,
PEP 457 (or an alternative) be accepted, Argument Clinic issues be resolved, 
we could say it stable enough. For now even Argument Clinic tests are broken.
msg286374 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-27 15:52
Stefan Krah:
> AC will not happen: It makes the module too large and unreadable.

Ah you dislike the additional [clinic input] sections?

It's kind of strange when you have to convert existing code, when once
the code uses AC, I prefer AC to separated documentation variables. It
helps to keep docstrings more up to date, and it helps to enhance the
API (ex: allow keywords, rename parameters to better names, etc.). It
also helps to make the documentation closer to the code, which is IMHO
a good thing :-)

IMHO the PyArg_ParseXXX() calls and their "kwlist" static variable are
"unreadable", and I'm happy to be able to hide them!

FYI decimal-2.patch replaces PyArg_ParseTupleAndKeywords() with
_PyArg_ParseStackAndKeywords() with static _PyArg_Parser object. This
object only decides keyword names once and is more efficient to parse
arguments. It explains partially the speedup. Only partially because
bm_telco only calls the .quantize() method, and it only uses
positional arguments (no keyword arguments) ;-)
msg286375 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2017-01-27 16:01
> STINNER Victor added the comment:
> > AC will not happen: It makes the module too large and unreadable.
> 
> Ah you dislike the additional [clinic input] sections?

Yes, they tear apart the code.  I stopped reading many C files because
of this.  Brett asked why several people don't review, this is actually
*one* of the reasons for me.

> It's kind of strange when you have to convert existing code, when once
> the code uses AC, I prefer AC to separated documentation variables. It
> helps to keep docstrings more up to date, and it helps to enhance the
> API (ex: allow keywords, rename parameters to better names, etc.). It
> also helps to make the documentation closer to the code, which is IMHO
> a good thing :-)

Apparently it works for several people, but not me.

> IMHO the PyArg_ParseXXX() calls and their "kwlist" static variable are
> "unreadable", and I'm happy to be able to hide them!
> 
> FYI decimal-2.patch replaces PyArg_ParseTupleAndKeywords() with
> _PyArg_ParseStackAndKeywords() with static _PyArg_Parser object. This
> object only decides keyword names once and is more efficient to parse
> arguments. It explains partially the speedup. Only partially because
> bm_telco only calls the .quantize() method, and it only uses
> positional arguments (no keyword arguments) ;-)

Okay, thanks!
msg286448 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-01-29 19:47
>> Ah you dislike the additional [clinic input] sections?
>
> Yes, they tear apart the code.  I stopped reading many C files because
> of this.

I concur with Stefan.  AC is very impactful on modules, greatly affecting their readability and maintainability.  The existing PyArg_ParseXXX() calls and their "kwlist" static variable are very easy to work with, easy to modify, and easy to teach (I cover them effortlessly in the Python classes I teach).  In contrast, AC is much harder to learn, harder to get right, doesn't fit some existing APIs, harder to change, and it greatly expands the number of lines of code.
msg287131 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2017-02-06 13:58
If the API can still change (msg287083), I think I'll wait with this until shortly before 3.7 beta.
msg287133 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-06 14:01
> If the API can still change (msg287083), I think I'll wait with this until shortly before 3.7 beta.

I do not understand why Serhiy keeps repeating that the API is going to change, I have no such plan. IMHO the FASTCALL API is now very well defined: (PyObject **args, Py_ssize_t nargs, PyObject *kwnames), and helper functions are now well tested.

We might optimize Argument Clinic further, but for decimal, it seems like it's a no-no and that direct usage of METH_FASTCALL is preferred.

So no, I don't plan or expect any API change.
msg287149 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-06 16:08
Oh ok, it seems like Serhiy wants to change METH_FASTCALL. I didn't know :-) He just opened the issue #29464: "Specialize FASTCALL for functions with positional-only parameters". Interesting idea.
History
Date User Action Args
2017-02-06 16:08:36hayposetmessages: + msg287149
2017-02-06 14:01:22hayposetmessages: + msg287133
2017-02-06 13:58:08skrahsetmessages: + msg287131
2017-01-29 19:47:51rhettingersetmessages: + msg286448
2017-01-27 16:01:24skrahsetmessages: + msg286375
2017-01-27 15:52:55hayposetmessages: + msg286374
2017-01-27 15:45:46serhiy.storchakasetmessages: + msg286372
2017-01-27 15:43:04skrahsetmessages: + msg286371
2017-01-27 15:31:32hayposetfiles: + decimal-2.patch

messages: + msg286369
2017-01-27 15:23:48serhiy.storchakasetmessages: + msg286368
2017-01-27 14:56:35hayposetstatus: closed -> open
resolution: rejected ->
messages: + msg286359
2017-01-17 20:42:06hayposetstatus: open -> closed
resolution: rejected
messages: + msg285680
2017-01-17 17:29:46rhettingersetassignee: skrah

messages: + msg285673
nosy: + rhettinger
2017-01-17 17:12:19serhiy.storchakasetmessages: + msg285671
2017-01-17 17:10:02hayposetmessages: + msg285670
2017-01-17 16:38:48hayposetnosy: + serhiy.storchaka
2017-01-17 16:38:44haypocreate