This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: decimal: Use FASTCALL and/or Argument Clinic
Type: performance Stage: resolved
Components: Versions: Python 3.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: skrah Nosy List: malin, rhettinger, serhiy.storchaka, skrah, vstinner
Priority: normal Keywords: patch

Created on 2017-01-17 16:38 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
decimal.patch vstinner, 2017-01-17 16:38
decimal-2.patch vstinner, 2017-01-27 15:31 review
Messages (23)
msg285665 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-17 16:38
I'm currently working on the isuse #29259: "Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects". I used bm_telco of the performance benchmark suite to check which functions still require to create a temporary tuple for positional arguments and a temporary dict for keyword arguments. I found 3 remaining functions which have an impact on the result of the benchmark:

* print(): optimized by the issue #29296
* _struct.unpack(): I just created the issue #29300 "Modify the _struct module to use FASTCALL and Argument Clinic"
* _decimal.Decimal.quantize()

I would like to know if Stephan would be ok to modify the _decimal module to use FASTCALL. I know that recently he reverted changes to keep the same code base on Python 3.5, 3.6 and 3.7.

With 4 changes (tp_fastcall #29259, print #29296, unpack #29300 and this issue), bm_telco becomes 22% faster which is not negligible!

   20.9 ms +- 0.5 ms => 16.4 ms +- 0.5 ms

Attached decimal.patch patch is the strict minimum change to optimize bm_telco, but I would prefer to change all _decimal functions and methods using METH_VARARGS and METH_VARARGS|METH_KEYWORDS to convert them to METH_FASTCALL.

The best would be to use Argument Clinic. AC exists since Python 3.5, so it should be possible to keep the same code base on Python 3.5-3.7, only generated code would be different.
msg285670 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-17 17:10
Oh wait, I'm not sure that attached patch has a significant impact on performances. It seems like the speedup mostly comes from the print patch:
http://bugs.python.org/issue29296#msg285668

But well, it is still interesting to use METH_FASTCALL in decimal ;-)
msg285671 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-17 17:12
See msg207652 in issue20177.
msg285673 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-01-17 17:29
This should not go in without Stephan Krah's approval.  This code is finely tuned and carefully arranged.
msg285680 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-17 20:42
Raymond Hettinger: "This should not go in without Stephan Krah's approval.  This code is finely tuned and carefully arrang"

Sure, that's why I opened this issue.


Serhiy Storchaka: "See msg207652 in issue20177."

Oh, I didn't know that Stefan Krah already gave his opinion.

Since the performance impact is unclear, I prefer to close this issue. I will come back if argument parsing is more clearly identified as a performance bottleneck.
msg286359 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-27 14:56
Hello, I'm back! :-)

I almost abandonned my the tp_fastcall change (#29259).

print() now uses FASTCALL (#29296), it already made bm_telco faster:
https://speed.python.org/timeline/#/?exe=5&ben=telco&env=1&revs=50&equid=off&quarts=on&extr=on
(see the speedup around January 16 at the right)

For the unpack module, I'm still working on my patch in the issue #29300, but it doesn't seem to have a significant impact on bm_telco.

So the remaining question is the usage of FASTCALL or Argument Clinic in the _decimal module.


> Since the performance impact is unclear, I prefer to close this issue. I will come back if argument parsing is more clearly identified as a performance bottleneck.

I ran a new benchmark and FASTCALL makes bm_telco benchmark 1.11x faster:

Median +- std dev: [ref] 19.4 ms +- 0.7 ms -> [decimal-2.patch] 17.5 ms +- 0.6 ms: 1.11x faster (-10%)

So I decided to reopen the issue.

Stefan: what do you think of using Argument Clinic and/or FASTCALL in _decimal? Is "bm_telco 1.11x faster" significant enough for you to justify such change?

"The best would be to use Argument Clinic. AC exists since Python 3.5, so it should be possible to keep the same code base on Python 3.5-3.7, only generated code would be different."
msg286368 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-27 15:23
I would dissuade from direct using of FASTCALL, but Argument Clinic is enough reliable. I would recommend to use Argument Clinic unless Stefan supports external fork of the decimal module and is limited by public API.
msg286369 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-27 15:31
Oops, I forgot to attach decimal-2.patch, here you have.


> I would dissuade from direct using of FASTCALL, but Argument Clinic is enough reliable.

I wrote decimal-2.patch in a few minutes, when I was trying to find all functions still calling _PyStack_AsTuple() in my tp_fast{new,init,call} fork. I just wrote it to identify which parts of CPython can be optimized to make bm_telco faster.

I agree that Argument Clinic should be preferred over using directly METH_FASTCALL, especially for the _decimal module: Stefan already wrote that he wants to have the same code base on Python 3.5, 3.6 and 3.7.


> I would recommend to use Argument Clinic unless Stefan supports external fork of the decimal module and is limited by public API.

What is the status of Argument Clinic? Is it something "public" or not?

The status of _decimal is also unclear to me. Is it part of CPython or not? :-) I know that there is also a "backported" module for Python 2:

   https://pypi.python.org/pypi/cdecimal
   http://www.bytereef.org/mpdecimal/index.html

IMHO it's a great tool. Maybe it's time to make it usable outside CPython in Python 3.7? Or maybe we should wait until the remaining missing features are implemented? Issues #20291 and #29299 for example.

I'm now waiting for Stefan's feedback ;-)
msg286371 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2017-01-27 15:43
I'll take a look when I have the opportunity.

AC will not happen: It makes the module too large and unreadable.
msg286372 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-27 15:45
> What is the status of Argument Clinic? Is it something "public" or not?

No, it is for for internal CPython use only. It lacks some features (support 
of var-positional and var-keyword parameters, optional parameters without 
default value), the syntax of positional-only parameters is not officially 
accepted still, and future optimizations can require incompatible changes.

Only when all CPython builtins and extensions be converted to Argument Clinic,
PEP 457 (or an alternative) be accepted, Argument Clinic issues be resolved, 
we could say it stable enough. For now even Argument Clinic tests are broken.
msg286374 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-27 15:52
Stefan Krah:
> AC will not happen: It makes the module too large and unreadable.

Ah you dislike the additional [clinic input] sections?

It's kind of strange when you have to convert existing code, when once
the code uses AC, I prefer AC to separated documentation variables. It
helps to keep docstrings more up to date, and it helps to enhance the
API (ex: allow keywords, rename parameters to better names, etc.). It
also helps to make the documentation closer to the code, which is IMHO
a good thing :-)

IMHO the PyArg_ParseXXX() calls and their "kwlist" static variable are
"unreadable", and I'm happy to be able to hide them!

FYI decimal-2.patch replaces PyArg_ParseTupleAndKeywords() with
_PyArg_ParseStackAndKeywords() with static _PyArg_Parser object. This
object only decides keyword names once and is more efficient to parse
arguments. It explains partially the speedup. Only partially because
bm_telco only calls the .quantize() method, and it only uses
positional arguments (no keyword arguments) ;-)
msg286375 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2017-01-27 16:01
> STINNER Victor added the comment:
> > AC will not happen: It makes the module too large and unreadable.
> 
> Ah you dislike the additional [clinic input] sections?

Yes, they tear apart the code.  I stopped reading many C files because
of this.  Brett asked why several people don't review, this is actually
*one* of the reasons for me.

> It's kind of strange when you have to convert existing code, when once
> the code uses AC, I prefer AC to separated documentation variables. It
> helps to keep docstrings more up to date, and it helps to enhance the
> API (ex: allow keywords, rename parameters to better names, etc.). It
> also helps to make the documentation closer to the code, which is IMHO
> a good thing :-)

Apparently it works for several people, but not me.

> IMHO the PyArg_ParseXXX() calls and their "kwlist" static variable are
> "unreadable", and I'm happy to be able to hide them!
> 
> FYI decimal-2.patch replaces PyArg_ParseTupleAndKeywords() with
> _PyArg_ParseStackAndKeywords() with static _PyArg_Parser object. This
> object only decides keyword names once and is more efficient to parse
> arguments. It explains partially the speedup. Only partially because
> bm_telco only calls the .quantize() method, and it only uses
> positional arguments (no keyword arguments) ;-)

Okay, thanks!
msg286448 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-01-29 19:47
>> Ah you dislike the additional [clinic input] sections?
>
> Yes, they tear apart the code.  I stopped reading many C files because
> of this.

I concur with Stefan.  AC is very impactful on modules, greatly affecting their readability and maintainability.  The existing PyArg_ParseXXX() calls and their "kwlist" static variable are very easy to work with, easy to modify, and easy to teach (I cover them effortlessly in the Python classes I teach).  In contrast, AC is much harder to learn, harder to get right, doesn't fit some existing APIs, harder to change, and it greatly expands the number of lines of code.
msg287131 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2017-02-06 13:58
If the API can still change (msg287083), I think I'll wait with this until shortly before 3.7 beta.
msg287133 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-06 14:01
> If the API can still change (msg287083), I think I'll wait with this until shortly before 3.7 beta.

I do not understand why Serhiy keeps repeating that the API is going to change, I have no such plan. IMHO the FASTCALL API is now very well defined: (PyObject **args, Py_ssize_t nargs, PyObject *kwnames), and helper functions are now well tested.

We might optimize Argument Clinic further, but for decimal, it seems like it's a no-no and that direct usage of METH_FASTCALL is preferred.

So no, I don't plan or expect any API change.
msg287149 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-02-06 16:08
Oh ok, it seems like Serhiy wants to change METH_FASTCALL. I didn't know :-) He just opened the issue #29464: "Specialize FASTCALL for functions with positional-only parameters". Interesting idea.
msg297135 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-28 01:47
So Stefan: what do you want to do? Use FASTCALL or not? :-)

I would prefer to not wait until the beta if you are ok to use FASTCALL.

Note: it seems like bpo-29464 is going to be approved ;-)
msg297153 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2017-06-28 05:20
I think I'll wait until #29464 is committed and the API is considered frozen (see msg295176?).
msg318123 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-29 22:14
Sorry, I lost interest in this optimization. If someone wants to work on it, please open a new issue.
msg338430 - (view) Author: Ma Lin (malin) * Date: 2019-03-20 02:01
> AC will not happen: It makes the module too large and unreadable.

AC has great performance improvements these days: issue35582 and issue36127
It's quite worthy of reconsidering this decision.
msg338462 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2019-03-20 12:17
Thanks, but it is still not going to happen. Look at the increased code size
in e.g. blake2s_impl.c.h.


I want to know what is going on in the code.  Also, the performance
improvements are in argument parsing, but not when you have numerical
code like a * b, where a and b are already decimals.
msg338513 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-20 23:36
>  Also, the performance improvements are in argument parsing, but not when you have numerical code like a * b, where a and b are already decimals.

If the function call takes around 100 ns, the benefit of FASTCALL and the new more efficient function to parse arguments becomes quite significant. Some Decimal methods are that fast if not faster.

python3 -m perf timeit \
   -s 'import decimal; a=decimal.Decimal(1); b=decimal.Decimal(2); ctx=decimal.getcontext()' \
   'ctx.copy_sign(a, b)' \
   --duplicate=1024

=> Mean +- std dev: [ref] 148 ns +- 1 ns -> [fastcall] 98.9 ns +- 4.9 ns: 1.49x faster (-33%)

./python -m perf timeit \
   -s 'import decimal; a=decimal.Decimal(1); b=decimal.Decimal(2)' \
   'a.copy_sign(b)' \
   --duplicate=1024

=> Mean +- std dev: [ref] 123 ns +- 5 ns -> [fastcall] 86.5 ns +- 1.4 ns: 1.42x faster (-29%)

I wrote a quick & dirty change using directly METH_FASTCALL with _PyArg_ParseStackAndKeywords() (dec_mpd_qcopy_sign) and _PyArg_CheckPositional() (ctx_mpd_qcopy_sign). Using Argument Clinic may be more verbose, but it generates *even more* efficient code for functions accepting keyword arguments (like Decimal.copy_sign) and generate better docstring (at least, the signature for function parameters if no text is given).

Obviously, if your application only uses large numbers, the benefit will be invisible. But I guess that some people use decimal with "small" numbers ;-)

Note: Sure, you're right that operators like "a * b" wouldn't benefit from FASTCALL since they don't parse explicitly arguments. BINARY_MULTIPLY instruction gets directly 2 values from the Python stack and pass them to PyNumber_Multiply() with is defined to always take exactly 2 PyObject* in C.
msg338516 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-03-21 00:24
Hum, after reading again my previous, I'm not sure that my intent was clear. I'm fine with Stefan rejecting the optimization. He is the maintainer of decimal. I just wanted to comment what he said ;-)
History
Date User Action Args
2022-04-11 14:58:42adminsetgithub: 73487
2019-03-21 00:24:41vstinnersetmessages: + msg338516
2019-03-20 23:36:13vstinnersetmessages: + msg338513
2019-03-20 12:17:57skrahsetmessages: + msg338462
2019-03-20 02:01:04malinsetnosy: + malin
messages: + msg338430
2018-05-29 22:14:26vstinnersetstatus: open -> closed
resolution: out of date
messages: + msg318123

stage: resolved
2017-06-28 05:20:21skrahsetmessages: + msg297153
2017-06-28 01:47:33vstinnersetmessages: + msg297135
2017-02-06 16:08:36vstinnersetmessages: + msg287149
2017-02-06 14:01:22vstinnersetmessages: + msg287133
2017-02-06 13:58:08skrahsetmessages: + msg287131
2017-01-29 19:47:51rhettingersetmessages: + msg286448
2017-01-27 16:01:24skrahsetmessages: + msg286375
2017-01-27 15:52:55vstinnersetmessages: + msg286374
2017-01-27 15:45:46serhiy.storchakasetmessages: + msg286372
2017-01-27 15:43:04skrahsetmessages: + msg286371
2017-01-27 15:31:32vstinnersetfiles: + decimal-2.patch

messages: + msg286369
2017-01-27 15:23:48serhiy.storchakasetmessages: + msg286368
2017-01-27 14:56:35vstinnersetstatus: closed -> open
resolution: rejected -> (no value)
messages: + msg286359
2017-01-17 20:42:06vstinnersetstatus: open -> closed
resolution: rejected
messages: + msg285680
2017-01-17 17:29:46rhettingersetassignee: skrah

messages: + msg285673
nosy: + rhettinger
2017-01-17 17:12:19serhiy.storchakasetmessages: + msg285671
2017-01-17 17:10:02vstinnersetmessages: + msg285670
2017-01-17 16:38:48vstinnersetnosy: + serhiy.storchaka
2017-01-17 16:38:44vstinnercreate