Message 259695 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	Yury.Selivanov, casevh, josh.r, lemburg, mark.dickinson, pitrou, rhettinger, serhiy.storchaka, skrah, vstinner, yselivanov, zbyrne
Date	2016-02-05.22:37:27
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1454711847.9.0.736666080255.issue21955@psf.upfronthosting.co.za>
In-reply-to

Content
Serhiy Storchaka: "My patches were just samples. I'm glad that Yury incorporated the main idea and that this helps." Oh, if even Serhiy prefers Yury's patches, I should read them again :-) -- I read fastint5.patch one more time and I finally understood the following macros: +#define NB_SLOT(slot) offsetof(PyNumberMethods, slot) +#define NB_BINOP(nb_methods, slot) \ + ((binaryfunc)(& ((char)nb_methods)[NB_SLOT(slot)])) +#define PY_LONG_CALL_BINOP(slot, left, right) \ + (NB_BINOP(PyLong_Type.tp_as_number, slot))(left, right) +#define PY_FLOAT_CALL_BINOP(slot, left, right) \ + (NB_BINOP(PyFloat_Type.tp_as_number, slot))(left, right) In short, a+b calls long_add(a, b) with that. At the first read, I understood that it casted objects to C long or C double (don't ask me why). I see a difference between fastint5.patch and fastintfloat_alt.patch: fastint5.patch resolves the address of long_add() at runtime, whereas fastintfloat_alt.patch gets a direct pointer to _PyLong_Add() at the compilation. I expected a sublte speedup, but I'm unable to see it on benchmarks (again, both patches have the same speed). The float path is simpler in fastint5.patch because it uses the same code if right is float or long, but it adds more checks for the slow-path. No patch looks to have a real impact on the slow-path. Is it worth to change the second if to PyFloat_CheckExact() and then check type of right in the if body to avoid other checks on the slow-path? (C checks look very cheap, so I think that I already replied to my own question :-)) -- fastint5.patch optimizes a+b, a-b, ab, a/b and a//b. Why not other operators? List of operators from my constant folding optimzation in fatoptimizer: * int, float: a+b, a-b, ab, a/b, +x, -x, ~x, a//b, a%b, ab int only: a<<b, a>>b, a&b, a\|b, a^b If we optimize a//b, I suggest to also optimize a%b to be consistent. For integers, a*b, a<<b and a>>b would make sense too. Coming from the C language, I would prefer a<<b and a>>b than a2k or a//2k, since I expect better performance. For float, -x and +x may be common, but less a+b, a-b, a*b, a/b. Well, what I'm trying to say: if choose fastintfloat_alt.patch design, we will have to expose like a lot of new C functions in headers, and duplicate a lot of code. To support more than 4 operators, we need a macro. If we use a macro, it's cheap (in term of code maintenance) to use it for most or even all operators. -- > But I don't quite understand why it adds any gain. Is this just due to overhead of calling PyNumber_Add? Hum, that's a good question. > Then we should test with other compilers and with the LTO option. There are projects (I don't recall the number number) but I would prefer to handle LTO separatly. Python supports platforms and compilers which don't implement LTO. > fastint5.patch adds an overhead for type checks and increases the size of ceval loop. What is outweigh this overhead? I stopped to guess the speedup just by reading the code or a patch. I only trust benchmarks :-) Advice: don't trust yourself! only trust benchmarks.

Serhiy Storchaka: "My patches were just samples. I'm glad that Yury incorporated the main idea and that this helps."

Oh, if even Serhiy prefers Yury's patches, I should read them again :-)

--

I read fastint5.patch one more time and I finally understood the following macros:

+#define NB_SLOT(slot) offsetof(PyNumberMethods, slot)
+#define NB_BINOP(nb_methods, slot) \
+    (*(binaryfunc*)(& ((char*)nb_methods)[NB_SLOT(slot)]))
+#define PY_LONG_CALL_BINOP(slot, left, right) \
+    (NB_BINOP(PyLong_Type.tp_as_number, slot))(left, right)
+#define PY_FLOAT_CALL_BINOP(slot, left, right) \
+    (NB_BINOP(PyFloat_Type.tp_as_number, slot))(left, right)

In short, a+b calls long_add(a, b) with that. At the first read, I understood that it casted objects to C long or C double (don't ask me why).


I see a difference between fastint5.patch and fastintfloat_alt.patch: fastint5.patch resolves the address of long_add() at runtime, whereas fastintfloat_alt.patch gets a direct pointer to _PyLong_Add() at the compilation. I expected a sublte speedup, but I'm unable to see it on benchmarks (again, both patches have the same speed).

The float path is simpler in fastint5.patch because it uses the same code if right is float or long, but it adds more checks for the slow-path. No patch looks to have a real impact on the slow-path. Is it worth to change the second if to PyFloat_CheckExact() and then check type of right in the if body to avoid other checks on the slow-path?

(C checks look very cheap, so I think that I already replied to my own question :-))

--

fastint5.patch optimizes a+b, a-b, a*b, a/b and a//b. Why not other operators? List of operators from my constant folding optimzation in fatoptimizer:

* int, float: a+b, a-b, a*b, a/b, +x, -x, ~x, a//b, a%b, a**b
* int only: a<<b, a>>b, a&b, a|b, a^b

If we optimize a//b, I suggest to also optimize a%b to be consistent. For integers, a**b, a<<b and a>>b would make sense too. Coming from the C language, I would prefer a<<b and a>>b than a*2**k or a//2**k, since I expect better performance.

For float, -x and +x may be common, but less a+b, a-b, a*b, a/b.

Well, what I'm trying to say: if choose fastintfloat_alt.patch design, we will have to expose like a lot of new C functions in headers, and duplicate a lot of code.

To support more than 4 operators, we need a macro.

If we use a macro, it's cheap (in term of code maintenance) to use it for most or even all operators.

--

> But I don't quite understand why it adds any gain. Is this just due to overhead of calling PyNumber_Add?

Hum, that's a good question.


> Then we should test with other compilers and with the LTO option.

There are projects (I don't recall the number number) but I would prefer to handle LTO separatly. Python supports platforms and compilers which don't implement LTO.


> fastint5.patch adds an overhead for type checks and increases the size of ceval loop. What is outweigh this overhead?

I stopped to guess the speedup just by reading the code or a patch. I only trust benchmarks :-)

Advice: don't trust yourself! only trust benchmarks.

History
Date	User	Action	Args
2016-02-05 22:37:27	vstinner	set	recipients: + vstinner, lemburg, rhettinger, mark.dickinson, pitrou, casevh, skrah, Yury.Selivanov, serhiy.storchaka, yselivanov, josh.r, zbyrne
2016-02-05 22:37:27	vstinner	set	messageid: <1454711847.9.0.736666080255.issue21955@psf.upfronthosting.co.za>
2016-02-05 22:37:27	vstinner	link	issue21955 messages
2016-02-05 22:37:27	vstinner	create