Message 409609 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tim.peters
Recipients	Dennis Sweeney, kj, mark.dickinson, rhettinger, tim.peters
Date	2022-01-03.19:25:09
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1641237909.84.0.850870612325.issue46020@roundup.psfhosted.org>
In-reply-to

Content
I was suprised that https://bugs.python.org/issue44376 managed to get i*2 to within a factor of 2 of ii's speed. The overheads of running long_pow() at all are high! Don't overlook that initialization of stack variables at the start, like PyLongObject z = NULL; / accumulated result / isn't free - code has to be generated to force zeroes into those variables. The initialization of `table[]` alone requires code to fill 256 memory bytes with zeroes (down to 128 on current main branch). Nothing is free. We can't sanely move the `table` initialization expense into the "giant k-ary window exponentiation" block either, because every bigint operation can fail ("out of memory"), and the macros for doing the common ones (MULT and REDUCE) can do "goto Error;", and that common exit code has no way to know what is or isn't initialized. We can't let it see uninitialized stack trash. The exit code in turn has a string of things like Py_DECREF(a); Py_DECREF(b); Py_XDECREF(c); and those cost cycles too, including tests and branches. So the real "outrage" to me is why xx took 17.6 nsec for x == 10 in the original report. That's many times longer than the HW takes to do the actual multiply. Whether it's spelled xx or x2, we're overwhelming timing overheads. `pow()` has many because it's a kind of Swiss army knife doing all sorts of things; what's `xx`'s excuse? ;-)

I was suprised that

https://bugs.python.org/issue44376 

managed to get i**2 to within a factor of 2 of i*i's speed. The overheads of running long_pow() at all are high! Don't overlook that initialization of stack variables at the start, like

    PyLongObject *z = NULL;  /* accumulated result */

isn't free - code has to be generated to force zeroes into those variables. The initialization of `table[]` alone requires code to fill 256 memory bytes with zeroes (down to 128 on current main branch). Nothing is free.

We can't sanely move the `table` initialization expense into the "giant k-ary window exponentiation" block either, because every bigint operation can fail ("out of memory"), and the macros for doing the common ones (MULT and REDUCE) can do "goto Error;", and that common exit code has no way to know what is or isn't initialized. We can't let it see uninitialized stack trash.

The exit code in turn has a string of things like

    Py_DECREF(a);
    Py_DECREF(b);
    Py_XDECREF(c);

and those cost cycles too, including tests and branches.

So the real "outrage" to me is why x*x took 17.6 nsec for x == 10 in the original report. That's many times longer than the HW takes to do the actual multiply. Whether it's spelled x*x or x**2, we're overwhelming timing overheads. `pow()` has many because it's a kind of Swiss army knife doing all sorts of things; what's `x*x`'s excuse? ;-)

History
Date	User	Action	Args
2022-01-03 19:25:09	tim.peters	set	recipients: + tim.peters, rhettinger, mark.dickinson, Dennis Sweeney, kj
2022-01-03 19:25:09	tim.peters	set	messageid: <1641237909.84.0.850870612325.issue46020@roundup.psfhosted.org>
2022-01-03 19:25:09	tim.peters	link	issue46020 messages
2022-01-03 19:25:09	tim.peters	create