This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: redundant iteration over digits in _PyLong_AsUnsignedLongMask
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Oren Milman, mark.dickinson, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2016-06-11 21:20 by Oren Milman, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
proposedPatches.diff Oren Milman, 2016-06-11 21:20 proposed patches diff file review
patchedCPythonTestOutput.txt Oren Milman, 2016-06-11 21:20 test output of CPython with my patches (tested on my PC)
CPythonTestOutput.txt Oren Milman, 2016-06-11 21:21 test output of CPython without my patches (tested on my PC)
capiWrapperModule.c Oren Milman, 2016-06-17 12:35 a simple C extension I used for micro-benchmarking
badMicroBenchProposedPatches.diff Oren Milman, 2016-06-17 12:36 diff file of the proposed patches with the (overall) bad micro-benchmarks results review
proposedPatches.diff Oren Milman, 2016-06-17 12:37 proposed patches in Modules/_testcapimodule diff file review
Pull Requests
URL Status Linked Edit
PR 392 merged Oren Milman, 2017-03-02 09:06
Messages (8)
msg268274 - (view) Author: Oren Milman (Oren Milman) * Date: 2016-06-11 21:20
------------ current state ------------
1. In Objects/longobject.c in _PyLong_AsUnsignedLongMask, in case v is a multiple-digit int, _PyLong_AsUnsignedLongMask iterates over all of its digits (going from the most to the least significant digit) and does (for each digit) 'x = (x << PyLong_SHIFT) | v->ob_digit[i];'.
However, iterating over digits other than the 'ceil(sizeof(x) * 8.0 / PyLong_SHIFT)' least significant digits is redundant, because their bits would be shifted out of x anyway.

With regard to relevant changes made in the past, the iteration over all of the digits of a multiple-digit v remained quite the same since _PyLong_AsUnsignedLongMask was added, in revision 28652.

2. In Modules/_testcapimodule.c, the function comment of test_k_code, and an error string inside that function, contain mistakes.

With regard to relevant changes made in the past, these mistakes were there since test_k_code was added, in revision 28652.


------------ proposed changes ------------
1. In _PyLong_AsUnsignedLongMask, in case v is a multiple-digit int, iterate only over the 'Py_MIN(i, sizeof(x) * 8 / PyLong_SHIFT + 1)' least significant digits.

Obviously, 'sizeof(x) * 8 / PyLong_SHIFT + 1' might be off by one in case CPython is compiled with a PyLong_SHIFT other than 15 or 30. I suspect it's an overkill, but I added an assert, just in case.

2. Fix the function comment of test_k_code, and an error string inside that function.


------------ alternative changes ------------
1. I have also considered iterating only over the 'Py_MIN(i, (Py_ssize_t)ceil(sizeof(x) * 8.0 / PyLong_SHIFT))' least significant digits. 
Even though this number of digits is guaranteed to be accurate, it cannot be calculated at compile time, so it would probably cause the optimization to overall introduce a performance penalty.


------------ diff ------------
The proposed patches diff file is attached.


------------ tests ------------
I built the patched CPython for x86, and played with it a little. Everything seemed to work as usual. 

In addition, I ran 'python_d.exe -m test -j3' (on my 64-bit Windows 10) with and without the patches, and got quite the same output.
The outputs of both runs are attached.
msg268581 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-06-14 19:33
The accurate number of digits is (sizeof(x) * 8 + PyLong_SHIFT - 1) / PyLong_SHIFT or (sizeof(x) * 8 - 1) / PyLong_SHIFT + 1.

There is a similar code in _PyLong_AsUnsignedLongLongMask().

Changes of Modules/_testcapimodule.c look not related to the implementation of _PyLong_AsUnsignedLongMask.

Needed some results of microbenchmarking. The patch speeds up _PyLong_AsUnsignedLongLongMask() for very large integers, but in most cases this function is called for not such larger integers, and the patch adds additional check. We should ensure that this doesn't slow down the function.
msg268586 - (view) Author: Oren Milman (Oren Milman) * Date: 2016-06-14 20:44
Ah, that's a cool alternative to divide and ceil. I would change my patch accordingly.

And would write the patch also for _PyLong_AsUnsignedLongLongMask, and work on some micro-benchmarking for it and _PyLong_AsUnsignedLongMask.

Indeed _testcapimodule.c is not really related. Should I open another issue for that only?
(I am often worried about opening many small issues.. Please let me know if I shouldn't be.)

Now that you mention it, similar 'while (--i >= 0)' can be found in some other functions:
1. PyLong_AsLongAndOverflow
2. PyLong_AsSsize_t
3. PyLong_AsUnsignedLong
4. PyLong_AsSize_t
5. PyLong_AsLongLongAndOverflow
I suspect we could optimize these 5 by first determining whether the Python int has too many bits, instead of shifting and checking for an overflow on each iteration. 
I would definitely give it a try when I am done with _PyLong_AsUnsignedLongLongMask and _PyLong_AsUnsignedLongMask :)
msg268721 - (view) Author: Oren Milman (Oren Milman) * Date: 2016-06-17 12:35
I did some micro-benchmarking, and it looks like bad news for my patch.

I wrote a simple C extension in order to call PyLong_AsUnsignedLongMask and PyLong_AsUnsignedLongLongMask from Python code (attached).
Then I ran the following micro-benchmarks using both the default CPython branch and my patched CPython:
1. small ints:
python.exe -m timeit -s "import capiWrapper" "print('bug') if any(i != capiWrapper.asUnsignedLongMaskWrapper(i) for i in range(10 ** 6)) else None"
python.exe -m timeit -s "import capiWrapper" "print('bug') if any(i != capiWrapper.asUnsignedLongLongMaskWrapper(i) for i in range(10 ** 6)) else None"
with the following results:
    asUnsignedLongMaskWrapper:
        default: 312 msec, patched: 353 msec
    asUnsignedLongLongMaskWrapper:
        default: 319 msec, patched: 356 msec

2. big ints:
python.exe -m timeit -s "import capiWrapper; bigInt = 10 ** 1000" "print('bug') if any((i & 0xffffffff) != capiWrapper.asUnsignedLongMaskWrapper(i) for i in range(bigInt, bigInt + 10000)) else None"
python.exe -m timeit -s "import capiWrapper; bigInt = 10 ** 1000" "print('bug') if any((i & 0xffffffffffffffff) != capiWrapper.asUnsignedLongLongMaskWrapper(i) for i in range(bigInt, bigInt + 10000)) else None"
with the following results:
    when bigInt = 10 ** 1000:
        asUnsignedLongMaskWrapper:
            default: 23.1 msec, patched: 21.5 msec
        asUnsignedLongLongMaskWrapper:
            default: 24.1 msec, patched: 21.7 msec

    when bigInt = 10 ** 150:
        asUnsignedLongMaskWrapper:
            default: 7.72 msec, patched: 7.82 msec
        asUnsignedLongLongMaskWrapper:
            default: 8.03 msec, patched: 7.99 msec


To sum it up, my patch degrades performance for ints smaller than (approximately) 10 ** 150, and improves performance for bigger ints. 

Seems to me like it doesn't worth it. 
I attached the patches diff file anyway, for the sake of documentation.

==========================================================================

That leaves us with the proposed changes in Modules/_testcapimodule.c.
The diff file for these (hopefully my final diff file for this issue) is also attached.
msg288788 - (view) Author: Oren Milman (Oren Milman) * Date: 2017-03-02 09:13
I created a pull request (https://github.com/python/cpython/pull/392)
to fix the mistakes in _testcapimodule, but didn't mention this issue
in the pull request's title, as the issue mentioned these mistakes
only as a side note.
msg288791 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-02 10:02
> To sum it up, my patch degrades performance for ints smaller than (approximately) 10 ** 150, and improves performance for bigger ints. 

IMHO Python language mostly handles integers smaller than 1 million, so the optimization is not a good idea :-) I had the same issue when I tried to modify Python to use GMP internally for long integers.
msg288796 - (view) Author: Oren Milman (Oren Milman) * Date: 2017-03-02 11:42
the pull request was merged, so I guess we can close this issue..
msg288798 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-02 11:53
Ok, thanks.
History
Date User Action Args
2022-04-11 14:58:32adminsetgithub: 71485
2017-03-02 11:53:27vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg288798

stage: resolved
2017-03-02 11:42:07Oren Milmansetmessages: + msg288796
2017-03-02 10:02:55vstinnersetnosy: + vstinner
messages: + msg288791
2017-03-02 09:13:22Oren Milmansetmessages: + msg288788
2017-03-02 09:06:25Oren Milmansetpull_requests: + pull_request323
2016-09-23 07:44:26Oren Milmansetversions: + Python 3.7, - Python 3.6
2016-06-17 12:37:39Oren Milmansetfiles: + proposedPatches.diff
2016-06-17 12:36:39Oren Milmansetfiles: + badMicroBenchProposedPatches.diff
2016-06-17 12:35:25Oren Milmansetfiles: + capiWrapperModule.c

messages: + msg268721
2016-06-14 20:44:50Oren Milmansetmessages: + msg268586
2016-06-14 19:33:08serhiy.storchakasetmessages: + msg268581
versions: + Python 3.6
2016-06-11 21:21:49Oren Milmansetfiles: + CPythonTestOutput.txt
2016-06-11 21:21:11serhiy.storchakasetnosy: + mark.dickinson, serhiy.storchaka
2016-06-11 21:20:55Oren Milmansetfiles: + patchedCPythonTestOutput.txt
2016-06-11 21:20:13Oren Milmancreate