This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: optimize int division
Type: performance Stage: resolved
Components: Versions: Python 3.11
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: gregory.p.smith, mark.dickinson, rhettinger, tim.peters
Priority: normal Keywords: patch

Created on 2022-01-16 23:38 by gregory.p.smith, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 30626 closed gregory.p.smith, 2022-01-16 23:45
PR 30856 merged tim.peters, 2022-01-24 18:55
Messages (5)
msg410735 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-01-16 23:38
Based on a python-dev thread, we've come up with faster int division code for CPython's bignums.

https://mail.python.org/archives/list/python-dev@python.org/thread/ZICIMX5VFCX4IOFH5NUPVHCUJCQ4Q7QM/#NEUNFZU3TQU4CPTYZNF3WCN7DOJBBTK5

filing this issue for starters to attach a PR to.  details forthcoming.
msg410736 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-01-16 23:55
The PR was directly inspired by Mark Dickinson's code in the email thread directly using __asm__ to get the instruction he wanted.  There is usually a way to make the compiler actually do what you intend.  This appears to be it.

Interestingly, experimenting with small code snippets rather than the entire longobject.c on gotbolt.org to check various compilers output does not always yield as nice of a result.  (clang 11+ showed promise there, but this change benefits gcc equally as well in real world CPython microbenchmark timeit tests).  https://godbolt.org/z/63eWPczjx was my playground code.

```
$ ./b-clang13/python -m timeit -n 1500000 -s 'x = 10**1000; r=x//10; assert r == 10**999, r' 'x//17'
1500000 loops, best of 5: 450 nsec per loop
$ ./b-clang13-new-basic-divrem1/python -m timeit -n 1500000 -s 'x = 10**1000; r=x//10; assert r == 10**999, r' 'x//17'
1500000 loops, best of 5: 375 nsec per loop
$ ./b-gcc9/python -m timeit -n 1500000 -s 'x = 10**1000; r=x//10; assert r == 10**999, r' 'x//17'
1500000 loops, best of 5: 448 nsec per loop
$ ./b-gcc9-new-basic-divrem1/python -m timeit -n 1500000 -s 'x = 10**1000; r=x//10; assert r == 10**999, r' 'x//17'
1500000 loops, best of 5: 370 nsec per loop
```

That's on an AMD zen3 (x86_64).  Also tested with other divisors, 17 is not specialized by the compiler.  These were not --enable-optimizations builds, though the results remain similar on those for non-specialized values as x//10 turns into when using -fprofile-values on gcc9.

Performance tests using other architectures forthcoming.

A pyperformance suite run on a benchmark-stable host is worthwhile. I don't actually expect this to show up as significant in most things there; we'll see.

The new code is not any more difficult to maintain than the previous code regardless.
msg410739 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-01-17 03:04
I tested my PR branch on 32-bit arm (raspbian bullseye) and the microbenchmark timing shows no change (within the noise across repeated runs).  Unsurprising as division is entirely different on 32-bit arm.

Raspbian uses armv6 for compatibility with the original rpi and rpi0.  armv6 does not have an integer division instruction. (how RISCy of it)  But that doesn't make a difference in this code as the final 32-bit arm ISA, armv7-a, only has a 32:32 divider.  (armv8 aka aarch64 is 64-bit and uses a UDIV as one would expect)

anyways, that satisfies me that it isn't making anything worse elsewhere.
msg411358 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2022-01-23 10:00
New changeset c7f20f1cc8c20654e5d539552604362feb9b0512 by Gregory P. Smith in branch 'main':
bpo-46406: Faster single digit int division. (#30626)
https://github.com/python/cpython/commit/c7f20f1cc8c20654e5d539552604362feb9b0512
msg411541 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2022-01-25 01:06
New changeset 7c26472d09548905d8c158b26b6a2b12de6cdc32 by Tim Peters in branch 'main':
bpo-46504: faster code for trial quotient in x_divrem() (GH-30856)
https://github.com/python/cpython/commit/7c26472d09548905d8c158b26b6a2b12de6cdc32
History
Date User Action Args
2022-04-11 14:59:54adminsetgithub: 90564
2022-01-25 01:06:10tim.peterssetmessages: + msg411541
2022-01-24 18:55:00tim.peterssetpull_requests: + pull_request29038
2022-01-23 10:01:22mark.dickinsonsetstatus: open -> closed
stage: patch review -> resolved
2022-01-23 10:00:45mark.dickinsonsetmessages: + msg411358
2022-01-17 04:27:35rhettingersetnosy: + tim.peters, rhettinger
2022-01-17 03:04:12gregory.p.smithsetmessages: + msg410739
2022-01-16 23:55:18gregory.p.smithsetnosy: + mark.dickinson
2022-01-16 23:55:01gregory.p.smithsetmessages: + msg410736
2022-01-16 23:45:41gregory.p.smithsetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request28829
2022-01-16 23:38:29gregory.p.smithcreate