Issue 41902: Micro optimization for range.index if step is 1

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/86068

classification

Title:	Micro optimization for range.index if step is 1
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.10

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	corona10	Nosy List:	ammar2, corona10, mark.dickinson, methane, pablogsal, serhiy.storchaka, shihai1991, vstinner
Priority:	normal	Keywords:	patch

Created on 2020-10-01 13:42 by corona10, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
bench_range_index.py	corona10, 2020-10-01 13:42
bench_range_step.py	corona10, 2020-10-01 13:43
bench_range_compute_item.py	corona10, 2020-10-02 04:27

Pull Requests
URL	Status	Linked	Edit
PR 22479	merged	corona10, 2020-10-01 13:49
PR 22480	closed	corona10, 2020-10-01 15:54
PR 22492	merged	corona10, 2020-10-02 04:33

Messages (18)
msg377752 - (view)	Author: Dong-hee Na (corona10) *	Date: 2020-10-01 13:42
if we declare range without setting step, we don't have to process divide operation. here is the benchmark result both setting step and without step, With my patch, there is no performance degrade with range.index when the step is not one and showing 19% enhancement when the step is the default value (1) . Mean +- std dev: [range_master] 1.11 us +- 0.01 us -> [range_opt] 896 ns +- 23 ns: 1.24x faster (-19%) Mean +- std dev: [range_step_master] 1.12 us +- 0.02 us -> [range_step_opt] 1.11 us +- 0.01 us: 1.01x faster (-1%)
msg377753 - (view)	Author: Dong-hee Na (corona10) *	Date: 2020-10-01 13:49
For optimization case, >>> a = range(0, 11111) >>> a.index(3) 3
msg377755 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-10-01 14:55
PR 22479 avoids calling PyNumber_FloorDivide(a, b) if b == 1 (if b == _PyLong_One): it makes range.index(a, 1) call 214 ns faster. I'm surprised that PyNumber_FloorDivide(a, 1) takes 214 ns. Does it spend most time in binary_op1()? Or PyNumber_FloorDivide()? long_div(a, 1) is quite simple: CHECK_BINOP(a, b); if (Py_ABS(Py_SIZE(a)) == 1 && Py_ABS(Py_SIZE(b)) == 1) { return fast_floor_div((PyLongObject)a, (PyLongObject)b); with: static PyObject * fast_floor_div(PyLongObject a, PyLongObject b) { sdigit left = a->ob_digit[0]; sdigit right = b->ob_digit[0]; sdigit div; if (Py_SIZE(a) == Py_SIZE(b)) { div = left / right; } else { div = -1 - (left - 1) / right; } return PyLong_FromLong(div); } Do we need another fast-path in long_div(a, b) when b == _PyLong_One? Just return a in this case.
msg377757 - (view)	Author: Dong-hee Na (corona10) *	Date: 2020-10-01 15:03
For practical use case >>> a = range(-2, 10) >>> a.index(2)
msg377759 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-10-01 15:10
About your benchmark, do you build Python with PGO+LTO optimizations? What is your OS and CPU? Do you use CPU isolation on Linux? It would be good to use PGO+LTO optimizations.
msg377762 - (view)	Author: Dong-hee Na (corona10) *	Date: 2020-10-01 15:20
> do you build Python with PGO+LTO optimizations Nope, but I will try to run the benchmark > What is your OS and CPU? Do you use CPU isolation on Linux? macOS, intel i9 with 8 cores
msg377765 - (view)	Author: Dong-hee Na (corona10) *	Date: 2020-10-01 15:39
For the record with optimization Mean +- std dev: [range_master] 112 ns +- 3 ns -> [range_opt] 99.3 ns +- 1.8 ns: 1.13x faster (-12%)
msg377770 - (view)	Author: Dong-hee Na (corona10) *	Date: 2020-10-01 16:21
PR 22479 and PR 22480 are different approaches. I (and Victor) want to check which approach might be better. PR 22480 would affect overall long division performance PR 22479 assumes that step=1 case is very often (e.g range(100), range(-100, 100))
msg377773 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2020-10-01 16:26
> Do we need another fast-path in long_div(a, b) when b == _PyLong_One? Just return a in this case. I'd much prefer not. Every extra fast path check costs time for the general case, and there's no strong reason to expect people to be dividing by one. The range code seems like the right place for this optimization, not the long-divide code.
msg377775 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-10-01 16:31
Mark: "I'd much prefer not. Every extra fast path check costs time for the general case, and there's no strong reason to expect people to be dividing by one. The range code seems like the right place for this optimization, not the long-divide code." In this case, I suggest to add a comment in long_div() to explain why there is no fast-path in long_div(). Otherwise, I am likely to suggest the exact same optimization in 6 months :-D I'm thinking at something similar to my comment in ceval.c: case TARGET(BINARY_ADD): { PyObject right = POP(); PyObject left = TOP(); PyObject sum; / NOTE(haypo): Please don't try to micro-optimize int+int on CPython using bytecode, it is simply worthless. See http://bugs.python.org/issue21955 and http://bugs.python.org/issue10044 for the discussion. In short, no patch shown any impact on a realistic benchmark, only a minor speedup on microbenchmarks. */ This comment is the outcome of 2 years of hard work by many developers :-D See bpo-21955.
msg377786 - (view)	Author: Ammar Askar (ammar2) *	Date: 2020-10-01 23:17
Out of curiosity, what is the motivation for this optimization? Is there reason to believe that range.index is performed often enough that it's worth the extra code and maintenance cost here?
msg377794 - (view)	Author: Dong-hee Na (corona10) *	Date: 2020-10-02 03:50
> s there reason to believe that range.index is performed often enough that it's worth the extra code and maintenance cost here? There are at least 3 reasons 1. pointer comparaition optimization quite common usecase in CPython codebase. e.x) list.count 2. if you create the range object, step = 1 which is singleton object is quite high percentage use case. - range(100) -> step = 1 - range(-100, 100) -> step =1 - range(-100, 100, 2) -> step != 1 3. fast path code does not cost high difficulty maintainence for this case but can bring 12% performance enhancement.
msg377795 - (view)	Author: Dong-hee Na (corona10) *	Date: 2020-10-02 04:27
I found another optimal case and this looks more practical usecase than range.index
msg377796 - (view)	Author: Dong-hee Na (corona10) *	Date: 2020-10-02 04:27
Mean +- std dev: [master-compute] 317 ns +- 3 ns -> [bpo-41902-compute] 287 ns +- 6 ns: 1.11x faster (-10%)
msg377799 - (view)	Author: Dong-hee Na (corona10) *	Date: 2020-10-02 06:55
@vstinner @pablo @mark On my local machine (without cpu isolation), PR 22480 does not affect performance issues. import pyperf runner = pyperf.Runner() runner.timeit(name="bench long divide", stmt=""" for i in range(1, 256): a = 10000 // i """) but I think that my benchmark does not cover the worst case. I need to set up the CPU isolation environment but my resource is limited. (I need a Linux machine with sufficient permission but not) PR 22479 and PR 22480 has two sides of assumption. PR 22479: PyNumber_FloorDivide is a heavy operation if the user thinks that this is an unnecessary operation it should be avoided. (In this case divide by one) PR 22480: PyNumber_FloorDivide should process handle unnecessary operations smartly. In conclusion, I'd like to +1 on mark's decision. - PR 22480: even though the divisor value is one, if the dividend is not qualified from PyLong_CheckExact it will not get an optimization path. So it will not cover all the 'divide by one' case and it can cause performance issues. - PR 22480: Always optimized if the user does not set a specific step value and this will be same effect on PR 22492.
msg377806 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2020-10-02 10:19
I would need to carefully look at the PRs to estimate the maintenance cost, but it seems to me initially that all these operations happen very rarely in regular code and probably will have no impact on macro benchmarks. In general, I dislike branches because they quickly can become the source of slight different semantics.
msg379178 - (view)	Author: Inada Naoki (methane) *	Date: 2020-10-21 01:29
New changeset 25492a5b59c5b74328278f195540e318ab87674f by Dong-hee Na in branch 'master': bpo-41902: Micro optimization for compute_item of range (GH-22492) https://github.com/python/cpython/commit/25492a5b59c5b74328278f195540e318ab87674f
msg379183 - (view)	Author: Dong-hee Na (corona10) *	Date: 2020-10-21 02:30
New changeset c0f22fb8b3006936757cebb959cee94e285bc503 by Dong-hee Na in branch 'master': bpo-41902: Micro optimization for range.index if step is 1 (GH-22479) https://github.com/python/cpython/commit/c0f22fb8b3006936757cebb959cee94e285bc503

History
Date	User	Action	Args
2022-04-11 14:59:36	admin	set	github: 86068
2020-10-21 02:39:25	corona10	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2020-10-21 02:30:09	corona10	set	messages: + msg379183
2020-10-21 01:29:21	methane	set	nosy: + methane messages: + msg379178
2020-10-02 15:21:18	shihai1991	set	nosy: + shihai1991
2020-10-02 10:19:34	pablogsal	set	messages: + msg377806
2020-10-02 06:55:18	corona10	set	messages: + msg377799
2020-10-02 04:33:45	corona10	set	pull_requests: + pull_request21509
2020-10-02 04:27:31	corona10	set	messages: + msg377796
2020-10-02 04:27:17	corona10	set	files: + bench_range_compute_item.py messages: + msg377795
2020-10-02 03:50:52	corona10	set	messages: + msg377794
2020-10-01 23:17:11	ammar2	set	nosy: + ammar2 messages: + msg377786
2020-10-01 16:31:55	vstinner	set	messages: + msg377775
2020-10-01 16:26:44	mark.dickinson	set	nosy: + mark.dickinson messages: + msg377773
2020-10-01 16:21:43	corona10	set	messages: + msg377770
2020-10-01 15:54:22	corona10	set	pull_requests: + pull_request21498
2020-10-01 15:39:18	corona10	set	messages: + msg377765
2020-10-01 15:20:24	corona10	set	messages: + msg377762
2020-10-01 15:10:33	vstinner	set	messages: + msg377759
2020-10-01 15:03:04	corona10	set	messages: + msg377757
2020-10-01 14:55:34	vstinner	set	messages: + msg377755
2020-10-01 13:49:49	corona10	set	messages: + msg377753
2020-10-01 13:49:07	corona10	set	keywords: + patch stage: patch review pull_requests: + pull_request21496
2020-10-01 13:43:05	corona10	set	files: + bench_range_step.py
2020-10-01 13:42:45	corona10	create