Issue45026
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2021-08-27 03:03 by serhiy.storchaka, last changed 2022-04-11 14:59 by admin.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
three-way-comparison.log | lukasz.langa, 2021-09-27 14:46 |
Pull Requests | |||
---|---|---|---|
URL | Status | Linked | Edit |
PR 27986 | open | serhiy.storchaka, 2021-08-27 03:05 | |
PR 28176 | open | serhiy.storchaka, 2021-09-05 13:10 |
Messages (22) | |||
---|---|---|---|
msg400390 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2021-08-27 03:03 | |
The proposed PR provides more compact implementation of the range iterator. It consumes less memory and produces smaller pickles. It is presumably faster because it performs simpler arithmetic operations on iteration (no multiplications). |
|||
msg400392 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2021-08-27 03:16 | |
Currently the range iterator contains four integers: index, start, step and len. On iteration index is increased until achieve len, the result is calculated as start+index*step. In the proposed PR the range iterator contains only three integers: index was removed. Instead len counts the number of iterations that left, and start is increased by step on every iteration. Less memory, simpler calculation. Pickle no longer contains index, but __setstate__ is kept for compatibility. The only incompatible change is that calling __setstate__ repeatedly will have different effect. Currently __setstate__ with the same values does not have effect, with this PR it will advance iterator. But Python only calls __setstate__ once for just created object when unpickle or copy. |
|||
msg400396 - (view) | Author: Dennis Sweeney (Dennis Sweeney) * | Date: 2021-08-27 05:09 | |
Is it worth removing the len field as well and lazily using get_len_of_range() as needed? Then the hot function can look something like: static PyObject * rangeiter_next(rangeiterobject *r) { long result = r->start if (result < r->stop) { r->start += r->step; return PyLong_FromLong(result); } return NULL; } |
|||
msg400399 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2021-08-27 07:24 | |
step can be negative. So the condition should be more complex: ((r->stop > 0) ? (result < r->stop) : (result > r->stop)). And it would look much more complex for longrangeiterobject. |
|||
msg400528 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2021-08-29 12:52 | |
Microbenchmarks show some speed up: Iterating large range: $ ./python -m timeit -s 'r = range(0, 10**20, 3**35)' 'list(r)' Before: 2000 loops, best of 5: 199 usec per loop After: 2000 loops, best of 5: 113 usec per loop Unpickling: $ ./python -m timeit -s 'from pickle import dumps, loads; p = dumps([iter(range(i)) for i in range(1000)])' 'loads(p)' Before: 500 loops, best of 5: 476 usec per loop After: 500 loops, best of 5: 363 usec per loop I did not observe any difference in iterating small ranges and pickling. Smaller size in memory, smaller pickles. |
|||
msg401085 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2021-09-05 13:16 | |
PR 28176 implements idea proposed by Dennis in msg400396. It keeps the current and stop values instead of the initial start, index and initial length. I do not have data yet, but it is expected that it's iteration may be faster (for large integers), but __lenght_hint__ should be slower. |
|||
msg401411 - (view) | Author: Łukasz Langa (lukasz.langa) * | Date: 2021-09-08 17:23 | |
I like Dennis' idea and Serhiy's implementation in GH-28176. It's a bit of a larger change compared to GH-27986 but I think it's worth it: I expect iteration speed is more important than `len()` speed for range objects. |
|||
msg401419 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2021-09-08 18:32 | |
I have not benchmarked PR 28176 yet and do not know whether it have advantages over PR 27986 and how large. Slower __lenght_hint__ can make list(range(...)) slower for small ranges, but I do not know how small. |
|||
msg402139 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2021-09-18 21:32 | |
Iterating large integers: $ ./python -m pyperf timeit -s 'r = range(0, 10**20, 3**35)' 'for i in r: pass' baseline: Mean +- std dev: 223 us +- 10 us PR 27986: Mean +- std dev: 128 us +- 4 us PR 28176: Mean +- std dev: 99.0 us +- 3.7 us $ ./python -m pyperf timeit -s 'r = range(0, 10**20, 3**35)' 'list(r)' baseline: Mean +- std dev: 191 us +- 13 us PR 27986: Mean +- std dev: 107 us +- 7 us PR 28176: Mean +- std dev: 91.3 us +- 2.4 us Unpickling: $ ./python -m pyperf timeit -s 'from pickle import dumps, loads; p = dumps([iter(range(i)) for i in range(1000)])' 'loads(p)' baseline: Mean +- std dev: 535 us +- 29 us PR 27986: Mean +- std dev: 420 us +- 15 us PR 28176: Mean +- std dev: 418 us +- 17 us $ ./python -m pyperf timeit -s 'from pickle import dumps, loads; p = dumps([iter(range(i*10**10)) for i in range(1000)])' 'loads(p)' baseline: Mean +- std dev: 652 us +- 37 us PR 27986: Mean +- std dev: 530 us +- 43 us PR 28176: Mean +- std dev: 523 us +- 17 us Seems PR 28176 is slightly faster than PR 27986 in iterating long integers. |
|||
msg402335 - (view) | Author: Łukasz Langa (lukasz.langa) * | Date: 2021-09-21 17:39 | |
Looks like GH-28176 is faster across the board. One missing benchmark here is comparing `len()` speed as this was where the results will be reversed. It would be interesting to see to what extent. |
|||
msg402347 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2021-09-21 18:52 | |
length_hint(), not len(). Its cost is included in microbenchmark for list(), where it is followed by iterating 2000 items. Calling operator.length_hint() in Python: $ ./python -m pyperf timeit -s 'it = iter(range(1000)); from operator import length_hint' 'length_hint(it)' baseline: Mean +- std dev: 109 ns +- 6 ns PR 27986: Mean +- std dev: 109 ns +- 5 ns PR 28176: Mean +- std dev: 115 ns +- 5 ns $ ./python -m pyperf timeit -s 'it = iter(range(0, 10**20, 3**35)); from operator import length_hint' 'length_hint(it)' baseline: Mean +- std dev: 114 ns +- 6 ns PR 27986: Mean +- std dev: 95.6 ns +- 4.3 ns PR 28176: Mean +- std dev: 285 ns +- 13 ns Indirect call from C (it includes overhead for calling list() and iter() in Python): $ ./python -m pyperf timeit -s 'r = range(10**20, 10**20+1, 3**35)' 'list(iter(r))' baseline: Mean +- std dev: 331 ns +- 16 ns PR 27986: Mean +- std dev: 300 ns +- 16 ns PR 28176: Mean +- std dev: 391 ns +- 18 ns With few experiments I have found that PR 28176 is faster than PR 27986 for list(iter(range(...))) if a range is larger than 40-100 items. |
|||
msg402349 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2021-09-21 19:10 | |
May the best PR win. :-) |
|||
msg402452 - (view) | Author: Łukasz Langa (lukasz.langa) * | Date: 2021-09-22 16:13 | |
Since len timings for ranges of 100 items are negligible anyway, I personally still favor GH-28176 which is clearly faster during iteration. |
|||
msg402477 - (view) | Author: Dennis Sweeney (Dennis Sweeney) * | Date: 2021-09-23 03:19 | |
I benchmarked GH-27986 and GH-28176 on "for i in range(10000): pass" and found that GH-27986 was faster for this (likely most common) case of relatively small integers. Mean +- std dev: [main] 204 us +- 5 us -> [GH-27986] 194 us +- 4 us: 1.05x faster Mean +- std dev: [main] 204 us +- 5 us -> [GH-28176] 223 us +- 6 us: 1.09x slower It's possible to have different implementations for small/large integers, but IMO it's probably best to keep consistency and go with GH-27986. |
|||
msg402534 - (view) | Author: Łukasz Langa (lukasz.langa) * | Date: 2021-09-23 21:44 | |
Good point benchmarking small iterations too, Dennis. I missed that. Agreed then, GH-27986 looks like a winner. |
|||
msg402543 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2021-09-24 07:01 | |
I do not see any difference in iterating small integers: $ ./python -m pyperf timeit -s 'r = range(10000)' 'for i in r: pass' PR 27986: Mean +- std dev: 174 us +- 9 us PR 28176: Mean +- std dev: 172 us +- 10 us |
|||
msg402555 - (view) | Author: Łukasz Langa (lukasz.langa) * | Date: 2021-09-24 11:12 | |
Serhiy, right: looks like the difference stems from recreating the range objects, not from iteration. But Dennis' observation still stands: using `for i in range(...):` is a very popular idiom. If that becomes slower than the status quo, we will be making existing Python programs slower as well. |
|||
msg402639 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * | Date: 2021-09-25 18:00 | |
There is nothing in code that could explain a measureable difference in creating the range objects or the range object iterators. And indeed, it is in the range of the standard deviation, so it is non-existent. |
|||
msg402649 - (view) | Author: Dennis Sweeney (Dennis Sweeney) * | Date: 2021-09-25 21:32 | |
I did more benchmarks on my Windows laptop, and it seems the difference goes away after using PGO. The benchmarking program: ################################# from pyperf import Runner runner = Runner() for n in [10, 100, 1000, 10_000, 100_000]: runner.timeit(f"for i in range({n}): pass", stmt=f"for i in range({n}): pass") runner.timeit(f"for i in it_{n}: pass", setup=f"it = iter(range({n}))", stmt="for i in it: pass") runner.timeit(f"deque(it_{n})", setup=(f"from collections import deque; " f"it = iter(range({n}))"), stmt="deque(it, maxlen=0)") runner.timeit(f"list(iter(range({n})))", stmt=f"list(iter(range({n})))") ################################### The results (without PGO): PS C:\Users\sween\Source\Repos\cpython2\cpython> .\python.bat -m pyperf compare_to .\20e3149c175a24466c7d1c352f8ff2c11effc489.json .\cffa90a8b0057d7e7456571045f2fb7b9ceb426f.json -G Running Release|x64 interpreter... Slower (15): - list(iter(range(100))): 741 ns +- 13 ns -> 836 ns +- 11 ns: 1.13x slower - for i in range(100000): pass: 2.05 ms +- 0.05 ms -> 2.26 ms +- 0.06 ms: 1.10x slower - list(iter(range(1000))): 12.2 us +- 0.1 us -> 13.2 us +- 0.2 us: 1.08x slower - for i in range(10000): pass: 203 us +- 4 us -> 219 us +- 4 us: 1.08x slower - for i in range(100): pass: 1.18 us +- 0.02 us -> 1.27 us +- 0.03 us: 1.08x slower - for i in range(1000): pass: 18.1 us +- 0.3 us -> 19.5 us +- 0.3 us: 1.07x slower - list(iter(range(10000))): 145 us +- 7 us -> 152 us +- 2 us: 1.05x slower - list(iter(range(100000))): 1.98 ms +- 0.06 ms -> 2.06 ms +- 0.05 ms: 1.04x slower - for i in range(10): pass: 265 ns +- 9 ns -> 272 ns +- 8 ns: 1.03x slower - deque(it_1000): 324 ns +- 4 ns -> 332 ns +- 8 ns: 1.02x slower - deque(it_100000): 327 ns +- 5 ns -> 333 ns +- 7 ns: 1.02x slower - list(iter(range(10))): 357 ns +- 7 ns -> 363 ns +- 3 ns: 1.02x slower - deque(it_10): 325 ns +- 5 ns -> 330 ns +- 5 ns: 1.01x slower - deque(it_100): 325 ns +- 6 ns -> 329 ns +- 4 ns: 1.01x slower - deque(it_10000): 326 ns +- 7 ns -> 330 ns +- 4 ns: 1.01x slower Faster (2): - for i in it_10: pass: 26.0 ns +- 1.4 ns -> 25.3 ns +- 0.3 ns: 1.03x faster - for i in it_1000: pass: 25.7 ns +- 0.7 ns -> 25.3 ns +- 0.4 ns: 1.02x faster Benchmark hidden because not significant (3): for i in it_100: pass, for i in it_10000: pass, for i in it_100000: pass Geometric mean: 1.03x slower ################################### The results (with PGO): PS C:\Users\sween\Source\Repos\cpython2\cpython> .\python.bat -m pyperf compare_to .\PGO-20e3149c175a24466c7d1c352f8ff2c11effc489.json .\PGO-cffa90a8b0057d7e7456571045f2fb7b9ceb426f.json -G Running PGUpdate|x64 interpreter... Slower (7): - for i in it_100: pass: 20.3 ns +- 0.5 ns -> 21.3 ns +- 0.7 ns: 1.05x slower - for i in it_10000: pass: 20.4 ns +- 0.6 ns -> 21.4 ns +- 0.8 ns: 1.05x slower - for i in it_100000: pass: 20.5 ns +- 0.5 ns -> 21.4 ns +- 0.6 ns: 1.05x slower - for i in it_1000: pass: 20.3 ns +- 0.5 ns -> 21.2 ns +- 0.5 ns: 1.05x slower - for i in it_10: pass: 20.3 ns +- 0.5 ns -> 21.1 ns +- 0.5 ns: 1.04x slower - for i in range(10): pass: 214 ns +- 3 ns -> 219 ns +- 11 ns: 1.03x slower - deque(it_100000): 288 ns +- 5 ns -> 291 ns +- 12 ns: 1.01x slower Faster (7): - list(iter(range(10000))): 112 us +- 3 us -> 96.4 us +- 3.4 us: 1.16x faster - list(iter(range(1000))): 9.69 us +- 0.15 us -> 8.39 us +- 0.26 us: 1.15x faster - list(iter(range(100000))): 1.65 ms +- 0.04 ms -> 1.48 ms +- 0.04 ms: 1.11x faster - list(iter(range(100))): 663 ns +- 11 ns -> 623 ns +- 12 ns: 1.06x faster - for i in range(1000): pass: 14.6 us +- 0.5 us -> 14.2 us +- 0.3 us: 1.03x faster - for i in range(10000): pass: 162 us +- 2 us -> 159 us +- 3 us: 1.02x faster - for i in range(100000): pass: 1.64 ms +- 0.03 ms -> 1.62 ms +- 0.04 ms: 1.01x faster Benchmark hidden because not significant (6): deque(it_10), list(iter(range(10))), for i in range(100): pass, deque(it_100), deque(it_1000), deque(it_10000) Geometric mean: 1.01x faster |
|||
msg402708 - (view) | Author: Łukasz Langa (lukasz.langa) * | Date: 2021-09-27 13:25 | |
Dennis, run your benchmarks with --rigorous to avoid "Benchmark hidden because not significant". I note that the second and third benchmarks aren't useful as written because the iterators are exhausted after first repetition. I could see this in my results, note how the values don't rise with the iterator size: for i in it_10: pass: Mean +- std dev: 25.0 ns +- 0.3 ns for i in it_100: pass: Mean +- std dev: 25.1 ns +- 0.5 ns for i in it_1000: pass: Mean +- std dev: 25.0 ns +- 0.3 ns for i in it_10000: pass: Mean +- std dev: 25.0 ns +- 0.3 ns for i in it_100000: pass: Mean +- std dev: 25.6 ns +- 0.5 ns deque(it_10): Mean +- std dev: 334 ns +- 8 ns deque(it_100): Mean +- std dev: 338 ns +- 9 ns deque(it_1000): Mean +- std dev: 335 ns +- 9 ns deque(it_10000): Mean +- std dev: 336 ns +- 10 ns deque(it_100000): Mean +- std dev: 338 ns +- 11 ns When I modified those to recreate the iterator on every run, the story was much different. |
|||
msg402712 - (view) | Author: Łukasz Langa (lukasz.langa) * | Date: 2021-09-27 13:55 | |
Benchmarks for PGO builds on macOS 10.15 Catalina, Intel MBP 2018. Like in Dennis' case, 20e3149c175a24466c7d1c352f8ff2c11effc489 is GH-27986 and cffa90a8b0057d7e7456571045f2fb7b9ceb426f is GH-28176. The difference is that `it_` benchmarks create the iterator on each execution. In this case the explicit iterator versions of the for-loop are indistinguishable from the ones using `range()` directly. ################ ❯ python -m pyperf compare_to /tmp/20e3149c175a24466c7d1c352f8ff2c11effc489-2.json /tmp/cffa90a8b0057d7e7456571045f2fb7b9ceb426f-2.json -G Slower (11): - deque(it_100): 886 ns +- 22 ns -> 944 ns +- 12 ns: 1.07x slower - list(iter(range(100))): 856 ns +- 17 ns -> 882 ns +- 17 ns: 1.03x slower - for i in range(100000): pass: 2.20 ms +- 0.02 ms -> 2.26 ms +- 0.03 ms: 1.02x slower - for i in range(10000): pass: 219 us +- 1 us -> 223 us +- 5 us: 1.02x slower - for i in it_10000: pass: 219 us +- 1 us -> 223 us +- 5 us: 1.02x slower - for i in it_100000: pass: 2.20 ms +- 0.03 ms -> 2.24 ms +- 0.04 ms: 1.02x slower - for i in it_1000: pass: 20.1 us +- 0.1 us -> 20.4 us +- 0.4 us: 1.02x slower - for i in range(1000): pass: 20.2 us +- 0.4 us -> 20.5 us +- 0.3 us: 1.02x slower - for i in range(100): pass: 1.50 us +- 0.03 us -> 1.52 us +- 0.03 us: 1.01x slower - list(iter(range(10))): 317 ns +- 9 ns -> 320 ns +- 6 ns: 1.01x slower - for i in it_100: pass: 1.53 us +- 0.01 us -> 1.54 us +- 0.02 us: 1.01x slower Faster (8): - list(iter(range(100000))): 2.25 ms +- 0.05 ms -> 2.12 ms +- 0.03 ms: 1.06x faster - deque(it_10000): 145 us +- 2 us -> 142 us +- 1 us: 1.03x faster - list(iter(range(1000))): 12.6 us +- 0.2 us -> 12.3 us +- 0.1 us: 1.02x faster - deque(it_100000): 1.47 ms +- 0.01 ms -> 1.45 ms +- 0.02 ms: 1.02x faster - for i in it_10: pass: 309 ns +- 6 ns -> 304 ns +- 3 ns: 1.02x faster - list(iter(range(10000))): 147 us +- 2 us -> 145 us +- 2 us: 1.01x faster - deque(it_10): 544 ns +- 19 ns -> 537 ns +- 10 ns: 1.01x faster - deque(it_1000): 12.6 us +- 0.2 us -> 12.5 us +- 0.2 us: 1.01x faster Benchmark hidden because not significant (1): for i in range(10): pass Geometric mean: 1.00x slower ################ The results look like a wash here. Let me compare both to `main`. |
|||
msg402724 - (view) | Author: Łukasz Langa (lukasz.langa) * | Date: 2021-09-27 14:46 | |
Well, this is kind of disappointing on my end. I attach the full result of a three-way comparison between `main` at the time of Serhiy's last merge (3f8b23f8ddab75d9b77a3997d54e663187e12cc8) with GH-27986 (20e3149c175a24466c7d1c352f8ff2c11effc489) and GH-28176 (cffa90a8b0057d7e7456571045f2fb7b9ceb426f). The gist is this: Geometric mean ============== 20e3149c175a24466c7d1c352f8ff2c11effc489-2: 1.01x slower cffa90a8b0057d7e7456571045f2fb7b9ceb426f-2: 1.01x slower At least on my Macbook Pro, all PGO builds, looks like the status quo is on average faster than any of the candidate PRs. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:49 | admin | set | github: 89189 |
2021-09-27 14:46:46 | lukasz.langa | set | files:
+ three-way-comparison.log messages: + msg402724 |
2021-09-27 13:55:09 | lukasz.langa | set | messages: + msg402712 |
2021-09-27 13:25:07 | lukasz.langa | set | messages: + msg402708 |
2021-09-25 21:32:28 | Dennis Sweeney | set | messages: + msg402649 |
2021-09-25 18:00:25 | serhiy.storchaka | set | messages: + msg402639 |
2021-09-24 11:12:48 | lukasz.langa | set | messages: + msg402555 |
2021-09-24 07:01:34 | serhiy.storchaka | set | messages: + msg402543 |
2021-09-23 21:44:02 | lukasz.langa | set | messages: + msg402534 |
2021-09-23 03:19:36 | Dennis Sweeney | set | messages: + msg402477 |
2021-09-22 16:13:31 | lukasz.langa | set | messages: + msg402452 |
2021-09-21 19:10:46 | gvanrossum | set | messages: + msg402349 |
2021-09-21 18:52:37 | serhiy.storchaka | set | messages: + msg402347 |
2021-09-21 17:39:33 | lukasz.langa | set | messages: + msg402335 |
2021-09-19 15:35:19 | gvanrossum | set | nosy:
+ gvanrossum |
2021-09-18 21:32:08 | serhiy.storchaka | set | messages: + msg402139 |
2021-09-08 18:32:17 | serhiy.storchaka | set | messages: + msg401419 |
2021-09-08 17:23:18 | lukasz.langa | set | messages: + msg401411 |
2021-09-05 13:16:31 | serhiy.storchaka | set | messages: + msg401085 |
2021-09-05 13:10:21 | serhiy.storchaka | set | pull_requests: + pull_request26603 |
2021-08-29 12:52:49 | serhiy.storchaka | set | messages: + msg400528 |
2021-08-27 07:24:03 | serhiy.storchaka | set | messages: + msg400399 |
2021-08-27 05:09:59 | Dennis Sweeney | set | nosy:
+ Dennis Sweeney messages: + msg400396 |
2021-08-27 03:16:59 | serhiy.storchaka | set | messages: + msg400392 |
2021-08-27 03:05:58 | serhiy.storchaka | set | keywords:
+ patch stage: patch review pull_requests: + pull_request26433 |
2021-08-27 03:03:33 | serhiy.storchaka | create |