Issue 36551: Optimize list comprehensions with preallocate size and protect against overflow

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/80732

classification

Title:	Optimize list comprehensions with preallocate size and protect against overflow
Type:	enhancement	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.8

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	anthonypjshaw	Nosy List:	Aaron Hall, anthony shaw, anthonypjshaw, methane, ncoghlan, pablogsal, ronaldoussoren, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2019-04-08 01:28 by anthony shaw, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 12718	closed	anthonypjshaw, 2019-04-08 01:28

Messages (26)
msg339586 - (view)	Author: anthony shaw (anthony shaw)	Date: 2019-04-08 01:28
List comprehensions currently create a series of opcodes inside a code object, the first of which is BUILD_LIST with an oparg of 0, effectively creating a zero-length list with a preallocated size of 0. If you're doing a simple list comprehension on an iterator, e.g. def foo(): a = iterable return [x for x in a] Disassembly of <code object <listcomp> at 0x109db2c40, file "<stdin>", line 3>: 3 0 BUILD_LIST 0 2 LOAD_FAST 0 (.0) >> 4 FOR_ITER 8 (to 14) 6 STORE_FAST 1 (x) 8 LOAD_FAST 1 (x) 10 LIST_APPEND 2 12 JUMP_ABSOLUTE 4 >> 14 RETURN_VALUE The list comprehension will do a list_resize on the 4, 8, 16, 25, 35, 46, 58, 72, 88th iterations, etc. This PR preallocates the list created in a list comprehension to the length of the iterator using PyObject_LengthHint(). It uses a new BUILD_LIST_PREALLOC opcode which builds a list with the allocated size of PyObject_LengthHint(co_varnames[oparg]). [x for x in iterable] compiles to: Disassembly of <code object <listcomp> at 0x109db2c40, file "<stdin>", line 3>: 3 0 BUILD_LIST_PREALLOC 0 2 LOAD_FAST 0 (.0) >> 4 FOR_ITER 8 (to 14) 6 STORE_FAST 1 (x) 8 LOAD_FAST 1 (x) 10 LIST_APPEND 2 12 JUMP_ABSOLUTE 4 >> 14 RETURN_VALUE If the comprehension has ifs, then it will use the existing BUILD_LIST opcode Testing using a range length of 10000 ./python.exe -m timeit "x=list(range(10000)); [y for y in x]" Gives 392us on the current 3.8 branch and 372us with this change (about 8-10% faster) the longer the iterable, the bigger the impact. This change also catches the issue that a very large iterator, like a range object : [a for a in range(2256)] Would cause the 3.8< interpreter to consume all memory and crash because there is no check against PY_SSIZE_MAX currently. With this change (assuming there is no if inside the comprehension) is now caught and thrown as an OverflowError: >>> [a for a in range(2256)] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 1, in <listcomp> OverflowError: Python int too large to convert to C ssize_t
msg339594 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-04-08 07:30
The benefit is too small to add a new opcode.
msg339595 - (view)	Author: anthony shaw (anthony shaw)	Date: 2019-04-08 07:34
The opcode would not solely apply to this specific use case. I could seek another way of implementing the same behaviour without an additional opcode?
msg339612 - (view)	Author: Ronald Oussoren (ronaldoussoren) *	Date: 2019-04-08 11:02
This might cause a MemoryError when the __length_hint__ of the source returns a too large value, even when the actual size of the comprehension is smaller, e.g.: [x**2 for x in range(LARGE_VALUE) if is_prime(x)] See also issue28940
msg339613 - (view)	Author: Inada Naoki (methane) *	Date: 2019-04-08 11:20
I agree with Serhiy. Benefit seems too small to add new opcode. > I could seek another way of implementing the same behaviour without an additional opcode? How about converting `[x for x in it]` to `[*it]` in AST?
msg339614 - (view)	Author: Inada Naoki (methane) *	Date: 2019-04-08 11:21
$ python3 -m timeit -s 'r=range(1000)' -- '[x for x in r]' 5000 loops, best of 5: 40 usec per loop $ python3 -m timeit -s 'r=range(1000)' -- '[*r]' 20000 loops, best of 5: 17.3 usec per loop
msg339617 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2019-04-08 11:34
This patch makes it slow for small iterators: Perf program: import perf runner = perf.Runner() runner.timeit("list_comp", stmt="[x for x in range(10)]", setup="") Current master: ❯ ./python.exe ../check.py ..................... list_comp: Mean +- std dev: 3.97 us +- 0.15 us PR 12718: ❯ ./python.exe ../check.py ..................... list_comp: Mean +- std dev: 4.57 us +- 0.17 us The overhead is very likely due to calling __length_hint__
msg339618 - (view)	Author: anthony shaw (anthony shaw)	Date: 2019-04-08 11:35
> How about converting `[x for x in it]` to `[it]` in AST? I should have been more explicit, this patch improves the performance of all list comprehensions that don’t have an if clause. Not just [x for x in y] but: d = {} # some sort of dictionary [f”{k} — {v}” for k, v in d.items()] a = iterable [val*2 for val in a] Would all use BUILD_LIST_PREALLOC and use a LengthHint. I can do another speed test for those other scenarios. Most of the stdlib packages have these sorts of list comps, including those in the default site.py.
msg339619 - (view)	Author: Inada Naoki (methane) *	Date: 2019-04-08 11:36
> I should have been more explicit, this patch improves the performance of all list comprehensions that don’t have an if clause. But in these cases, overhead of reallocation will be smaller than simple case.
msg339620 - (view)	Author: anthony shaw (anthony shaw)	Date: 2019-04-08 11:36
> This might cause a MemoryError when the __length_hint__ of the source returns a too large value, even when the actual size of the comprehension is smaller, e.g.: The current implementation of list comprehensions raise neither a memoryerror or overflow error. They will consume all available memory and crash the interpreter. This patch raises an OverflowError before execution instead of just looping until memory heap exhaustion
msg339621 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2019-04-08 11:39
More benchmarks for slow iterators: import perf runner = perf.Runner() runner.timeit("list_comp", stmt="[x**2 for x in k]", setup="k=iter(list(range(10)))") Current master: ❯ ./python.exe ../check.py ..................... list_comp: Mean +- std dev: 924 ns +- 35 ns PR 12718: ❯ ./python.exe ../check.py ..................... list_comp: Mean +- std dev: 1.17 us +- 0.06 us
msg339622 - (view)	Author: Inada Naoki (methane) *	Date: 2019-04-08 11:45
> > This might cause a MemoryError when the __length_hint__ of the source returns a too large value, even when the actual size of the comprehension is smaller, e.g.: > > The current implementation of list comprehensions raise neither a memoryerror or overflow error. They will consume all available memory and crash the interpreter. > > This patch raises an OverflowError before execution instead of just looping until memory heap exhaustion > Note PEP 424. """ __length_hint__ must return an integer (else a TypeError is raised) or NotImplemented, and is not required to be accurate. It may return a value that is either larger or smaller than the actual size of the container. """ it.__length_hint__ can return 2**1000 even if len(list(it))==0. In such case, current behavior works. And your patch will raise OverflowError.
msg339623 - (view)	Author: anthony shaw (anthony shaw)	Date: 2019-04-08 11:47
> This patch makes it slow for small iterators That is a one-off cost for the __length_hint__ of the range object specifically. Objects with a known length (lists, sets, tuples) would not have that overhead. I can run a more useful set of benchmarks against this. So the +0.6us would be the same for ranges 8-16. Then less for 16-25, then again for 25-35 as the removal of the reallocation process has a more significant factor for larger ranges.
msg339625 - (view)	Author: anthony shaw (anthony shaw)	Date: 2019-04-08 11:50
> In such case, current behavior works. And your patch will raise OverflowError. Try [x for x in range(21000)] in a REPL. It doesn’t raise anything, it tries to create a list that will eventually exceed PY_SIZE_MAX, but it only crashes once it reaches that iteration. This raises an OverflowError instead, the same way: len(range(21000)) raises an OverflowError
msg339626 - (view)	Author: Inada Naoki (methane) *	Date: 2019-04-08 11:54
> Try > > [x for x in range(21000)] > > in a REPL. It doesn’t raise anything, it tries to create a list that will eventually exceed PY_SIZE_MAX, but it only crashes once it reaches that iteration. It is expected behavior. > This raises an OverflowError instead, the same way: > len(range(21000)) > raises an OverflowError If your patch uses __length_hint__, it is bug. iterator will return 2**1000 for __length_hint__, but produce no item on iteration.
msg339627 - (view)	Author: anthony shaw (anthony shaw)	Date: 2019-04-08 12:08
> If your patch uses __length_hint__, it is bug. iterator will return 2**1000 for __length_hint__, but produce no item on iteration. It raises an OverflowError because of the goto https://github.com/python/cpython/pull/12718/files#diff-7f17c8d8448b7b6f90549035d2147a9fR2493 this could just as easily set size to 0. I put `goto error` given the opportunity to handle an expected fault gracefully.
msg339628 - (view)	Author: anthony shaw (anthony shaw)	Date: 2019-04-08 12:16
> If your patch uses __length_hint__, it is bug. I’m not sure I understand this comment, PEP424 says “This is useful for presizing containers when building from an iterable.“ This patch uses __length_hint__ to presize the list container for a list comprehension.
msg339630 - (view)	Author: Inada Naoki (methane) *	Date: 2019-04-08 12:33
"useful" doesn't mean "use it as-is". It is just a hint. It will be wrong. See here for list example: https://github.com/python/cpython/blob/7a0630c530121725136526a88c49589b54da6492/Objects/listobject.c#L929-L940
msg339631 - (view)	Author: Inada Naoki (methane) *	Date: 2019-04-08 12:42
I'm sorry. list_extend raises OverflowError too.
msg339636 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2019-04-08 13:09
> That is a one-off cost for the __length_hint__ of the range object specifically. Objects with a known length (lists, sets, tuples) would not have that overhead. That seems incorrect. This is not unique of range objects as it affects also objects with known lengths (like a list): import perf runner = perf.Runner() runner.timeit("list_comp", stmt="[x2 for x in k]", setup="k=list(range(10))") Current master: ❯ ./python.exe ../check.py -n 10 ..................... list_comp: Mean +- std dev: 3.82 us +- 0.13 us PR 12718: ❯ ./python.exe ../check.py -n 10 ..................... list_comp: Mean +- std dev: 4.38 us +- 0.16 us Check also my other benchmark with a list iterator ( iter(list(range(10))) ) or this one with a generator comp: import perf runner = perf.Runner() runner.timeit("list_comp", stmt="[x2 for x in it]", setup="k=list(range(10));it=(x for x in k)") Current master: ❯ ./python.exe ../check.py -n 10 ..................... list_comp: Mean +- std dev: 945 ns +- 27 ns PR 12718: ❯ ./python.exe ../check.py -n 10 ..................... list_comp: Mean +- std dev: 1.33 us +- 0.05 us
msg339640 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2019-04-08 13:50
I was going to note that the algorithm Anthony has pursued here is the same one we already use for the list constructor and list.extend(), but Inada-san already pointed that out :) While length_hint is allowed to be somewhat inaccurate, we do expect it to be at least vaguely accurate (otherwise it isn't very useful, and if it can be inaccurate enough to trigger OverflowError or MemoryError in cases that would otherwise work reasonably well, it would be better for a type not to implement it at all). While it would be nice to be able to avoid adding a new opcode, the problem is that the existing candidate opcodes (BUILD_LIST, BUILD_LIST_UNPACK) are both inflexible in what they do: - BUILD_LIST requires that the final list length be known at compile time - BUILD_LIST_UNPACK infers the final length from an object reference, but calls _PyList_Extend directly, so it combines the preallocation and the iterator consumption into a single step, and hence can't be used outside of iterable unpacking into a list At the same time, attempting to generalise either of them isn't desirable, since it would slow them down for their existing use cases, and be slower than a new opcode for this use case. The proposed BUILD_LIST_PREALLOC opcode splits the difference: it lets the compiler provide the interpreter with a hint as to how big the resulting list is expected to be. That said, you'd want to run the result through the benchmark suite rather than relying solely on microbenchmarks, as even though unfiltered "[some_operation_on_x for x in y]" comprehensions without nested loops or filter clauses are pretty common (more common than the relatively new "[*itr]" syntax), it's far less clear what the typical distribution in input lengths actually is, and how many memory allocations need to be avoided in order to offset the cost of the initial _PyObject_LengthHint call (as Pablo's small scale results show). (Note that in the _PyList_Extend code, there are preceding special cases for builtin lists and tuples that take those down a much faster path that avoids the _PyObject_LengthHint call entirely)
msg339684 - (view)	Author: Pablo Galindo Salgado (pablogsal) *	Date: 2019-04-08 23:31
Here are the updated results for the benchmark suite. The previous results (unlinked from the issue to reduce noise) were against an old version of the master branch. 2019-04-08_13-08-master-58721a903074.json.gz ============================================ Performance version: 0.7.0 Report on Linux-5.0.5-1-ARCH-x86_64-with-glibc2.28 Number of logical CPUs: 12 Start date: 2019-04-09 00:32:40.656576 End date: 2019-04-09 00:54:03.721798 12718-5f06333a4e49.json.gz ========================== Performance version: 0.7.0 Report on Linux-5.0.5-1-ARCH-x86_64-with-glibc2.28 Number of logical CPUs: 12 Start date: 2019-04-09 00:07:53.433511 End date: 2019-04-09 00:29:32.603022 -- CURRENT MASTER -- -- PR12718 -- +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| Benchmark \| 2019-04-08_13-08-master-58721a903074.json.gz \| 12718-5f06333a4e49.json.gz \| Change \| Significance \| +=========================+==============================================+============================+==============+========================+ \| 2to3 \| 455 ms \| 465 ms \| 1.02x slower \| Significant (t=-2.27) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| chaos \| 171 ms \| 180 ms \| 1.05x slower \| Significant (t=-5.92) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| crypto_pyaes \| 169 ms \| 177 ms \| 1.05x slower \| Significant (t=-6.06) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| deltablue \| 11.5 ms \| 11.9 ms \| 1.03x slower \| Significant (t=-3.28) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| django_template \| 175 ms \| 191 ms \| 1.09x slower \| Significant (t=-13.00) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| dulwich_log \| 127 ms \| 131 ms \| 1.03x slower \| Significant (t=-8.97) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| fannkuch \| 646 ms \| 665 ms \| 1.03x slower \| Significant (t=-12.37) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| float \| 165 ms \| 165 ms \| 1.00x faster \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| go \| 386 ms \| 381 ms \| 1.01x faster \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| hexiom \| 15.4 ms \| 15.5 ms \| 1.00x slower \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| html5lib \| 130 ms \| 132 ms \| 1.01x slower \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| json_dumps \| 18.4 ms \| 18.3 ms \| 1.00x faster \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| json_loads \| 38.6 us \| 38.6 us \| 1.00x faster \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| logging_format \| 15.0 us \| 15.6 us \| 1.04x slower \| Significant (t=-10.98) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| logging_silent \| 284 ns \| 287 ns \| 1.01x slower \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| logging_simple \| 13.5 us \| 13.9 us \| 1.03x slower \| Significant (t=-8.03) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| mako \| 24.6 ms \| 24.5 ms \| 1.01x faster \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| meteor_contest \| 157 ms \| 155 ms \| 1.01x faster \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| nbody \| 196 ms \| 198 ms \| 1.01x slower \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| nqueens \| 144 ms \| 147 ms \| 1.02x slower \| Significant (t=-9.68) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| pathlib \| 27.6 ms \| 28.4 ms \| 1.03x slower \| Significant (t=-5.24) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| pickle \| 14.1 us \| 14.2 us \| 1.01x slower \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| pickle_dict \| 33.7 us \| 35.2 us \| 1.04x slower \| Significant (t=-13.28) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| pickle_list \| 5.22 us \| 5.38 us \| 1.03x slower \| Significant (t=-6.65) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| pickle_pure_python \| 681 us \| 719 us \| 1.05x slower \| Significant (t=-5.64) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| pidigits \| 217 ms \| 219 ms \| 1.01x slower \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| python_startup \| 12.2 ms \| 12.5 ms \| 1.02x slower \| Significant (t=-9.34) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| python_startup_no_site \| 8.90 ms \| 8.92 ms \| 1.00x slower \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| raytrace \| 789 ms \| 793 ms \| 1.00x slower \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| regex_compile \| 267 ms \| 270 ms \| 1.01x slower \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| regex_dna \| 221 ms \| 221 ms \| 1.00x faster \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| regex_effbot \| 3.94 ms \| 3.98 ms \| 1.01x slower \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| regex_v8 \| 30.4 ms \| 30.4 ms \| 1.00x slower \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| richards \| 108 ms \| 108 ms \| 1.00x slower \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| scimark_fft \| 525 ms \| 525 ms \| 1.00x faster \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| scimark_lu \| 270 ms \| 271 ms \| 1.00x slower \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| scimark_monte_carlo \| 160 ms \| 160 ms \| 1.00x slower \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| scimark_sor \| 295 ms \| 305 ms \| 1.03x slower \| Significant (t=-9.82) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| scimark_sparse_mat_mult \| 6.63 ms \| 6.57 ms \| 1.01x faster \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| spectral_norm \| 217 ms \| 219 ms \| 1.01x slower \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| sqlalchemy_declarative \| 239 ms \| 244 ms \| 1.02x slower \| Significant (t=-4.34) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| sqlalchemy_imperative \| 50.7 ms \| 52.0 ms \| 1.03x slower \| Significant (t=-3.84) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| sqlite_synth \| 4.20 us \| 4.14 us \| 1.01x faster \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| sympy_expand \| 612 ms \| 637 ms \| 1.04x slower \| Significant (t=-8.36) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| sympy_integrate \| 27.5 ms \| 28.9 ms \| 1.05x slower \| Significant (t=-4.16) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| sympy_str \| 284 ms \| 299 ms \| 1.05x slower \| Significant (t=-6.97) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| sympy_sum \| 151 ms \| 156 ms \| 1.03x slower \| Significant (t=-4.37) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| telco \| 9.70 ms \| 9.64 ms \| 1.01x faster \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| tornado_http \| 276 ms \| 287 ms \| 1.04x slower \| Significant (t=-5.75) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| unpack_sequence \| 64.4 ns \| 67.3 ns \| 1.05x slower \| Significant (t=-6.16) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| unpickle \| 22.8 us \| 22.7 us \| 1.01x faster \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| unpickle_list \| 5.72 us \| 5.97 us \| 1.04x slower \| Significant (t=-6.05) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| unpickle_pure_python \| 510 us \| 527 us \| 1.03x slower \| Significant (t=-3.89) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| xml_etree_generate \| 150 ms \| 154 ms \| 1.02x slower \| Significant (t=-3.72) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| xml_etree_iterparse \| 145 ms \| 148 ms \| 1.02x slower \| Significant (t=-3.60) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| xml_etree_parse \| 208 ms \| 207 ms \| 1.00x faster \| Not significant \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+ \| xml_etree_process \| 121 ms \| 125 ms \| 1.04x slower \| Significant (t=-5.43) \| +-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
msg339699 - (view)	Author: anthony shaw (anthony shaw)	Date: 2019-04-09 05:35
Have just optimized some of the code and pushed another change as 69dce1c552. ran both master and 69dce1c552 using pyperformance with PGO: ➜ ~ python3.8 -m perf compare_to master.json 69dce1c552.json --table +-------------------------+---------+-----------------------------+ \| Benchmark \| master \| 69dce1c552 \| +=========================+=========+=============================+ \| 2to3 \| 432 ms \| 426 ms: 1.02x faster (-2%) \| +-------------------------+---------+-----------------------------+ \| chaos \| 157 ms \| 155 ms: 1.01x faster (-1%) \| +-------------------------+---------+-----------------------------+ \| crypto_pyaes \| 154 ms \| 153 ms: 1.00x faster (-0%) \| +-------------------------+---------+-----------------------------+ \| dulwich_log \| 123 ms \| 124 ms: 1.00x slower (+0%) \| +-------------------------+---------+-----------------------------+ \| fannkuch \| 603 ms \| 600 ms: 1.01x faster (-1%) \| +-------------------------+---------+-----------------------------+ \| float \| 153 ms \| 154 ms: 1.01x slower (+1%) \| +-------------------------+---------+-----------------------------+ \| go \| 323 ms \| 326 ms: 1.01x slower (+1%) \| +-------------------------+---------+-----------------------------+ \| hexiom \| 13.6 ms \| 13.5 ms: 1.01x faster (-1%) \| +-------------------------+---------+-----------------------------+ \| json_dumps \| 18.1 ms \| 17.9 ms: 1.01x faster (-1%) \| +-------------------------+---------+-----------------------------+ \| logging_format \| 13.2 us \| 13.8 us: 1.05x slower (+5%) \| +-------------------------+---------+-----------------------------+ \| logging_silent \| 266 ns \| 280 ns: 1.05x slower (+5%) \| +-------------------------+---------+-----------------------------+ \| logging_simple \| 12.4 us \| 13.1 us: 1.06x slower (+6%) \| +-------------------------+---------+-----------------------------+ \| meteor_contest \| 145 ms \| 132 ms: 1.10x faster (-9%) \| +-------------------------+---------+-----------------------------+ \| nbody \| 179 ms \| 172 ms: 1.04x faster (-4%) \| +-------------------------+---------+-----------------------------+ \| nqueens \| 138 ms \| 134 ms: 1.03x faster (-3%) \| +-------------------------+---------+-----------------------------+ \| pathlib \| 56.4 ms \| 55.6 ms: 1.01x faster (-1%) \| +-------------------------+---------+-----------------------------+ \| pickle \| 15.0 us \| 15.4 us: 1.03x slower (+3%) \| +-------------------------+---------+-----------------------------+ \| pickle_pure_python \| 620 us \| 617 us: 1.01x faster (-1%) \| +-------------------------+---------+-----------------------------+ \| raytrace \| 696 ms \| 691 ms: 1.01x faster (-1%) \| +-------------------------+---------+-----------------------------+ \| regex_compile \| 242 ms \| 243 ms: 1.00x slower (+0%) \| +-------------------------+---------+-----------------------------+ \| scimark_monte_carlo \| 140 ms \| 143 ms: 1.02x slower (+2%) \| +-------------------------+---------+-----------------------------+ \| scimark_sparse_mat_mult \| 5.90 ms \| 5.94 ms: 1.01x slower (+1%) \| +-------------------------+---------+-----------------------------+ \| spectral_norm \| 194 ms \| 196 ms: 1.01x slower (+1%) \| +-------------------------+---------+-----------------------------+ \| sympy_str \| 246 ms \| 245 ms: 1.00x faster (-0%) \| +-------------------------+---------+-----------------------------+ \| telco \| 8.42 ms \| 8.31 ms: 1.01x faster (-1%) \| +-------------------------+---------+-----------------------------+ \| unpack_sequence \| 59.2 ns \| 59.7 ns: 1.01x slower (+1%) \| +-------------------------+---------+-----------------------------+ \| unpickle \| 21.2 us \| 21.4 us: 1.01x slower (+1%) \| +-------------------------+---------+-----------------------------+ \| unpickle_list \| 5.73 us \| 5.81 us: 1.01x slower (+1%) \| +-------------------------+---------+-----------------------------+ \| unpickle_pure_python \| 471 us \| 467 us: 1.01x faster (-1%) \| +-------------------------+---------+-----------------------------+ \| xml_etree_iterparse \| 142 ms \| 143 ms: 1.01x slower (+1%) \| +-------------------------+---------+-----------------------------+ \| xml_etree_generate \| 139 ms \| 137 ms: 1.02x faster (-2%) \| +-------------------------+---------+-----------------------------+ \| xml_etree_process \| 109 ms \| 108 ms: 1.01x faster (-1%) \| +-------------------------+---------+-----------------------------+ Not significant (21): deltablue; django_template; html5lib; json_loads; mako; pickle_dict; pickle_list; pidigits; python_startup; python_startup_no_site; regex_dna; regex_effbot; regex_v8; richards; scimark_fft; scimark_lu; scimark_sor; sympy_expand; sympy_integrate; sympy_sum; xml_etree_parse I'd like to look at the way range object LengthHint works, it looks like the path for those is not ideal and could use some optimization. Also, BUILD_LIST_PREALLOC uses the Iterator, not the actual object, so you can't use the much faster _HasLen and PyObject_Length(). I'm going to look at how __length_hint__ could be optimized for iterators that would make the smaller range cases more efficient. meteor_contest uses a lot of list comprehensions, so should show the impact for the patch.
msg339702 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-04-09 05:50
> I was going to note that the algorithm Anthony has pursued here is the same one we already use for the list constructor and list.extend(), but Inada-san already pointed that out :) And that optimization looks questionable to me. I tried to reduce an overhead for small lists, but this requires much more complex code and gives mixed results. I am -1 for this optimization because it affects only one particular case (neither other kinds of comprehensions, nor generator expressions, nor list comprehensions with conditions) and even in this case it is small. It is possible to add a lot of other optimizations for other cases which will sped up them to 50% or 100%, but we do not do this, because every such optimization has a cost. It increases the amount of code which should be maintained and covered by tests, it adds small overhead in common cases to speed up an uncommon case, and increasing the code base can negatively affect surrounding code (just because the CPU cache and registers are used inappropriate and the compiler optimizes less important paths). In addition, while this change speed up list comprehensions for long list, it slows down them for short lists. Short lists are more common.
msg339709 - (view)	Author: anthony shaw (anthony shaw)	Date: 2019-04-09 07:04
>I am -1 for this optimization because it affects only one particular case (neither other kinds of comprehensions, nor generator expressions, nor list comprehensions with conditions) and even in this case it is small. It is possible to add a lot of other optimizations for other cases which will sped up them to 50% or 100%, but we do not do this, because every such optimization has a cost. It increases the amount of code which should be maintained and covered by tests, it adds small overhead in common cases to speed up an uncommon case, and increasing the code base can negatively affect surrounding code (just because the CPU cache and registers are used inappropriate and the compiler optimizes less important paths). Understood, I had hoped this change would have a broader impact. The additional opcode is not ideal either. > In addition, while this change speed up list comprehensions for long list, it slows down them for short lists. Short lists are more common. I've been profiling this today, basically, this implementation receives the `list_iter`, `range_iter`, etc. There is no standard object model for an iterator's length, _PyObject_HasLen would return false because it neither implements tp_as_sequence nor, tp_as_mapping (rightly so). What this has uncovered (so hopefully there's some value from this whole experience!) is that __length_hint__ for iterators is _really_ inefficient. Take a list_iterator for example: PyObject_LengthHint will call, _PyObject_HasLen, which returns false, which then goes to call _PyObject_LookupSpecial, then _PyObject_CallNoArg, which calls listiter_len, which calls PyList_GET_SIZE which returns a Py_ssize_t, which is then converted to a PyLong via PyLong_FromSsize_t, which is then returned back to PyObject_LengthHint, which then PyLong_AsSsize_t is run to convert the PyLong back into a Py_ssize_t The Py_ssize_t is then finally returned to the caller! My conclusion was that the list comprehension should be initialized to the length of the target, before GET_ITER is run. This would remove the overhead for range objects, because you could simply call _PyObject_HasLen, which would return true for dict, list, tuple and set, but false for range objects (which is what you want). The issue is that GET_ITER is called outside the code object for the comprehension, so you'd have to pass an additional argument to the comprehension generator. This is way outside of my expertise, but the only way I can see to find a sizeable benefit, with minimal code and no edge cases which are slower. Thanks for your time
msg341975 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-05-09 09:06
We can return to this issue if make the invocation of __length_hint__ much much faster. For example by adding the tp_length_hint slot. But currently it is too large change and his has negative effects.

History
Date	User	Action	Args
2022-04-11 14:59:13	admin	set	github: 80732
2019-05-09 09:06:57	serhiy.storchaka	set	messages: + msg341975
2019-05-09 06:56:01	anthonypjshaw	set	status: open -> closed resolution: rejected stage: patch review -> resolved
2019-05-09 04:53:37	anthonypjshaw	set	assignee: anthonypjshaw nosy: + anthonypjshaw
2019-04-09 07:04:14	anthony shaw	set	messages: + msg339709
2019-04-09 05:50:57	serhiy.storchaka	set	messages: + msg339702
2019-04-09 05:35:20	anthony shaw	set	messages: + msg339699
2019-04-08 23:31:25	pablogsal	set	messages: + msg339684
2019-04-08 23:27:54	pablogsal	set	messages: - msg339665
2019-04-08 19:07:12	pablogsal	set	messages: + msg339665
2019-04-08 13:50:09	ncoghlan	set	messages: + msg339640
2019-04-08 13:09:33	pablogsal	set	messages: + msg339636
2019-04-08 13:08:01	pablogsal	set	messages: - msg339634
2019-04-08 13:05:45	pablogsal	set	nosy: + pablogsal messages: + msg339634
2019-04-08 12:42:21	methane	set	messages: + msg339631
2019-04-08 12:33:36	methane	set	messages: + msg339630
2019-04-08 12:16:14	anthony shaw	set	messages: + msg339628
2019-04-08 12:08:40	anthony shaw	set	messages: + msg339627
2019-04-08 11:54:52	methane	set	messages: + msg339626
2019-04-08 11:50:05	anthony shaw	set	messages: + msg339625
2019-04-08 11:47:48	anthony shaw	set	nosy: - pablogsal messages: + msg339623
2019-04-08 11:45:13	methane	set	messages: + msg339622
2019-04-08 11:39:43	pablogsal	set	nosy: + pablogsal messages: + msg339621
2019-04-08 11:36:57	anthony shaw	set	messages: + msg339620
2019-04-08 11:36:40	methane	set	messages: + msg339619
2019-04-08 11:35:08	anthony shaw	set	nosy: - pablogsal messages: + msg339618
2019-04-08 11:34:39	pablogsal	set	nosy: + pablogsal messages: + msg339617
2019-04-08 11:21:02	methane	set	messages: + msg339614
2019-04-08 11:20:11	methane	set	messages: + msg339613
2019-04-08 11:02:52	ronaldoussoren	set	nosy: + ronaldoussoren messages: + msg339612
2019-04-08 10:48:59	xtreak	set	nosy: + methane
2019-04-08 07:34:19	anthony shaw	set	messages: + msg339595
2019-04-08 07:30:32	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg339594
2019-04-08 05:48:28	anthony shaw	set	type: enhancement
2019-04-08 01:46:48	Aaron Hall	set	nosy: + Aaron Hall
2019-04-08 01:28:57	anthonypjshaw	set	keywords: + patch stage: patch review pull_requests: + pull_request12644
2019-04-08 01:28:21	anthony shaw	create