classification
Title: Optimize list comprehensions with preallocate size and protect against overflow
Type: enhancement Stage: resolved
Components: Interpreter Core Versions: Python 3.8
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: anthonypjshaw Nosy List: Aaron Hall, anthony shaw, anthonypjshaw, methane, ncoghlan, pablogsal, ronaldoussoren, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2019-04-08 01:28 by anthony shaw, last changed 2019-05-09 09:06 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 12718 closed anthonypjshaw, 2019-04-08 01:28
Messages (26)
msg339586 - (view) Author: anthony shaw (anthony shaw) Date: 2019-04-08 01:28
List comprehensions currently create a series of opcodes inside a code object, the first of which is BUILD_LIST with an oparg of 0, effectively creating a zero-length list with a preallocated size of 0.

If you're doing a simple list comprehension on an iterator, e.g.

def foo():
  a = iterable
  return [x for x in a]

Disassembly of <code object <listcomp> at 0x109db2c40, file "<stdin>", line 3>:
  3           0 BUILD_LIST               0
              2 LOAD_FAST                0 (.0)
        >>    4 FOR_ITER                 8 (to 14)
              6 STORE_FAST               1 (x)
              8 LOAD_FAST                1 (x)
             10 LIST_APPEND              2
             12 JUMP_ABSOLUTE            4
        >>   14 RETURN_VALUE

The list comprehension will do a list_resize on the 4, 8, 16, 25, 35, 46, 58, 72, 88th iterations, etc.

This PR preallocates the list created in a list comprehension to the length of the iterator using PyObject_LengthHint().

It uses a new BUILD_LIST_PREALLOC opcode which builds a list with the allocated size of PyObject_LengthHint(co_varnames[oparg]).

[x for x in iterable] compiles to:

Disassembly of <code object <listcomp> at 0x109db2c40, file "<stdin>", line 3>:
  3           0 BUILD_LIST_PREALLOC      0
              2 LOAD_FAST                0 (.0)
        >>    4 FOR_ITER                 8 (to 14)
              6 STORE_FAST               1 (x)
              8 LOAD_FAST                1 (x)
             10 LIST_APPEND              2
             12 JUMP_ABSOLUTE            4
        >>   14 RETURN_VALUE

If the comprehension has ifs, then it will use the existing BUILD_LIST opcode

Testing using a range length of 10000

./python.exe -m timeit "x=list(range(10000)); [y for y in x]"

Gives 392us on the current 3.8 branch
and 372us with this change (about 8-10% faster)

the longer the iterable, the bigger the impact.

This change also catches the issue that a very large iterator, like a range object :
[a for a in range(2**256)]

Would cause the 3.8< interpreter to consume all memory and crash because there is no check against PY_SSIZE_MAX currently.

With this change (assuming there is no if inside the comprehension) is now caught and thrown as an OverflowError:

>>> [a for a in range(2**256)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
OverflowError: Python int too large to convert to C ssize_t
msg339594 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-04-08 07:30
The benefit is too small to add a new opcode.
msg339595 - (view) Author: anthony shaw (anthony shaw) Date: 2019-04-08 07:34
The opcode would not solely apply to this specific use case.

I could seek another way of implementing the same behaviour without an additional opcode?
msg339612 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2019-04-08 11:02
This might cause a MemoryError when the __length_hint__ of the source returns a too large value, even when the actual size of the comprehension is smaller, e.g.:

     [x**2 for x in range(LARGE_VALUE) if is_prime(x)]

See also issue28940
msg339613 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-04-08 11:20
I agree with Serhiy.  Benefit seems too small to add new opcode.

> I could seek another way of implementing the same behaviour without an additional opcode?

How about converting `[x for x in it]` to `[*it]` in AST?
msg339614 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-04-08 11:21
$ python3 -m timeit -s 'r=range(1000)' -- '[x for x in r]'
5000 loops, best of 5: 40 usec per loop

$ python3 -m timeit -s 'r=range(1000)' -- '[*r]'
20000 loops, best of 5: 17.3 usec per loop
msg339617 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-04-08 11:34
This patch makes it slow for small iterators:

Perf program:

import perf

runner = perf.Runner()
runner.timeit("list_comp",
               stmt="[x for x in range(10)]",
               setup="")


Current master:
❯ ./python.exe ../check.py
.....................
list_comp: Mean +- std dev: 3.97 us +- 0.15 us

PR 12718:
❯ ./python.exe ../check.py
.....................
list_comp: Mean +- std dev: 4.57 us +- 0.17 us

The overhead is very likely due to calling __length_hint__
msg339618 - (view) Author: anthony shaw (anthony shaw) Date: 2019-04-08 11:35
> How about converting `[x for x in it]` to `[*it]` in AST?

I should have been more explicit, this patch improves the performance of all list comprehensions that don’t have an if clause.

Not just
[x for x in y]

but:

d = {} # some sort of dictionary
[f”{k} — {v}” for k, v in d.items()]

a = iterable
[val**2 for val in a]

Would all use BUILD_LIST_PREALLOC and use a LengthHint.

I can do another speed test for those other scenarios. 

Most of the stdlib packages have these sorts of list comps, including those in the default site.py.
msg339619 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-04-08 11:36
> I should have been more explicit, this patch improves the performance of all list comprehensions that don’t have an if clause.

But in these cases, overhead of reallocation will be smaller than simple case.
msg339620 - (view) Author: anthony shaw (anthony shaw) Date: 2019-04-08 11:36
> This might cause a MemoryError when the __length_hint__ of the source returns a too large value, even when the actual size of the comprehension is smaller, e.g.:

The current implementation of list comprehensions raise neither a memoryerror or overflow error. They will consume all available memory and crash the interpreter.

This patch raises an OverflowError before execution instead of just looping until memory heap exhaustion
msg339621 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-04-08 11:39
More benchmarks for slow iterators:

import perf

runner = perf.Runner()
runner.timeit("list_comp",
               stmt="[x**2 for x in k]",
               setup="k=iter(list(range(10)))")


Current master:
❯ ./python.exe ../check.py
.....................
list_comp: Mean +- std dev: 924 ns +- 35 ns

PR 12718:
❯ ./python.exe ../check.py
.....................
list_comp: Mean +- std dev: 1.17 us +- 0.06 us
msg339622 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-04-08 11:45
> > This might cause a MemoryError when the __length_hint__ of the source returns a too large value, even when the actual size of the comprehension is smaller, e.g.:
>
> The current implementation of list comprehensions raise neither a memoryerror or overflow error. They will consume all available memory and crash the interpreter.
>
> This patch raises an OverflowError before execution instead of just looping until memory heap exhaustion
>

Note PEP 424.

"""
__length_hint__ must return an integer (else a TypeError is raised) or
NotImplemented, and is not required to be accurate. It may return a
value that is either larger or smaller than the actual size of the
container.
"""

it.__length_hint__ can return 2**1000 even if len(list(it))==0.

In such case, current behavior works.  And your patch will raise OverflowError.
msg339623 - (view) Author: anthony shaw (anthony shaw) Date: 2019-04-08 11:47
> This patch makes it slow for small iterators

That is a one-off cost for the __length_hint__ of the range object specifically.
Objects with a known length (lists, sets, tuples) would not have that overhead.

I can run a more useful set of benchmarks against this.

So the +0.6us would be the same for ranges 8-16. Then less for 16-25, then again for 25-35 as the removal of the reallocation process has a more significant factor for larger ranges.
msg339625 - (view) Author: anthony shaw (anthony shaw) Date: 2019-04-08 11:50
> In such case, current behavior works.  And your patch will raise OverflowError.

Try

[x for x in range(2**1000)] 

in a REPL. It doesn’t raise anything, it tries to create a list that will eventually exceed PY_SIZE_MAX, but it only crashes once it reaches that iteration.

This raises an OverflowError instead, the same way:
len(range(2**1000))
raises an OverflowError
msg339626 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-04-08 11:54
> Try
>
> [x for x in range(2**1000)]
>
> in a REPL. It doesn’t raise anything, it tries to create a list that will eventually exceed PY_SIZE_MAX, but it only crashes once it reaches that iteration.

It is expected behavior.

> This raises an OverflowError instead, the same way:
> len(range(2**1000))
> raises an OverflowError

If your patch uses __length_hint__, it is bug.
iterator will return 2**1000 for __length_hint__, but produce no item
on iteration.
msg339627 - (view) Author: anthony shaw (anthony shaw) Date: 2019-04-08 12:08
> If your patch uses __length_hint__, it is bug.
iterator will return 2**1000 for __length_hint__, but produce no item
on iteration.

It raises an OverflowError because of the goto
https://github.com/python/cpython/pull/12718/files#diff-7f17c8d8448b7b6f90549035d2147a9fR2493 this could just as easily set size to 0.

I put `goto error` given the opportunity to handle an expected fault gracefully.
msg339628 - (view) Author: anthony shaw (anthony shaw) Date: 2019-04-08 12:16
> If your patch uses __length_hint__, it is bug.

I’m not sure I understand this comment,

PEP424 says “This is useful for presizing containers when building from an iterable.“

This patch uses __length_hint__ to presize the list container for a list comprehension.
msg339630 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-04-08 12:33
"useful" doesn't mean "use it as-is".
It is just a hint.  It will be wrong.

See here for list example:
https://github.com/python/cpython/blob/7a0630c530121725136526a88c49589b54da6492/Objects/listobject.c#L929-L940
msg339631 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2019-04-08 12:42
I'm sorry.  list_extend raises OverflowError too.
msg339636 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-04-08 13:09
> That is a one-off cost for the __length_hint__ of the range object specifically.
Objects with a known length (lists, sets, tuples) would not have that overhead.

That seems incorrect. This is not unique of range objects as it affects also objects with known lengths (like a list):

import perf

runner = perf.Runner()
runner.timeit("list_comp",
               stmt="[x*2 for x in k]",
               setup="k=list(range(10))")


Current master:
❯ ./python.exe ../check.py -n 10
.....................
list_comp: Mean +- std dev: 3.82 us +- 0.13 us

PR 12718:
❯ ./python.exe ../check.py -n 10
.....................
list_comp: Mean +- std dev: 4.38 us +- 0.16 us


Check also my other benchmark with a list iterator ( iter(list(range(10))) ) or this one with a generator comp:

import perf

runner = perf.Runner()
runner.timeit("list_comp",
               stmt="[x*2 for x in it]",
               setup="k=list(range(10));it=(x for x in k)")


Current master:
❯ ./python.exe ../check.py -n 10
.....................
list_comp: Mean +- std dev: 945 ns +- 27 ns


PR 12718:
❯ ./python.exe ../check.py -n 10
.....................
list_comp: Mean +- std dev: 1.33 us +- 0.05 us
msg339640 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2019-04-08 13:50
I was going to note that the algorithm Anthony has pursued here is the same one we already use for the list constructor and list.extend(), but Inada-san already pointed that out :)

While length_hint is allowed to be somewhat inaccurate, we do expect it to be at least *vaguely* accurate (otherwise it isn't very useful, and if it can be inaccurate enough to trigger OverflowError or MemoryError in cases that would otherwise work reasonably well, it would be better for a type not to implement it at all).

While it would be nice to be able to avoid adding a new opcode, the problem is that the existing candidate opcodes (BUILD_LIST, BUILD_LIST_UNPACK) are both inflexible in what they do:

- BUILD_LIST requires that the final list length be known at compile time
- BUILD_LIST_UNPACK infers the final length from an object reference, but calls _PyList_Extend directly, so it combines the preallocation and the iterator consumption into a single step, and hence can't be used outside of iterable unpacking into a list

At the same time, attempting to generalise either of them isn't desirable, since it would slow them down for their existing use cases, and be slower than a new opcode for this use case.

The proposed BUILD_LIST_PREALLOC opcode splits the difference: it lets the compiler provide the interpreter with a *hint* as to how big the resulting list is expected to be.

That said, you'd want to run the result through the benchmark suite rather than relying solely on microbenchmarks, as even though unfiltered "[some_operation_on_x for x in y]" comprehensions without nested loops or filter clauses are pretty common (more common than the relatively new "[*itr]" syntax), it's far less clear what the typical distribution in input lengths actually is, and how many memory allocations need to be avoided in order to offset the cost of the initial _PyObject_LengthHint call (as Pablo's small scale results show).

(Note that in the _PyList_Extend code, there are preceding special cases for builtin lists and tuples that take those down a much faster path that avoids the _PyObject_LengthHint call entirely)
msg339684 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-04-08 23:31
Here are the updated results for the benchmark suite. The previous results (unlinked from the issue to reduce noise)
were against an old version of the master branch.

2019-04-08_13-08-master-58721a903074.json.gz
============================================

Performance version: 0.7.0
Report on Linux-5.0.5-1-ARCH-x86_64-with-glibc2.28
Number of logical CPUs: 12
Start date: 2019-04-09 00:32:40.656576
End date: 2019-04-09 00:54:03.721798

12718-5f06333a4e49.json.gz
==========================

Performance version: 0.7.0
Report on Linux-5.0.5-1-ARCH-x86_64-with-glibc2.28
Number of logical CPUs: 12
Start date: 2019-04-09 00:07:53.433511
End date: 2019-04-09 00:29:32.603022

                                      
                                        -- CURRENT MASTER --                    -- PR12718 --
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| Benchmark               | 2019-04-08_13-08-master-58721a903074.json.gz | 12718-5f06333a4e49.json.gz | Change       | Significance           |
+=========================+==============================================+============================+==============+========================+
| 2to3                    | 455 ms                                       | 465 ms                     | 1.02x slower | Significant (t=-2.27)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| chaos                   | 171 ms                                       | 180 ms                     | 1.05x slower | Significant (t=-5.92)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| crypto_pyaes            | 169 ms                                       | 177 ms                     | 1.05x slower | Significant (t=-6.06)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| deltablue               | 11.5 ms                                      | 11.9 ms                    | 1.03x slower | Significant (t=-3.28)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| django_template         | 175 ms                                       | 191 ms                     | 1.09x slower | Significant (t=-13.00) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| dulwich_log             | 127 ms                                       | 131 ms                     | 1.03x slower | Significant (t=-8.97)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| fannkuch                | 646 ms                                       | 665 ms                     | 1.03x slower | Significant (t=-12.37) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| float                   | 165 ms                                       | 165 ms                     | 1.00x faster | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| go                      | 386 ms                                       | 381 ms                     | 1.01x faster | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| hexiom                  | 15.4 ms                                      | 15.5 ms                    | 1.00x slower | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| html5lib                | 130 ms                                       | 132 ms                     | 1.01x slower | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| json_dumps              | 18.4 ms                                      | 18.3 ms                    | 1.00x faster | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| json_loads              | 38.6 us                                      | 38.6 us                    | 1.00x faster | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| logging_format          | 15.0 us                                      | 15.6 us                    | 1.04x slower | Significant (t=-10.98) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| logging_silent          | 284 ns                                       | 287 ns                     | 1.01x slower | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| logging_simple          | 13.5 us                                      | 13.9 us                    | 1.03x slower | Significant (t=-8.03)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| mako                    | 24.6 ms                                      | 24.5 ms                    | 1.01x faster | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| meteor_contest          | 157 ms                                       | 155 ms                     | 1.01x faster | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| nbody                   | 196 ms                                       | 198 ms                     | 1.01x slower | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| nqueens                 | 144 ms                                       | 147 ms                     | 1.02x slower | Significant (t=-9.68)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| pathlib                 | 27.6 ms                                      | 28.4 ms                    | 1.03x slower | Significant (t=-5.24)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| pickle                  | 14.1 us                                      | 14.2 us                    | 1.01x slower | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| pickle_dict             | 33.7 us                                      | 35.2 us                    | 1.04x slower | Significant (t=-13.28) |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| pickle_list             | 5.22 us                                      | 5.38 us                    | 1.03x slower | Significant (t=-6.65)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| pickle_pure_python      | 681 us                                       | 719 us                     | 1.05x slower | Significant (t=-5.64)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| pidigits                | 217 ms                                       | 219 ms                     | 1.01x slower | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| python_startup          | 12.2 ms                                      | 12.5 ms                    | 1.02x slower | Significant (t=-9.34)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| python_startup_no_site  | 8.90 ms                                      | 8.92 ms                    | 1.00x slower | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| raytrace                | 789 ms                                       | 793 ms                     | 1.00x slower | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| regex_compile           | 267 ms                                       | 270 ms                     | 1.01x slower | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| regex_dna               | 221 ms                                       | 221 ms                     | 1.00x faster | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| regex_effbot            | 3.94 ms                                      | 3.98 ms                    | 1.01x slower | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| regex_v8                | 30.4 ms                                      | 30.4 ms                    | 1.00x slower | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| richards                | 108 ms                                       | 108 ms                     | 1.00x slower | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| scimark_fft             | 525 ms                                       | 525 ms                     | 1.00x faster | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| scimark_lu              | 270 ms                                       | 271 ms                     | 1.00x slower | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| scimark_monte_carlo     | 160 ms                                       | 160 ms                     | 1.00x slower | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| scimark_sor             | 295 ms                                       | 305 ms                     | 1.03x slower | Significant (t=-9.82)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| scimark_sparse_mat_mult | 6.63 ms                                      | 6.57 ms                    | 1.01x faster | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| spectral_norm           | 217 ms                                       | 219 ms                     | 1.01x slower | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| sqlalchemy_declarative  | 239 ms                                       | 244 ms                     | 1.02x slower | Significant (t=-4.34)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| sqlalchemy_imperative   | 50.7 ms                                      | 52.0 ms                    | 1.03x slower | Significant (t=-3.84)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| sqlite_synth            | 4.20 us                                      | 4.14 us                    | 1.01x faster | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| sympy_expand            | 612 ms                                       | 637 ms                     | 1.04x slower | Significant (t=-8.36)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| sympy_integrate         | 27.5 ms                                      | 28.9 ms                    | 1.05x slower | Significant (t=-4.16)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| sympy_str               | 284 ms                                       | 299 ms                     | 1.05x slower | Significant (t=-6.97)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| sympy_sum               | 151 ms                                       | 156 ms                     | 1.03x slower | Significant (t=-4.37)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| telco                   | 9.70 ms                                      | 9.64 ms                    | 1.01x faster | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| tornado_http            | 276 ms                                       | 287 ms                     | 1.04x slower | Significant (t=-5.75)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| unpack_sequence         | 64.4 ns                                      | 67.3 ns                    | 1.05x slower | Significant (t=-6.16)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| unpickle                | 22.8 us                                      | 22.7 us                    | 1.01x faster | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| unpickle_list           | 5.72 us                                      | 5.97 us                    | 1.04x slower | Significant (t=-6.05)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| unpickle_pure_python    | 510 us                                       | 527 us                     | 1.03x slower | Significant (t=-3.89)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| xml_etree_generate      | 150 ms                                       | 154 ms                     | 1.02x slower | Significant (t=-3.72)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| xml_etree_iterparse     | 145 ms                                       | 148 ms                     | 1.02x slower | Significant (t=-3.60)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| xml_etree_parse         | 208 ms                                       | 207 ms                     | 1.00x faster | Not significant        |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
| xml_etree_process       | 121 ms                                       | 125 ms                     | 1.04x slower | Significant (t=-5.43)  |
+-------------------------+----------------------------------------------+----------------------------+--------------+------------------------+
msg339699 - (view) Author: anthony shaw (anthony shaw) Date: 2019-04-09 05:35
Have just optimized some of the code and pushed another change as 69dce1c552.

ran both master and 69dce1c552 using pyperformance with PGO:

➜  ~ python3.8 -m perf compare_to master.json 69dce1c552.json --table 
+-------------------------+---------+-----------------------------+
| Benchmark               | master  | 69dce1c552                  |
+=========================+=========+=============================+
| 2to3                    | 432 ms  | 426 ms: 1.02x faster (-2%)  |
+-------------------------+---------+-----------------------------+
| chaos                   | 157 ms  | 155 ms: 1.01x faster (-1%)  |
+-------------------------+---------+-----------------------------+
| crypto_pyaes            | 154 ms  | 153 ms: 1.00x faster (-0%)  |
+-------------------------+---------+-----------------------------+
| dulwich_log             | 123 ms  | 124 ms: 1.00x slower (+0%)  |
+-------------------------+---------+-----------------------------+
| fannkuch                | 603 ms  | 600 ms: 1.01x faster (-1%)  |
+-------------------------+---------+-----------------------------+
| float                   | 153 ms  | 154 ms: 1.01x slower (+1%)  |
+-------------------------+---------+-----------------------------+
| go                      | 323 ms  | 326 ms: 1.01x slower (+1%)  |
+-------------------------+---------+-----------------------------+
| hexiom                  | 13.6 ms | 13.5 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| json_dumps              | 18.1 ms | 17.9 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| logging_format          | 13.2 us | 13.8 us: 1.05x slower (+5%) |
+-------------------------+---------+-----------------------------+
| logging_silent          | 266 ns  | 280 ns: 1.05x slower (+5%)  |
+-------------------------+---------+-----------------------------+
| logging_simple          | 12.4 us | 13.1 us: 1.06x slower (+6%) |
+-------------------------+---------+-----------------------------+
| meteor_contest          | 145 ms  | 132 ms: 1.10x faster (-9%)  |
+-------------------------+---------+-----------------------------+
| nbody                   | 179 ms  | 172 ms: 1.04x faster (-4%)  |
+-------------------------+---------+-----------------------------+
| nqueens                 | 138 ms  | 134 ms: 1.03x faster (-3%)  |
+-------------------------+---------+-----------------------------+
| pathlib                 | 56.4 ms | 55.6 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| pickle                  | 15.0 us | 15.4 us: 1.03x slower (+3%) |
+-------------------------+---------+-----------------------------+
| pickle_pure_python      | 620 us  | 617 us: 1.01x faster (-1%)  |
+-------------------------+---------+-----------------------------+
| raytrace                | 696 ms  | 691 ms: 1.01x faster (-1%)  |
+-------------------------+---------+-----------------------------+
| regex_compile           | 242 ms  | 243 ms: 1.00x slower (+0%)  |
+-------------------------+---------+-----------------------------+
| scimark_monte_carlo     | 140 ms  | 143 ms: 1.02x slower (+2%)  |
+-------------------------+---------+-----------------------------+
| scimark_sparse_mat_mult | 5.90 ms | 5.94 ms: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| spectral_norm           | 194 ms  | 196 ms: 1.01x slower (+1%)  |
+-------------------------+---------+-----------------------------+
| sympy_str               | 246 ms  | 245 ms: 1.00x faster (-0%)  |
+-------------------------+---------+-----------------------------+
| telco                   | 8.42 ms | 8.31 ms: 1.01x faster (-1%) |
+-------------------------+---------+-----------------------------+
| unpack_sequence         | 59.2 ns | 59.7 ns: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| unpickle                | 21.2 us | 21.4 us: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| unpickle_list           | 5.73 us | 5.81 us: 1.01x slower (+1%) |
+-------------------------+---------+-----------------------------+
| unpickle_pure_python    | 471 us  | 467 us: 1.01x faster (-1%)  |
+-------------------------+---------+-----------------------------+
| xml_etree_iterparse     | 142 ms  | 143 ms: 1.01x slower (+1%)  |
+-------------------------+---------+-----------------------------+
| xml_etree_generate      | 139 ms  | 137 ms: 1.02x faster (-2%)  |
+-------------------------+---------+-----------------------------+
| xml_etree_process       | 109 ms  | 108 ms: 1.01x faster (-1%)  |
+-------------------------+---------+-----------------------------+

Not significant (21): deltablue; django_template; html5lib; json_loads; mako; pickle_dict; pickle_list; pidigits; python_startup; python_startup_no_site; regex_dna; regex_effbot; regex_v8; richards; scimark_fft; scimark_lu; scimark_sor; sympy_expand; sympy_integrate; sympy_sum; xml_etree_parse

I'd like to look at the way range object LengthHint works, it looks like the path for those is not ideal and could use some optimization. Also, BUILD_LIST_PREALLOC uses the Iterator, not the actual object, so you can't use the much faster _HasLen and PyObject_Length().

I'm going to look at how __length_hint__ could be optimized for iterators that would make the smaller range cases more efficient.

meteor_contest uses a lot of list comprehensions, so should show the impact for the patch.
msg339702 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-04-09 05:50
> I was going to note that the algorithm Anthony has pursued here is the same one we already use for the list constructor and list.extend(), but Inada-san already pointed that out :)

And that optimization looks questionable to me. I tried to reduce an overhead for small lists, but this requires much more complex code and gives mixed results.

I am -1 for this optimization because it affects only one particular case (neither other kinds of comprehensions, nor generator expressions, nor list comprehensions with conditions) and even in this case it is small. It is possible to add a lot of other optimizations for other cases which will sped up them to 50% or 100%, but we do not do this, because every such optimization has a cost. It increases the amount of code which should be maintained and covered by tests, it adds small overhead in common cases to speed up an uncommon case, and increasing the code base can negatively affect surrounding code (just because the CPU cache and registers are used inappropriate and the compiler optimizes less important paths).

In addition, while this change speed up list comprehensions for long list, it slows down them for short lists. Short lists are more common.
msg339709 - (view) Author: anthony shaw (anthony shaw) Date: 2019-04-09 07:04
>I am -1 for this optimization because it affects only one particular case (neither other kinds of comprehensions, nor generator expressions, nor list comprehensions with 
 conditions) and even in this case it is small. It is possible to add a lot of other optimizations for other cases which will sped up them to 50% or 100%, but we do not do 
 this, because every such optimization has a cost. It increases the amount of code which should be maintained and covered by tests, it adds small overhead in common cases to 
 speed up an uncommon case, and increasing the code base can negatively affect surrounding code (just because the CPU cache and registers are used inappropriate and the 
 compiler optimizes less important paths).

Understood, I had hoped this change would have a broader impact. The additional opcode is not ideal either. 

> In addition, while this change speed up list comprehensions for long list, it slows down them for short lists. Short lists are more common.

I've been profiling this today, basically, this implementation receives the `list_iter`, `range_iter`, etc.

There is no standard object model for an iterator's length, _PyObject_HasLen would return false because it neither implements tp_as_sequence nor, tp_as_mapping (rightly so).

What this has uncovered (so hopefully there's some value from this whole experience!) is that __length_hint__ for iterators is _really_ inefficient. Take a list_iterator for example:

PyObject_LengthHint will call, _PyObject_HasLen, which returns false, which then goes to call
_PyObject_LookupSpecial, then 
_PyObject_CallNoArg, which calls 
listiter_len, which calls 
PyList_GET_SIZE which returns a Py_ssize_t, which is then converted to a PyLong via
PyLong_FromSsize_t, which is then returned back to PyObject_LengthHint, which then
PyLong_AsSsize_t is run to convert the PyLong back into a Py_ssize_t

The Py_ssize_t is then finally returned to the caller!

My conclusion was that the list comprehension should be initialized to the length of the target, before GET_ITER is run. This would remove the overhead for range objects, because you could simply call _PyObject_HasLen, which would return true for dict, list, tuple and set, but false for range objects (which is what you want).

The issue is that GET_ITER is called outside the code object for the comprehension, so you'd have to pass an additional argument to the comprehension generator.

This is way outside of my expertise, but the only way I can see to find a sizeable benefit, with minimal code and no edge cases which are slower.

Thanks for your time
msg341975 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-05-09 09:06
We can return to this issue if make the invocation of __length_hint__ much much faster. For example by adding the tp_length_hint slot. But currently it is too large change and his has negative effects.
History
Date User Action Args
2019-05-09 09:06:57serhiy.storchakasetmessages: + msg341975
2019-05-09 06:56:01anthonypjshawsetstatus: open -> closed
resolution: rejected
stage: patch review -> resolved
2019-05-09 04:53:37anthonypjshawsetassignee: anthonypjshaw

nosy: + anthonypjshaw
2019-04-09 07:04:14anthony shawsetmessages: + msg339709
2019-04-09 05:50:57serhiy.storchakasetmessages: + msg339702
2019-04-09 05:35:20anthony shawsetmessages: + msg339699
2019-04-08 23:31:25pablogsalsetmessages: + msg339684
2019-04-08 23:27:54pablogsalsetmessages: - msg339665
2019-04-08 19:07:12pablogsalsetmessages: + msg339665
2019-04-08 13:50:09ncoghlansetmessages: + msg339640
2019-04-08 13:09:33pablogsalsetmessages: + msg339636
2019-04-08 13:08:01pablogsalsetmessages: - msg339634
2019-04-08 13:05:45pablogsalsetnosy: + pablogsal
messages: + msg339634
2019-04-08 12:42:21methanesetmessages: + msg339631
2019-04-08 12:33:36methanesetmessages: + msg339630
2019-04-08 12:16:14anthony shawsetmessages: + msg339628
2019-04-08 12:08:40anthony shawsetmessages: + msg339627
2019-04-08 11:54:52methanesetmessages: + msg339626
2019-04-08 11:50:05anthony shawsetmessages: + msg339625
2019-04-08 11:47:48anthony shawsetnosy: - pablogsal
messages: + msg339623
2019-04-08 11:45:13methanesetmessages: + msg339622
2019-04-08 11:39:43pablogsalsetnosy: + pablogsal
messages: + msg339621
2019-04-08 11:36:57anthony shawsetmessages: + msg339620
2019-04-08 11:36:40methanesetmessages: + msg339619
2019-04-08 11:35:08anthony shawsetnosy: - pablogsal
messages: + msg339618
2019-04-08 11:34:39pablogsalsetnosy: + pablogsal
messages: + msg339617
2019-04-08 11:21:02methanesetmessages: + msg339614
2019-04-08 11:20:11methanesetmessages: + msg339613
2019-04-08 11:02:52ronaldoussorensetnosy: + ronaldoussoren
messages: + msg339612
2019-04-08 10:48:59xtreaksetnosy: + methane
2019-04-08 07:34:19anthony shawsetmessages: + msg339595
2019-04-08 07:30:32serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg339594
2019-04-08 05:48:28anthony shawsettype: enhancement
2019-04-08 01:46:48Aaron Hallsetnosy: + Aaron Hall
2019-04-08 01:28:57anthonypjshawsetkeywords: + patch
stage: patch review
pull_requests: + pull_request12644
2019-04-08 01:28:21anthony shawcreate