This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add length_hint parameter to list, dict, set constructors to allow efficient presizing
Type: performance Stage:
Components: Versions: Python 3.4
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: alex, asvetlov, benjamin.peterson, brett.cannon, christian.heimes, giampaolo.rodola, gregory.p.smith, neologix, pitrou, rhettinger, scoder, serhiy.storchaka, trent, vstinner
Priority: normal Keywords: patch

Created on 2013-03-02 17:23 by alex, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
preallocate.patch christian.heimes, 2013-03-05 18:53 review
listresize.py christian.heimes, 2013-03-06 12:37
Messages (17)
msg183332 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2013-03-02 17:23
Following the length_hint PEP, we should expose this facility to end-python programmers. The semantics would be basically: the list has behavior identical to if you hadn't provided length_hint, except the VM is free to preallocate efficiently.

CPython (and PyPy) use this kind of logic internally all over the place when lists are to be returned (including, but not limited to, list() with a single argument!)

If this is considered to be a good idea I'll whip up a patch.
msg183339 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-03-02 18:39
Definitely needs to be discussed on python-ideas/python-dev.
msg183349 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-03-02 21:48
Guido has previously rejected any kind of user specified presizing parameter for dictionaries.  He felt that it was breaking the abstraction in a way that caused people to focus on implementation specific details.

Also, set() and dict() use presizing (and hash value reuse) only in the limited case of the input iterable being an exact dictionary or set.  Otherwise, it makes no assumptions about the input iterable and grows as needed.  For sets, even when the __len__ is known for a general iterable, we don't know how many unique elements are the input -- consider for example: set([None] * 1000000).  In typical use cases for sets, a user also doesn't know in advance how many unique elements will be present -- and to the extent they try to add estimation logic, it would only complicate their programs.
msg183353 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-03-03 01:15
I think it is going a bit too far in the direction of exposing low-level optimizations. Python is not C++. I also agree with Raymond that it's rather rare to construct read-only containers (in the sense that they are allocated once and for all).
msg183360 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2013-03-03 03:48
python-dev/ideas may be a better place to have this discussion, but basically if you're going to insist that stuff like this doesn't go into Python "because it isn't C++", people are going to have to write C++, and that makes me incredibly sad.

There's an obvious need for this, CPython uses optimizations like this internally, as does PyPy.
msg183361 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2013-03-03 03:52
I think the emphasis is on "internally" :).
msg183365 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-03-03 10:29
> There's an obvious need for this, CPython uses optimizations like this
> internally, as does PyPy.

I don't know if it's a need or if it's just "nice to have".
By the way, in the list case it also makes C code simpler, since once
the list is preallocated you just have to call PyList_SET_ITEM to
populate it, and there's no error return to worry about.

Also, lists are easy to pre-allocate in pure Python as well:

l = [None] * N
# populate
for i in range(N):
    l[i] = ...
msg183366 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2013-03-03 10:31
That strategy only works if you know the exact count, it fails if you only have an estimate (as __length_hint__ gives you).
msg183473 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-03-04 16:34
This unlikely will be useful. C code uses __length_hint__ because C code is fast enough and fill time can be comparable to resize time. But Python code will be one or two order slower and resize time is insignificant fraction of total time for any real code. You can implement your proposition and do the tests, and I doubt that they will show impressive results.
msg183536 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-03-05 17:13
My python-ideas posting about the same topic http://mail.python.org/pipermail/python-ideas/2013-March/019807.html
msg183545 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-03-05 18:53
Here is an experimental patch. The speedup is ... measurable.

$ ./python -m timeit -n1000 "l = []" "l.__preallocate__(10000)" "app = l.append" "for i in range(10000): app(i)" "l.__shrink__()"
1000 loops, best of 3: 3.68 msec per loop

$ ./python -m timeit -n1000 "l = []" "app = l.append" "for i in range(10000): app(i)"
1000 loops, best of 3: 3.75 msec per loop
msg183552 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-03-05 19:58
> Here is an experimental patch. The speedup is ... measurable.

It's likely to be more useful for dict and set, to avoid/limit rehashs.
Also, the allocation overhead depends on the implementation, I suspect the gain would be more important with PyPy.

FWIW, Java proposes this for lists and maps, but I'm not convinced exposing such tunables really makes sense for Python.
msg183558 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2013-03-05 22:27
The gain will be more noticeable the faster the Python implementation it is running under is. It is going to avoid logN relloc's in just about all implementations.  That CPython is relatively slow is not a justification to avoid adding the feature.

I like Christian's proposal of not having it be a constructor argument but having it be a __preallocate__ method on the object.  That keeps it out of the eyes of most users reading the docs (keeping the main interface clear) and lets us put it in a special section of the python docs.  Code that *needs* it could use it but most would not need to bother.

Python VMs would be free to have dummy do nothing implementations (ie: we should *never* guarantee that a python call to thing.__preallocate__(1000) followed immediately by a C extension module call that takes thing as input can assume anything about the underlying object... it needs to check the size or resize itself before using the unchecked _SET_ITEM macros, etc.)
msg183580 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2013-03-06 05:51
I agree with basically all counter-arguments that were brought up so far. While I would eventually like to use the available length hints in the special case of Cython's list comprehensions, I'm far from seeing a general applicability for this. The use cases are fairly limited, the (missing) guarantees are easy to misunderstand by users and alternatives exist for most use cases. Python code shouldn't be bothered with memory management.
msg183589 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2013-03-06 12:37
Gregory, thanks. :) 

By the way CPython's list type does more than log(N) resize ops:

$ ./listresize.py 10 100 1000 10000 100000
10 list.append() do 4 resize ops.
100 list.append() do 11 resize ops.
1000 list.append() do 28 resize ops.
10000 list.append() do 47 resize ops.
100000 list.append() do 66 resize ops.

I suspect that hash types (dict, set) could gain even more speed as it eliminates the need for rehashing. It's more costly than memcpy() inside realloc().

I understand the proposal as a power user tool. Most people don't need a pneumatic hammer, an ordinary hammer suffices. But in some cases you need a tool with more power. "CPython doesn't gain much from the new API, let's not add it to Python" isn't nice to other implementations that may benefit from it.
msg183592 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-03-06 14:14
> By the way CPython's list type does more than log(N) resize ops:

Obviously it's an asymptotic complexity, not a hard number.

> I understand the proposal as a power user tool. Most people don't
> need a pneumatic hammer, an ordinary hammer suffices. But in some
> cases you need a tool with more power. "CPython doesn't gain much
> from the new API, let's not add it to Python" isn't nice to other
> implementations that may benefit from it.

I'd still like to see concrete use cases, not micro-benchmarks.
msg217843 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-05-03 23:17
Based on all the negative comments, I'm closing this one.  It adds too much complication in return for dubious value.
History
Date User Action Args
2022-04-11 14:57:42adminsetgithub: 61540
2014-05-03 23:17:55rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg217843
2013-03-21 03:52:48trentsetnosy: + trent
2013-03-10 20:49:24asvetlovsetnosy: + asvetlov
2013-03-07 10:53:11giampaolo.rodolasetnosy: + giampaolo.rodola
2013-03-06 18:27:00vstinnersetnosy: + vstinner
2013-03-06 14:14:07pitrousetmessages: + msg183592
title: Add length_hint parameter to list, dict, set constructors to allow efficient presizing -> Add length_hint parameter to list, dict, set constructors to allow efficient presizing
2013-03-06 12:37:17christian.heimessetfiles: + listresize.py

messages: + msg183589
2013-03-06 05:51:12scodersetnosy: + scoder
messages: + msg183580
2013-03-05 22:27:59gregory.p.smithsetnosy: + gregory.p.smith
messages: + msg183558
2013-03-05 19:58:04neologixsetnosy: + neologix
messages: + msg183552
2013-03-05 18:53:35christian.heimessetfiles: + preallocate.patch
keywords: + patch
messages: + msg183545
2013-03-05 17:13:02christian.heimessetmessages: + msg183536
2013-03-05 17:09:48christian.heimessetnosy: + christian.heimes
2013-03-04 16:34:15serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg183473
2013-03-04 16:27:44brett.cannonsetnosy: + brett.cannon
2013-03-03 10:31:12alexsetmessages: + msg183366
2013-03-03 10:29:59pitrousetmessages: + msg183365
2013-03-03 03:52:04benjamin.petersonsetmessages: + msg183361
2013-03-03 03:48:41alexsetmessages: + msg183360
2013-03-03 01:15:43pitrousetnosy: + pitrou
messages: + msg183353
2013-03-02 21:48:38rhettingersetnosy: + rhettinger
messages: + msg183349
2013-03-02 18:39:56benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg183339
2013-03-02 17:23:39alexcreate