This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Surprising list overallocation from .split()
Type: resource usage Stage:
Components: Interpreter Core Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: JelleZijlstra, tim.peters
Priority: normal Keywords:

Created on 2022-03-11 22:37 by tim.peters, last changed 2022-04-11 14:59 by admin.

Messages (3)
msg414942 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2022-03-11 22:37
When looking into a StackOverflow question about surprisingly high memory use, I stumbled into this (under 3.10.1, Win64):

>>> import sys
>>> s = "1 2 3 4 5".split()
>>> s
['1', '2', '3', '4', '5']
>>> sys.getsizeof(s)
152
>>> _ - sys.getsizeof([])
96
>>> 96 / 8
12.0

That is, we allocated enough space in the list to store 12(!) elements, despite that only 5 are used. Other ways of building a 5-element list I've tried overallocate by at most 3 slots:

>>> sys.getsizeof([ch for ch in "12345"])
120
>>> sys.getsizeof([1, 2, 3, 4, 5])
120

(and 120 - 56 = 64, room for 8 pointers)

Then there's this curiosity, which allocates space for exactly the 5 needed:

>>> sys.getsizeof(list(tuple("1 2 3 4 5".split())))
96

(and 96 - 56 = 40, room for the 5 pointers needed)

I don't expect this to be consistent, but allocating space for 12 when only 5 are needed is unreasonable. Even allocating space for 8 is pushing it ;-)
msg414958 - (view) Author: Jelle Zijlstra (JelleZijlstra) * (Python committer) Date: 2022-03-12 00:56
The value 12 is hardcoded here: https://github.com/python/cpython/blob/a89c29fbcc7e7e85848499443d819c3fab68c78a/Objects/stringlib/split.h#L14

The comment there says that this is because most .split() calls are on lines of human-readable text, which has about 11 words per line. I don't know if I believe that.
msg414968 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2022-03-12 03:24
Well, that's annoying ;-) In context, the OP was saving a list of 10 million splits. So each overallocation by a single element burned 80 million bytes of RAM. Overallocating by 7 burned 560 million bytes.

Which is unusual. Usually a split result is short-lived, consumed once then thrown away.

OTOH, the overwhelming motivation for overallocating at all is to acheive O(1) amortized time after a long _sequence_ of appends, and split results typically aren't appended to at all. split() appears to be using it as a timing micro-optimization for tiny lists instead.

So, like I said, it's annoying ;-) For "small" lists, split() really shouldn't overallocate at all (because, as before, split results are rarely appended to). A compromise could be to save pointers to the first N (12, whatever) instances of the splitting string in a stack ("auto") vector, before any list object (or result string object) is created. If it's out of stuff to do before reaching N, fine, build a result out of exactly what was found. If there's more to do, build a result from the first N, and go on as currently (letting PyList_Append deal with it - overallocation is huge in percentage terms when the list is short, but not so much as the list gets longer).
History
Date User Action Args
2022-04-11 14:59:57adminsetgithub: 91146
2022-03-12 03:24:49tim.peterssetmessages: + msg414968
2022-03-12 00:56:51JelleZijlstrasetnosy: + JelleZijlstra
messages: + msg414958
2022-03-11 22:37:35tim.peterssettype: behavior -> resource usage
2022-03-11 22:37:20tim.peterscreate