Issue 46990: Surprising list overallocation from .split()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/91146

classification

Title:	Surprising list overallocation from .split()
Type:	resource usage	Stage:
Components:	Interpreter Core	Versions:	Python 3.11

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	JelleZijlstra, tim.peters
Priority:	normal	Keywords:

Created on 2022-03-11 22:37 by tim.peters, last changed 2022-04-11 14:59 by admin.

Messages (3)
msg414942 - (view)	Author: Tim Peters (tim.peters) *	Date: 2022-03-11 22:37
When looking into a StackOverflow question about surprisingly high memory use, I stumbled into this (under 3.10.1, Win64): >>> import sys >>> s = "1 2 3 4 5".split() >>> s ['1', '2', '3', '4', '5'] >>> sys.getsizeof(s) 152 >>> _ - sys.getsizeof([]) 96 >>> 96 / 8 12.0 That is, we allocated enough space in the list to store 12(!) elements, despite that only 5 are used. Other ways of building a 5-element list I've tried overallocate by at most 3 slots: >>> sys.getsizeof([ch for ch in "12345"]) 120 >>> sys.getsizeof([1, 2, 3, 4, 5]) 120 (and 120 - 56 = 64, room for 8 pointers) Then there's this curiosity, which allocates space for exactly the 5 needed: >>> sys.getsizeof(list(tuple("1 2 3 4 5".split()))) 96 (and 96 - 56 = 40, room for the 5 pointers needed) I don't expect this to be consistent, but allocating space for 12 when only 5 are needed is unreasonable. Even allocating space for 8 is pushing it ;-)
msg414958 - (view)	Author: Jelle Zijlstra (JelleZijlstra) *	Date: 2022-03-12 00:56
The value 12 is hardcoded here: https://github.com/python/cpython/blob/a89c29fbcc7e7e85848499443d819c3fab68c78a/Objects/stringlib/split.h#L14 The comment there says that this is because most .split() calls are on lines of human-readable text, which has about 11 words per line. I don't know if I believe that.
msg414968 - (view)	Author: Tim Peters (tim.peters) *	Date: 2022-03-12 03:24
Well, that's annoying ;-) In context, the OP was saving a list of 10 million splits. So each overallocation by a single element burned 80 million bytes of RAM. Overallocating by 7 burned 560 million bytes. Which is unusual. Usually a split result is short-lived, consumed once then thrown away. OTOH, the overwhelming motivation for overallocating at all is to acheive O(1) amortized time after a long _sequence_ of appends, and split results typically aren't appended to at all. split() appears to be using it as a timing micro-optimization for tiny lists instead. So, like I said, it's annoying ;-) For "small" lists, split() really shouldn't overallocate at all (because, as before, split results are rarely appended to). A compromise could be to save pointers to the first N (12, whatever) instances of the splitting string in a stack ("auto") vector, before any list object (or result string object) is created. If it's out of stuff to do before reaching N, fine, build a result out of exactly what was found. If there's more to do, build a result from the first N, and go on as currently (letting PyList_Append deal with it - overallocation is huge in percentage terms when the list is short, but not so much as the list gets longer).

History
Date	User	Action	Args
2022-04-11 14:59:57	admin	set	github: 91146
2022-03-12 03:24:49	tim.peters	set	messages: + msg414968
2022-03-12 00:56:51	JelleZijlstra	set	nosy: + JelleZijlstra messages: + msg414958
2022-03-11 22:37:35	tim.peters	set	type: behavior -> resource usage
2022-03-11 22:37:20	tim.peters	create