Issue 23971: dict(list) and dict.fromkeys() doesn't account for 2/3 fill ratio

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68159

classification

Title:	dict(list) and dict.fromkeys() doesn't account for 2/3 fill ratio
Type:	performance	Stage:	patch review
Components:	Interpreter Core	Versions:	Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Matt.Mackall, benjamin.peterson, gregory.p.smith, josh.r, larry, mark.dickinson, marmoute, methane, mpm, python-dev, rhettinger, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2015-04-15 23:35 by larry, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (13)
msg241179 - (view)	Author: Larry Hastings (larry) *	Date: 2015-04-15 23:35
Matt Mackall brought this over to me today. If list "l" has one million pairs and you do dict(l) it creates the dict, then resizes it to a million elements, then starts adding name/value pairs. But since it's sized to a million elements, it gets beyond its "fill ratio" comfort level of 666k and resizes. And when you're up to a million elements, the resize is quite slow. It should have internally resized the dict correctly from the beginning such that no resize was necessary. The same is true for dict.fromkeys--passed in a list of 1m elements, it uses an initial size of 1m, not 1.5m. Attached is a sample patch to change the behavior of both these operations so no resizes are needed during the insertions. I haven't checked that it actually fixes the behavior, I just made the change and ran the regression test. (I'm oversubscribed here at the sprints, and am kind of rushing this diff out just to get the conversation rolling.) Benjamin: what do you think? Would it be appropriate to make a change like this in 2.7? Python 3 (or at least 3.5 trunk) is fancy here. It starts with the minimum size of a dict, then iterates until the new size is > 1.5*the number of elements. This means the dicts it creates are of the same size as they would be if we'd started with a minsize dict and inserted one element at a time. This might help minimize wear and tear on the small block allocator for smaller dicts. BTW, the internal function _PyDict_NewPresize should really have taken the fill ratio into account too. But, even though it's an internal-only function, it's probably too late to change its behavior.
msg241297 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2015-04-17 00:06
You shouldn't presize when the input is a potentially non-unique iterable. It makes sense to presize for inputs of type dict or set, but for a list, the assumption is that many duplicates will be present, since deduping is a common use case for dicts (less so with the advent of sets, but still common enough to support without massive overallocation of memory). In short, dict([(1, 2)] * 1000000) should not involve allocating GB of memory for the dict up front.
msg241345 - (view)	Author: Matt Mackall (Matt.Mackall)	Date: 2015-04-17 17:17
We already presize the output dict and have for ages. The question here is what size to use. In the current state, we use twice as much memory and CPU as necessary quite often because we end up spilling and growing... even though we apparently intentionally sized the object to fit. In the sparse case, if we allocate a 1GB dictionary and use two entries in it, we will have consumed 1GB of address space but 1kB of actual memory.
msg241445 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2015-04-18 19:10
Won't we always consume the memory thanks to a memset(newtable, 0, ...) https://hg.python.org/cpython/file/df28044b7e14/Objects/dictobject.c#l654 ? (also, i'm not sure if Windows is allocates mapped pages on demand as posix systems tend to)
msg241446 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2015-04-18 19:12
fwiw, as for 2.7 i personally don't think I would change its behavior around this at this point. make sure 3.5+ do something desirable. (my link to dictobject.c above is from 2.7)
msg243068 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-05-13 10:13
New changeset cd0706499812 by Raymond Hettinger in branch '2.7': Issue #23971: Fix underestimated presizing in dict.fromkeys() https://hg.python.org/cpython/rev/cd0706499812
msg243069 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-05-13 10:29
Fixed fromkeys() in Py2.7. Stills needs to be forward ported to 3.4/3.5.
msg243080 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-05-13 12:28
- if (dictresize(mp, Py_SIZE(seq))) { + if (dictresize(mp, Py_SIZE(seq) / 2 * 3)) { If Py_SIZE(seq) is 1, dictresize argument is 0. Why not wryte the expression as Py_SIZE(seq) * 3 / 2? It never overflows, because Py_SIZE(seq) is the size of allocated array of pointers.
msg243082 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2015-05-13 12:42
From the patch: - if (dictresize(mp, Py_SIZE(seq))) { + if (dictresize(mp, Py_SIZE(seq) / 2 * 3)) { Isn't there a risk of signed overflow here? The dictresize function has an `assert(minused >= 0)`, which is going to fail for values of `Py_SIZE(seq)` that are close to `PY_SSIZE_T_MAX` in the common case where signed overflow just wraps modulo the appropriate power of 2 (though it's undefined behaviour, so in theory it could do anything).
msg243085 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2015-05-13 12:45
> It never overflows, because Py_SIZE(seq) is the size of allocated array of pointers. Ah, good point.
msg243106 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-05-13 17:49
> If Py_SIZE(seq) is 1, dictresize argument is 0. That doesn't matter. The minimum dict size is 8. > Why not write Py_SIZE(seq) * 3 / 2? Because I prefer the / 2 * 3 style so I don't have to think about whether seq can overflow. Do you really feel like bike-shedding this?
msg243112 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-05-13 18:13
No, I just asking. If the minimum dict size is 8 and there is no special case for empty dicts, all is good to me. But note, that x / 2 * 3 overflows as well as x * 3 / 2.
msg283254 - (view)	Author: Inada Naoki (methane) *	Date: 2016-12-15 08:28
> Fixed fromkeys() in Py2.7. Stills needs to be forward ported to 3.4/3.5. dict.fromkeys() is fixed already in 3.6+. dict(l) doesn't presize the dict. If it's desirable, let's create a new issue.

History
Date	User	Action	Args
2022-04-11 14:58:15	admin	set	github: 68159
2016-12-15 08:28:54	methane	set	status: open -> closed nosy: + methane messages: + msg283254 resolution: fixed
2015-05-13 18:13:01	serhiy.storchaka	set	messages: + msg243112
2015-05-13 17:49:28	rhettinger	set	messages: + msg243106
2015-05-13 12:45:49	mark.dickinson	set	messages: + msg243085
2015-05-13 12:42:56	mark.dickinson	set	nosy: + mark.dickinson messages: + msg243082
2015-05-13 12:28:09	serhiy.storchaka	set	messages: + msg243080
2015-05-13 12:23:20	serhiy.storchaka	set	messages: - msg243078
2015-05-13 12:22:09	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg243078
2015-05-13 10:29:24	rhettinger	set	assignee: rhettinger -> messages: + msg243069
2015-05-13 10:13:37	python-dev	set	nosy: + python-dev messages: + msg243068
2015-04-18 19:12:18	gregory.p.smith	set	messages: + msg241446
2015-04-18 19:10:40	gregory.p.smith	set	nosy: + gregory.p.smith messages: + msg241445
2015-04-17 17:17:02	Matt.Mackall	set	nosy: + Matt.Mackall messages: + msg241345
2015-04-17 00:06:55	josh.r	set	nosy: + josh.r messages: + msg241297
2015-04-16 01:07:26	rhettinger	set	assignee: rhettinger
2015-04-16 00:05:44	serhiy.storchaka	set	nosy: + rhettinger type: behavior -> performance
2015-04-15 23:35:44	larry	create