classification
Title: dict(list) and dict.fromkeys() doesn't account for 2/3 fill ratio
Type: performance Stage: patch review
Components: Interpreter Core Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Matt.Mackall, benjamin.peterson, gregory.p.smith, josh.r, larry, mark.dickinson, marmoute, methane, mpm, python-dev, rhettinger, serhiy.storchaka
Priority: normal Keywords:

Created on 2015-04-15 23:35 by larry, last changed 2016-12-15 08:28 by methane. This issue is now closed.

Messages (13)
msg241179 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2015-04-15 23:35
Matt Mackall brought this over to me today.

If list "l" has one million pairs and you do
  dict(l)
it creates the dict, then resizes it to a million elements, then starts adding name/value pairs.  But since it's sized to a million elements, it gets beyond its "fill ratio" comfort level of 666k and resizes.  And when you're up to a million elements, the resize is quite slow.  It should have internally resized the dict correctly from the beginning such that no resize was necessary.

The same is true for dict.fromkeys--passed in a list of 1m elements, it uses an initial size of 1m, not 1.5m.

Attached is a sample patch to change the behavior of both these operations so no resizes are needed during the insertions.  I haven't checked that it actually fixes the behavior, I just made the change and ran the regression test.  (I'm oversubscribed here at the sprints, and am kind of rushing this diff out just to get the conversation rolling.)

Benjamin: what do you think?  Would it be appropriate to make a change like this in 2.7?

Python 3 (or at least 3.5 trunk) is fancy here.  It starts with the minimum size of a dict, then iterates until the new size is > 1.5*the  number of elements.  This means the dicts it creates are of the same size as they would be if we'd started with a minsize dict and inserted one element at a time.  This might help minimize wear and tear on the small block allocator for smaller dicts.

BTW, the internal function _PyDict_NewPresize should really have taken the fill ratio into account too.  But, even though it's an internal-only function, it's probably too late to change its behavior.
msg241297 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2015-04-17 00:06
You shouldn't presize when the input is a potentially non-unique iterable. It makes sense to presize for inputs of type dict or set, but for a list, the assumption is that many duplicates will be present, since deduping is a common use case for dicts (less so with the advent of sets, but still common enough to support without massive overallocation of memory).

In short, dict([(1, 2)] * 1000000) should not involve allocating GB of memory for the dict up front.
msg241345 - (view) Author: Matt Mackall (Matt.Mackall) Date: 2015-04-17 17:17
We already presize the output dict and have for ages. The question here is what size to use. In the current state, we use twice as much memory and CPU as necessary quite often because we end up spilling and growing... even though we apparently intentionally sized the object to fit.

In the sparse case, if we allocate a 1GB dictionary and use two entries in it, we will have consumed 1GB of address space but 1kB of actual memory.
msg241445 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015-04-18 19:10
Won't we always consume the memory thanks to a memset(newtable, 0, ...) https://hg.python.org/cpython/file/df28044b7e14/Objects/dictobject.c#l654 ?

(also, i'm not sure if Windows is allocates mapped pages on demand as posix systems tend to)
msg241446 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2015-04-18 19:12
fwiw, as for 2.7 i personally don't think I would change its behavior around this at this point.  make sure 3.5+ do something desirable.  (my link to dictobject.c above is from 2.7)
msg243068 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-05-13 10:13
New changeset cd0706499812 by Raymond Hettinger in branch '2.7':
Issue #23971:  Fix underestimated presizing in dict.fromkeys()
https://hg.python.org/cpython/rev/cd0706499812
msg243069 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-05-13 10:29
Fixed fromkeys() in Py2.7.  Stills needs to be forward ported to 3.4/3.5.
msg243080 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-13 12:28
-            if (dictresize(mp, Py_SIZE(seq))) {
+            if (dictresize(mp, Py_SIZE(seq) / 2 * 3)) {


If Py_SIZE(seq) is 1, dictresize argument is 0.

Why not wryte the expression as Py_SIZE(seq) * 3 / 2? It never overflows, because Py_SIZE(seq) is the size of allocated array of pointers.
msg243082 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2015-05-13 12:42
From the patch:

-            if (dictresize(mp, Py_SIZE(seq))) {
+            if (dictresize(mp, Py_SIZE(seq) / 2 * 3)) {

Isn't there a risk of signed overflow here?  The dictresize function has an `assert(minused >= 0)`, which is going to fail for values of `Py_SIZE(seq)` that are close to `PY_SSIZE_T_MAX` in the common case where signed overflow just wraps modulo the appropriate power of 2 (though it's undefined behaviour, so in theory it *could* do anything).
msg243085 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2015-05-13 12:45
> It never overflows, because Py_SIZE(seq) is the size of allocated array of pointers.

Ah, good point.
msg243106 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-05-13 17:49
> If Py_SIZE(seq) is 1, dictresize argument is 0.

That doesn't matter.  The minimum dict size is 8.

> Why not write Py_SIZE(seq) * 3 / 2?

Because I prefer the / 2 * 3 style so I don't have to think about whether seq can overflow.  Do you really feel like bike-shedding this?
msg243112 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-13 18:13
No, I just asking. If the minimum dict size is 8 and there is no special case for empty dicts, all is good to me.

But note, that x / 2 * 3 overflows as well as x * 3 / 2.
msg283254 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-12-15 08:28
> Fixed fromkeys() in Py2.7.  Stills needs to be forward ported to 3.4/3.5.

dict.fromkeys() is fixed already in 3.6+.

dict(l) doesn't presize the dict.  If it's desirable, let's create a new issue.
History
Date User Action Args
2016-12-15 08:28:54methanesetstatus: open -> closed

nosy: + methane
messages: + msg283254

resolution: fixed
2015-05-13 18:13:01serhiy.storchakasetmessages: + msg243112
2015-05-13 17:49:28rhettingersetmessages: + msg243106
2015-05-13 12:45:49mark.dickinsonsetmessages: + msg243085
2015-05-13 12:42:56mark.dickinsonsetnosy: + mark.dickinson
messages: + msg243082
2015-05-13 12:28:09serhiy.storchakasetmessages: + msg243080
2015-05-13 12:23:20serhiy.storchakasetmessages: - msg243078
2015-05-13 12:22:09serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg243078
2015-05-13 10:29:24rhettingersetassignee: rhettinger ->
messages: + msg243069
2015-05-13 10:13:37python-devsetnosy: + python-dev
messages: + msg243068
2015-04-18 19:12:18gregory.p.smithsetmessages: + msg241446
2015-04-18 19:10:40gregory.p.smithsetnosy: + gregory.p.smith
messages: + msg241445
2015-04-17 17:17:02Matt.Mackallsetnosy: + Matt.Mackall
messages: + msg241345
2015-04-17 00:06:55josh.rsetnosy: + josh.r
messages: + msg241297
2015-04-16 01:07:26rhettingersetassignee: rhettinger
2015-04-16 00:05:44serhiy.storchakasetnosy: + rhettinger
type: behavior -> performance
2015-04-15 23:35:44larrycreate