dict.update allocates too much #72695

serhiy-storchaka · 2016-10-22T19:13:38Z

BPO	28509
Nosy	@rhettinger, @benjaminp, @methane, @markshannon, @serhiy-storchaka, @zhangyangyu
Files	28509-smaller-update.patch 28509-smaller-update2.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/methane'
closed_at = <Date 2016-10-27.10:31:04.176>
created_at = <Date 2016-10-22.19:13:37.906>
labels = ['interpreter-core', '3.7', 'performance']
title = 'dict.update allocates too much'
updated_at = <Date 2017-10-23.11:18:07.738>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2017-10-23.11:18:07.738>
actor = 'serhiy.storchaka'
assignee = 'methane'
closed = True
closed_date = <Date 2016-10-27.10:31:04.176>
closer = 'methane'
components = ['Interpreter Core']
creation = <Date 2016-10-22.19:13:37.906>
creator = 'serhiy.storchaka'
dependencies = []
files = ['45223', '45229']
hgrepos = []
issue_num = 28509
keywords = ['patch']
message_count = 11.0
messages = ['279215', '279234', '279235', '279240', '279447', '279458', '279480', '279488', '279497', '279528', '279534']
nosy_count = 7.0
nosy_names = ['rhettinger', 'benjamin.peterson', 'methane', 'Mark.Shannon', 'python-dev', 'serhiy.storchaka', 'xiang.zhang']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'resource usage'
url = 'https://bugs.python.org/issue28509'
versions = ['Python 3.6', 'Python 3.7']

serhiy-storchaka · 2016-10-22T19:13:37Z

The size of small key-sharing dictionary (PEP-412) can be larger than the size of normal dictionary.

Python 3.6:

>>> def dictsizes(k):
...     d = {'a%d'%i: None for i in range(k)}
...     class C:
...         def __init__(self):
...             self.__dict__.update(d)
...     return sys.getsizeof(d), sys.getsizeof(C().__dict__)
... 
>>> for i in range(20):
...     print(i, dictsizes(i))
... 
0 (128, 60)
1 (128, 60)
2 (128, 60)
3 (128, 60)
4 (128, 60)
5 (128, 60)
6 (196, 196)
7 (196, 196)
8 (196, 344)
9 (196, 344)
10 (196, 344)
11 (344, 344)
12 (344, 344)
13 (344, 344)
14 (344, 344)
15 (344, 344)
16 (344, 628)
17 (344, 628)
18 (344, 628)
19 (344, 628)

Normal dictionaries of size 8-10 are more compact than corresponding key-sharing dictionaries.

Python 3.5:

>>> for i in range(20):
...     print(i, dictsizes(i))
... 
0 (144, 48)
1 (144, 48)
2 (144, 48)
3 (144, 48)
4 (144, 240)
5 (144, 240)
6 (240, 240)
7 (240, 240)
8 (240, 432)
9 (240, 432)
10 (240, 432)
11 (240, 432)
12 (432, 432)
13 (432, 432)
14 (432, 432)
15 (432, 432)
16 (432, 816)
17 (432, 816)
18 (432, 816)
19 (432, 816)

In Python 3.5 normal dictionaries of size 4-5 and 8-11 are more compact than corresponding key-sharing dictionaries. And note that key-sharing dictionaries of size 0-3 consume more memory on 3.6 that on 3.5. I think we should use more thrifty strategy for allocating key-sharing dictionaries.

methane · 2016-10-23T02:25:54Z

0 (128, 60)
1 (128, 60)
2 (128, 60)
3 (128, 60)
4 (128, 60)
5 (128, 60)

Minimum dict keysize is 8, and it's capacity is 5.

6 (196, 196)

Dict is resized. And since __dict__.update() caused the resizing,
both are normal (combined) dict.

7 (196, 196)
8 (196, 344)

dict.update() cause faster increasing keysize.

Adding to dict cause resizing when number of items reaches 2/3 of keysize.
On the other hand, dict.update() targets 1/2 of keysize is filled.

In this case, keysize is 16 and 16 * 2 // 3 = 10. Since 8 < 10, adding
item to key
doesn't increase it's size. Since 8 >= (16 / 2), dict.update() creates
dict having keysize == 32.

(note: My patch in http://bugs.python.org/issue28147 changes >= to >.
So keysize == 16 when
number of items == 8).

But, while checking this, I found another bug in dict_merge.

    /* Do one big resize at the start, rather than
     * incrementally resizing as we insert new items.  Expect
     * that there will be no (or few) overlapping keys.
     */
    if (mp->ma_keys->dk_usable * 3 < other->ma_used * 2)
        if (dictresize(mp, (mp->ma_used + other->ma_used)*2) != 0)
           return -1;

dk_usable means "how many new items can be inserted without resizing".
So this if statement should be:

    if (mp->ma_keys->dk_usable < other->ma_used)

methane · 2016-10-23T02:41:31Z

And I feel current target size of dict_merge is bit larger.

When inserting new item:

ma_used = dk_size*2 / 3 when right before increasing keys
ma_used = dk_size / 3 when right after increasing keys

On the other hand, current dict_merge creates:

ma_used = dk_size / 2 when all keys in two dict is distinct
ma_used = dk_size / 4 when all keys in two dict is same

If changing it to dictresize(mp, (mp->ma_used + other->ma_used)*3/2),

ma_used = dk_size*2 / 3 when all keys in two dict is distinct
ma_used = dk_size / 3 when all keys in two dict is same

I think this is more consistent.

serhiy-storchaka · 2016-10-23T07:15:54Z

Dict is resized. And since __dict__.update() caused the resizing,
both are normal (combined) dict.

Ah, my bad, I missed this point. The original issue now disappears. Thanks Naoki.

But dict.update (and maybe inserting) can use more economical allocation strategy.

methane · 2016-10-25T19:05:25Z

script:
import sys

for i in range(25):
    a = {}
    b = {j: j for j in range(i)}
    a.update(b)
    print(i, sys.getsizeof(a))

before:
0 256
1 256
2 256
3 256
4 256
5 256
6 384
7 384
8 664
9 664
10 664
11 664
12 664
13 664
14 664
15 664
16 1200
17 1200
18 1200
19 1200
20 1200
21 1200
22 1200
23 1200
24 1200

patched:
0 256
1 256
2 256
3 256
4 256
5 256
6 384
7 384
8 384
9 384
10 384
11 664
12 664
13 664
14 664
15 664
16 664
17 664
18 664
19 664
20 664
21 664
22 1200
23 1200
24 1200

serhiy-storchaka · 2016-10-25T21:10:17Z

LGTM.

Here is a script that produces more compact output. The first column is a size after which the dict is resized.

import sys
p = 10000
b = {}
for i in range(10000):
    a = {}
    b[i] = i
    a.update(b)
    s = sys.getsizeof(a)
    if s > p:
        print(i, p)
    p = s

Unpatched:
5 128
7 196
15 344
31 628
63 1208
127 2612
255 5176
511 10292
1023 20536
2047 41012
4095 81976
8191 163892

Patched:
5 128
10 196
21 344
42 628
85 1208
170 2612
341 5176
682 10292
1365 20536
2730 41012
5461 81976

But I suggest instead the condition

mp->ma_keys->dk_usable < other->ma_used

use the condition

mp->ma_used + mp->ma_keys->dk_usable < other->ma_used

If there are overlapping keys this can allow to avoid resizing. In worst keys one resizing will be happened in dictinsert().

Yes one estimation is:

USABLE_FRACTION(2 * mp->ma_keys->dk_size) < mp->ma_used + other->ma_used

Dict size is at least doubled after resizing. No need to make preliminary resizing if the final result is the same. The benefit is that if there are many overlapping keys the final size can be ESTIMATE_SIZE(other->ma_used) instead of ESTIMATE_SIZE(mp->ma_used + other->ma_used).

All this conditions can be combined (because they have different computational cost):

mp->ma_keys->dk_usable < other->ma_used &&
mp->ma_used + mp->ma_keys->dk_usable < other->ma_used &&
USABLE_FRACTION(2 * mp->ma_keys->dk_size) < mp->ma_used + other->ma_used

methane · 2016-10-26T01:47:13Z

I feel that accept one resize while merging is better.
How about this?

        /* Do one big resize at the start, rather than incrementally
         * resizing.  At most one resize happen while merging.
         */
        if (USABLE_FRACTION(mp->ma_keys->dk_size) < other->ma_used) {
            assert(mp->ma_used < other->ma_used);
            if (dictresize(mp, ESTIMATE_SIZE(other->ma_used))) {
               return -1;
            }
        }

rhettinger · 2016-10-26T04:17:40Z

FWIW, I prefer the existing code be left as is. We've already greatly compacted the dictionaries. There is no need to be ultra-aggressive in shaving off every little corner. There is some advantage to having the dict be more sparse (fewer collisions, quicker insertion time for the update, quicker lookups etc).

methane · 2016-10-26T10:08:44Z

OK, I won't change it to allow additional resize while merging, after pre-resize.
But current code has two problem:

Pre-resize happen even when resizing is not necessary. (ex. two dict has same keys).
Pre-resize allocates too much memory which doesn't make sense.

Next patch (28509-smaller-update2.patch) seems better because:

When pre-resize doesn't happen, at most one resize while merging.
When pre-resize happens, no resize happens while merging.
Doesn't make surprisingly sparse dict when two dicts have same keys.

PoC code:

import sys
b = {}
for i in range(16):
    b[i] = i
    a = b.copy()
    a.update(b)  # No need to resize
    print(i, sys.getsizeof(a), sys.getsizeof(b))

Current:

0 256 256
1 256 256
2 256 256
3 664 256
4 664 256
5 384 384 # !!!
6 664 384
7 664 384
8 664 384
9 664 384
10 664 664
11 664 664
12 1200 664
13 1200 664
14 1200 664
15 1200 664

With second patch:

0 256 256
1 256 256
2 256 256
3 256 256
4 256 256
5 384 384
6 384 384
7 384 384
8 384 384
9 384 384
10 664 664
11 664 664
12 664 664
13 664 664
14 664 664
15 664 664

serhiy-storchaka · 2016-10-27T07:16:07Z

28509-smaller-update2.patch LGTM.

Your idea in msg279480 looks worth, but needs benchmarking different corner cases to check that it doesn't cause to a regression. I think we will have enough time for this at 3.7 developing stage.

python-dev · 2016-10-27T10:30:21Z

New changeset 8c2615decd2e by INADA Naoki in branch '3.6':
Issue bpo-28509: dict.update() no longer allocate unnecessary large memory
https://hg.python.org/cpython/rev/8c2615decd2e

New changeset deb3e5857d8c by INADA Naoki in branch 'default':
Issue bpo-28509: dict.update() no longer allocate unnecessary large memory
https://hg.python.org/cpython/rev/deb3e5857d8c

serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Oct 22, 2016

serhiy-storchaka added the 3.7 (EOL) end of life label Oct 23, 2016

serhiy-storchaka changed the title ~~Key-sharing dictionaries can inrease the memory consumption~~ dict.update allocates too much Oct 23, 2016

serhiy-storchaka assigned methane Oct 27, 2016

methane closed this as completed Oct 27, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dict.update allocates too much #72695

dict.update allocates too much #72695

serhiy-storchaka commented Oct 22, 2016

serhiy-storchaka commented Oct 22, 2016

methane commented Oct 23, 2016

methane commented Oct 23, 2016

serhiy-storchaka commented Oct 23, 2016

methane commented Oct 25, 2016

serhiy-storchaka commented Oct 25, 2016

methane commented Oct 26, 2016

rhettinger commented Oct 26, 2016

methane commented Oct 26, 2016

serhiy-storchaka commented Oct 27, 2016

python-dev mannequin commented Oct 27, 2016

dict.update allocates too much #72695

dict.update allocates too much #72695

Comments

serhiy-storchaka commented Oct 22, 2016

serhiy-storchaka commented Oct 22, 2016

methane commented Oct 23, 2016

methane commented Oct 23, 2016

serhiy-storchaka commented Oct 23, 2016

methane commented Oct 25, 2016

serhiy-storchaka commented Oct 25, 2016

methane commented Oct 26, 2016

rhettinger commented Oct 26, 2016

methane commented Oct 26, 2016

serhiy-storchaka commented Oct 27, 2016

python-dev mannequin commented Oct 27, 2016