New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dict.update allocates too much #72695
Comments
The size of small key-sharing dictionary (PEP-412) can be larger than the size of normal dictionary. Python 3.6: >>> def dictsizes(k):
... d = {'a%d'%i: None for i in range(k)}
... class C:
... def __init__(self):
... self.__dict__.update(d)
... return sys.getsizeof(d), sys.getsizeof(C().__dict__)
...
>>> for i in range(20):
... print(i, dictsizes(i))
...
0 (128, 60)
1 (128, 60)
2 (128, 60)
3 (128, 60)
4 (128, 60)
5 (128, 60)
6 (196, 196)
7 (196, 196)
8 (196, 344)
9 (196, 344)
10 (196, 344)
11 (344, 344)
12 (344, 344)
13 (344, 344)
14 (344, 344)
15 (344, 344)
16 (344, 628)
17 (344, 628)
18 (344, 628)
19 (344, 628) Normal dictionaries of size 8-10 are more compact than corresponding key-sharing dictionaries. Python 3.5: >>> for i in range(20):
... print(i, dictsizes(i))
...
0 (144, 48)
1 (144, 48)
2 (144, 48)
3 (144, 48)
4 (144, 240)
5 (144, 240)
6 (240, 240)
7 (240, 240)
8 (240, 432)
9 (240, 432)
10 (240, 432)
11 (240, 432)
12 (432, 432)
13 (432, 432)
14 (432, 432)
15 (432, 432)
16 (432, 816)
17 (432, 816)
18 (432, 816)
19 (432, 816) In Python 3.5 normal dictionaries of size 4-5 and 8-11 are more compact than corresponding key-sharing dictionaries. And note that key-sharing dictionaries of size 0-3 consume more memory on 3.6 that on 3.5. I think we should use more thrifty strategy for allocating key-sharing dictionaries. |
Minimum dict keysize is 8, and it's capacity is 5.
Dict is resized. And since __dict__.update() caused the resizing,
dict.update() cause faster increasing keysize. Adding to dict cause resizing when number of items reaches 2/3 of keysize. In this case, keysize is 16 and 16 * 2 // 3 = 10. Since 8 < 10, adding (note: My patch in http://bugs.python.org/issue28147 changes >= to >. But, while checking this, I found another bug in dict_merge.
dk_usable means "how many new items can be inserted without resizing".
|
And I feel current target size of dict_merge is bit larger. When inserting new item:
On the other hand, current dict_merge creates:
If changing it to dictresize(mp, (mp->ma_used + other->ma_used)*3/2),
I think this is more consistent. |
Ah, my bad, I missed this point. The original issue now disappears. Thanks Naoki. But dict.update (and maybe inserting) can use more economical allocation strategy. |
script: for i in range(25):
a = {}
b = {j: j for j in range(i)}
a.update(b)
print(i, sys.getsizeof(a)) before: patched: |
LGTM. Here is a script that produces more compact output. The first column is a size after which the dict is resized. import sys
p = 10000
b = {}
for i in range(10000):
a = {}
b[i] = i
a.update(b)
s = sys.getsizeof(a)
if s > p:
print(i, p)
p = s Unpatched: Patched: But I suggest instead the condition
use the condition
If there are overlapping keys this can allow to avoid resizing. In worst keys one resizing will be happened in dictinsert(). Yes one estimation is:
Dict size is at least doubled after resizing. No need to make preliminary resizing if the final result is the same. The benefit is that if there are many overlapping keys the final size can be ESTIMATE_SIZE(other->ma_used) instead of ESTIMATE_SIZE(mp->ma_used + other->ma_used). All this conditions can be combined (because they have different computational cost):
|
I feel that accept one resize while merging is better. /* Do one big resize at the start, rather than incrementally
* resizing. At most one resize happen while merging.
*/
if (USABLE_FRACTION(mp->ma_keys->dk_size) < other->ma_used) {
assert(mp->ma_used < other->ma_used);
if (dictresize(mp, ESTIMATE_SIZE(other->ma_used))) {
return -1;
}
} |
FWIW, I prefer the existing code be left as is. We've already greatly compacted the dictionaries. There is no need to be ultra-aggressive in shaving off every little corner. There is some advantage to having the dict be more sparse (fewer collisions, quicker insertion time for the update, quicker lookups etc). |
OK, I won't change it to allow additional resize while merging, after pre-resize.
Next patch (28509-smaller-update2.patch) seems better because:
PoC code: import sys
b = {}
for i in range(16):
b[i] = i
a = b.copy()
a.update(b) # No need to resize
print(i, sys.getsizeof(a), sys.getsizeof(b)) Current: 0 256 256 With second patch: 0 256 256 |
28509-smaller-update2.patch LGTM. Your idea in msg279480 looks worth, but needs benchmarking different corner cases to check that it doesn't cause to a regression. I think we will have enough time for this at 3.7 developing stage. |
New changeset 8c2615decd2e by INADA Naoki in branch '3.6': New changeset deb3e5857d8c by INADA Naoki in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: