classification
Title: sizeof set after set_merge() is doubled from 3.5
Type: Stage: resolved
Components: Interpreter Core Versions: Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: cameron, inada.naoki, jgosmann, rhettinger, serhiy.storchaka
Priority: normal Keywords: 3.6regression

Created on 2017-03-30 18:09 by inada.naoki, last changed 2017-04-01 14:29 by inada.naoki. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 943 merged inada.naoki, 2017-04-01 06:40
PR 945 merged inada.naoki, 2017-04-01 08:32
Messages (12)
msg290868 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-03-30 18:09
(original thread is https://mail.python.org/pipermail/python-list/2017-March/720391.html)

https://github.com/python/cpython/commit/4897300276d870f99459c82b937f0ac22450f0b6

this commit doubles sizeof set object created by set_merge().
It is used by constructor of set and frozenset.

$ /usr/bin/python3
Python 3.5.2+ (default, Sep 22 2016, 12:18:14)
[GCC 6.2.0 20160927] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> s = set(range(10))
>>> sys.getsizeof(frozenset(s))
736

$ python3
Python 3.6.0 (default, Dec 30 2016, 20:49:54)
[GCC 6.2.0 20161005] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import  sys
>>> s = set(range(10))
>>> sys.getsizeof(frozenset(s))
1248
msg290888 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-03-31 03:16
https://gist.github.com/methane/8faf12621cdb2166019bbcee65987e99
This patch fixes the regression.

But I think it is still larger than idiomatic.  See this code:

code:
    minused *= 2;
    /* Find the smallest table size > minused. */
    /* XXX speed-up with intrinsics */
    for (newsize = PySet_MINSIZE;
         newsize <= minused && newsize > 0;
         newsize <<= 1)
        ;

When original minused is X, newsize will be about 2X ~ 4X.

For set.add(), preserving extra space for further add() make sense.
But for set_merge(), intention is avoiding repeated resize while merging.
There may be not "further add()", especially for `frozenset(s)`.
So 30% ~ 60% seems better than 25% ~ 50%.

How do you think, Raymond?
msg290906 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-03-31 11:25
I'll look at this over the next week or two.  I don't really like the proposed patch at all but will carefully think through the speed/space trade-offs.
msg290908 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-03-31 12:18
Hmm, I wonder why I'm not seeing the same sizes you are seeing.

$ cat setsize.py
from sys import getsizeof
print( [getsizeof(frozenset(range(n))) for n in range(20)] )
$ python3.4 setsize.py
[224, 224, 224, 224, 224, 224, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736]
$ python3.5 setsize.py
[224, 224, 224, 224, 224, 224, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736]
$ python3.6 setsize.py
[224, 224, 224, 224, 224, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736, 736]
$ python3.4 --version
Python 3.4.4
$ python3.5 --version
Python 3.5.3
$ python3.6 --version
Python 3.6.1
msg290909 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-03-31 12:22
See set_update_internal().
https://github.com/python/cpython/blob/master/Objects/setobject.c#L969-L1016

This happens only when iterable is set or dict.

>>> import sys
>>> sys.getsizeof(set(range(10)))
736
>>> sys.getsizeof(set(set(range(10))))
1248
>>> sys.getsizeof(set(dict.fromkeys(range(10))))
1248
msg290911 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-31 12:50
$ ./python -c 'N = 6000; from sys import getsizeof; s = [getsizeof(frozenset(range(n))) for n in range(N)]; print( [(n, s[n]) for n in range(N) if not n or s[n] != s[n-1]] )'

3.5:  [(0, 112), (6, 368), (22, 1136), (86, 4208), (342, 16496), (1366, 65648), (5462, 262256)]
3.6:  [(0, 112), (5, 368), (21, 1136), (85, 4208), (341, 16496), (1365, 65648), (5461, 262256)]
3.7:  [(0, 112), (5, 368), (19, 1136), (77, 4208), (307, 16496), (1229, 65648), (4915, 262256)]

$ ./python -c 'N = 6000; from sys import getsizeof; s = [getsizeof(frozenset(set(range(n)))) for n in range(N)]; print( [(n, s[n]) for n in range(N) if not n or s[n] != s[n-1]] )'

3.5:  [(0, 112), (6, 240), (8, 368), (16, 624), (32, 1136), (64, 2160), (128, 4208), (256, 8304), (512, 16496), (1024, 32880), (2048, 65648), (4096, 131184)]
3.6:  [(0, 112), (5, 368), (8, 624), (16, 1136), (32, 2160), (64, 4208), (128, 8304), (256, 16496), (512, 32880), (1024, 65648), (2048, 131184), (4096, 262256)]
3.7:  [(0, 112), (5, 368), (8, 624), (16, 1136), (32, 2160), (64, 4208), (128, 8304), (256, 16496), (512, 32880), (1024, 65648), (2048, 131184), (4096, 262256)]

frozenset(range(n)) is slightly larger in 3.7 than in 3.6. It is 4 times larger for about 10% of sizes.

frozenset(set(range(n))) is 2 times larger in 3.6 than in 3.5 for all sizes >= 16.
msg290912 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-03-31 12:52
> frozenset(range(n)) is slightly larger in 3.7 than in 3.6. It is 4 times larger for about 10% of sizes.

This is intensional:
https://github.com/python/cpython/commit/5cd87a8d61246b0a6233bfb8503d4718b693cef0

load factor is reduced from 66% to 60%. (-10%)
msg290915 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-03-31 15:24
I think the best thing to do is to undo the refactoring in https://github.com/python/cpython/commit/4897300276d870f99459c82b937f0ac22450f0b6 .   It is was intended be neutral but did affect set_update_internal() for small sets.
msg290928 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-03-31 17:19
I agree.  Before thinking about rebalance between size and speed,
resolving regression is important.
msg290934 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-03-31 18:01
Do you want to prepare a PR for me?  I not yet set-up with the ways of Github.  Please limit the PR to just unwinding the refactoring in the simplest possible way..

If in the future you want to chat about speed/space trade-offs, we can do that offline.  I've spent years thinking about this, interacting with users, discussing with other devs, speaking on the topic, and working through use cases for sets.  The original reasons for the choices made largely are still true today.  I would be happy to walk you through the history (the tracker isn't a good place to do this, IRC would serve us better).
msg290970 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-04-01 08:20
New changeset e82cf8675bacd7a03de508ed11865fc2701dcef5 by INADA Naoki in branch 'master':
bpo-29949: Fix set memory usage regression (GH-943)
https://github.com/python/cpython/commit/e82cf8675bacd7a03de508ed11865fc2701dcef5
msg290984 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2017-04-01 14:29
New changeset efde51ad54c58353f25ff80c8d30dbee82ef33a3 by INADA Naoki in branch '3.6':
bpo-29949: Fix set memory usage regression (GH-945)
https://github.com/python/cpython/commit/efde51ad54c58353f25ff80c8d30dbee82ef33a3
History
Date User Action Args
2017-04-01 14:29:33inada.naokisetmessages: + msg290984
2017-04-01 11:53:48inada.naokisetstatus: open -> closed
resolution: fixed
stage: resolved
2017-04-01 08:32:42inada.naokisetpull_requests: + pull_request1127
2017-04-01 08:20:27inada.naokisetmessages: + msg290970
2017-04-01 06:40:30inada.naokisetpull_requests: + pull_request1125
2017-03-31 23:12:12cameronsetnosy: + cameron
2017-03-31 18:01:38rhettingersetmessages: + msg290934
2017-03-31 17:19:50inada.naokisetmessages: + msg290928
2017-03-31 15:24:31rhettingersetmessages: + msg290915
2017-03-31 12:52:51inada.naokisetmessages: + msg290912
2017-03-31 12:50:37serhiy.storchakasetmessages: + msg290911
2017-03-31 12:22:41inada.naokisetmessages: + msg290909
2017-03-31 12:18:05rhettingersetmessages: + msg290908
2017-03-31 11:25:48rhettingersetmessages: + msg290906
2017-03-31 11:11:54rhettingersetassignee: rhettinger
2017-03-31 03:16:56inada.naokisetmessages: + msg290888
2017-03-30 18:25:51jgosmannsetnosy: + jgosmann
2017-03-30 18:15:04serhiy.storchakasetnosy: + serhiy.storchaka
2017-03-30 18:09:13inada.naokicreate