classification
Title: More compact sets and frozensets created from sets
Type: resource usage Stage: resolved
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: methane, rhettinger, serhiy.storchaka
Priority: normal Keywords:

Created on 2017-04-01 11:52 by serhiy.storchaka, last changed 2017-04-02 05:36 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 946 closed serhiy.storchaka, 2017-04-01 11:54
Messages (7)
msg290980 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-04-01 11:52
For now new set and frozenset objects can allocate 2 times larger table than necessary when are created from set or dict. For example if the size n of the original set is the power of two, resulting set will allocate the table of 4*n rather that 2*n. Up to 20% of new sets use twice more memory than necessary.

Proposed patch makes set_merge() allocating the table of size n*5/3 instead of n*2. This is the minimal size necessary for inserting all elements with fill ration <=60%.

$ ./python -c 'N = 6000; from sys import getsizeof; s = [getsizeof(frozenset(set(range(n)))) for n in range(N)]; print( [(n, s[n]) for n in range(N) if not n or s[n] != s[n-1]] )'
Unpatched:  [(0, 112), (5, 240), (8, 368), (16, 624), (32, 1136), (64, 2160), (128, 4208), (256, 8304), (512, 16496), (1024, 32880), (2048, 65648), (4096, 131184)]
Patched:    [(0, 112), (5, 240), (9, 368), (19, 624), (38, 1136), (77, 2160), (153, 4208), (307, 8304), (614, 16496), (1229, 32880), (2457, 65648), (4915, 131184)]
msg290992 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-04-01 15:53
FWIW, I've long considered this to be a feature of set and dicts, it is the only way a user can affect sparseness.  For lookup only sets, the extra sparseness is a virtue (fewer collisions).  So, I would like to leave the current code in-place.  

With sets the goals are different from dicts.  I place less value on compactness and more value on fast membership testing.  And with dicts, the norm is that the looked-up key usually exists.  With sets, the goal isn't looking an element known to be in the set, but more about determining whether an element is in the set.  Accordingly, sets have to optimize for misses as well as hits.

In the future, please assign set feature requests to me so that I don't miss them.
msg290997 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-04-01 16:26
There is a program that uses 32 GB for sets. The patch could save up to 6 GB for it.

I understood your arguments when you lowered the maximal fill ration from 2/3 to 60%. But in that case the fill ratio is between 15% and 30%! Isn't it too small?
msg291020 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-04-02 04:05
I also feel set created from iterator is too large sometimes.
Especially, frozenset is used for "unordered tuple" dict key sometimes.

$ python -c 'N = 6000; from sys import getsizeof; s = [getsizeof(frozenset(range(n))) for n in range(N)]; print( [(n, s[n]) for n in range(N) if not n or s[n] != s[n-1]] )'
[(0, 224), (5, 736), (21, 2272), (85, 8416), (341, 32992), (1365, 131296), (5461, 524512)]

$ python -c 'N = 6000; from sys import getsizeof; s = [getsizeof(dict.fromkeys(range(n))) for n in range(N)]; print( [(n, s[n]) for n in range(N) if not n or s[n] != s[n-1]] )'
[(0, 240), (6, 368), (11, 648), (22, 1184), (43, 2280), (86, 4704), (171, 9320), (342, 18528), (683, 36968), (1366, 73824), (2731, 147560), (5462, 295008)]

set size increases by *4 each resize.
This is code of set_add_entry():

    if ((size_t)so->fill*5 < mask*3)
        return 0;
    return set_table_resize(so, so->used>50000 ? so->used*2 : so->used*4);

If we simply replace `so->used>50000 ? so->used*2 : so->used*4` with `so->used*2`, there may be not enough room when there are some dummy entries. (used + dummy = fill).

But how about replacing `*4` with `*3`?
In general, load factor after resize will be changed from 12.5% ~ 25% to 16.6% ~ 33.3%. It seems OK to me.
In case of no removal (e.g. constructor called with iterator), set size increases by *2 each time:

>>> for n in range(3, 20):
...     print(math.ceil(2 ** n * 3/5) * 3, 2 ** (n+1))
15 16
30 32
60 64
117 128
231 256
462 512
924 1024
1845 2048
3687 4096
7374 8192
14748 16384
29493 32768
58983 65536
117966 131072
235932 262144
471861 524288
943719 1048576
msg291021 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-04-02 04:15
You guys seem to have an enormous about of spare time and seem to feel to need to redesign every single thing you see.  Please slow down and allow me to breathe a bit.  I cannot keep up with a new "let's change x" every single day.  This set object has been a successful design for a very long time.  I like the balance that it has struck and do not want to constantly mess with its tuning or parameters.

I working on bigger ideas at the moment that will make all of this irrelevant but can't make progress with you two proposing micro changes on an almost daily basis.

Please pick some actual bugs to go fix or implement a feature request to propel the language forward.  Eating up 100% of my development time is crippling my ability to contribute.  And I've grown weary of revisiting old ground over and over.  Please move on.
msg291022 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-04-02 04:19
OK, sorry about bothering you.
msg291023 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-04-02 05:36
Okay. I was going to provide patches for other set's weirdness (for example a set can increase its size after removing elements), but with your reaction I see that this doesn't make sense. I wash my hands from fixing set.
History
Date User Action Args
2017-04-02 05:36:13serhiy.storchakasetmessages: + msg291023
2017-04-02 04:19:59methanesetmessages: + msg291022
2017-04-02 04:15:05rhettingersetmessages: + msg291021
2017-04-02 04:05:57methanesetmessages: + msg291020
2017-04-01 16:26:27serhiy.storchakasetmessages: + msg290997
2017-04-01 15:53:26rhettingersetstatus: open -> closed
messages: + msg290992

assignee: rhettinger
resolution: rejected
stage: patch review -> resolved
2017-04-01 11:54:22serhiy.storchakasetpull_requests: + pull_request1128
2017-04-01 11:52:41serhiy.storchakacreate