Issue29961
Created on 2017-04-01 11:52 by serhiy.storchaka, last changed 2017-04-02 05:36 by serhiy.storchaka. This issue is now closed.
Pull Requests | |||
---|---|---|---|
URL | Status | Linked | Edit |
PR 946 | closed | serhiy.storchaka, 2017-04-01 11:54 |
Messages (7) | |||
---|---|---|---|
msg290980 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2017-04-01 11:52 | |
For now new set and frozenset objects can allocate 2 times larger table than necessary when are created from set or dict. For example if the size n of the original set is the power of two, resulting set will allocate the table of 4*n rather that 2*n. Up to 20% of new sets use twice more memory than necessary. Proposed patch makes set_merge() allocating the table of size n*5/3 instead of n*2. This is the minimal size necessary for inserting all elements with fill ration <=60%. $ ./python -c 'N = 6000; from sys import getsizeof; s = [getsizeof(frozenset(set(range(n)))) for n in range(N)]; print( [(n, s[n]) for n in range(N) if not n or s[n] != s[n-1]] )' Unpatched: [(0, 112), (5, 240), (8, 368), (16, 624), (32, 1136), (64, 2160), (128, 4208), (256, 8304), (512, 16496), (1024, 32880), (2048, 65648), (4096, 131184)] Patched: [(0, 112), (5, 240), (9, 368), (19, 624), (38, 1136), (77, 2160), (153, 4208), (307, 8304), (614, 16496), (1229, 32880), (2457, 65648), (4915, 131184)] |
|||
msg290992 - (view) | Author: Raymond Hettinger (rhettinger) * ![]() |
Date: 2017-04-01 15:53 | |
FWIW, I've long considered this to be a feature of set and dicts, it is the only way a user can affect sparseness. For lookup only sets, the extra sparseness is a virtue (fewer collisions). So, I would like to leave the current code in-place. With sets the goals are different from dicts. I place less value on compactness and more value on fast membership testing. And with dicts, the norm is that the looked-up key usually exists. With sets, the goal isn't looking an element known to be in the set, but more about determining whether an element is in the set. Accordingly, sets have to optimize for misses as well as hits. In the future, please assign set feature requests to me so that I don't miss them. |
|||
msg290997 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2017-04-01 16:26 | |
There is a program that uses 32 GB for sets. The patch could save up to 6 GB for it. I understood your arguments when you lowered the maximal fill ration from 2/3 to 60%. But in that case the fill ratio is between 15% and 30%! Isn't it too small? |
|||
msg291020 - (view) | Author: Inada Naoki (methane) * ![]() |
Date: 2017-04-02 04:05 | |
I also feel set created from iterator is too large sometimes. Especially, frozenset is used for "unordered tuple" dict key sometimes. $ python -c 'N = 6000; from sys import getsizeof; s = [getsizeof(frozenset(range(n))) for n in range(N)]; print( [(n, s[n]) for n in range(N) if not n or s[n] != s[n-1]] )' [(0, 224), (5, 736), (21, 2272), (85, 8416), (341, 32992), (1365, 131296), (5461, 524512)] $ python -c 'N = 6000; from sys import getsizeof; s = [getsizeof(dict.fromkeys(range(n))) for n in range(N)]; print( [(n, s[n]) for n in range(N) if not n or s[n] != s[n-1]] )' [(0, 240), (6, 368), (11, 648), (22, 1184), (43, 2280), (86, 4704), (171, 9320), (342, 18528), (683, 36968), (1366, 73824), (2731, 147560), (5462, 295008)] set size increases by *4 each resize. This is code of set_add_entry(): if ((size_t)so->fill*5 < mask*3) return 0; return set_table_resize(so, so->used>50000 ? so->used*2 : so->used*4); If we simply replace `so->used>50000 ? so->used*2 : so->used*4` with `so->used*2`, there may be not enough room when there are some dummy entries. (used + dummy = fill). But how about replacing `*4` with `*3`? In general, load factor after resize will be changed from 12.5% ~ 25% to 16.6% ~ 33.3%. It seems OK to me. In case of no removal (e.g. constructor called with iterator), set size increases by *2 each time: >>> for n in range(3, 20): ... print(math.ceil(2 ** n * 3/5) * 3, 2 ** (n+1)) 15 16 30 32 60 64 117 128 231 256 462 512 924 1024 1845 2048 3687 4096 7374 8192 14748 16384 29493 32768 58983 65536 117966 131072 235932 262144 471861 524288 943719 1048576 |
|||
msg291021 - (view) | Author: Raymond Hettinger (rhettinger) * ![]() |
Date: 2017-04-02 04:15 | |
You guys seem to have an enormous about of spare time and seem to feel to need to redesign every single thing you see. Please slow down and allow me to breathe a bit. I cannot keep up with a new "let's change x" every single day. This set object has been a successful design for a very long time. I like the balance that it has struck and do not want to constantly mess with its tuning or parameters. I working on bigger ideas at the moment that will make all of this irrelevant but can't make progress with you two proposing micro changes on an almost daily basis. Please pick some actual bugs to go fix or implement a feature request to propel the language forward. Eating up 100% of my development time is crippling my ability to contribute. And I've grown weary of revisiting old ground over and over. Please move on. |
|||
msg291022 - (view) | Author: Inada Naoki (methane) * ![]() |
Date: 2017-04-02 04:19 | |
OK, sorry about bothering you. |
|||
msg291023 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2017-04-02 05:36 | |
Okay. I was going to provide patches for other set's weirdness (for example a set can increase its size after removing elements), but with your reaction I see that this doesn't make sense. I wash my hands from fixing set. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2017-04-02 05:36:13 | serhiy.storchaka | set | messages: + msg291023 |
2017-04-02 04:19:59 | methane | set | messages: + msg291022 |
2017-04-02 04:15:05 | rhettinger | set | messages: + msg291021 |
2017-04-02 04:05:57 | methane | set | messages: + msg291020 |
2017-04-01 16:26:27 | serhiy.storchaka | set | messages: + msg290997 |
2017-04-01 15:53:26 | rhettinger | set | status: open -> closed messages: + msg290992 assignee: rhettinger resolution: rejected stage: patch review -> resolved |
2017-04-01 11:54:22 | serhiy.storchaka | set | pull_requests: + pull_request1128 |
2017-04-01 11:52:41 | serhiy.storchaka | create |