New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup empty dict creation and reduce its memory usage #74226
Comments
dict.clear() make the dict to empty key-sharing dict to reduce it's size. $ ./python.default
Python 3.7.0a0 (heads/master:6dfcc81, Apr 10 2017, 19:55:52)
[GCC 6.2.0 20161005] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> d = {}
>>> sys.getsizeof(d)
240
>>> d.clear()
>>> sys.getsizeof(d)
72
$ ./python.patched
Python 3.7.0a0 (heads/master-dirty:6dfcc81, Apr 11 2017, 18:11:02)
[GCC 6.2.0 20161005] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getsizeof({})
72 |
performance impact best case: worst case: |
Isn't the latter case the more common one? Creating an empty dict and then populate it. |
This is memory usage optimization, not performance optimization. |
macro bench result: $ ./python.default -m perf compare_to -G --min-speed=1 default.json patched.json
Slower (11):
- scimark_lu: 362 ms +- 13 ms -> 383 ms +- 22 ms: 1.06x slower (+6%)
- unpickle_pure_python: 882 us +- 18 us -> 924 us +- 18 us: 1.05x slower (+5%)
- regex_v8: 45.4 ms +- 0.6 ms -> 46.7 ms +- 3.1 ms: 1.03x slower (+3%)
- mako: 40.4 ms +- 0.4 ms -> 41.4 ms +- 0.4 ms: 1.03x slower (+3%)
- meteor_contest: 200 ms +- 1 ms -> 204 ms +- 2 ms: 1.02x slower (+2%)
- genshi_text: 88.8 ms +- 1.2 ms -> 90.1 ms +- 1.6 ms: 1.01x slower (+1%)
- scimark_monte_carlo: 255 ms +- 6 ms -> 258 ms +- 7 ms: 1.01x slower (+1%)
- richards: 176 ms +- 4 ms -> 178 ms +- 8 ms: 1.01x slower (+1%)
- pickle: 24.2 us +- 0.5 us -> 24.4 us +- 0.7 us: 1.01x slower (+1%)
- sympy_str: 438 ms +- 3 ms -> 442 ms +- 3 ms: 1.01x slower (+1%)
- genshi_xml: 196 ms +- 3 ms -> 198 ms +- 2 ms: 1.01x slower (+1%) Faster (7):
Benchmark hidden because not significant (46) |
I mean creating a solo empty dict doesn't seem to make much sense. Although it saves memory, but when it's populated, it's resized and the memory occupation comes back. And this makes PyDict_New() hard to understand. :-( |
Use "--duplicate 100" when making microbenchmarks for such fast operations. The overhead of iterating can be significant and comparable with the time of the operation. |
While I think it's preferable that {} and d.clear() have same memory footprint, I'll check memory usage difference with application I used in this ML thread. |
I've worked on an application (proprietary, unfortunately) that created a lot of empty dictionaries that only sometimes got populated. It involved sqlalchemy, but I don't remember if the dicts came from sqlalchemy itself or from the code that used it. That application did care about memory, and the shared-key dicts were a big benefit to it. |
Thank you for your reply. [1] https://patch-diff.githubusercontent.com/raw/python/cpython/pull/1080.patch |
Sorry, but I no longer have access to that application (I'm a consultant, and the owner is no longer a client). |
But sometimes it's not populated. class A:
def __init__(self, **kwargs):
self._extra = kwargs
xa = [A() for _ in range(1000)] So problem is (a) how many empty dicts, and (b) how much memory this patch saves.
Yes, but it is not new complexity because it's same to d.clear(). |
I got following result: $ ./python.patched -m perf timeit --compare-to=./python.default --duplicate=100 -- '{}'
python.default: ..................... 203 ns +- 5 ns
python.patched: ..................... 97.1 ns +- 0.7 ns Mean +- std dev: [python.default] 203 ns +- 5 ns -> [python.patched] 97.1 ns +- 0.7 ns: 2.09x faster (-52%) $ ./python.patched -m perf timeit --compare-to=./python.default --duplicate=100 -- 'x={}; x[1]=1'
python.default: ..................... 494 ns +- 5 ns
python.patched: ..................... 592 ns +- 7 ns Mean +- std dev: [python.default] 494 ns +- 5 ns -> [python.patched] 592 ns +- 7 ns: 1.20x slower (+20%) Seems something is wrong with resizing an empty dict. It shouldn't take such much time. |
Inada's patch version act different inside when running this code: 'x = {}; x['a'] = 123' at PyObject_SetItem, patch version goes to this line: but original version goes to: I think that's why the performance issue came out, still digging why this happened. |
forgive my words, I trace the wrong code, sorry about that. |
I'm testing1 that if we make a fast path that detect if keys is ➜ cpython git:(compact_empty_dict) ✗ ./python.default -m perf compare_to -G --min-speed=1 default.json lpatch.json
Faster (8):
Benchmark hidden because not significant (42) |
For the record, legitimate case when many empty dicts are created, and few are populated, is the collections-free approach to defaultdict(dict):
For, say, 100 unique key1s, and 10,000 total key1/key2 pairs, you'd create 10,000 empty dicts, discarding 9,900 of them. Granted, collections.defaultdict(dict) is even better (avoids the 9,900 unused dicts entirely), but I see the setdefault pattern enough, usually with list or dict, that it's not totally unreasonable to account for it. |
I don't think this should have been done. Conceptually, there is no basis for presuming key-sharing for new empty dicts -- you can't know what they would share with. This patch essentially undoes the entire reason for having a pre-allocated minsize dict. If it were deemed to be the norm that applications typically had huge numbers of empty dicts that were never populated, then the correct solution would be a NULL pointer to the table field (as dicts do). FWIW, the macro benchmarks aren't very informative here because they don't exercise much of this code. I think there is an over-prioritization of small space savings at the expense of the intended use cases for dicts. This patch just forces every dict that gets used to have to convert back from a key-sharing dict and do a memory allocation and memset(0) in the process. The whole point of the minsize dict was to avoid that cost in the common case of small dicts (i.e. passing keyword arguments). |
It increases massive NULL checks, and possibly cause bugs.
Note that there is _PyDict_NewPresized(). When kwargs is empty, this patch increase dict creation speed. $ ./py.edict bm_edict.py --compare-to ./py.master
py.master: ..................... 192 ns +- 7 ns
py.edict: ..................... 175 ns +- 4 ns
Mean +- std dev: [py.master] 192 ns +- 7 ns -> [py.edict] 175 ns +- 4 ns: 1.10x faster (-9%) py.master: ..................... 269 ns +- 4 ns So I don't think net performance of kwargs doesn't get worse. |
Another micro benchmark: $ ./py.edict.opt -m perf timeit --compare-to ./py.master.opt '{}' --duplicate=10
py.master.opt: ..................... 26.3 ns +- 0.5 ns
py.edict.opt: ..................... 13.0 ns +- 0.1 ns Mean +- std dev: [py.master.opt] 26.3 ns +- 0.5 ns -> [py.edict.opt] 13.0 ns +- 0.1 ns: 2.02x faster (-51%) $ ./py.edict.opt -m perf timeit --compare-to ./py.master.opt 'd={}; d["a"]=None' --duplicate=10
py.master.opt: ..................... 58.1 ns +- 0.9 ns
py.edict.opt: ..................... 64.1 ns +- 0.9 ns Mean +- std dev: [py.master.opt] 58.1 ns +- 0.9 ns -> [py.edict.opt] 64.1 ns +- 0.9 ns: 1.10x slower (+10%) Hmm, while 2x faster temporal empty dict is nice, 10% slower case can be mitigated. |
What about creating small dict using a dict display? E.g. {'a': None}. |
I hate to see you working so hard to make this work. IMO, the effort is futile. Dicts were intentionally designed to do a single memory allocation and memset(0) to be fully ready to store data. This PR is undoing that decision and will make Python worse rather than better. The PR is optimizing for the wrong thing. The space used by empty dicts is something we care much less about than the cost of the first assignment. No "mitigations" are going to help. If there is a second malloc and we incur the cost of switching from sharing-to-non-sharing, that is additional work that will have be done by every single dictionary that actually gets used. Nothing will get that lost work back. FWIW, if we were going to optimize for space used by empty dicts, the table pointer could just be set to NULL. That would be better than key sharing which is not only slower but also conceptually wrong (outside the context of instance creation, dicts will never be shared).
While it is helpful to know that there is some possible application that would benefit, we don't optimize for the rare case, we optimize for the common case where dicts get used. A substantial fraction of the language is implemented using dicts. For the most part, we use NULL values when the dict isn't actually needed; so, the normal case is that dicts do get used. |
In general, I agree with Raymond that this is likely to counter-productive. But let's not guess, let's measure :) I expect that there are few live empty dicts at any time for most programs. In which case there is no point in any change that attempts to save memory use for empty dicts. But I could be wrong. If there commonly are lots of live empty dicts, I should also add that dict.clear() uses a key-sharing dict to avoid allocation, because PyDict_Clear() is a void function so there is no way to handle an allocation failure. |
Serhiy, for {'a': None} the dict is created using _PyDict_NewPresized() so this makes no difference. |
Some of my thoughts on this. Conceptually, I expected that clearing a normal dict should make it an empty normal dict. I presume that making it instead an empty shared key dict is a matter of efficiency and code simplicity. If the 'anomaly' is to be corrected, changing .clear would be an alternative. The fact that this patch 'rescues' people who use .setdefault when collections.defaultdict would be better does not especially persuade me (msg291817). The dict method doc and docstring could refer to defaultdict for such situations. In 3.8.0a2, empty sets, like empty dicts, are ready to add. Empty lists in a2 are *not*, so pre-allocation is not universal in CPython for mutable collections. |
On Wed, Mar 13, 2019 at 4:46 AM Raymond Hettinger
Please that this PR is not do it. From Python 3.6, dict uses two |
Benchmark result is here. https://github.com/methane/sandbox/tree/master/2019.01/empty-dict notes:
|
I do not like how much code is needed for such minor optimization. |
PR 12307 is +46 lines. Benefit is about 10ns when first insertion. $ cpython/python.opt -m perf timeit --compare-to master/python.opt --python-names master:empty-dict2 --duplicate 100 'd={};d["a"]=1'
master: ..................... 62.6 ns +- 1.1 ns
empty-dict2: ..................... 52.5 ns +- 1.0 ns Mean +- std dev: [master] 62.6 ns +- 1.1 ns -> [empty-dict2] 52.5 ns +- 1.0 ns: 1.19x faster (-16%) While "1.19x faster" is significant, I agree that this optimization is "minor". But I came up with an idea. _PyDict_NewPresized(n) skips preallocation when n fit in minsize table. With this idea, "first insertion optimization" works. Additionally, it may help branch prediction Total diff is still +46 lines. But it affects more cases. $ alias compare='cpython/python.opt -m perf timeit --compare-to master/python.opt --python-names master:empty-dict2'
$ compare -s 'def f(x, **kw): pass' --duplicate 10 -- 'f(4)' # No table
Mean +- std dev: [master] 65.4 ns +- 2.3 ns -> [empty-dict2] 64.5 ns +- 1.5 ns: 1.01x faster (-1%)
$ compare -s 'def f(x, **kw): pass' --duplicate 10 -- 'f(4, a=1)' # minsize table is allocated
Mean +- std dev: [master] 152 ns +- 3 ns -> [empty-dict2] 144 ns +- 4 ns: 1.05x faster (-5%)
$ compare -s 'def f(x, **kw): pass' --duplicate 10 -- 'f(4, a=1,b=2)'
Mean +- std dev: [master] 211 ns +- 3 ns -> [empty-dict2] 186 ns +- 3 ns: 1.13x faster (-12%)
$ compare -s 'def f(x, **kw): pass' --duplicate 10 -- 'f(4, a=1,b=2,c=3)'
Mean +- std dev: [master] 248 ns +- 3 ns -> [empty-dict2] 223 ns +- 3 ns: 1.11x faster (-10%)
$ compare -s 'def f(x, **kw): pass' --duplicate 10 -- 'f(4, a=1,b=2,c=3,d=4,e=5)'
Mean +- std dev: [master] 327 ns +- 6 ns -> [empty-dict2] 301 ns +- 6 ns: 1.09x faster (-8%)
$ compare -s 'def f(x, **kw): pass' --duplicate 10 -- 'f(4, a=1,b=2,c=3,d=4,e=5,f=6)' # minsize*2 table is allocated
Mean +- std dev: [master] 431 ns +- 8 ns -> [empty-dict2] 406 ns +- 8 ns: 1.06x faster (-6%) Now I think PR 12307 is not so minor. Of course, same idea can be applied to PR 12308. |
I merged PR-12307 instead of PR-12308 because:
And I close this issue because there are no more drawbacks about first insertion. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: