Issue 33205: GROWTH_RATE prevents dict shrinking

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/77386

classification

Title:	GROWTH_RATE prevents dict shrinking
Type:	resource usage	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.8, Python 3.7

process

Status:	open	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Mark.Shannon, Yury.Selivanov, brandtbucher, eric.snow, methane, miss-islington, rhettinger, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2018-04-02 11:55 by methane, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
dict_rand.py	methane, 2018-04-02 11:59

Pull Requests
URL	Status	Linked	Edit
PR 6350	merged	methane, 2018-04-02 16:00
PR 6503	merged	miss-islington, 2018-04-17 06:54

Messages (8)
msg314806 - (view)	Author: Inada Naoki (methane) *	Date: 2018-04-02 11:55
GROWTH_RATE is changed from (used2) to (used2 + dk_size/2) in #17563, at Python 3.4. It was for avoid resizing dict for massive del/insert use case, by increasing possibility of overwriting DUMMY entry. From Python 3.6, there are no DUMMY entry. While there are dummy keys, we resize (repack) when entries are full. So there are no benefit from "possibility of overwriting dummy entry". (Of course, there are still benefit from slow repacking rate for new dict. So I don't propose to change it back to (used2), but (used3).) This GROWTH_RATE prevents dict is shrinked in insertion_resize(). For example, consider this dict: >>> d = dict.fromkeys(range(10900)) >>> len(d) 10900 >>> sys.getsizeof(d) 295008 >>> for i in range(10900): ... del d[i] ... >>> len(d) 0 >>> sys.getsizeof(d) 295008 `del d[i]` doesn't shrink the dict. This is another issue (#32623). Current dk_size is 16384 and entries length is dk_size * 2 // 3 = 10922. So dictresize will called when next 923 entries are added. New dk_size is round_up_to_power_of_two(922 + 16384/2) = 16384. So dict is not shrinked! >>> for i in range(923): ... d[i] = i ... >>> sys.getsizeof(d) 295008 round_up_to_power_of_two(used + dk_size/2) means dict is shrinked only when used == 0. I propose changing GROWTH_RATE again to `used3` from `used2 + dk_size/2` In case of dict growing without deletion, dk_size is doubled for each resize as current behavior. When there are deletion, dk_size is growing aggressively than Python 3.3 (used2 -> used3). And it allows dict shrinking after massive deletions. >>> import sys >>> d = dict.fromkeys(range(10900)) >>> sys.getsizeof(d) 295008 >>> for i in range(10900): ... del d[i] ... >>> len(d) 0 >>> sys.getsizeof(d) 295008 >>> for i in range(923): ... d[i] = i ... >>> sys.getsizeof(d) 36968 I want to backport this change to Python 3.7. While it's beta3, "dict can't be shrinked unless empty" behavior looks resource usage bug to me.
msg314807 - (view)	Author: Inada Naoki (methane) *	Date: 2018-04-02 11:59
# cap2.json is master (used2 + dk_size/2) # used3.json is patched (used3) $ ./python-master -m perf compare_to cap2.json used3.json -G Slower (2): - rand_access(size=20): 2.67 ms +- 0.01 ms -> 2.80 ms +- 0.04 ms: 1.05x slower (+5%) - rand_access(size=10): 2.70 ms +- 0.03 ms -> 2.80 ms +- 0.04 ms: 1.04x slower (+4%) Faster (1): - rand_access(size=50): 2.76 ms +- 0.06 ms -> 2.74 ms +- 0.02 ms: 1.01x faster (-1%) Benchmark hidden because not significant (5): rand_access(size=2), rand_access(size=5), rand_access(size=100), rand_access(size=200), rand_access(size=500)
msg314977 - (view)	Author: Inada Naoki (methane) *	Date: 2018-04-05 11:11
@Mark.Shannon, @rhettinger How do you think this?
msg315351 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-04-16 06:14
The capacity of the dict is 2/3 of its hashtable size: dk_usable < 2/3 * dk_size. Currently the dict grows if dk_usable > 1/4 * dk_size, and preserves the size if dk_usable < 1/4 * dk_size. Note that it it can grow twice if dk_usable > 1/2 * dk_size. With the proposed change the dict will grow only if dk_usable > 1/3 * dk_size, preserve the size if 1/6 * dk_size < dk_usable < 1/3 * dk_size, and shrink if dk_usable < 1/6 * dk_size. After growing once it will no need to grow again until the number of item be increased. This LGTM.
msg315380 - (view)	Author: Inada Naoki (methane) *	Date: 2018-04-17 06:53
New changeset 5fbc511f56688654a05b9eba23d140318bb9b2d5 by INADA Naoki in branch 'master': bpo-33205: dict: Change GROWTH_RATE to `used*3` (GH-6350) https://github.com/python/cpython/commit/5fbc511f56688654a05b9eba23d140318bb9b2d5
msg315409 - (view)	Author: miss-islington (miss-islington)	Date: 2018-04-17 17:17
New changeset 902bb62d5af21526b68892a1032c63aa86ded247 by Miss Islington (bot) in branch '3.7': bpo-33205: dict: Change GROWTH_RATE to `used*3` (GH-6350) https://github.com/python/cpython/commit/902bb62d5af21526b68892a1032c63aa86ded247
msg411534 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2022-01-25 00:05
Should this have been "filled3" rather than "used3"? The intent was to give a larger resize to dict that had a lot of dummy entries and a smaller resize to dicts without deletions.
msg411716 - (view)	Author: Inada Naoki (methane) *	Date: 2022-01-26 08:54
We do not have fill since Python 3.6. There is a `dk_nentries` instead. But when `insertion_resize()` is called, `dk_nentries` is equal to `USABLE_FRACTION(dk_size)` (dk_size is `1 << dk_log2_size` for now). So it is different from fill in the old dict. I chose `dk_used3` as GROWTH_RATE because it reserves more spaces when there are dummies than when there is no dummy, as I described in the same comment: > In case of dict growing without deletion, dk_size is doubled for each resize as current behavior. > When there are deletion, dk_size is growing aggressively than Python 3.3 (used2 -> used3). And it allows dict shrinking after massive deletions. For example, when current dk_size == 16 and USABLE_FRACTION(dk_size) == 10, new dk_size is: used = 10 (dummy=0) -> 32 (31.25%) * used = 9 (dummy=1) -> 32 (28.125%) (snip) * used = 6 (dummy=4) -> 32 (18.75%) * used = 5 (dummy=5) -> 16 (31.25%) * used = 4 (dummy=6) -> 16 (25%) (snip) * used = 2 (dummy=8) -> 8 (25%) As you can see, dict is more sparse when there is dummy than when there is no dummy, except used=5/dummy=5 case. There may be a small room for improvement, especially for `used=5/dummy=5` case. But I am not sure it is worth enough to use more complex GROWTH_RATE than used*3. Any good idea?

History
Date	User	Action	Args
2022-04-11 14:58:59	admin	set	github: 77386
2022-01-26 18:08:32	brandtbucher	set	nosy: + brandtbucher
2022-01-26 08:54:49	methane	set	messages: + msg411716
2022-01-26 05:23:37	rhettinger	set	status: closed -> open
2022-01-25 18:05:06	vstinner	set	nosy: - vstinner
2022-01-25 00:05:41	rhettinger	set	messages: + msg411534
2018-04-17 17:22:53	methane	set	resolution: fixed
2018-04-17 17:22:43	methane	set	status: open -> closed stage: patch review -> resolved
2018-04-17 17:17:28	miss-islington	set	nosy: + miss-islington messages: + msg315409
2018-04-17 06:54:43	miss-islington	set	pull_requests: + pull_request6198
2018-04-17 06:53:37	methane	set	messages: + msg315380
2018-04-16 06:14:27	serhiy.storchaka	set	messages: + msg315351
2018-04-05 11:11:02	methane	set	messages: + msg314977
2018-04-02 16:00:28	methane	set	keywords: + patch stage: patch review pull_requests: + pull_request6060
2018-04-02 14:24:24	serhiy.storchaka	set	nosy: + serhiy.storchaka
2018-04-02 11:59:24	methane	set	files: + dict_rand.py messages: + msg314807
2018-04-02 11:56:14	methane	link	issue32623 dependencies
2018-04-02 11:55:51	methane	create