Issue 24275: lookdict_* give up too soon

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68463

classification

Title:	lookdict_* give up too soon
Type:	enhancement	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.10

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Jim.Jewett, Mark.Shannon, benjamin.peterson, h.venev, methane, python-dev, serhiy.storchaka, vaultah
Priority:	normal	Keywords:	patch

Created on 2015-05-24 05:08 by Jim.Jewett, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
0001-Don-t-downgrade-unicode-only-dicts-to-mixed-on-non-u.patch	h.venev, 2021-01-28 18:54	patch
bench.py	h.venev, 2021-01-28 18:58
0001-Don-t-downgrade-unicode-only-dicts-to-mixed-on-non-u.patch	h.venev, 2021-01-31 20:34	patch with test

Pull Requests
URL	Status	Linked	Edit
PR 25186	merged	python-dev, 2021-04-04 16:55

Messages (15)
msg243969 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2015-05-24 05:08
The specialized lookdict_* variants replace themselves with the generic lookdict as soon as a non-unicode key is looked up. They could reasonably leave the replacement to insertdict (line 819, currently assert rather than a replacement), when a non-unicode key is actually inserted into the dict. While inserts are less common than (all lookups including insert), I see the main advantage as reducing the number of duplications of the replacement logic.
msg244014 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2015-05-25 04:31
You could have a non-unicode key that compares equal to a unicode string.
msg244024 - (view)	Author: Mark Shannon (Mark.Shannon) *	Date: 2015-05-25 09:50
I don't understand why this has been closed. I agree with Jim's analysis. Lookups do not change the dict and the choice of lookdict_* variant depends solely on the set of keys. In fact, lookdict_split doesn't replace itself, it merely calls look_dict, leaving maintaining the invariant to insertdict. Benjamin, could you please reopen this and mark it as needing a patch.
msg244026 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-05-25 10:09
It is not obvious that the patch is needed. If you have ready patch and good benchmark results, you could reopen the issue. Otherwise status quo wins.
msg385876 - (view)	Author: Hristo Venev (h.venev) *	Date: 2021-01-28 18:54
I've attached a patch. I will soon provide benchmark results.
msg385877 - (view)	Author: Hristo Venev (h.venev) *	Date: 2021-01-28 18:58
Benchmark program attached. 0. Creates a dict with str keys 1. str lookups 2. str subclass lookups on the same dict 3. str lookups on the same dict Results before patch: 0.9493069459404069 +- 0.004707371313935551 1.47313450980997 +- 0.01350596115630158 1.3181799192904144 +- 0.006550182814933545 Results after patch: 0.9498907704499289 +- 0.003721378313122522 1.4936510094799451 +- 0.057905386684185135 0.9494844124498195 +- 0.0029465760297623235
msg385994 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2021-01-30 20:41
This was originally "can be reopened if a patch is submitted" and Hristo Venev has now done so. Therefore, I am reopening.
msg385996 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2021-01-30 20:59
Based on Hristo's timing, it appears to be a clear win. A near-wash for truly string-only dicts that shouldn't be effected; a near-wash for looking up non-(exact-)strings, and a nearly 40% speedup for the target case of looking up but not inserting a non-string or string subclass, then looking up strings thereafter. Additional comments: Barring objections, I will promote from patch review to commit review when I've had a chance to look more closely. I don't have commit privs, but I think some of the others following this issue do. The test looks pretty good enough -- good enough that I wonder if I'm missing something on the parts that seem odd. It would be great if you either cleaned them up or commented to explain why: Why is the first key vx1, which seems, if anything, like a variable? Why not k1 or string_key? Why is the first key built up as vx='x'; vx += '1' instead of just k1="x1"? Using a str subclass in the test is a great idea, and you've created a truly minimal one. It would probably be good to also test with a non-string, like 3 or 42.0. I can't imagine this affecting things (unless you missed an eager lookdict demotion somewhere), but it would be good to have that path documented against regression. This seems like a test that could probably be rolled into a bigger testfile for the actual commit. I don't have the name of such an appropriate file at hand right now, but will try to find it on a deeper review.
msg385997 - (view)	Author: Hristo Venev (h.venev) *	Date: 2021-01-30 21:28
> Why is the first key built up as vx='x'; vx += '1' instead of just k1="x1"? I wanted to construct a key that is equal to, but not the same object as, `'x1'`. Consider this example: assert 'x1' is 'x1' spam = 'x1' assert spam is 'x1' eggs = 'x' eggs += '1' assert eggs == 'x1' assert eggs is not 'x1' assert sys.intern(eggs) is 'x1' When doing a dict lookup and the lookup key is the same object as a stored entry, `__eq__` is not called. Lookups are then significantly faster, maybe 20%. Consider this example: class EqTest: def __eq__(self, other): raise RuntimeError def __hash__(self): return id(self) adict = {} k1 = EqTest() k2 = EqTest() adict[k1] = 42 adict[k2] = 43 print(adict[k1], adict[k2]) Here `k1` is considered the same as `k1` and `k2` is considered the same as `k2`. However, `k1` and `k2` are considered distinct and never compared because they have different hashes. However, if we were to set `EqTest.__hash__ = lambda self: 42`, we'd get a RuntimeError when we try to set `adict[k2]` because it would get compared for equality with `k1`. Even if `__eq__` works, we can get some interesting behaviors. For example, when using multiple instances of `float('nan')` as keys. > Using a str subclass in the test is a great idea, and you've created a truly minimal one. It would probably be good to also test with a non-string, like 3 or 42.0. I can't imagine this affecting things (unless you missed an eager lookdict demotion somewhere), but it would be good to have that path documented against regression. I also tested a custom class that compares equal to strings. Other than being much slower, there weren't any significant differences. I also did some checks with int key lookups, which obviously fail with KeyError. They did not make the performance worse for the subsequent str lookups. I will try to make a proper test tomorrow.
msg386000 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2021-01-30 22:24
Please test also creating a large dict with string-only or non-string keys. Since you add a code in insertdict() it can potentially affects performance.
msg386040 - (view)	Author: Hristo Venev (h.venev) *	Date: 2021-01-31 20:34
I've attached a patch that should also contain a test. I also ran some benchmarks on dict creation/inserts. I couldn't notice any difference in performance.
msg387505 - (view)	Author: Inada Naoki (methane) *	Date: 2021-02-22 08:09
I'm +1 for this. Would you create a pull request on GitHub?
msg389933 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2021-03-31 21:12
What is the status on this? If you are losing interest, would you like someone else to turn your patch into a pull request?
msg389934 - (view)	Author: Hristo Venev (h.venev) *	Date: 2021-03-31 21:15
Sorry, I must have missed Inada Naoki's reply. I will try to send a pull request this weekend.
msg392273 - (view)	Author: Inada Naoki (methane) *	Date: 2021-04-29 02:06
New changeset 8557edbfa8f74514de82feea4c62f5963e4e0aa7 by Hristo Venev in branch 'master': bpo-24275: Don't downgrade unicode-only dicts to mixed on lookups (GH-25186) https://github.com/python/cpython/commit/8557edbfa8f74514de82feea4c62f5963e4e0aa7

History
Date	User	Action	Args
2022-04-11 14:58:17	admin	set	github: 68463
2021-04-29 02:06:31	methane	set	status: open -> closed resolution: remind -> fixed stage: patch review -> resolved
2021-04-29 02:06:14	methane	set	messages: + msg392273
2021-04-04 16:55:33	python-dev	set	nosy: + python-dev pull_requests: + pull_request23928
2021-03-31 21:15:15	h.venev	set	messages: + msg389934
2021-03-31 21:12:54	Jim.Jewett	set	messages: + msg389933
2021-02-22 08:09:04	methane	set	messages: + msg387505
2021-02-01 01:46:54	methane	set	nosy: + methane
2021-01-31 20:34:41	h.venev	set	files: + 0001-Don-t-downgrade-unicode-only-dicts-to-mixed-on-non-u.patch messages: + msg386040
2021-01-30 22:24:18	serhiy.storchaka	set	messages: + msg386000
2021-01-30 21:28:47	h.venev	set	messages: + msg385997
2021-01-30 20:59:25	Jim.Jewett	set	messages: + msg385996
2021-01-30 20:41:43	Jim.Jewett	set	status: closed -> open versions: + Python 3.10 messages: + msg385994 resolution: rejected -> remind stage: patch review
2021-01-28 18:58:11	h.venev	set	files: + bench.py messages: + msg385877
2021-01-28 18:54:14	h.venev	set	files: + 0001-Don-t-downgrade-unicode-only-dicts-to-mixed-on-non-u.patch nosy: + h.venev messages: + msg385876 keywords: + patch
2015-05-25 10:09:46	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg244026
2015-05-25 09:50:23	Mark.Shannon	set	nosy: + Mark.Shannon messages: + msg244024
2015-05-25 04:31:18	benjamin.peterson	set	status: open -> closed nosy: + benjamin.peterson messages: + msg244014 resolution: rejected
2015-05-24 06:08:44	vaultah	set	nosy: + vaultah
2015-05-24 05:08:29	Jim.Jewett	create