classification
Title: lookdict_* give up too soon
Type: enhancement Stage: patch review
Components: Interpreter Core Versions: Python 3.10
process
Status: open Resolution: remind
Dependencies: Superseder:
Assigned To: Nosy List: Jim.Jewett, Mark.Shannon, benjamin.peterson, h.venev, methane, serhiy.storchaka, vaultah
Priority: normal Keywords: patch

Created on 2015-05-24 05:08 by Jim.Jewett, last changed 2021-02-22 08:09 by methane.

Files
File name Uploaded Description Edit
0001-Don-t-downgrade-unicode-only-dicts-to-mixed-on-non-u.patch h.venev, 2021-01-28 18:54 patch
bench.py h.venev, 2021-01-28 18:58
0001-Don-t-downgrade-unicode-only-dicts-to-mixed-on-non-u.patch h.venev, 2021-01-31 20:34 patch with test
Messages (12)
msg243969 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2015-05-24 05:08
The specialized lookdict_* variants replace themselves with the generic lookdict as soon as a non-unicode key is looked up.  

They could reasonably leave the replacement to insertdict (line 819, currently assert rather than a replacement), when a non-unicode key is actually inserted into the dict. 

While inserts are less common than (all lookups including insert), I see the main advantage as reducing the number of duplications of the replacement logic.
msg244014 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2015-05-25 04:31
You could have a non-unicode key that compares equal to a unicode string.
msg244024 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2015-05-25 09:50
I don't understand why this has been closed.
I agree with Jim's analysis.

Lookups do not change the dict and the choice of lookdict_* variant depends solely on the set of keys.

In fact, lookdict_split *doesn't* replace itself, it merely calls look_dict, leaving maintaining the invariant to insertdict.

Benjamin, could you please reopen this and mark it as needing a patch.
msg244026 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-25 10:09
It is not obvious that the patch is needed. If you have ready patch and good benchmark results, you could reopen the issue. Otherwise status quo wins.
msg385876 - (view) Author: Hristo Venev (h.venev) * Date: 2021-01-28 18:54
I've attached a patch. I will soon provide benchmark results.
msg385877 - (view) Author: Hristo Venev (h.venev) * Date: 2021-01-28 18:58
Benchmark program attached.

0. Creates a dict with str keys
1. str lookups
2. str subclass lookups on the same dict
3. str lookups on the same dict

Results before patch:
0.9493069459404069 +- 0.004707371313935551
1.47313450980997 +- 0.01350596115630158
1.3181799192904144 +- 0.006550182814933545

Results after patch:
0.9498907704499289 +- 0.003721378313122522
1.4936510094799451 +- 0.057905386684185135
0.9494844124498195 +- 0.0029465760297623235
msg385994 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2021-01-30 20:41
This was originally "can be reopened if a patch is submitted" and Hristo Venev has now done so. Therefore, I am reopening.
msg385996 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2021-01-30 20:59
Based on Hristo's timing, it appears to be a clear win.  

A near-wash for truly string-only dicts that shouldn't be effected; a near-wash for looking up non-(exact-)strings, and a nearly 40% speedup for the target case of looking up but not inserting a non-string or string subclass, then looking up strings thereafter. 

Additional comments:

Barring objections, I will promote from patch review to commit review when I've had a chance to look more closely.  I don't have commit privs, but I think some of the others following this issue do.

The test looks pretty good enough -- good enough that I wonder if I'm missing something on the parts that seem odd.  It would be great if you either cleaned them up or commented to explain why:

Why is the first key vx1, which seems, if anything, like a variable? 
 Why not k1 or string_key?

Why is the first key built up as vx='x'; vx += '1' instead of just k1="x1"?

Using a str subclass in the test is a great idea, and you've created a truly minimal one.  It would probably be good to *also* test with a non-string, like 3 or 42.0.  I can't imagine this affecting things (unless you missed an eager lookdict demotion somewhere), but it would be good to have that path documented against regression.

This seems like a test that could probably be rolled into a bigger testfile for the actual commit.  I don't have the name of such an appropriate file at hand right now, but will try to find it on a deeper review.
msg385997 - (view) Author: Hristo Venev (h.venev) * Date: 2021-01-30 21:28
> Why is the first key built up as vx='x'; vx += '1' instead of just k1="x1"?

I wanted to construct a key that is equal to, but not the same object as, `'x1'`. Consider this example:

    assert 'x1' is 'x1'
    spam = 'x1'
    assert spam is 'x1'
    eggs = 'x'
    eggs += '1'
    assert eggs == 'x1'
    assert eggs is not 'x1'
    assert sys.intern(eggs) is 'x1'

When doing a dict lookup and the lookup key is the same object as a stored entry, `__eq__` is not called. Lookups are then significantly faster, maybe 20%.

Consider this example:

    class EqTest:
        def __eq__(self, other):
            raise RuntimeError
        def __hash__(self):
            return id(self)
    
    adict = {}
    k1 = EqTest()
    k2 = EqTest()
    
    adict[k1] = 42
    adict[k2] = 43
    print(adict[k1], adict[k2])

Here `k1` is considered the same as `k1` and `k2` is considered the same as `k2`. However, `k1` and `k2` are considered distinct and never compared because they have different hashes.

However, if we were to set `EqTest.__hash__ = lambda self: 42`, we'd get a RuntimeError when we try to set `adict[k2]` because it would get compared for equality with `k1`.

Even if `__eq__` works, we can get some interesting behaviors. For example, when using multiple instances of `float('nan')` as keys.

> Using a str subclass in the test is a great idea, and you've created a truly minimal one.  It would probably be good to *also* test with a non-string, like 3 or 42.0.  I can't imagine this affecting things (unless you missed an eager lookdict demotion somewhere), but it would be good to have that path documented against regression.

I also tested a custom class that compares equal to strings. Other than being much slower, there weren't any significant differences. I also did some checks with int key lookups, which obviously fail with KeyError. They did not make the performance worse for the subsequent str lookups.

I will try to make a proper test tomorrow.
msg386000 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-01-30 22:24
Please test also creating a large dict with string-only or non-string keys. Since you add a code in insertdict() it can potentially affects performance.
msg386040 - (view) Author: Hristo Venev (h.venev) * Date: 2021-01-31 20:34
I've attached a patch that should also contain a test.

I also ran some benchmarks on dict creation/inserts. I couldn't notice any difference in performance.
msg387505 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021-02-22 08:09
I'm +1 for this. Would you create a pull request on GitHub?
History
Date User Action Args
2021-02-22 08:09:04methanesetmessages: + msg387505
2021-02-01 01:46:54methanesetnosy: + methane
2021-01-31 20:34:41h.venevsetfiles: + 0001-Don-t-downgrade-unicode-only-dicts-to-mixed-on-non-u.patch

messages: + msg386040
2021-01-30 22:24:18serhiy.storchakasetmessages: + msg386000
2021-01-30 21:28:47h.venevsetmessages: + msg385997
2021-01-30 20:59:25Jim.Jewettsetmessages: + msg385996
2021-01-30 20:41:43Jim.Jewettsetstatus: closed -> open
versions: + Python 3.10
messages: + msg385994

resolution: rejected -> remind
stage: patch review
2021-01-28 18:58:11h.venevsetfiles: + bench.py

messages: + msg385877
2021-01-28 18:54:14h.venevsetfiles: + 0001-Don-t-downgrade-unicode-only-dicts-to-mixed-on-non-u.patch

nosy: + h.venev
messages: + msg385876

keywords: + patch
2015-05-25 10:09:46serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg244026
2015-05-25 09:50:23Mark.Shannonsetnosy: + Mark.Shannon
messages: + msg244024
2015-05-25 04:31:18benjamin.petersonsetstatus: open -> closed

nosy: + benjamin.peterson
messages: + msg244014

resolution: rejected
2015-05-24 06:08:44vaultahsetnosy: + vaultah
2015-05-24 05:08:29Jim.Jewettcreate