classification
Title: dict: simplify lookup functions
Type: enhancement Stage: resolved
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Dmitry Rubanovich, inada.naoki, pitrou, rhettinger, serhiy.storchaka, tim.peters, xgdomingo, xiang.zhang
Priority: normal Keywords: patch

Created on 2017-01-18 02:36 by inada.naoki, last changed 2017-08-05 11:51 by inada.naoki. This issue is now closed.

Files
File name Uploaded Description Edit
dict-refactor.patch inada.naoki, 2017-01-18 06:09 review
dict-refactoring-2.patch inada.naoki, 2017-01-26 17:53 review
dict-refactoring-3.patch inada.naoki, 2017-01-26 18:54 review
collisions_count.py Dmitry Rubanovich, 2017-06-15 06:57
Pull Requests
URL Status Linked Edit
PR 2273 merged inada.naoki, 2017-06-19 04:38
PR 2347 merged inada.naoki, 2017-06-23 12:50
PR 2407 merged inada.naoki, 2017-06-26 13:25
Messages (25)
msg285695 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-01-18 02:36
All lookdict* functions are implemented like pseudo language:

```
lookup()
if not collision:
    return result

while True:
    perturb_shift()
    lookup()
    if not collision:
        return result
```

This patch changes it as:

```
while True:
    lookup()
    if not collision:
        return result
    perturb_shift()
```

It removes 100 lines of code. Good. But how about performance?

When this patch is applied to 4a534c45bbf6:

```
$ ../python.patched -m perf compare_to default.json patched2.json -G --min-speed=2
Slower (4):
- xml_etree_generate: 271 ms +- 6 ms -> 283 ms +- 9 ms: 1.04x slower (+4%)
- nqueens: 263 ms +- 4 ms -> 272 ms +- 3 ms: 1.04x slower (+4%)
- scimark_monte_carlo: 272 ms +- 10 ms -> 280 ms +- 14 ms: 1.03x slower (+3%)
- scimark_lu: 435 ms +- 23 ms -> 446 ms +- 32 ms: 1.03x slower (+3%)

Faster (7):
- call_method: 15.2 ms +- 0.2 ms -> 14.7 ms +- 0.4 ms: 1.04x faster (-4%)
- call_simple: 14.4 ms +- 0.2 ms -> 13.9 ms +- 0.3 ms: 1.04x faster (-4%)
- xml_etree_iterparse: 227 ms +- 9 ms -> 219 ms +- 7 ms: 1.04x faster (-3%)
- scimark_sor: 527 ms +- 10 ms -> 510 ms +- 11 ms: 1.03x faster (-3%)
- call_method_slots: 14.7 ms +- 0.5 ms -> 14.3 ms +- 0.2 ms: 1.03x faster (-3%)
- genshi_text: 90.2 ms +- 1.1 ms -> 87.8 ms +- 1.1 ms: 1.03x faster (-3%)
- django_template: 403 ms +- 5 ms -> 394 ms +- 4 ms: 1.02x faster (-2%)

Benchmark hidden because not significant (53): 2to3, ...
```

And when this patch applied to 1a97b10cb420 :

```
$ ../python.patched -m perf compare_to default.json patched.json -G --min-speed=2                                                                                                               
Slower (6):
- call_simple: 13.5 ms +- 0.5 ms -> 14.4 ms +- 0.4 ms: 1.07x slower (+7%)
- xml_etree_generate: 270 ms +- 6 ms -> 287 ms +- 5 ms: 1.06x slower (+6%)
- xml_etree_process: 240 ms +- 6 ms -> 247 ms +- 4 ms: 1.03x slower (+3%)
- regex_compile: 429 ms +- 3 ms -> 440 ms +- 5 ms: 1.03x slower (+3%)
- call_method_unknown: 16.1 ms +- 0.2 ms -> 16.5 ms +- 0.3 ms: 1.02x slower (+2%)
- logging_simple: 31.2 us +- 0.4 us -> 32.0 us +- 0.3 us: 1.02x slower (+2%)

Faster (8):
- genshi_text: 90.6 ms +- 1.4 ms -> 87.6 ms +- 1.2 ms: 1.03x faster (-3%)
- scimark_sor: 513 ms +- 11 ms -> 497 ms +- 12 ms: 1.03x faster (-3%)
- genshi_xml: 200 ms +- 2 ms -> 194 ms +- 2 ms: 1.03x faster (-3%)
- unpickle_pure_python: 857 us +- 21 us -> 835 us +- 13 us: 1.03x faster (-3%)
- python_startup_no_site: 9.95 ms +- 0.02 ms -> 9.74 ms +- 0.02 ms: 1.02x faster (-2%)
- json_dumps: 29.7 ms +- 0.4 ms -> 29.1 ms +- 0.4 ms: 1.02x faster (-2%)
- xml_etree_iterparse: 225 ms +- 9 ms -> 220 ms +- 5 ms: 1.02x faster (-2%)
- chameleon: 31.1 ms +- 0.3 ms -> 30.5 ms +- 0.5 ms: 1.02x faster (-2%)

Benchmark hidden because not significant (50): 2to3, ...
```

I can't see any stable and significant performance regression.
I'll try to create some micro benchmarks.
msg285696 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2017-01-18 03:26
Ahh, I got the same idea before. But then I persuaded myself that the first "lookup()" was deliberately extracted from the loop since it's highly possible there is no collision and you can hit the result (empty or not) the first time.
msg285698 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-01-18 03:54
I worry you guys are rapidly sloshing around and churning code that was developed slowly and methodically over many years.  A lot of people have looked at an tweaked the dictionary code, but now it seems like there is a rewrite-everything festival.  When looking at potential changes, please consider that the people who came before you knew what they were doing.  If there is a pre-loop step that could have been folded into a single loop to save a little code, it isn't there because the people who wrote it are daft.  That expansion was made for a reason.

Even if you get an improved timing here or there, consider that it may vary from compiler to compiler, from os to os, and may vary widely across different kinds of applications.  Branch prediction and cache effect issues don't readily show up in simple timing loops.

Particularly as a new core dev, I recommend resisting the urge to rewrite everything you touch.  That will only destabilize Python.  In the past couple of years, two or three devs have churned an enormous amount of code.
msg285702 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-01-18 05:42
Raymond, I understand your worries.  I won't commit this until I do more precise survey.

But why I trying this is not only I find duplicated code.
I think benefit of this technique is reduced by historical reason.


In Python 2.7, there are only two lookup functions: lookdict and lookdict_string.
Both functions support dict containing dummy entry.

In the first comparing part in the lookdict_string():

    if (ep->me_key == NULL || ep->me_key == key)
        return ep;
    if (ep->me_key == dummy)
        freeslot = ep;
    else {
        if (ep->me_hash == hash && _PyString_Eq(ep->me_key, key))
            return ep;
        freeslot = NULL;
    }

And similar code in same function but in loop:

        if (ep->me_key == NULL)
            return freeslot == NULL ? ep : freeslot;
        if (ep->me_key == key
            || (ep->me_hash == hash
            && ep->me_key != dummy
            && _PyString_Eq(ep->me_key, key)))
            return ep;
        if (ep->me_key == dummy && freeslot == NULL)
            freeslot = ep;

First part can ignore freeslot variable is NULL or not.  But second lookup can't.
Since most lookup is done at first try, this technique may had significant effect.


But for now, we're having five lookdict functions (if we include lookdict_index).

Three of them (lookdict_unicode_nodummy, lookdict_split, and lookdict_index) cares only dict without dummies.
They don't have freeslot variable.  So first part and second part is almost same.
Performance benefit from this technique may be negligible.

About remaining two functions (lookdict_unicode and lookdict), this technique may have still effect.
But performance of them are not so important now, compared with Python 2.7.
They were used for all name lookup in Python 2.7, but they are fallback of lookdict_split and lookdict_unicode_nodummy now.

On the other hand, having 2.5x (2 -> 5) lookdict function means maintenance cost of duplicated code is increased.


At first, I'll start three functions which doesn't take care of dummies.
I think there are almost zero performance regression.
msg285703 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-01-18 06:09
I've attached wrong file.  This patch is first version I wanted to post.
It dedupe all five functions.
msg286320 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-01-26 16:24
Digging history, duplicated code is introduced here. (1997-01-17)

https://github.com/python/cpython/commit/99304174680d4c724476dad300ae7fc638842bf0#diff-2131209d0deb0e50c93a88ec6c7b0d52

	/* Optimizations based on observations by Jyrki Alakuijala
	   (paraphrased):
	   - This routine is very heavily used, so should be AFAP
	   (As Fast As Possible).
	   - Most of the time, the first try is a hit or a definite
	   miss; so postpone the calculation of incr until we know the
	   first try was a miss.
	   - Write the loop twice, so we can move the test for
	   freeslot==NULL out of the loop.
	   - Write the loop using pointer increments and comparisons
	   rather than using an integer loop index.
	   Note that it behooves the compiler to calculate the values
	   of incr*sizeof(*ep) outside the loops and use this in the
	   increment of ep.  I've reduced the number of register
	   variables to the two most obvious candidates.

At the time, there was a single lookdict function, and three tries.

* The comment said the first try was for skipping `inc` calculation.
  While `inc` had been removed already, I think this implies skipping
  freeslot==NULL test.

* After first try, there are two loop for skipping freeslot==NULL test
  until first dummy found.  This optimization looks gone.


As I said above, lookdict_unicode_nodummy and lookdict_split only search
from table without dummies.
And lookdict_unicode() and lookdict() are not so important lookmapping() was in 1997,
duplicated code only for skipping one freeslot==NULL doesn't make sense.

One possible optimization is removing freeslot completely. because:

* freeslot is used only when inserting. finding is more important.
* insertdict() can find dummy quickly, only looking dk_indices.

But before trying optimization, I suggest to remove duplicated code first.
Macro bench doesn't show significant difference at least.
msg286322 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-01-26 16:32
BTW, perturb shift uses (i << 2) + i, instead of i*5.
I feel it doesn't make sense in 21th century.  I'll change it too.
msg286323 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-01-26 17:02
I agreed that the duplication was made for reasons, but these reasons can be not valid or less weighty now, after a number of changes made last years like shared keys or compact dicts. An overhead of additional levels of indirection and complex code of dk_get_index() can be much larger than the benefit of code duplicating.

I think neither macrobenchmarks nor microbenchmarks would not show the true effect of this change. The code should be analyzed with more precise tools, similarly to how the set implementation was tuned. I think the patch should not be pushed without such analysis. Perhaps Raymond will found a time to do this.
msg286328 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-01-26 17:53
>  I think the patch should not be pushed without such analysis. Perhaps Raymond will found a time to do this.

Ok, I won't push until expert's LGTM.

> One possible optimization is removing freeslot completely. because:
> 
> * freeslot is used only when inserting. finding is more important.
> * insertdict() can find dummy quickly, only looking dk_indices.
>
> But before trying optimization,

I found this is not only optimization, but also refactoring.
There is function for finding insertion point already, and insertdict()
and PyDict_SetDefault() use it in some cases.
So this can be done without adding any code.

+ ../python.default -m perf compare_to default.json patched.json -G --min-speed=2
Slower (5):
- logging_silent: 736 ns +- 7 ns -> 758 ns +- 15 ns: 1.03x slower (+3%)
- unpickle: 32.2 us +- 0.4 us -> 33.0 us +- 0.3 us: 1.03x slower (+3%)
- unpickle_pure_python: 829 us +- 13 us -> 849 us +- 13 us: 1.02x slower (+2%)
- meteor_contest: 200 ms +- 2 ms -> 204 ms +- 2 ms: 1.02x slower (+2%)
- scimark_sor: 482 ms +- 8 ms -> 493 ms +- 9 ms: 1.02x slower (+2%)

Faster (6):
- unpack_sequence: 132 ns +- 2 ns -> 123 ns +- 0 ns: 1.07x faster (-7%)
- call_simple: 14.3 ms +- 0.5 ms -> 13.7 ms +- 0.3 ms: 1.05x faster (-5%)
- sqlite_synth: 10.3 us +- 0.3 us -> 9.91 us +- 0.21 us: 1.03x faster (-3%)
- mako: 40.7 ms +- 0.5 ms -> 39.8 ms +- 0.3 ms: 1.02x faster (-2%)
- xml_etree_parse: 319 ms +- 11 ms -> 311 ms +- 14 ms: 1.02x faster (-2%)
- chameleon: 30.4 ms +- 0.4 ms -> 29.8 ms +- 0.3 ms: 1.02x faster (-2%)

Benchmark hidden because not significant (53): ...
msg286331 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-01-26 18:54
dict-refactoring-3.patch:

$ ../python.default -m perf compare_to default.json patched2.json -G --min-speed=2
Slower (7):
- scimark_lu: 422 ms +- 35 ms -> 442 ms +- 11 ms: 1.05x slower (+5%)
- logging_silent: 736 ns +- 7 ns -> 761 ns +- 21 ns: 1.03x slower (+3%)
- scimark_sor: 482 ms +- 8 ms -> 494 ms +- 7 ms: 1.03x slower (+3%)
- meteor_contest: 200 ms +- 2 ms -> 205 ms +- 2 ms: 1.02x slower (+2%)
- unpickle: 32.2 us +- 0.4 us -> 32.9 us +- 0.5 us: 1.02x slower (+2%)
- unpickle_pure_python: 829 us +- 13 us -> 848 us +- 14 us: 1.02x slower (+2%)
- scimark_sparse_mat_mult: 8.71 ms +- 0.32 ms -> 8.89 ms +- 0.13 ms: 1.02x slower (+2%)

Faster (8):
- unpack_sequence: 132 ns +- 2 ns -> 123 ns +- 2 ns: 1.07x faster (-7%)
- call_simple: 14.3 ms +- 0.5 ms -> 13.4 ms +- 0.3 ms: 1.07x faster (-6%)
- call_method: 15.1 ms +- 0.1 ms -> 14.5 ms +- 0.2 ms: 1.04x faster (-4%)
- mako: 40.7 ms +- 0.5 ms -> 39.6 ms +- 0.5 ms: 1.03x faster (-3%)
- scimark_monte_carlo: 266 ms +- 7 ms -> 258 ms +- 6 ms: 1.03x faster (-3%)
- chameleon: 30.4 ms +- 0.4 ms -> 29.6 ms +- 0.4 ms: 1.03x faster (-3%)
- xml_etree_parse: 319 ms +- 11 ms -> 312 ms +- 15 ms: 1.02x faster (-2%)
- pickle_pure_python: 1.28 ms +- 0.03 ms -> 1.26 ms +- 0.02 ms: 1.02x faster (-2%)

# microbench

$ ./python -m perf timeit --compare-to=`pwd`/python.default -s 'r=range(1000)' -- '{k:k for k in r}'                                                                               
python.default: ..................... 60.0 us +- 0.3 us
python: ..................... 61.7 us +- 0.4 us

Median +- std dev: [python.default] 60.0 us +- 0.3 us -> [python] 61.7 us +- 0.4 us: 1.03x slower (+3%)

$ ./python -m perf timeit --compare-to=`pwd`/python.default -s 'ks=[str(k) for k in range(1000)]; d={k:k for k in ks}' -- 'for k in ks: d[k]'                                      
python.default: ..................... 37.1 us +- 0.9 us
python: ..................... 37.7 us +- 0.9 us

Median +- std dev: [python.default] 37.1 us +- 0.9 us -> [python] 37.7 us +- 0.9 us: 1.02x slower (+2%)

Hmm, 3% slower?
I'll rerun benchmarks with PGO+LTO build.
msg296070 - (view) Author: Dmitry Rubanovich (Dmitry Rubanovich) * Date: 2017-06-15 06:57
lookdict_index() (and the rest of the files in dictobject.c) are using unnecessarily complicated perturb mechanism.  And, in fact, it's slower than the simpler case.

Instead of this:

for (size_t perturb = hash;;) {
    perturb >>= PERTURB_SHIFT;
    i = mask & ((i << 2) + i + perturb + 1);
....

it should do this:

for (size_t perturb = hash;;) {
    i = mask & ((i << 1) + perturb + 1);
    perturb >>= PERTURB_SHIFT;
....

This would not only save an instruction (a minor issue), but it would also reduce collisions.

I've attached a file which calculates frequencies of collisions for demonstration purposes.  It shows that the calculation, as it stands right now, does not create a 1-1 mapping even on the 1st iteration through the loop.  Moving PERTURB_SHIFT to the line before the calculation does reduce the density of the collision space.  But using the calculation, which I proposed, eliminates collisions on the 1st iteration completely and reduces it on most subsequent iterations.
msg296071 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-15 07:11
Dmitry, please open a new issue for your proposition. It changes visible behavior, while Inada's patch doesn't do this.
msg296305 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-06-19 04:48
pr-2237 only changes lookdict_index() function.
The function is used from only dict_popitem().  So it's not performance critical part.

And microbench doesn't show significant performance changes:

$ ./python.default -m perf timeit -q -l 2000 -s 'd=dict.fromkeys(str(i) for i in range(2000))' -- 'd.popitem()'
Mean +- std dev: 96.1 ns +- 1.3 ns

$ ./python.patched -m perf timeit -q -l 2000 -s 'd=dict.fromkeys(str(i) for i in range(2000))' -- 'd.popitem()'
Mean +- std dev: 96.4 ns +- 1.3 ns
msg296680 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-06-23 06:22
New changeset 073ae487b3ff9001c69d530c7555ddaa530dee16 by INADA Naoki in branch 'master':
bpo-29304: simplify lookdict_index() function. (GH-2273)
https://github.com/python/cpython/commit/073ae487b3ff9001c69d530c7555ddaa530dee16
msg296702 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-06-23 11:32
From a purely human and social perspective, I agree that it's beneficial to minimize duplicated code.

From a performance perspective, I can see two possible consequences:
- either compilers are already smart enough to undo the code duplication, and generated code will remain the same => no performance difference
- or compilers are not able to undo the code duplication, and the reduced I-cache pressure may be beneficial on non-trivial workloads => potential improvement (though probably minor) with the reduced code size
msg296848 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2017-06-26 05:15
I suggest reading the thread I started here[1] before pursuing this:  it looks very likely that the entire collision resolution scheme should be replaced with one of the "double hashing" ones given there, a bona fide algorithmic improvement for small tables and pathological key sets.  Whether it actually runs faster remains a mystery ;-)  The loop guts change from a shift, three adds, and a mask (or a multiply, two adds, and a mask) to just one add and a mask.  But the post-first-probe pre-loop setup gets more expensive.

[1] https://mail.python.org/pipermail/python-ideas/2017-June/046143.html
msg296849 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2017-06-26 05:20
Oops!  I undercounted the shifts in the current scheme:  there's an additional shift for "perturb".  That doesn't exist in the "double hashing" alternatives.
msg296890 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-06-26 13:19
Antonie:

Thank you for you comment.
Actually speaking, size of instructions are reduced on amd64@gcc.

lookdict_unicode_nodummy and lookdict_split doesn't have any reason to have duplicated code anymore.  I'll do dedupe for them first.

lookdict and lookdict_unicode have one difference between code before loop and inner loop:

// before loop:
    if (ix == DKIX_DUMMY) {
        freeslot = i;

// inner loop:
        if (ix == DKIX_DUMMY) {
            if (freeslot == -1)
                freeslot = i;

Since lookdict_unicode may be used for namespace, I'll keep it for now.
But lookdict() is very unlikely to be used for namespace.  I'll dedupe it after checking some macro/micro benchmarks.
msg296891 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-06-26 13:22
Tim Peters:

Thanks for your suggestion.  But I want to focus on very simple code cleanup without any change to proving algorithm nor memory access pattern in this issue.

I'll reply about my thought about reducing collision on the ML.
msg297060 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-06-27 17:37
Finally, I removed `freeslot` and all lookdict*() functions are much simpler in GH-2407.

Macro bench doesn't show any significant performance change.
I'll try to write micro benchmarks.
msg299709 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-08-03 14:42
# microbench
https://gist.github.com/methane/db16224a99d12ad4ed170cf3bd45b819

Slower (2):
- fromkeys_str: 3.85 us +- 0.07 us -> 4.09 us +- 0.09 us: 1.06x slower (+6%)
- fromkeys_int: 4.24 us +- 0.13 us -> 4.48 us +- 0.16 us: 1.06x slower (+6%)

Faster (2):
- unicode_nodummy_6: 604 ns +- 37 ns -> 577 ns +- 8 ns: 1.05x faster (-4%)
- unicode_nodummy_5: 626 ns +- 31 ns -> 600 ns +- 21 ns: 1.04x faster (-4%)

Benchmark hidden because not significant (1): unicode_split

# pyperformance

Slower (10):                                                                  
- xml_etree_iterparse: 191 ms +- 5 ms -> 200 ms +- 5 ms: 1.05x slower (+5%)                                                            
- pickle_list: 7.80 us +- 0.19 us -> 8.08 us +- 0.24 us: 1.04x slower (+4%)
- unpack_sequence: 116 ns +- 5 ns -> 119 ns +- 4 ns: 1.03x slower (+3%)     
- meteor_contest: 183 ms +- 3 ms -> 187 ms +- 2 ms: 1.02x slower (+2%)      
- logging_silent: 330 ns +- 6 ns -> 336 ns +- 5 ns: 1.02x slower (+2%)
- sqlite_synth: 8.65 us +- 0.16 us -> 8.78 us +- 0.14 us: 1.02x slower (+2%)
- sympy_sum: 172 ms +- 1 ms -> 174 ms +- 1 ms: 1.01x slower (+1%)                                                                                                
- json_dumps: 27.0 ms +- 0.4 ms -> 27.3 ms +- 0.3 ms: 1.01x slower (+1%)                                                               
- pickle_dict: 65.1 us +- 1.9 us -> 65.8 us +- 1.7 us: 1.01x slower (+1%)                                                                                       
- pickle: 20.4 us +- 0.3 us -> 20.6 us +- 0.2 us: 1.01x slower (+1%)        
                                                                            
Faster (20):                                                                                                                                                            
- nbody: 287 ms +- 4 ms -> 231 ms +- 6 ms: 1.24x faster (-19%)                                                                                                          
- scimark_sparse_mat_mult: 8.94 ms +- 1.35 ms -> 8.26 ms +- 0.36 ms: 1.08x faster (-8%)                                                                         
- chaos: 248 ms +- 3 ms -> 229 ms +- 2 ms: 1.08x faster (-8%)              
- pickle_pure_python: 1.01 ms +- 0.02 ms -> 953 us +- 13 us: 1.06x faster (-5%)
- scimark_fft: 689 ms +- 22 ms -> 654 ms +- 14 ms: 1.05x faster (-5%)
- regex_v8: 41.5 ms +- 1.8 ms -> 39.8 ms +- 0.2 ms: 1.04x faster (-4%)                                                   
- xml_etree_parse: 279 ms +- 12 ms -> 271 ms +- 10 ms: 1.03x faster (-3%)
- chameleon: 23.2 ms +- 0.5 ms -> 22.6 ms +- 0.5 ms: 1.03x faster (-2%)     
- dulwich_log: 146 ms +- 2 ms -> 143 ms +- 1 ms: 1.02x faster (-2%)         
- unpickle_pure_python: 725 us +- 15 us -> 709 us +- 15 us: 1.02x faster (-2%)
- nqueens: 211 ms +- 3 ms -> 206 ms +- 2 ms: 1.02x faster (-2%)                                                                        
- json_loads: 56.6 us +- 2.6 us -> 55.7 us +- 0.4 us: 1.02x faster (-2%)  
- 2to3: 636 ms +- 3 ms -> 625 ms +- 4 ms: 1.02x faster (-2%)                
- richards: 148 ms +- 4 ms -> 145 ms +- 2 ms: 1.02x faster (-2%)            
- spectral_norm: 253 ms +- 9 ms -> 250 ms +- 5 ms: 1.01x faster (-1%)
- xml_etree_generate: 203 ms +- 5 ms -> 201 ms +- 4 ms: 1.01x faster (-1%)
- html5lib: 194 ms +- 6 ms -> 191 ms +- 6 ms: 1.01x faster (-1%)
- sqlalchemy_imperative: 55.0 ms +- 1.7 ms -> 54.3 ms +- 1.7 ms: 1.01x faster (-1%)
- sqlalchemy_declarative: 285 ms +- 3 ms -> 281 ms +- 3 ms: 1.01x faster (-1%)
- genshi_xml: 171 ms +- 2 ms -> 169 ms +- 2 ms: 1.01x faster (-1%)

Benchmark hidden because not significant (30): ...
msg299711 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-08-03 14:45
New changeset 778928b0c7aa438c282727535814d73df850693a by INADA Naoki in branch 'master':
bpo-29304: Simplify dict lookup functions (GH-2407)
https://github.com/python/cpython/commit/778928b0c7aa438c282727535814d73df850693a
msg299713 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-08-03 16:05
Thank you Inada!
msg299765 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2017-08-05 01:18
It would have been nice to have timed this on ARM chips or 32-bit builds.  I suspect that the previous code was mainly beneficial in register starved environments when it would have saved some memory accesses whenever there were an immediate first hit on a dictionary lookup (which is common).  Also, it is more likely to make a difference on larger dicts where the memory accesses aren't all in L1 cache.
msg299777 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2017-08-05 11:51
On i386 docker image, pyperformance:

 ./python -m perf compare_to before.json after.json -G --min-speed=2
Slower (4):
- regex_dna: 279 ms +- 1 ms -> 288 ms +- 3 ms: 1.03x slower (+3%)
- deltablue: 17.9 ms +- 0.3 ms -> 18.3 ms +- 1.0 ms: 1.03x slower (+3%)
- scimark_monte_carlo: 258 ms +- 4 ms -> 264 ms +- 4 ms: 1.02x slower (+2%)
- unpack_sequence: 115 ns +- 1 ns -> 117 ns +- 3 ns: 1.02x slower (+2%)

Faster (9):
- scimark_lu: 408 ms +- 16 ms -> 389 ms +- 13 ms: 1.05x faster (-5%)
- spectral_norm: 315 ms +- 3 ms -> 302 ms +- 2 ms: 1.04x faster (-4%)
- xml_etree_process: 235 ms +- 2 ms -> 228 ms +- 3 ms: 1.03x faster (-3%)
- telco: 19.8 ms +- 0.4 ms -> 19.3 ms +- 0.3 ms: 1.03x faster (-3%)
- scimark_sor: 487 ms +- 6 ms -> 474 ms +- 4 ms: 1.03x faster (-3%)
- django_template: 385 ms +- 3 ms -> 376 ms +- 3 ms: 1.02x faster (-2%)
- unpickle_pure_python: 934 us +- 13 us -> 913 us +- 11 us: 1.02x faster (-2%)
- pathlib: 68.2 ms +- 0.8 ms -> 66.7 ms +- 0.5 ms: 1.02x faster (-2%)
- xml_etree_generate: 277 ms +- 3 ms -> 271 ms +- 7 ms: 1.02x faster (-2%)

Benchmark hidden because not significant (44)

and microbench:

 ./python -m perf compare_to before-dict.json after-dict.json -G    
Slower (2):
- fromkeys_str: 5.12 us +- 0.10 us -> 5.27 us +- 0.09 us: 1.03x slower (+3%)
- unicode_split: 636 ns +- 11 ns -> 648 ns +- 16 ns: 1.02x slower (+2%)

Faster (2):
- unicode_nodummy_5: 837 ns +- 28 ns -> 818 ns +- 21 ns: 1.02x faster (-2%)
- unicode_nodummy_6: 824 ns +- 10 ns -> 808 ns +- 20 ns: 1.02x faster (-2%)

Benchmark hidden because not significant (1): fromkeys_int
History
Date User Action Args
2017-08-05 11:51:39inada.naokisetmessages: + msg299777
2017-08-05 01:18:56rhettingersetmessages: + msg299765
2017-08-03 16:05:42pitrousetmessages: + msg299713
2017-08-03 14:45:59inada.naokisetstatus: open -> closed
resolution: fixed
stage: resolved
2017-08-03 14:45:17inada.naokisetmessages: + msg299711
2017-08-03 14:42:32inada.naokisetmessages: + msg299709
2017-06-27 17:37:15inada.naokisetmessages: + msg297060
2017-06-26 13:25:19inada.naokisetpull_requests: + pull_request2455
2017-06-26 13:22:51inada.naokisetmessages: + msg296891
2017-06-26 13:19:50inada.naokisetmessages: + msg296890
2017-06-26 05:20:00tim.peterssetmessages: + msg296849
2017-06-26 05:15:15tim.peterssetnosy: + tim.peters
messages: + msg296848
2017-06-26 04:52:08xgdomingosetnosy: + xgdomingo
2017-06-23 12:50:23inada.naokisetpull_requests: + pull_request2402
2017-06-23 11:32:22pitrousetnosy: + pitrou
messages: + msg296702
2017-06-23 06:22:52inada.naokisetmessages: + msg296680
2017-06-19 04:48:49inada.naokisetmessages: + msg296305
title: dict: simplify lookup function -> dict: simplify lookup functions
2017-06-19 04:38:15inada.naokisetpull_requests: + pull_request2323
2017-06-15 07:11:53serhiy.storchakasetmessages: + msg296071
2017-06-15 06:57:07Dmitry Rubanovichsetfiles: + collisions_count.py

type: enhancement
components: + Interpreter Core
versions: + Python 3.7
nosy: + Dmitry Rubanovich

messages: + msg296070
2017-01-26 18:54:32inada.naokisetfiles: + dict-refactoring-3.patch

messages: + msg286331
2017-01-26 17:53:09inada.naokisetfiles: + dict-refactoring-2.patch

messages: + msg286328
2017-01-26 17:02:38serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg286323
2017-01-26 16:32:05inada.naokisetmessages: + msg286322
2017-01-26 16:24:26inada.naokisetmessages: + msg286320
2017-01-18 06:09:20inada.naokisetfiles: + dict-refactor.patch

messages: + msg285703
2017-01-18 06:06:31inada.naokisetfiles: - dictlook-refactoring.patch
2017-01-18 05:42:22inada.naokisetmessages: + msg285702
2017-01-18 03:54:41rhettingersetnosy: + rhettinger
messages: + msg285698
2017-01-18 03:26:45xiang.zhangsetnosy: + xiang.zhang
messages: + msg285696
2017-01-18 02:36:13inada.naokicreate