classification
Title: Decorate hot functions using __attribute__((hot)) to optimize Python
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: haypo, inada.naoki, jcea, pitrou, python-dev, serhiy.storchaka, yselivanov
Priority: normal Keywords: patch

Created on 2016-11-05 00:29 by haypo, last changed 2017-05-18 00:42 by jcea. This issue is now closed.

Files
File name Uploaded Description Edit
hot_function.patch haypo, 2016-11-05 00:29 review
pgo.json.gz haypo, 2016-11-08 21:09
patch.json.gz haypo, 2016-11-08 21:09
hot3.patch haypo, 2016-11-15 14:21 review
Messages (34)
msg280097 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-05 00:29
When analyzing results of Python performance benchmarks, I noticed that call_method was 70% slower (!) between revisions 83877018ef97 (Oct 18) and 3e073e7b4460 (Oct 22), including these revisions, on the speed-python server.

On these revisions, the CPU L1 instruction cache is less efficient: 8% cache misses, whereas it was only 0.06% before and after these revisions.

Since the two mentioned revisions have no obvious impact on the call_method() benchmark, I understand that the performance difference by a different layout of the machine code, maybe the exact location of functions.

IMO the best solution to such compilation issue is to use PGO compilation. Problem: PGO doesn't work on Ubuntu 14.04, the OS used by speed-python (the server runining benchmarks for http://speed.python.org/).

I propose to decorate manually the "hot" functions using the GCC __attribute__((hot)) decorator:
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes
(search for "hot")

Attached patch adds Py_HOT_FUNCTION and decorates the following functions:

* _PyEval_EvalFrameDefault()
* PyFrame_New()
* call_function()
* lookdict_unicode_nodummy()
* _PyFunction_FastCall()
* frame_dealloc()

These functions are the top 6 according to the Linux perf tool when running the call_simple benchmark of the performance project:

32,66%: _PyEval_EvalFrameDefault
13,09%: PyFrame_New
12,78%: call_function
12,24%: lookdict_unicode_nodummy
 9,85%: _PyFunction_FastCall
 8,47%: frame_dealloc
msg280105 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-05 09:07
I ran benchmarks. Globally, it seems like the impact of the patch is positive. regex_v8 and call_simple are slower, but these benchmarks are microbenchmarks impacted by low level stuff like CPU L1 cache. Well, my patch was supposed to optimize CPython for call_simple :-/ I should maybe investigate a little bit more.


Performance comparison (performance 0.3.2):

haypo@smithers$ python3 -m perf compare_to orig.json hot.json -G
Slower (6):
- regex_v8: 40.6 ms +- 5.7 ms -> 47.1 ms +- 0.3 ms: 1.16x slower
- call_simple: 12.6 ms +- 0.2 ms -> 13.2 ms +- 1.3 ms: 1.05x slower
- regex_effbot: 4.58 ms +- 0.07 ms -> 4.70 ms +- 0.05 ms: 1.03x slower
- sympy_integrate: 43.4 ms +- 0.3 ms -> 44.0 ms +- 0.2 ms: 1.01x slower
- nqueens: 239 ms +- 2 ms -> 241 ms +- 1 ms: 1.01x slower
- scimark_fft: 674 ms +- 12 ms -> 680 ms +- 75 ms: 1.01x slower

Faster (32):
- scimark_monte_carlo: 255 ms +- 4 ms -> 234 ms +- 7 ms: 1.09x faster
- chameleon: 28.4 ms +- 3.1 ms -> 27.0 ms +- 0.4 ms: 1.05x faster
- scimark_sor: 488 ms +- 27 ms -> 467 ms +- 10 ms: 1.05x faster
- sqlite_synth: 9.16 us +- 1.03 us -> 8.82 us +- 0.23 us: 1.04x faster
- scimark_lu: 485 ms +- 20 ms -> 469 ms +- 14 ms: 1.03x faster
- xml_etree_process: 226 ms +- 30 ms -> 219 ms +- 4 ms: 1.03x faster
- logging_simple: 29.7 us +- 0.4 us -> 28.9 us +- 0.3 us: 1.03x faster
- pickle_list: 7.99 us +- 0.88 us -> 7.78 us +- 0.05 us: 1.03x faster
- raytrace: 1.26 sec +- 0.08 sec -> 1.23 sec +- 0.01 sec: 1.03x faster
- sympy_expand: 995 ms +- 31 ms -> 971 ms +- 35 ms: 1.03x faster
- deltablue: 17.0 ms +- 0.1 ms -> 16.6 ms +- 0.2 ms: 1.02x faster
- call_method_slots: 16.0 ms +- 0.1 ms -> 15.6 ms +- 0.2 ms: 1.02x faster
- fannkuch: 983 ms +- 12 ms -> 962 ms +- 29 ms: 1.02x faster
- pickle_pure_python: 1.25 ms +- 0.14 ms -> 1.22 ms +- 0.01 ms: 1.02x faster
- logging_format: 34.0 us +- 0.3 us -> 33.4 us +- 1.5 us: 1.02x faster
- xml_etree_parse: 274 ms +- 9 ms -> 270 ms +- 5 ms: 1.02x faster
- sympy_str: 441 ms +- 3 ms -> 433 ms +- 3 ms: 1.02x faster
- genshi_text: 87.6 ms +- 9.2 ms -> 86.0 ms +- 1.4 ms: 1.02x faster
- genshi_xml: 187 ms +- 17 ms -> 184 ms +- 1 ms: 1.02x faster
- django_template: 376 ms +- 4 ms -> 370 ms +- 2 ms: 1.02x faster
- json_dumps: 27.1 ms +- 0.4 ms -> 26.7 ms +- 0.4 ms: 1.02x faster
- sqlalchemy_declarative: 295 ms +- 3 ms -> 291 ms +- 3 ms: 1.01x faster
- call_method_unknown: 18.1 ms +- 0.1 ms -> 17.8 ms +- 0.1 ms: 1.01x faster
- nbody: 218 ms +- 4 ms -> 216 ms +- 2 ms: 1.01x faster
- regex_dna: 250 ms +- 24 ms -> 247 ms +- 2 ms: 1.01x faster
- go: 573 ms +- 2 ms -> 566 ms +- 3 ms: 1.01x faster
- richards: 173 ms +- 4 ms -> 171 ms +- 4 ms: 1.01x faster
- python_startup: 24.6 ms +- 0.1 ms -> 24.5 ms +- 0.1 ms: 1.00x faster
- regex_compile: 404 ms +- 6 ms -> 403 ms +- 5 ms: 1.00x faster
- dulwich_log: 143 ms +- 11 ms -> 143 ms +- 1 ms: 1.00x faster
- pidigits: 290 ms +- 1 ms -> 289 ms +- 0 ms: 1.00x faster
- pickle_dict: 58.3 us +- 6.5 us -> 58.3 us +- 0.7 us: 1.00x faster

Benchmark hidden because not significant (26): 2to3, call_method, chaos, crypto_pyaes, float, hexiom, html5lib, json_loads, logging_silent, mako, meteor_contest, pathlib, pickle, python_startup_no_site, scimark_sparse_mat_mult, spectral_norm, sqlalchemy_imperative, sympy_sum, telco, tornado_http, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_generate, xml_etree_iterparse

--

More readable output, only display differences >= 5%:

haypo@smithers$ python3 -m perf compare_to orig.json hot.json -G --min-speed=5
Slower (1):
- regex_v8: 40.6 ms +- 5.7 ms -> 47.1 ms +- 0.3 ms: 1.16x slower

Faster (2):
- scimark_monte_carlo: 255 ms +- 4 ms -> 234 ms +- 7 ms: 1.09x faster
- chameleon: 28.4 ms +- 3.1 ms -> 27.0 ms +- 0.4 ms: 1.05x faster

Benchmark hidden because not significant (61): 2to3, call_method, call_method_slots, call_method_unknown, call_simple, chaos, crypto_pyaes, deltablue, django_template, dulwich_log, fannkuch, float, genshi_text, genshi_xml, go, hexiom, html5lib, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, meteor_contest, nbody, nqueens, pathlib, pickle, pickle_dict, pickle_list, pickle_pure_python, pidigits, python_startup, python_startup_no_site, raytrace, regex_compile, regex_dna, regex_effbot, richards, scimark_fft, scimark_lu, scimark_sor, scimark_sparse_mat_mult, spectral_norm, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_str, sympy_sum, telco, tornado_http, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_generate, xml_etree_iterparse, xml_etree_parse, xml_etree_process
msg280106 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-05 09:08
Oh, I forgot to mention that I compiled Python with "./configure -C". The purpose of the patch is to optimize Python when LTO and/or PGO compilation are not explicitly used.
msg280108 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-11-05 09:59
Can you compare against a PGO build? Ubuntu 14.04 is old, and I don't think this is something we should worry about.

Overall I think this manual approach is really the wrong way to look at it. Compilers can do better than us.
msg280115 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-05 15:37
Antoine Pitrou added the comment:
> Can you compare against a PGO build?

Do you mean comparison between current Python with PGO and patched
Python without PGO?

The hot attribute is ignored by GCC when PGO compilation is used.

> Ubuntu 14.04 is old, and I don't think this is something we should worry about.

Well, it's a practical issue for me to run benchmarks for speed.python.org.

Moreover, I like the idea of getting a fast(er) Python even when no
advanced optimization techniques like LTO or PGO is used. At least,
it's common to build quickly Python using "./configure && make" for a
quick benchmark.
msg280116 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-05 16:14
> Moreover, I like the idea of getting a fast(er) Python even when no
advanced optimization techniques like LTO or PGO is used.

Seconded.
msg280125 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-11-05 20:02
Le 05/11/2016 à 16:37, STINNER Victor a écrit :
> 
> Antoine Pitrou added the comment:
>> Can you compare against a PGO build?
> 
> Do you mean comparison between current Python with PGO and patched
> Python without PGO?

Yes.

>> Ubuntu 14.04 is old, and I don't think this is something we should worry about.
> 
> Well, it's a practical issue for me to run benchmarks for speed.python.org.

Why isn't the OS updated on that machine?
msg280126 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-05 22:53
Antoine Pitrou added the comment:
>> Do you mean comparison between current Python with PGO and patched
>> Python without PGO?
>
> Yes.

Oh ok, sure. I will try to run these 2 benchmarks.

>>> Ubuntu 14.04 is old, and I don't think this is something we should worry about.
>>
>> Well, it's a practical issue for me to run benchmarks for speed.python.org.
>
> Why isn't the OS updated on that machine?

I am not sure that I want to use PGO compilation to run benchmarks.
Last time I checked, I noticed performance differences between two
compilations. PGO compilation doesn't seem 100% deterministic.

Maybe PGO compilation is fine when you build Python to create a Linux
package. But to get reliable benchmarks, I'm not sure that it's a good
idea.

I'm still running benchmarks on Python recompiled many times using
different compiler options, to measure the impact of the compiler
options (especially LTO and/or PGO) on the benchmark stability.
msg280350 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-08 21:09
>> Do you mean comparison between current Python with PGO and patched
>> Python without PGO?
>
> Yes.

Ok, here you have. As expected, PGO compilation is faster than default compilation with my patch. PGO implements more optimization than just __attribute__((hot)), it also optimizes branches for example.

haypo@smithers$ python3 -m perf compare_to pgo.json.gz patch.json.gz -G --min-speed=5
Slower (56):
- regex_effbot: 4.30 ms +- 0.26 ms -> 5.77 ms +- 0.33 ms: 1.34x slower
- telco: 16.0 ms +- 1.1 ms -> 20.6 ms +- 0.4 ms: 1.29x slower
- xml_etree_process: 174 ms +- 15 ms -> 218 ms +- 29 ms: 1.25x slower
- xml_etree_generate: 205 ms +- 16 ms -> 254 ms +- 4 ms: 1.24x slower
- unpickle_list: 6.04 us +- 1.12 us -> 7.47 us +- 0.18 us: 1.24x slower
- call_simple: 10.6 ms +- 1.4 ms -> 13.1 ms +- 0.3 ms: 1.24x slower
- mako: 33.5 ms +- 0.3 ms -> 41.3 ms +- 0.9 ms: 1.23x slower
- pathlib: 37.0 ms +- 2.3 ms -> 44.7 ms +- 2.0 ms: 1.21x slower
- sqlite_synth: 7.56 us +- 0.20 us -> 8.97 us +- 0.18 us: 1.19x slower
- unpickle: 24.2 us +- 3.9 us -> 28.7 us +- 0.3 us: 1.18x slower
- chameleon: 23.4 ms +- 2.6 ms -> 27.4 ms +- 1.5 ms: 1.17x slower
- spectral_norm: 214 ms +- 7 ms -> 249 ms +- 9 ms: 1.17x slower
- nqueens: 210 ms +- 2 ms -> 244 ms +- 36 ms: 1.16x slower
- unpickle_pure_python: 717 us +- 10 us -> 831 us +- 66 us: 1.16x slower
- pickle: 18.7 us +- 4.3 us -> 21.6 us +- 3.3 us: 1.15x slower
- sympy_expand: 829 ms +- 39 ms -> 957 ms +- 28 ms: 1.15x slower
- genshi_text: 73.1 ms +- 3.2 ms -> 84.3 ms +- 1.1 ms: 1.15x slower
- pickle_list: 6.82 us +- 0.20 us -> 7.86 us +- 0.05 us: 1.15x slower
- sympy_str: 372 ms +- 28 ms -> 428 ms +- 3 ms: 1.15x slower
- xml_etree_parse: 231 ms +- 7 ms -> 266 ms +- 9 ms: 1.15x slower
- call_method_slots: 14.0 ms +- 1.3 ms -> 16.1 ms +- 1.2 ms: 1.15x slower
- sympy_sum: 169 ms +- 6 ms -> 194 ms +- 19 ms: 1.15x slower
- logging_format: 29.3 us +- 2.5 us -> 33.7 us +- 1.6 us: 1.15x slower
- logging_simple: 25.7 us +- 2.1 us -> 29.3 us +- 0.4 us: 1.14x slower
- genshi_xml: 159 ms +- 15 ms -> 182 ms +- 1 ms: 1.14x slower
- xml_etree_iterparse: 178 ms +- 3 ms -> 203 ms +- 5 ms: 1.14x slower
- pickle_pure_python: 1.06 ms +- 0.17 ms -> 1.21 ms +- 0.16 ms: 1.14x slower
- logging_silent: 618 ns +- 11 ns -> 705 ns +- 62 ns: 1.14x slower
- hexiom: 19.0 ms +- 0.2 ms -> 21.7 ms +- 0.2 ms: 1.14x slower
- html5lib: 184 ms +- 11 ms -> 209 ms +- 31 ms: 1.14x slower
- call_method: 14.3 ms +- 0.7 ms -> 16.3 ms +- 0.1 ms: 1.14x slower
- django_template: 324 ms +- 18 ms -> 368 ms +- 3 ms: 1.14x slower
- sympy_integrate: 37.9 ms +- 0.3 ms -> 43.0 ms +- 2.7 ms: 1.13x slower
- deltablue: 15.0 ms +- 2.0 ms -> 16.9 ms +- 1.0 ms: 1.12x slower
- call_method_unknown: 16.0 ms +- 0.4 ms -> 17.9 ms +- 0.2 ms: 1.12x slower
- 2to3: 611 ms +- 12 ms -> 677 ms +- 57 ms: 1.11x slower
- regex_compile: 300 ms +- 3 ms -> 332 ms +- 21 ms: 1.11x slower
- json_loads: 50.5 us +- 2.5 us -> 55.8 us +- 1.2 us: 1.10x slower
- unpack_sequence: 111 ns +- 5 ns -> 122 ns +- 1 ns: 1.10x slower
- pickle_dict: 53.2 us +- 3.7 us -> 58.1 us +- 3.7 us: 1.09x slower
- scimark_sor: 420 ms +- 60 ms -> 458 ms +- 12 ms: 1.09x slower
- scimark_lu: 398 ms +- 74 ms -> 434 ms +- 18 ms: 1.09x slower
- regex_dna: 227 ms +- 1 ms -> 247 ms +- 9 ms: 1.09x slower
- pidigits: 266 ms +- 33 ms -> 290 ms +- 10 ms: 1.09x slower
- chaos: 243 ms +- 2 ms -> 265 ms +- 3 ms: 1.09x slower
- crypto_pyaes: 197 ms +- 16 ms -> 215 ms +- 28 ms: 1.09x slower
- dulwich_log: 129 ms +- 15 ms -> 140 ms +- 8 ms: 1.08x slower
- sqlalchemy_imperative: 50.8 ms +- 0.9 ms -> 55.0 ms +- 1.8 ms: 1.08x slower
- meteor_contest: 173 ms +- 22 ms -> 187 ms +- 5 ms: 1.08x slower
- sqlalchemy_declarative: 268 ms +- 11 ms -> 290 ms +- 3 ms: 1.08x slower
- tornado_http: 335 ms +- 4 ms -> 361 ms +- 3 ms: 1.08x slower
- python_startup: 20.6 ms +- 0.6 ms -> 22.1 ms +- 0.9 ms: 1.08x slower
- python_startup_no_site: 8.37 ms +- 0.08 ms -> 9.00 ms +- 0.07 ms: 1.08x slower
- go: 518 ms +- 36 ms -> 557 ms +- 39 ms: 1.07x slower
- raytrace: 1.14 sec +- 0.08 sec -> 1.22 sec +- 0.02 sec: 1.07x slower
- scimark_fft: 594 ms +- 29 ms -> 627 ms +- 13 ms: 1.06x slower

Benchmark hidden because not significant (8): fannkuch, float, json_dumps, nbody, regex_v8, richards, scimark_monte_carlo, scimark_sparse_mat_mult
msg280556 - (view) Author: Roundup Robot (python-dev) Date: 2016-11-11 01:14
New changeset 59b91b4e9506 by Victor Stinner in branch 'default':
Issue #28618: Make hot functions using __attribute__((hot))
https://hg.python.org/cpython/rev/59b91b4e9506
msg280557 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-11 01:49
I tried different patches and ran many quick & dirty benchmarks.

I tried to use likely/unlikely macros (using GCC __builtin__expect): the effect is not significant on call_simple microbenchmark. I gave up on this part.

__attribute__((hot)) on a few Python core functions fixes the major slowdown on call_method on the revision 83877018ef97 (described in the initial message).

I noticed tiny differences when using __attribute__((hot)), speedup in most cases. I noticed sometimes slowdown, but very small (ex: 1%, but 1% on a microbenchmark doesn't mean anything).

I pushed my patch to try to keep stable performance when Python is not compiled with PGO.

If you would like to know more about the crazy effect of code placement in modern Intel CPUs, I suggest you to see the slides of this recent talk from an Intel engineer:
https://llvmdevelopersmeetingbay2016.sched.org/event/8YzY/causes-of-performance-instability-due-to-code-placement-in-x86
"Causes of Performance Swings Due to Code Placement in IA by Zia Ansari (Intel), November 2016"

--

About PGO or not PGO: this question is not simple, I suggest to discuss it in a different place to not flood this issue ;-)

For my use case, I'm not convinced yet that PGO with our current build system produce reliable performance.

Not all Linux distributions compile Python using PGO: Fedora and RHEL don't compile Python using PGO for example. Bugzilla for Fedora:
https://bugzilla.redhat.com/show_bug.cgi?id=613045

I guess that there also some developers running benchmarks on Python compiled with "./configure && make". I'm trying to enhance documentation and tools around Python benchmarks to advice developers to use LTO and/or PGO.
msg280568 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-11 09:10
Final result on speed-python:

haypo@speed-python$ python3 -m perf compare_to json_8nov/2016-11-10_15-39-default-8ebaa546a033.json 2016-11-11_02-13-default-59b91b4e9506.json -G

Slower (12):
- scimark_sparse_mat_mult: 8.71 ms +- 0.19 ms -> 9.28 ms +- 0.12 ms: 1.07x slower
- nbody: 244 ms +- 2 ms -> 252 ms +- 4 ms: 1.03x slower
- json_loads: 71.4 us +- 0.8 us -> 72.9 us +- 1.4 us: 1.02x slower
- fannkuch: 1.07 sec +- 0.01 sec -> 1.09 sec +- 0.01 sec: 1.01x slower
- scimark_lu: 502 ms +- 19 ms -> 509 ms +- 12 ms: 1.01x slower
- chaos: 302 ms +- 3 ms -> 305 ms +- 3 ms: 1.01x slower
- xml_etree_iterparse: 224 ms +- 3 ms -> 226 ms +- 6 ms: 1.01x slower
- regex_dna: 299 ms +- 1 ms -> 300 ms +- 1 ms: 1.00x slower
- pickle_list: 9.21 us +- 0.33 us -> 9.24 us +- 0.56 us: 1.00x slower
- crypto_pyaes: 245 ms +- 1 ms -> 246 ms +- 2 ms: 1.00x slower
- meteor_contest: 219 ms +- 1 ms -> 219 ms +- 1 ms: 1.00x slower
- unpack_sequence: 128 ns +- 2 ns -> 128 ns +- 0 ns: 1.00x slower

Faster (39):
- logging_silent: 997 ns +- 40 ns -> 803 ns +- 13 ns: 1.24x faster
- regex_effbot: 6.16 ms +- 0.24 ms -> 5.17 ms +- 0.27 ms: 1.19x faster
- mako: 45.9 ms +- 0.7 ms -> 42.9 ms +- 0.6 ms: 1.07x faster
- xml_etree_process: 253 ms +- 4 ms -> 237 ms +- 4 ms: 1.07x faster
- call_simple: 13.9 ms +- 0.3 ms -> 13.1 ms +- 0.4 ms: 1.06x faster
- spectral_norm: 274 ms +- 2 ms -> 260 ms +- 2 ms: 1.05x faster
- xml_etree_generate: 300 ms +- 4 ms -> 285 ms +- 5 ms: 1.05x faster
- call_method_slots: 17.1 ms +- 0.2 ms -> 16.2 ms +- 0.3 ms: 1.05x faster
- telco: 21.8 ms +- 0.5 ms -> 20.7 ms +- 0.3 ms: 1.05x faster
- call_method: 17.3 ms +- 0.3 ms -> 16.5 ms +- 0.2 ms: 1.05x faster
- pickle_pure_python: 1.42 ms +- 0.02 ms -> 1.36 ms +- 0.03 ms: 1.04x faster
- pathlib: 51.9 ms +- 0.8 ms -> 50.6 ms +- 0.4 ms: 1.03x faster
- xml_etree_parse: 295 ms +- 8 ms -> 287 ms +- 7 ms: 1.03x faster
- chameleon: 31.0 ms +- 0.3 ms -> 30.2 ms +- 0.2 ms: 1.03x faster
- deltablue: 19.3 ms +- 0.3 ms -> 18.8 ms +- 0.2 ms: 1.02x faster
- django_template: 484 ms +- 4 ms -> 472 ms +- 5 ms: 1.02x faster
- call_method_unknown: 18.7 ms +- 0.2 ms -> 18.3 ms +- 0.2 ms: 1.02x faster
- html5lib: 261 ms +- 6 ms -> 256 ms +- 6 ms: 1.02x faster
- unpickle_pure_python: 973 us +- 12 us -> 954 us +- 15 us: 1.02x faster
- regex_v8: 47.6 ms +- 0.8 ms -> 46.7 ms +- 0.4 ms: 1.02x faster
- richards: 202 ms +- 4 ms -> 198 ms +- 5 ms: 1.02x faster
- logging_simple: 37.8 us +- 0.6 us -> 37.1 us +- 0.4 us: 1.02x faster
- sympy_integrate: 50.8 ms +- 0.9 ms -> 49.9 ms +- 1.4 ms: 1.02x faster
- dulwich_log: 189 ms +- 2 ms -> 186 ms +- 1 ms: 1.02x faster
- sqlalchemy_declarative: 343 ms +- 3 ms -> 339 ms +- 3 ms: 1.01x faster
- hexiom: 25.0 ms +- 0.1 ms -> 24.7 ms +- 0.1 ms: 1.01x faster
- logging_format: 44.6 us +- 0.6 us -> 44.1 us +- 0.6 us: 1.01x faster
- 2to3: 787 ms +- 4 ms -> 777 ms +- 4 ms: 1.01x faster
- tornado_http: 440 ms +- 4 ms -> 435 ms +- 4 ms: 1.01x faster
- json_dumps: 30.7 ms +- 0.4 ms -> 30.5 ms +- 0.3 ms: 1.01x faster
- go: 637 ms +- 10 ms -> 632 ms +- 8 ms: 1.01x faster
- regex_compile: 397 ms +- 2 ms -> 394 ms +- 3 ms: 1.01x faster
- nqueens: 266 ms +- 2 ms -> 264 ms +- 2 ms: 1.01x faster
- python_startup: 16.8 ms +- 0.0 ms -> 16.7 ms +- 0.0 ms: 1.01x faster
- python_startup_no_site: 9.91 ms +- 0.01 ms -> 9.86 ms +- 0.01 ms: 1.01x faster
- scimark_sor: 513 ms +- 13 ms -> 510 ms +- 8 ms: 1.01x faster
- raytrace: 1.41 sec +- 0.02 sec -> 1.40 sec +- 0.02 sec: 1.00x faster
- genshi_text: 95.2 ms +- 1.1 ms -> 94.7 ms +- 0.8 ms: 1.00x faster
- sympy_str: 529 ms +- 5 ms -> 528 ms +- 4 ms: 1.00x faster

Benchmark hidden because not significant (13): float, genshi_xml, pickle, pickle_dict, pidigits, scimark_fft, scimark_monte_carlo, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_sum, unpickle, unpickle_list
msg280606 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-11 19:52
> - json_loads: 71.4 us +- 0.8 us -> 72.9 us +- 1.4 us: 1.02x slower

Hum, sadly this benchmark is still unstable after my change 59b91b4e9506 ("Mark hot functions using __attribute__((hot))", oops, I wanted to write Mark, not Make :-/).

This benchmark is around 63.4 us during many months, whereas it reached 72.9 us at rev 59b91b4e9506, and the new run (also using hot attribute) gone back to 63.0 us...

I understand that json_loads depends on the code placement of some other functions which are not currently marked with the hot attribute.

https://speed.python.org/timeline/#/?exe=4&ben=json_loads&env=1&revs=50&equid=off&quarts=on&extr=on
msg280607 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-11 19:58
> - scimark_sparse_mat_mult: 8.71 ms +- 0.19 ms -> 9.28 ms +- 0.12 ms: 1.07x slower

Same issue on this benchmark:

* average on one year: 8.8 ms
* peak at rev 59b91b4e9506: 9.3 ms
* run after rev 59b91b4e9506: 9.0 ms

The benchmark is unstable, but the difference is small, especially compared to the difference of call_method without the hot attribute.
msg280675 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-11-12 22:25
Can we commit this to 3.6 too?
msg280679 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-12 23:40
> Can we commit this to 3.6 too?

I worked on patches to try to optimize json_loads and regex_effbot as well, but it's still unclear to me how the hot attribute works, and I'm not 100% sure that using the attribut explicitly does not introduce a performance regession.

So I prefer to experiment such change in default right now.
msg280748 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2016-11-14 10:41
How about marking lookdict_unicode and lookdict_unicode_nodummy as hot?
msg280764 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-14 12:23
INADA Naoki added the comment:
> How about marking lookdict_unicode and lookdict_unicode_nodummy as hot?

I don't understand well the effect of the hot attribute, so I suggest
to run benchmarks and check that it has a non negligible effect on
benchmarks ;-)
msg280831 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2016-11-15 11:56
> I don't understand well the effect of the hot attribute

I compared lookdict_unicode_nodummy assembly by `objdump -d dictobject.o`.
It looks completely same.

So I think only difference is placement. hot functions are in .text.hot section and linker
groups hot functions. This reduces cache hazard possibility.

When compiling Python with PGO, we can see what function is hot by objdump.

```
~/work/cpython/Objects$ objdump -tj .text.hot dictobject.o

dictobject.o:     file format elf64-x86-64

SYMBOL TABLE:
0000000000000000 l    d  .text.hot      0000000000000000 .text.hot
00000000000007a0 l     F .text.hot      0000000000000574 lookdict_unicode_nodummy
00000000000046d0 l     F .text.hot      00000000000000e8 free_keys_object
00000000000001c0 l     F .text.hot      0000000000000161 new_keys_object
00000000000003b0 l     F .text.hot      00000000000003e8 insertdict
0000000000001180 l     F .text.hot      000000000000081f dictresize
00000000000019a0 l     F .text.hot      0000000000000165 find_empty_slot.isra.0
0000000000002180 l     F .text.hot      00000000000005f1 lookdict
0000000000001b10 l     F .text.hot      00000000000000c2 unicode_eq
0000000000002780 l     F .text.hot      0000000000000184 dict_traverse
0000000000004c20 l     F .text.hot      00000000000005f7 lookdict_unicode
0000000000006b20 l     F .text.hot      0000000000000330 lookdict_split
...
```

cold section of hot function is placed in .text.unlikely section.

```
$ objdump -t  dictobject.o  | grep lookdict
00000000000007a0 l     F .text.hot      0000000000000574 lookdict_unicode_nodummy
0000000000002180 l     F .text.hot      00000000000005f1 lookdict
000000000000013e l       .text.unlikely 0000000000000000 lookdict_unicode_nodummy.cold.6
0000000000000a38 l       .text.unlikely 0000000000000000 lookdict.cold.15
0000000000004c20 l     F .text.hot      00000000000005f7 lookdict_unicode
0000000000006b20 l     F .text.hot      0000000000000330 lookdict_split
0000000000001339 l       .text.unlikely 0000000000000000 lookdict_unicode.cold.28
0000000000001d01 l       .text.unlikely 0000000000000000 lookdict_split.cold.42
```

All lookdict* function are put in hot section, and all of cold part is 0 byte.
It seems PGO put all lookdict* functions in hot section.

compiler info:
```
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.4' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
```
msg280832 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2016-11-15 12:04
> so I suggest to run benchmarks and check that it has a non negligible effect on benchmarks ;-)

When added _Py_HOT_FUNCTION to lookdict_unicode, lookdict_unicode_nodummy and lookdict_split
(I can't measure L1 miss via `perf stat -d` because I use EC2 for benchmark):

$ ~/local/python-master/bin/python3 -m perf compare_to -G all-master.json all-patched.json
Slower (28):
- pybench.CompareFloats: 106 ns +- 1 ns -> 112 ns +- 1 ns: 1.07x slower
- pybench.BuiltinFunctionCalls: 1.62 us +- 0.00 us -> 1.68 us +- 0.03 us: 1.04x slower
- pybench.CompareFloatsIntegers: 180 ns +- 3 ns -> 185 ns +- 3 ns: 1.03x slower
- sympy_sum: 163 ms +- 7 ms -> 167 ms +- 7 ms: 1.03x slower
- deltablue: 13.7 ms +- 0.4 ms -> 14.1 ms +- 0.5 ms: 1.02x slower
- pickle_list: 5.77 us +- 0.09 us -> 5.90 us +- 0.07 us: 1.02x slower
- pybench.PythonFunctionCalls: 1.20 us +- 0.02 us -> 1.22 us +- 0.02 us: 1.02x slower
- pybench.SpecialClassAttribute: 1.46 us +- 0.02 us -> 1.49 us +- 0.03 us: 1.02x slower
- pybench.TryRaiseExcept: 207 ns +- 4 ns -> 210 ns +- 0 ns: 1.02x slower
- pickle_pure_python: 868 us +- 18 us -> 882 us +- 16 us: 1.02x slower
- genshi_text: 56.0 ms +- 0.7 ms -> 56.8 ms +- 0.6 ms: 1.01x slower
- json_dumps: 19.5 ms +- 0.3 ms -> 19.8 ms +- 0.2 ms: 1.01x slower
- richards: 137 ms +- 3 ms -> 139 ms +- 2 ms: 1.01x slower
- sqlalchemy_declarative: 272 ms +- 4 ms -> 276 ms +- 3 ms: 1.01x slower
- pickle_dict: 43.5 us +- 0.4 us -> 44.1 us +- 0.2 us: 1.01x slower
- go: 436 ms +- 4 ms -> 441 ms +- 4 ms: 1.01x slower
- pybench.SecondImport: 2.52 us +- 0.04 us -> 2.55 us +- 0.07 us: 1.01x slower
- pybench.NormalClassAttribute: 1.46 us +- 0.02 us -> 1.47 us +- 0.02 us: 1.01x slower
- genshi_xml: 118 ms +- 2 ms -> 118 ms +- 3 ms: 1.01x slower
- pybench.UnicodePredicates: 75.8 ns +- 0.6 ns -> 76.2 ns +- 0.9 ns: 1.01x slower
- pybench.ListSlicing: 415 us +- 4 us -> 417 us +- 4 us: 1.01x slower
- scimark_fft: 494 ms +- 2 ms -> 496 ms +- 12 ms: 1.01x slower
- logging_format: 23.7 us +- 0.3 us -> 23.9 us +- 0.4 us: 1.00x slower
- chaos: 200 ms +- 1 ms -> 201 ms +- 1 ms: 1.00x slower
- pybench.StringPredicates: 509 ns +- 3 ns -> 511 ns +- 4 ns: 1.00x slower
- call_method: 13.6 ms +- 0.1 ms -> 13.7 ms +- 0.2 ms: 1.00x slower
- pybench.StringSlicing: 530 ns +- 3 ns -> 532 ns +- 8 ns: 1.00x slower
- pybench.SimpleLongArithmetic: 535 ns +- 2 ns -> 536 ns +- 4 ns: 1.00x slower

Faster (47):
- html5lib: 169 ms +- 7 ms -> 158 ms +- 6 ms: 1.07x faster
- pybench.ConcatUnicode: 57.3 ns +- 3.0 ns -> 55.8 ns +- 1.3 ns: 1.03x faster
- pybench.IfThenElse: 60.5 ns +- 1.0 ns -> 59.0 ns +- 0.7 ns: 1.02x faster
- logging_silent: 606 ns +- 14 ns -> 593 ns +- 13 ns: 1.02x faster
- scimark_lu: 411 ms +- 5 ms -> 404 ms +- 4 ms: 1.02x faster
- pathlib: 29.1 ms +- 0.3 ms -> 28.7 ms +- 0.5 ms: 1.02x faster
- pybench.CreateStringsWithConcat: 2.87 us +- 0.01 us -> 2.82 us +- 0.00 us: 1.02x faster
- pybench.DictCreation: 621 ns +- 10 ns -> 612 ns +- 8 ns: 1.01x faster
- meteor_contest: 167 ms +- 5 ms -> 164 ms +- 5 ms: 1.01x faster
- unpickle_pure_python: 656 us +- 19 us -> 647 us +- 9 us: 1.01x faster
- pybench.NestedForLoops: 20.2 ns +- 0.1 ns -> 20.0 ns +- 0.1 ns: 1.01x faster
- regex_effbot: 4.01 ms +- 0.07 ms -> 3.95 ms +- 0.06 ms: 1.01x faster
- pybench.CreateUnicodeWithConcat: 57.1 ns +- 0.2 ns -> 56.4 ns +- 0.2 ns: 1.01x faster
- chameleon: 18.3 ms +- 0.2 ms -> 18.0 ms +- 0.3 ms: 1.01x faster
- python_startup: 13.7 ms +- 0.1 ms -> 13.5 ms +- 0.1 ms: 1.01x faster
- pybench.SmallTuples: 967 ns +- 6 ns -> 955 ns +- 8 ns: 1.01x faster
- pybench.TryFinally: 200 ns +- 3 ns -> 198 ns +- 2 ns: 1.01x faster
- pybench.SimpleIntegerArithmetic: 425 ns +- 3 ns -> 420 ns +- 4 ns: 1.01x faster
- pybench.Recursion: 1.34 us +- 0.02 us -> 1.33 us +- 0.03 us: 1.01x faster
- pybench.SimpleIntFloatArithmetic: 424 ns +- 1 ns -> 420 ns +- 1 ns: 1.01x faster
- float: 222 ms +- 2 ms -> 220 ms +- 3 ms: 1.01x faster
- 2to3: 531 ms +- 4 ms -> 527 ms +- 5 ms: 1.01x faster
- python_startup_no_site: 8.30 ms +- 0.04 ms -> 8.23 ms +- 0.05 ms: 1.01x faster
- xml_etree_parse: 196 ms +- 5 ms -> 194 ms +- 2 ms: 1.01x faster
- pybench.ComplexPythonFunctionCalls: 794 ns +- 7 ns -> 788 ns +- 7 ns: 1.01x faster
- logging_simple: 20.4 us +- 0.2 us -> 20.3 us +- 0.4 us: 1.01x faster
- fannkuch: 795 ms +- 9 ms -> 790 ms +- 3 ms: 1.01x faster
- hexiom: 18.7 ms +- 0.1 ms -> 18.6 ms +- 0.1 ms: 1.01x faster
- regex_compile: 322 ms +- 9 ms -> 320 ms +- 8 ms: 1.01x faster
- mako: 36.0 ms +- 0.1 ms -> 35.8 ms +- 0.2 ms: 1.01x faster
- pybench.UnicodeProperties: 91.7 ns +- 0.9 ns -> 91.1 ns +- 0.8 ns: 1.01x faster
- pybench.SimpleComplexArithmetic: 577 ns +- 8 ns -> 573 ns +- 3 ns: 1.01x faster
- xml_etree_process: 147 ms +- 2 ms -> 146 ms +- 2 ms: 1.01x faster
- pybench.CompareUnicode: 22.4 ns +- 0.1 ns -> 22.2 ns +- 0.1 ns: 1.01x faster
- crypto_pyaes: 175 ms +- 1 ms -> 174 ms +- 1 ms: 1.01x faster
- unpickle_list: 5.43 us +- 0.04 us -> 5.41 us +- 0.02 us: 1.01x faster
- pybench.WithFinally: 257 ns +- 4 ns -> 256 ns +- 2 ns: 1.01x faster
- xml_etree_generate: 183 ms +- 2 ms -> 182 ms +- 2 ms: 1.00x faster
- pybench.WithRaiseExcept: 475 ns +- 4 ns -> 472 ns +- 6 ns: 1.00x faster
- pybench.SecondPackageImport: 2.85 us +- 0.08 us -> 2.84 us +- 0.09 us: 1.00x faster
- pybench.SimpleListManipulation: 444 ns +- 1 ns -> 442 ns +- 2 ns: 1.00x faster
- spectral_norm: 208 ms +- 2 ms -> 208 ms +- 1 ms: 1.00x faster
- pybench.ForLoops: 8.95 ns +- 0.19 ns -> 8.94 ns +- 0.01 ns: 1.00x faster
- scimark_sor: 371 ms +- 3 ms -> 371 ms +- 2 ms: 1.00x faster
- scimark_sparse_mat_mult: 5.61 ms +- 0.06 ms -> 5.61 ms +- 0.36 ms: 1.00x faster
- pybench.UnicodeMappings: 40.7 us +- 0.1 us -> 40.7 us +- 0.0 us: 1.00x faster
- pybench.CompareStrings: 22.2 ns +- 0.0 ns -> 22.2 ns +- 0.0 ns: 1.00x faster

Benchmark hidden because not significant (47): call_method_slots, call_method_unknown, call_simple, django_template, dulwich_log, json_loads, nbody, nqueens, pickle, pidigits, pybench.BuiltinMethodLookup, pybench.CompareIntegers, pybench.
CompareInternedStrings, pybench.CompareLongs, pybench.ConcatStrings, pybench.CreateInstances, pybench.CreateNewInstances, pybench.DictWithFloatKeys, pybench.DictWithIntegerKeys, pybench.DictWithStringKeys, pybench.NestedListComprehensions
, pybench.NormalInstanceAttribute, pybench.PythonMethodCalls, pybench.SecondSubmoduleImport, pybench.SimpleDictManipulation, pybench.SimpleFloatArithmetic, pybench.SimpleListComprehensions, pybench.SmallLists, pybench.SpecialInstanceAttri
bute, pybench.StringMappings, pybench.TryExcept, pybench.TupleSlicing, pybench.UnicodeSlicing, raytrace, regex_dna, regex_v8, scimark_monte_carlo, sqlalchemy_imperative, sqlite_synth, sympy_expand, sympy_integrate, sympy_str, telco, torna
do_http, unpack_sequence, unpickle, xml_etree_iterparse
msg280844 - (view) Author: Roundup Robot (python-dev) Date: 2016-11-15 14:15
New changeset cfc956f13ce2 by Victor Stinner in branch 'default':
Issue #28618: Mark dict lookup functions as hot
https://hg.python.org/cpython/rev/cfc956f13ce2
msg280845 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-15 14:18
> How about marking lookdict_unicode and lookdict_unicode_nodummy as hot?

Ok, your benchmark results doens't look bad, so I marked the following functions as hot:

- lookdict
- lookdict_unicode
- lookdict_unicode_nodummy
- lookdict_split

It's common to see these functions in the top 3 of "perf report".
msg280846 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-15 14:21
hot3.patch: Mark additional functions as hot

* PyNumber_AsSsize_t()
* _PyUnicode_FromUCS1()
* json: scanstring_unicode()
* siphash24()
* sre_ucs1_match, sre_ucs2_match, sre_ucs4_match

I'm not sure about this patch. It's hard to get reliable benchmark results on microbenchmarks :-/ It's hard to understand if a speedup comes from the hot attribute, or if the compiler decided itself to change the code placement. Without the hot attribute, the code placement seems random.
msg280849 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-15 14:28
I wrote hot3.patch when trying to make the following benchmarks more reliable:

- logging_silent: rev 8ebaa546a033 is 20% slower than the average en 2016
- json_loads: rev 0bd618fe0639 is 30% slower and rev 8ebaa546a033 is
15% slower than the average on 2016
- regex_effbot: rev 573bc1f9900e (nov 7) takes 6.0 ms, rev
cf7711887b4a (nov 7) takes 5.2 ms, rev 8ebaa546a033 (nov 10) takes 6.1
ms, etc.
msg280853 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-15 14:40
> * json: scanstring_unicode()

This doesn't look wise. This is specific to single extension module and perhaps to single particular benchmark. Most Python code don't use json at all.

What is the top of "perf report"? How this list intersects with the list of functions in .text.hot section of PGO build? Make several PGO builds (perhaps on different computers). Is .text.hot section stable?
msg280859 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-15 15:42
> New changeset cfc956f13ce2 by Victor Stinner in branch 'default':
> Issue #28618: Mark dict lookup functions as hot
> https://hg.python.org/cpython/rev/cfc956f13ce2

Here are benchmark results on the speed-python server:

haypo@speed-python$ PYTHONPATH=~/perf python -m perf compare_to 2016-11-15_09-12-default-ac93d188ebd6.json 2016-11-15_15-13-default-cfc956f13ce2.json -G --min-speed=1
Slower (6):
- json_loads: 62.8 us +- 1.1 us -> 65.8 us +- 2.6 us: 1.05x slower
- nbody: 243 ms +- 2 ms -> 253 ms +- 6 ms: 1.04x slower
- mako: 42.7 ms +- 0.2 ms -> 43.5 ms +- 0.3 ms: 1.02x slower
- chameleon: 29.2 ms +- 0.3 ms -> 29.7 ms +- 0.2 ms: 1.02x slower
- spectral_norm: 261 ms +- 2 ms -> 266 ms +- 3 ms: 1.02x slower
- pickle: 26.6 us +- 0.4 us -> 27.0 us +- 0.4 us: 1.01x slower

Faster (26):
- xml_etree_generate: 290 ms +- 4 ms -> 275 ms +- 3 ms: 1.06x faster
- float: 306 ms +- 5 ms -> 292 ms +- 7 ms: 1.05x faster
- logging_simple: 37.7 us +- 0.4 us -> 36.1 us +- 0.4 us: 1.04x faster
- hexiom: 25.6 ms +- 0.1 ms -> 24.5 ms +- 0.1 ms: 1.04x faster
- regex_effbot: 6.11 ms +- 0.31 ms -> 5.88 ms +- 0.43 ms: 1.04x faster
- sympy_expand: 1.19 sec +- 0.02 sec -> 1.15 sec +- 0.01 sec: 1.04x faster
- telco: 21.5 ms +- 0.4 ms -> 20.8 ms +- 0.4 ms: 1.03x faster
- raytrace: 1.41 sec +- 0.02 sec -> 1.37 sec +- 0.02 sec: 1.03x faster
- scimark_sor: 512 ms +- 11 ms -> 500 ms +- 12 ms: 1.03x faster
- logging_format: 44.6 us +- 0.5 us -> 43.6 us +- 0.7 us: 1.02x faster
- sympy_str: 532 ms +- 4 ms -> 520 ms +- 4 ms: 1.02x faster
- fannkuch: 1.11 sec +- 0.01 sec -> 1.08 sec +- 0.02 sec: 1.02x faster
- django_template: 475 ms +- 5 ms -> 467 ms +- 6 ms: 1.02x faster
- chaos: 308 ms +- 2 ms -> 303 ms +- 3 ms: 1.02x faster
- xml_etree_process: 244 ms +- 4 ms -> 240 ms +- 4 ms: 1.02x faster
- xml_etree_iterparse: 225 ms +- 5 ms -> 221 ms +- 4 ms: 1.02x faster
- pathlib: 51.1 ms +- 0.5 ms -> 50.3 ms +- 0.5 ms: 1.02x faster
- sqlite_synth: 10.5 us +- 0.2 us -> 10.3 us +- 0.2 us: 1.01x faster
- dulwich_log: 186 ms +- 1 ms -> 184 ms +- 1 ms: 1.01x faster
- sqlalchemy_imperative: 72.5 ms +- 1.6 ms -> 71.5 ms +- 1.6 ms: 1.01x faster
- deltablue: 18.5 ms +- 0.3 ms -> 18.3 ms +- 0.2 ms: 1.01x faster
- tornado_http: 438 ms +- 5 ms -> 433 ms +- 5 ms: 1.01x faster
- json_dumps: 30.4 ms +- 0.4 ms -> 30.1 ms +- 0.4 ms: 1.01x faster
- genshi_xml: 212 ms +- 3 ms -> 210 ms +- 3 ms: 1.01x faster
- scimark_monte_carlo: 273 ms +- 5 ms -> 271 ms +- 5 ms: 1.01x faster
- call_simple: 13.3 ms +- 0.3 ms -> 13.2 ms +- 0.4 ms: 1.01x faster

Benchmark hidden because not significant (32): 2to3, call_method, call_method_slots, call_method_unknown, crypto_pyaes, genshi_text, go, html5lib, logging_silent, meteor_contest, nqueens, pickle_dict, pickle_list, pickle_pure_python, pidigits, python_startup, python_startup_no_site, regex_compile, regex_dna, regex_v8, richards, scimark_fft, scimark_lu, scimark_sparse_mat_mult, sqlalchemy_declarative, sympy_integrate, sympy_sum, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse
msg280860 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-15 15:50
Serhiy Storchaka:
>> * json: scanstring_unicode()
>
> This doesn't look wise. This is specific to single extension module and perhaps to single particular benchmark. Most Python code don't use json at all.

Well, I tried different things to make these benchmarks more stable. I didn't say that we should merge hot3.patch as it is :-) It's just an attempt.


> What is the top of "perf report"?

For json_loads, it's:

 14.99%  _json.cpython-37m-x86_64-linux-gnu.so  scanstring_unicode
  8.34%  python                                 _PyUnicode_FromUCS1
  8.32%  _json.cpython-37m-x86_64-linux-gnu.so  scan_once_unicode
  8.01%  python                                 lookdict_unicode_nodummy
  6.72%  python                                 siphash24
  4.45%  python                                 PyDict_SetItem
  4.26%  python                                 _PyObject_Malloc
  3.38%  python                                 _PyEval_EvalFrameDefault
  3.16%  python                                 _Py_HashBytes
  2.72%  python                                 PyUnicode_New
  2.36%  python                                 PyLong_FromString
  2.25%  python                                 _PyObject_Free
  2.02%  libc-2.19.so                           __memcpy_sse2_unaligned
  1.61%  python                                 PyDict_GetItem
  1.40%  python                                 dictresize
  1.24%  python                                 unicode_hash
  1.11%  libc-2.19.so                           _int_malloc
  1.07%  python                                 unicode_dealloc
  1.00%  python                                 free_keys_object

Result produced with:

   $ perf record ./python ~/performance/performance/benchmarks/bm_json_loads.py --worker -v -l 128 -w0 -n 100                                                                  
   $ perf report                          


> How this list intersects with the list of functions in .text.hot section of PGO build?

I checked which functions are considered as "hot" by a PGO build: I found more than 2,000 functions... I'm not interested to tag so many functions with _Py_HOT_FUNCTIONS. I would prefer to only tag something like the top 10 or top 25 functions.

I don't know the recommandations to tag functions as hot. I guess that what matters is the total size of hot functions. Should I be smaller than the L2 cache? Smaller than the L3 cache? I'm talking about instructions, but data share also these caches...


> Make several PGO builds (perhaps on different computers). Is .text.hot section stable?

In my experience PGO builds don't provide stable performances, but I was never able to write an article on that because of so many bugs :-)
msg281459 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-22 10:30
FYI I wrote an article about this issue:
https://haypo.github.io/analysis-python-performance-issue.html

Sadly, it seems like I was just lucky when adding __attribute__((hot)) fixed the issue, because call_method is slow again!

* acde821520fc (Nov 21): 16.3 ms
* 2a14385710dc (Nov 22): 24.6 ms (+51%)
msg281463 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2016-11-22 11:07
Wow. It's sad that tagged version is accidentally slow...

I want to reproduce it and check `perf record -e L1-icache-load-misses`.
But IaaS (EC2, GCE, Azure VM) doesn't support CPU performance counter.
msg281466 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-22 11:47
2016-11-22 12:07 GMT+01:00 INADA Naoki <report@bugs.python.org>:
> I want to reproduce it and check `perf record -e L1-icache-load-misses`.
> But IaaS (EC2, GCE, Azure VM) doesn't support CPU performance counter.

You don't need to go that far to check performances: just run
call_method and check timings. You need to compare on multiple
revisions.

speed.python.org Timeline helps to track performances, to have an idea
of the "average performance" and detect spikes.
msg281467 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-22 11:50
Naoki: "Wow. It's sad that tagged version is accidentally slow..."

If you use PGO compilation, for example use "./configure
--enable-optimizations" as suggested by configure if you don't enable
the option, you don't get the issue.

I hope that most Linux distribution use PGO compilation. I'm quite
sure that it's the case for Ubuntu. I don't know for Fedora.
msg281473 - (view) Author: INADA Naoki (inada.naoki) * (Python committer) Date: 2016-11-22 12:19
I setup Ubuntu 14.04 on Azure, built python without neither PGO nor LTO.
But I failed to reproduce it.

@haypo, would you give me two binaries?

$ ~/local/py-2a143/bin/python3 -c 'import sys; print(sys.version)'
3.7.0a0 (default:2a14385710dc, Nov 22 2016, 12:02:34) 
[GCC 4.8.4]

$ ~/local/py-acde8/bin/python3 -c 'import sys; print(sys.version)'                                                                                    
3.7.0a0 (default:acde821520fc, Nov 22 2016, 11:31:16) 
[GCC 4.8.4]

$ ~/local/py-2a143/bin/python3 bm_call_method.py 
.....................
call_method: Median +- std dev: 16.1 ms +- 0.6 ms

$ ~/local/py-acde8/bin/python3 bm_call_method.py                                                                                                      
.....................
call_method: Median +- std dev: 16.1 ms +- 0.7 ms
msg281477 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-11-22 13:17
> But I failed to reproduce it.

Hey, performance issues with code placement is a mysterious secret :-)
Nobody understands it :-D

The server runner the benchmark is a Intel Xeon CPU of 2011. It seems
like code placement issues are more important on this CPU than my more
recent laptop or desktop PC.
msg286662 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-02-01 17:21
Victor: "FYI I wrote an article about this issue: https://haypo.github.io/analysis-python-performance-issue.html Sadly, it seems like I was just lucky when adding __attribute__((hot)) fixed the issue, because call_method is slow again!"

I upgraded speed-python server (running benchmarks) to Ubuntu 16.04 LTS to support PGO compilation. I removed all old benchmark results and ran again benchmarks with LTO+PGO. It seems like benchmark results are much better now.

I'm not sure anymore that _Py_HOT_FUNCTION is really useful to get stable benchmarks, but it may help code placement a little bit. I don't think that it hurts, so I suggest to keep it. Since benchmarks were still unstable with _Py_HOT_FUNCTION, I'm not interested to continue to tag more functions with _Py_HOT_FUNCTION. I will now focus on LTO+PGO for stable benchmarks, and ignore small performance difference when PGO is not used.

I close this issue now.
History
Date User Action Args
2017-05-18 00:42:46jceasetnosy: + jcea
2017-02-01 17:21:32hayposetstatus: open -> closed
resolution: fixed
messages: + msg286662
2016-11-22 13:17:27hayposetmessages: + msg281477
2016-11-22 12:19:34inada.naokisetmessages: + msg281473
2016-11-22 11:50:54hayposetmessages: + msg281467
2016-11-22 11:47:12hayposetmessages: + msg281466
2016-11-22 11:07:14inada.naokisetmessages: + msg281463
2016-11-22 10:30:19hayposetmessages: + msg281459
2016-11-15 15:50:33hayposetmessages: + msg280860
2016-11-15 15:42:10hayposetmessages: + msg280859
2016-11-15 14:40:01serhiy.storchakasetmessages: + msg280853
2016-11-15 14:28:34hayposetmessages: + msg280849
2016-11-15 14:21:57hayposetfiles: + hot3.patch

messages: + msg280846
2016-11-15 14:18:35hayposetmessages: + msg280845
2016-11-15 14:15:28python-devsetmessages: + msg280844
2016-11-15 12:04:06inada.naokisetmessages: + msg280832
2016-11-15 11:56:36inada.naokisetmessages: + msg280831
2016-11-14 12:23:38hayposetmessages: + msg280764
2016-11-14 10:41:11inada.naokisetnosy: + inada.naoki
messages: + msg280748
2016-11-12 23:40:38hayposetmessages: + msg280679
2016-11-12 22:25:21yselivanovsetnosy: + yselivanov
messages: + msg280675
2016-11-11 19:58:32hayposetmessages: + msg280607
2016-11-11 19:52:54hayposetmessages: + msg280606
2016-11-11 09:10:40hayposetmessages: + msg280568
2016-11-11 01:49:03hayposetmessages: + msg280557
2016-11-11 01:14:23python-devsetnosy: + python-dev
messages: + msg280556
2016-11-08 21:09:38hayposetfiles: + patch.json.gz
2016-11-08 21:09:28hayposetfiles: + pgo.json.gz

messages: + msg280350
2016-11-05 22:53:19hayposetmessages: + msg280126
2016-11-05 20:02:31pitrousetmessages: + msg280125
2016-11-05 16:14:07serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg280116
2016-11-05 15:37:45hayposetmessages: + msg280115
2016-11-05 09:59:20pitrousetnosy: + pitrou
messages: + msg280108
2016-11-05 09:08:51hayposetmessages: + msg280106
2016-11-05 09:07:45hayposetmessages: + msg280105
2016-11-05 00:29:04haypocreate