classification
Title: PROFILE_TASK for PGO build is not a good workload
Type: performance Stage: patch review
Components: Build Versions: Python 3.9, Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: gregory.p.smith, inada.naoki, nascheme, pablogsal, steve.dower, tianon
Priority: normal Keywords: patch

Created on 2019-02-19 20:39 by nascheme, last changed 2019-07-19 17:14 by nascheme.

Pull Requests
URL Status Linked Edit
PR 14702 open nascheme, 2019-07-11 07:46
Messages (12)
msg336018 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-02-19 20:39
I was doing some 'perf' profiling runs of Python.  I decided to try running PROFILE_TASK to see what the profile looks like.  I was surprised that gc_collect dominated runtime:

  Children      Self  Symbol
+   93.93%     6.00%  [.] _PyEval_EvalFrameDefault                
+   76.19%     0.12%  [.] function_code_fastcall                  
+   70.65%     0.31%  [.] _PyMethodDef_RawFastCallKeywords        
+   63.24%     0.13%  [.] _PyCFunction_FastCallKeywords           
+   58.67%     0.36%  [.] _PyEval_EvalCodeWithName                
+   57.45%    23.84%  [.] collect                                 
+   52.89%     0.00%  [.] gc_collect                              
+   52.10%     0.08%  [.] _PyFunction_FastCallDict                
+   41.99%     0.02%  [.] _PyObject_Call_Prepend                  
+   36.37%     0.18%  [.] _PyFunction_FastCallKeywords            
+   20.94%     0.07%  [.] _PyObject_FastCallDict                  
+   19.64%     0.00%  [.] PyObject_Call                           
+   17.74%     0.11%  [.] _PyObject_FastCallKeywords              
+   12.45%     0.00%  [.] slot_tp_call                            
+   12.27%     4.05%  [.] dict_traverse                           
+   11.45%    11.04%  [.] visit_reachable                         
+   11.18%    10.76%  [.] visit_decref                            
+    9.65%     0.11%  [.] type_call                               
+    8.80%     0.83%  [.] func_traverse                           
+    7.78%     0.08%  [.] _PyMethodDescr_FastCallKeywords     

Part of the problem is that we run full cyclic GC for every test.  I.e. cleanup_test_droppings() calls gc.collect().  Maybe we could make these calls conditional on the --pgo flag of regtest.  Or, maybe we need to re-evaluate if running the unit test suite is the best way to generate PGO trace data.

Based on a tiny bit of further investigation, it looks like gc_collect() is getting called quite a lot of times, in addition to cleanup_test_droppings().  Maybe there is some low-hanging fruit here for optimization.  Full GC is pretty expensive.
msg336020 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2019-02-19 20:59
Also, the test suite exercises a lot of branches (like passing incorrect types to function to check errors) that will hurt the branch prediction that PGO generates.
msg347605 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2019-07-10 07:40
In my experience, people overthink what needs to go into a CPython profiling run.  Sure, our default PROFILE_TASK is rather unfortunate because it takes a very long time by including runs of super slow tests that won't meaningfully contribute profile data (multiprocessing, subprocess, concurrent_futures, heavy I/O, etc).

But despite somewhat popular belief, it is not actually a problem that the suite exercises other sides of branches, because by and large just by executing Python code at all and exercising C extension module code, it still acts like most any Python program and spends most of the time on the common critical paths - regardless tests that trigger specific number of executions of other paths.  Those executions pale in comparison to the ordinary ones in anywhere critical.

I don't recommend making any claim about something "harming" the profile without reliable data to prove it.

Feel free to tune what test.regrtest --pgo enables or disables by default.  But try to do it in a scientific data based manner rather than guessing.  Decreasing the total wall time for a default --enable-optimizations build would be a good thing for everyone, provided the resulting interpreter remains "effectively similar" in speed.  If you somehow manage to find something that actually speeds up the resulting interpreter, amazing!

https://github.com/python/pyperformance is your friend for real world performance measurement.  Patience is key.  The builds and benchmark runs are slow.

One thing --enable-optimizations does _not_ do today is enable link time optimization by default.  Various toolchain+platform versions were having problems successfully generating a working interpreter with LTO enabled.  If you want a potential large speed up in the interpreter, figuring out how to get link time optimization on top of the existing PGO working reliably, detecting when toolchains+platform combos where it will be reliable, and enabling it by default on such systems is _likely_ the largest possible gain still to be had.
msg347649 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-07-11 03:36
> Decreasing the total wall time for a default --enable-optimizations build would 
> be a good thing for everyone, provided the resulting interpreter remains 
> "effectively similar" in speed.  If you somehow manage to find something that
> actually speeds up the resulting interpreter, amazing!

I spent quite a lot of time making different PGO builds and comparing with pyperformance.  The current PGO task is *really* slow.  Just running the PROFILE_TASK takes 24 minutes on my decently fast PC.

Using this set of tests seems to work pretty well:

PROFILE_TASK=-m test.regrtest --pgo \
        test_collections \
        test_dataclasses \
        test_difflib \
        test_embed \
        test_float \
        test_functools \
        test_generators \
        test_int \
        test_itertools \
        test_json \
        test_logging \
        test_long \
        test_ordered_dict \
        test_pickle \
        test_pprint \
        test_re \
        test_set \
        test_statistics \
        test_struct \
        test_tabnanny \
        test_xml_etree

Instead of 24 minutes, the above task takes one and a half minutes.  pyperformance results seem largely unchanged.  Comparison below.  Tuning the tests to get the best pyperformance result is a bit dangerous and perhaps running the whole test suite is safer (i.e. we are not optimizing for specific benchmarks).  I didn't tweak the list too much.  I added test_int, test_long, test_struct and test_itertools as a result of my pyperformance runs.  Not too surprising those are important modules.

I think the set of tests above should do a pretty good job of covering the hot code paths in most Python programs.  So, maybe it is good enough given the massive speedup in build time.



+-------------------------+----------+------------------------------+
| Benchmark               | task-all | task-short                   |
+=========================+==========+==============================+
| 2to3                    | 311 ms   | 315 ms: 1.01x slower (+1%)   |
+-------------------------+----------+------------------------------+
| chaos                   | 111 ms   | 108 ms: 1.02x faster (-2%)   |
+-------------------------+----------+------------------------------+
| crypto_pyaes            | 114 ms   | 112 ms: 1.01x faster (-1%)   |
+-------------------------+----------+------------------------------+
| dulwich_log             | 78.0 ms  | 78.7 ms: 1.01x slower (+1%)  |
+-------------------------+----------+------------------------------+
| fannkuch                | 470 ms   | 452 ms: 1.04x faster (-4%)   |
+-------------------------+----------+------------------------------+
| float                   | 118 ms   | 117 ms: 1.01x faster (-1%)   |
+-------------------------+----------+------------------------------+
| go                      | 253 ms   | 255 ms: 1.01x slower (+1%)   |
+-------------------------+----------+------------------------------+
| json_dumps              | 12.5 ms  | 11.8 ms: 1.06x faster (-6%)  |
+-------------------------+----------+------------------------------+
| json_loads              | 26.3 us  | 25.4 us: 1.04x faster (-3%)  |
+-------------------------+----------+------------------------------+
| logging_format          | 9.53 us  | 9.66 us: 1.01x slower (+1%)  |
+-------------------------+----------+------------------------------+
| logging_silent          | 198 ns   | 196 ns: 1.01x faster (-1%)   |
+-------------------------+----------+------------------------------+
| mako                    | 15.2 ms  | 15.8 ms: 1.04x slower (+4%)  |
+-------------------------+----------+------------------------------+
| meteor_contest          | 98.2 ms  | 96.8 ms: 1.01x faster (-1%)  |
+-------------------------+----------+------------------------------+
| nbody                   | 135 ms   | 133 ms: 1.01x faster (-1%)   |
+-------------------------+----------+------------------------------+
| nqueens                 | 97.2 ms  | 96.6 ms: 1.01x faster (-1%)  |
+-------------------------+----------+------------------------------+
| pathlib                 | 19.4 ms  | 19.7 ms: 1.02x slower (+2%)  |
+-------------------------+----------+------------------------------+
| pickle                  | 8.10 us  | 9.07 us: 1.12x slower (+12%) |
+-------------------------+----------+------------------------------+
| pickle_dict             | 23.1 us  | 18.6 us: 1.25x faster (-20%) |
+-------------------------+----------+------------------------------+
| pickle_list             | 3.64 us  | 2.81 us: 1.30x faster (-23%) |
+-------------------------+----------+------------------------------+
| pickle_pure_python      | 470 us   | 460 us: 1.02x faster (-2%)   |
+-------------------------+----------+------------------------------+
| pidigits                | 169 ms   | 173 ms: 1.02x slower (+2%)   |
+-------------------------+----------+------------------------------+
| python_startup          | 7.94 ms  | 8.02 ms: 1.01x slower (+1%)  |
+-------------------------+----------+------------------------------+
| python_startup_no_site  | 5.44 ms  | 5.49 ms: 1.01x slower (+1%)  |
+-------------------------+----------+------------------------------+
| raytrace                | 495 ms   | 490 ms: 1.01x faster (-1%)   |
+-------------------------+----------+------------------------------+
| regex_dna               | 172 ms   | 179 ms: 1.04x slower (+4%)   |
+-------------------------+----------+------------------------------+
| regex_effbot            | 2.95 ms  | 2.85 ms: 1.04x faster (-3%)  |
+-------------------------+----------+------------------------------+
| regex_v8                | 20.7 ms  | 21.5 ms: 1.04x slower (+4%)  |
+-------------------------+----------+------------------------------+
| richards                | 68.9 ms  | 69.8 ms: 1.01x slower (+1%)  |
+-------------------------+----------+------------------------------+
| scimark_sparse_mat_mult | 4.57 ms  | 4.29 ms: 1.07x faster (-6%)  |
+-------------------------+----------+------------------------------+
| spectral_norm           | 134 ms   | 133 ms: 1.01x faster (-1%)   |
+-------------------------+----------+------------------------------+
| sqlalchemy_declarative  | 161 ms   | 163 ms: 1.01x slower (+1%)   |
+-------------------------+----------+------------------------------+
| sqlalchemy_imperative   | 30.6 ms  | 31.0 ms: 1.01x slower (+1%)  |
+-------------------------+----------+------------------------------+
| sqlite_synth            | 2.90 us  | 2.95 us: 1.02x slower (+2%)  |
+-------------------------+----------+------------------------------+
| sympy_expand            | 422 ms   | 418 ms: 1.01x faster (-1%)   |
+-------------------------+----------+------------------------------+
| sympy_integrate         | 19.0 ms  | 19.2 ms: 1.01x slower (+1%)  |
+-------------------------+----------+------------------------------+
| sympy_sum               | 89.6 ms  | 91.7 ms: 1.02x slower (+2%)  |
+-------------------------+----------+------------------------------+
| telco                   | 6.06 ms  | 6.28 ms: 1.04x slower (+4%)  |
+-------------------------+----------+------------------------------+
| tornado_http            | 178 ms   | 181 ms: 1.02x slower (+2%)   |
+-------------------------+----------+------------------------------+
| unpickle_list           | 3.97 us  | 3.78 us: 1.05x faster (-5%)  |
+-------------------------+----------+------------------------------+
| unpickle_pure_python    | 326 us   | 324 us: 1.00x faster (-0%)   |
+-------------------------+----------+------------------------------+
| xml_etree_generate      | 90.6 ms  | 91.0 ms: 1.00x slower (+0%)  |
+-------------------------+----------+------------------------------+
| xml_etree_process       | 72.0 ms  | 71.4 ms: 1.01x faster (-1%)  |
+-------------------------+----------+------------------------------+

Not significant (15): deltablue; django_template; hexiom; html5lib; logging_simple; regex_compile; scimark_fft; scimark_lu; scimark_monte_carlo; scimark_sor; sympy_str; unpack_sequence; unpickle; xml_etree_parse; xml_etree_iterparse
msg347650 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2019-07-11 04:13
yeah I pulled a similar looking list of tests for the PROFILE_TASK to suggest to the Docker Python image maintainers - see https://github.com/docker-library/python/pull/404 and https://github.com/docker-library/python/issues/160.

I haven't run comparisons of that vs the default overly long profile task, but a comparison against a non PGO enabled cpython 3.7 build shows the types of speedups i'd expect from a decent PGO build.

doing proper comparisons against a full build would take a lot more time than I have right now.

It'd be reasonable to change the default task to something that is faster to build so long as we don't appear to be making major sacrifices in speed somewhere.

(your benchmark table's pickle result changes are inconsistent and odd, but that benchmark may not be very meaningful as written, i haven't looked at it)
msg347660 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-07-11 08:16
I tweaked the list of unit tests a little more, trying to incorporate some from your Docker build settings.  Not sure what's going on with the pickle results.  Below are new pyperformance runs, comparing my PR to the "master" version it is based on.

I added the --with-profile-task configure option.  If someone really wants to wait for a long build to gain a tiny bit of extra speed, that makes it pretty easy to do so.

BTW, I tried making the --pgo disable most of the gc_collect() calls.  It doesn't actually make much difference in profile generation time so I discarded those changes.
msg347661 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-07-11 08:20
+-------------------------+-----------+-----------------------------+
| Benchmark               | task-all2 | task-short7                 |
+=========================+===========+=============================+
| 2to3                    | 304 ms    | 305 ms: 1.00x slower (+0%)  |
+-------------------------+-----------+-----------------------------+
| crypto_pyaes            | 109 ms    | 107 ms: 1.01x faster (-1%)  |
+-------------------------+-----------+-----------------------------+
| django_template         | 116 ms    | 109 ms: 1.06x faster (-6%)  |
+-------------------------+-----------+-----------------------------+
| dulwich_log             | 76.8 ms   | 77.0 ms: 1.00x slower (+0%) |
+-------------------------+-----------+-----------------------------+
| fannkuch                | 454 ms    | 449 ms: 1.01x faster (-1%)  |
+-------------------------+-----------+-----------------------------+
| float                   | 113 ms    | 112 ms: 1.02x faster (-2%)  |
+-------------------------+-----------+-----------------------------+
| hexiom                  | 9.50 ms   | 9.56 ms: 1.01x slower (+1%) |
+-------------------------+-----------+-----------------------------+
| html5lib                | 93.4 ms   | 95.4 ms: 1.02x slower (+2%) |
+-------------------------+-----------+-----------------------------+
| json_loads              | 24.7 us   | 26.5 us: 1.07x slower (+7%) |
+-------------------------+-----------+-----------------------------+
| logging_format          | 9.55 us   | 9.63 us: 1.01x slower (+1%) |
+-------------------------+-----------+-----------------------------+
| logging_simple          | 8.54 us   | 8.72 us: 1.02x slower (+2%) |
+-------------------------+-----------+-----------------------------+
| nbody                   | 131 ms    | 130 ms: 1.01x faster (-1%)  |
+-------------------------+-----------+-----------------------------+
| nqueens                 | 93.6 ms   | 92.8 ms: 1.01x faster (-1%) |
+-------------------------+-----------+-----------------------------+
| pathlib                 | 19.0 ms   | 19.1 ms: 1.01x slower (+1%) |
+-------------------------+-----------+-----------------------------+
| pickle_list             | 2.72 us   | 2.66 us: 1.02x faster (-2%) |
+-------------------------+-----------+-----------------------------+
| pidigits                | 168 ms    | 168 ms: 1.00x faster (-0%)  |
+-------------------------+-----------+-----------------------------+
| python_startup          | 7.89 ms   | 7.89 ms: 1.00x slower (+0%) |
+-------------------------+-----------+-----------------------------+
| python_startup_no_site  | 5.41 ms   | 5.41 ms: 1.00x slower (+0%) |
+-------------------------+-----------+-----------------------------+
| raytrace                | 470 ms    | 481 ms: 1.02x slower (+2%)  |
+-------------------------+-----------+-----------------------------+
| regex_compile           | 168 ms    | 169 ms: 1.01x slower (+1%)  |
+-------------------------+-----------+-----------------------------+
| regex_dna               | 168 ms    | 173 ms: 1.03x slower (+3%)  |
+-------------------------+-----------+-----------------------------+
| regex_effbot            | 2.91 ms   | 2.72 ms: 1.07x faster (-7%) |
+-------------------------+-----------+-----------------------------+
| regex_v8                | 20.4 ms   | 21.1 ms: 1.03x slower (+3%) |
+-------------------------+-----------+-----------------------------+
| richards                | 67.6 ms   | 66.9 ms: 1.01x faster (-1%) |
+-------------------------+-----------+-----------------------------+
| scimark_fft             | 341 ms    | 339 ms: 1.01x faster (-1%)  |
+-------------------------+-----------+-----------------------------+
| scimark_monte_carlo     | 101 ms    | 98.2 ms: 1.03x faster (-3%) |
+-------------------------+-----------+-----------------------------+
| scimark_sparse_mat_mult | 4.05 ms   | 3.97 ms: 1.02x faster (-2%) |
+-------------------------+-----------+-----------------------------+
| spectral_norm           | 128 ms    | 129 ms: 1.01x slower (+1%)  |
+-------------------------+-----------+-----------------------------+
| sqlalchemy_imperative   | 30.0 ms   | 30.4 ms: 1.01x slower (+1%) |
+-------------------------+-----------+-----------------------------+
| sympy_expand            | 408 ms    | 411 ms: 1.01x slower (+1%)  |
+-------------------------+-----------+-----------------------------+
| sympy_sum               | 90.1 ms   | 89.6 ms: 1.01x faster (-1%) |
+-------------------------+-----------+-----------------------------+
| sympy_str               | 182 ms    | 180 ms: 1.01x faster (-1%)  |
+-------------------------+-----------+-----------------------------+
| tornado_http            | 175 ms    | 179 ms: 1.03x slower (+3%)  |
+-------------------------+-----------+-----------------------------+
| unpack_sequence         | 50.0 ns   | 47.5 ns: 1.05x faster (-5%) |
+-------------------------+-----------+-----------------------------+
| unpickle                | 12.0 us   | 12.3 us: 1.02x slower (+2%) |
+-------------------------+-----------+-----------------------------+
| unpickle_list           | 3.78 us   | 3.89 us: 1.03x slower (+3%) |
+-------------------------+-----------+-----------------------------+
| unpickle_pure_python    | 320 us    | 322 us: 1.01x slower (+1%)  |
+-------------------------+-----------+-----------------------------+
| xml_etree_parse         | 128 ms    | 127 ms: 1.01x faster (-1%)  |
+-------------------------+-----------+-----------------------------+
| xml_etree_generate      | 88.2 ms   | 86.9 ms: 1.02x faster (-2%) |
+-------------------------+-----------+-----------------------------+
| xml_etree_process       | 68.8 ms   | 69.6 ms: 1.01x slower (+1%) |
+-------------------------+-----------+-----------------------------+

Not significant (17): chaos; deltablue; go; json_dumps; logging_silent; mako; meteor_contest; pickle; pickle_dict; pickle_pure_python; scimark_lu; scimark_sor; sqlalchemy_declarative; sqlite_synth; sympy_integrate; telco; xml_etree_iterparse
msg347889 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-07-14 08:51
Looking at PR 14702, I would much rather update what `-m test --pgo` does than make this exclusive to Makefile.

Back when we added the --pgo flag, the intent was always to move toward an equivalently good profile, but for practical reasons we settled at the time for "skip tests that break PGO".

I know that will make the change far more skips throughout the test suite, so perhaps we want to change how --pgo works completely, but it would certainly be better to make --pgo do what it advertises than have multiple lists of PGO profiles.
msg347917 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2019-07-14 14:19
As far as where to put the lists of tests, you're probably right.  putting it within test.regrtest itself under the --pgo banner makes sense.  (though we should keep logic to accept a list of explicit tests to add or exclude if the user has also provided those)
msg347994 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-07-15 22:34
Thanks for the feedback.  I agree that putting the tests in regrtest is better.  I've made the following changes:

- the --pgo option now uses the shorter list of tests by default

- I added --pgo-extended as a regrtest option that enables the old behavior of running most of the tests.  I considered being more fancy and having named sets of different tests (like the -u option) but I think just the two options (fast or slow) is good enough.  I used "pgo-extended" because I think "fast" or "slow" is confusing (i.e. build is slower but hopefully resulting binary is faster).

- You can still pass an explicit list of tests if you use the --pgo option.  That should avoid breaking builds like the Docker one.

- You can now configure the the PGO task using 'configure'.  I think that is cleaner than trying to modify the Makefile after the fact. e.g. 

   ./configure [...] --with-profile-task='-m test --pgo-extended'
msg348151 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-07-19 07:33
>    ./configure [...] --with-profile-task='-m test --pgo-extended'

I think this is abusing of `--with` options.  `--with-*` is for external software [1].  But `--with-lto` option abuses it already.

How about `./configure PROFILE_TASKS="-m test --pgo-extend"`?


[1]: https://www.gnu.org/software/autoconf/manual/autoconf-2.66/html_node/External-Software.html#External-Software
msg348187 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2019-07-19 17:14
I changed configure.in to use AC_ARG_VAR instead.  So, you can override it as you suggest:

  ./configure [..] PROFILE_TASK="-m test --pgo-extended"
History
Date User Action Args
2019-07-19 17:14:36naschemesetmessages: + msg348187
2019-07-19 07:33:49inada.naokisetmessages: + msg348151
2019-07-15 22:34:54naschemesetmessages: + msg347994
2019-07-14 14:19:03gregory.p.smithsetmessages: + msg347917
2019-07-14 08:51:37steve.dowersetnosy: + steve.dower
messages: + msg347889
2019-07-11 08:20:24naschemesetmessages: + msg347661
2019-07-11 08:16:25naschemesetmessages: + msg347660
2019-07-11 07:46:27naschemesetkeywords: + patch
stage: patch review
pull_requests: + pull_request14501
2019-07-11 04:13:18gregory.p.smithsetmessages: + msg347650
2019-07-11 03:36:07naschemesetmessages: + msg347649
2019-07-10 07:40:42gregory.p.smithsetversions: + Python 3.8, Python 3.9
nosy: + gregory.p.smith

messages: + msg347605

components: + Build
2019-07-09 23:02:02tianonsetnosy: + tianon
2019-04-05 12:57:33inada.naokisetnosy: + inada.naoki
2019-02-19 20:59:46pablogsalsetnosy: + pablogsal
messages: + msg336020
2019-02-19 20:39:50naschemecreate