Message 310820 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	gvanrossum, mark.dickinson, methane, ned.deily, skrah, vstinner, yselivanov
Date	2018-01-26.23:11:00
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1517008260.68.0.467229070634.issue32630@psf.upfronthosting.co.za>
In-reply-to

Content
Since the root of the discussion is a performance regression, let me take a look since I also care of not regressing in term of performance. We (CPython core developers, as as team) spent a lot of time on optimizing CPython to make benchmarks like telco faster at each release. The good news is that Python 3.7 is faster than Python 3.6 on telco. If I recall correctly, it's not because of recent optimizations in the decimal module, but more general changes like CALL_METHOD optimization! Python master vs 3.6 (normalized on 3.6): https://speed.python.org/comparison/?exe=12%2BL%2Bmaster%2C12%2BL%2B3.6&ben=670&env=1%2C2&hor=false&bas=12%2BL%2B3.6&chart=normal+bars Graph of telco performance on master since April 2014 to January 2017: https://speed.python.org/timeline/#/?exe=12&ben=telco&env=1&revs=50&equid=off&quarts=on&extr=on 20.2 ms => 14.1 ms, well done! If you are curious of reasons why Python became faster, see my documentation: http://pyperformance.readthedocs.io/cpython_results_2017.html Or even my talk at Pycon 2017: https://www.youtube.com/watch?v=d65dCD3VH9Q&t=957s Sorry, I moved off topic. Let's move back to this measuring the performance of this issue... -- I rewrote xwith.py using my perf module to use CPU pinning (on my isolated CPUs), automatic calibration of the number of loops, ignore the first "warmup" value, spawn 20 processes, compute the average and standard deviation, etc. => see attached xwidth2.py Results on my laptop with 2 physical cores isolated for best benchmark stability (): vstinner@apu$ ./python -m perf compare_to master.json pr5278.json Mean +- std dev: [master] 1.86 us +- 0.03 us -> [pr5278] 2.27 us +- 0.04 us: 1.22x slower (+22%) Note: master is the commit 29a7df78277447cf6b898dfa0b1b42f8da7abc0c and I rebased PR 5278 on top on this commit. () http://perf.readthedocs.io/en/latest/run_benchmark.html#how-to-get-reproductible-benchmark-results This is obvious the worst case: a micro benchmark using local contexts and modifying this local context. In this case, I understand that this microbenchmark basically measures the overhead of contextvars on modying a context. The question here is if the bottleneck of applications using decimal is the code modifying the context or the code computing numbers (a+b, ab, a/b, etc.). Except for a few small projects, I rarely use decimal, so I'm unable to judge that. But just to add my 2 cents, I never used "with localcontext()", I don't see the point of this tool in my short applications. I prefer to modify directly the current context (getcontext()), and only modify this context once, at startup. For example, set the rounding mode and set the precision, and that's all. -- The Python benchmark suite does have a benchmark dedicated to the decimal module: http://pyperformance.readthedocs.io/benchmarks.html#telco I ran this benchmark on PR 5278: vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json Benchmark hidden because not significant (1): telco ... not significant. Honestly, I'm not surprised at all: telco benchmark doesn't modify the context in the hot code, only outside the benchmark * telco likely spends most of its runtime in computing numbers (sum += x; d = d.quantize(...), etc.) and converting Decimal to string. I don't know the decimal module, but I guess that it requires to read the current context. But getting the context is likely efficient and not significant compared to the cost of other operations. FYI timings can be seen in verbose mode: vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json -v Mean +- std dev: [telco_master] 10.7 ms +- 0.4 ms -> [telco_pr5278] 10.7 ms +- 0.4 ms: 1.00x faster (-0%) Not significant!

Since the root of the discussion is a performance regression, let me take a look since I also care of not regressing in term of performance. We (CPython core developers, as as team) spent a lot of time on optimizing CPython to make benchmarks like telco faster at each release. The good news is that Python 3.7 *is* faster than Python 3.6 on telco. If I recall correctly, it's not because of recent optimizations in the decimal module, but more general changes like CALL_METHOD optimization!

Python master vs 3.6 (normalized on 3.6):

https://speed.python.org/comparison/?exe=12%2BL%2Bmaster%2C12%2BL%2B3.6&ben=670&env=1%2C2&hor=false&bas=12%2BL%2B3.6&chart=normal+bars

Graph of telco performance on master since April 2014 to January 2017:
https://speed.python.org/timeline/#/?exe=12&ben=telco&env=1&revs=50&equid=off&quarts=on&extr=on

20.2 ms => 14.1 ms, well done!

If you are curious of reasons why Python became faster, see my documentation:
http://pyperformance.readthedocs.io/cpython_results_2017.html

Or even my talk at Pycon 2017:
https://www.youtube.com/watch?v=d65dCD3VH9Q&t=957s

Sorry, I moved off topic. Let's move back to this measuring the performance of this issue...

--

I rewrote  xwith.py using my perf module to use CPU pinning (on my isolated CPUs), automatic calibration of the number of loops, ignore the first "warmup" value, spawn 20 processes, compute the average and standard deviation, etc. => see attached xwidth2.py

Results on my laptop with 2 physical cores isolated for best benchmark stability (*):

vstinner@apu$ ./python -m perf compare_to master.json pr5278.json 
Mean +- std dev: [master] 1.86 us +- 0.03 us -> [pr5278] 2.27 us +- 0.04 us: 1.22x slower (+22%)

Note: master is the commit 29a7df78277447cf6b898dfa0b1b42f8da7abc0c and I rebased PR 5278 on top on this commit.

(*) http://perf.readthedocs.io/en/latest/run_benchmark.html#how-to-get-reproductible-benchmark-results

This is obvious the *worst* case: a *micro* benchmark using local contexts and modifying this local context. In this case, I understand that this microbenchmark basically measures the overhead of contextvars on modying a context.

The question here is if the bottleneck of applications using decimal is the code modifying the context or the code computing numbers (a+b, a*b, a/b, etc.).

Except for a few small projects, I rarely use decimal, so I'm unable to judge that.

But just to add my 2 cents, I never used "with localcontext()", I don't see the point of this tool in my short applications. I prefer to modify directly the current context (getcontext()), and only modify this context *once*, at startup. For example, set the rounding mode and set the precision, and that's all.

--

The Python benchmark suite does have a benchmark dedicated to the decimal module:
http://pyperformance.readthedocs.io/benchmarks.html#telco

I ran this benchmark on PR 5278:

vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json 
Benchmark hidden because not significant (1): telco

... not significant. Honestly, I'm not surprised at all:

* telco benchmark doesn't modify the context in the hot code, only *outside* the benchmark
* telco likely spends most of its runtime in computing numbers (sum += x; d = d.quantize(...), etc.) and converting Decimal to string. I don't know the decimal module, but I guess that it requires to *read* the current context. But getting the context is likely efficient and not significant compared to the cost of other operations.

FYI timings can be seen in verbose mode:

vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json -v
Mean +- std dev: [telco_master] 10.7 ms +- 0.4 ms -> [telco_pr5278] 10.7 ms +- 0.4 ms: 1.00x faster (-0%)
Not significant!

History
Date	User	Action	Args
2018-01-26 23:11:00	vstinner	set	recipients: + vstinner, gvanrossum, mark.dickinson, ned.deily, methane, skrah, yselivanov
2018-01-26 23:11:00	vstinner	set	messageid: <1517008260.68.0.467229070634.issue32630@psf.upfronthosting.co.za>
2018-01-26 23:11:00	vstinner	link	issue32630 messages
2018-01-26 23:11:00	vstinner	create