classification
Title: perf.py: calibrate benchmarks using time, not using a fixed number of iterations
Type: performance Stage:
Components: Benchmarks Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: brett.cannon, florin.papa, pitrou, serhiy.storchaka, skrah, vstinner, yselivanov, zbyrne
Priority: normal Keywords: patch

Created on 2016-02-03 10:38 by vstinner, last changed 2016-08-30 16:13 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
perf_calibration.patch vstinner, 2016-02-03 11:18
telco_haypo.py vstinner, 2016-02-04 11:37
stability.py vstinner, 2016-02-04 15:05
perf_calibration-2.patch vstinner, 2016-02-04 15:05
Messages (21)
msg259469 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-03 10:38
Hi, I'm working on some optimizations projects like FAT Python (PEP 509: issue #26058, PEP 510: issue #26098, and PEP 511: issue #26145) and faster memory allocators (issue #26249).

I have the *feeling* that perf.py output is not reliable even if it takes more than 20 minutes :-/ Maybe because Yury told that I must use -r (--rigorous) :-)

Example with 5 runs of "python3 perf.py ../default/python ../default/python.orig -b regex_v8":

---------------
Report on Linux smithers 4.3.3-300.fc23.x86_64 #1 SMP Tue Jan 5 23:31:01 UTC 2016 x86_64 x86_64
Total CPU cores: 8

### regex_v8 ###
Min: 0.043237 -> 0.050196: 1.16x slower
Avg: 0.043714 -> 0.050574: 1.16x slower
Significant (t=-19.83)
Stddev: 0.00171 -> 0.00174: 1.0178x larger

### regex_v8 ###
Min: 0.042774 -> 0.051420: 1.20x slower
Avg: 0.043843 -> 0.051874: 1.18x slower
Significant (t=-14.46)
Stddev: 0.00351 -> 0.00176: 2.0009x smaller

### regex_v8 ###
Min: 0.042673 -> 0.048870: 1.15x slower
Avg: 0.043726 -> 0.050474: 1.15x slower
Significant (t=-8.74)
Stddev: 0.00283 -> 0.00467: 1.6513x larger

### regex_v8 ###
Min: 0.044029 -> 0.049445: 1.12x slower
Avg: 0.044564 -> 0.049971: 1.12x slower
Significant (t=-13.97)
Stddev: 0.00175 -> 0.00211: 1.2073x larger

### regex_v8 ###
Min: 0.042692 -> 0.049084: 1.15x slower
Avg: 0.044295 -> 0.050725: 1.15x slower
Significant (t=-7.00)
Stddev: 0.00421 -> 0.00494: 1.1745x larger
---------------

I'm only care of the "Min", IMHO it's the most interesting information here.

The slowdown is betwen 12% and 20%, for me it's a big difference.


It looks like some benchmarks have very short iterations compare to others. For example, bm_json_v2 takes around 3 seconds, whereas bm_regex_v8 only takes less than 0.050 second (50 ms).

$ python3 performance/bm_json_v2.py -n 3 --timer perf_counter
3.310384973010514
3.3116717970115133
3.3077902760123834

$ python3 performance/bm_regex_v8.py -n 3 --timer perf_counter
0.0670697659952566
0.04515827298746444
0.045114840992027894

Do you think that bm_regex_v8 is reliable? I see that there is an "iteration scaling" to use run the benchmarks with more iterations. Maybe we can start to increase the "iteration scaling" for bm_regex_v8?

Instead of a fixed number of iterations, we should redesign benchmarks to use time. For example, one iteration must take at least 100 ms and should not take more than 1 second (but take longer to get more reliable results). Then the benchmark is responsible to ajust internal parameters.

I used this design for my "benchmark.py" script which is written to get "reliable" microbenchmarks:
https://bitbucket.org/haypo/misc/src/tip/python/benchmark.py?fileviewer=file-view-default

The script is based on time and calibrate a benchmark. It also uses the *effictive* resolution of the clock used by the benchmark to calibrate the benchmark.

I will maybe work on such patch, but it would be good to know first your opinion on such change.

I guess that we should use the base python to calibrate the benchmark and then pass the same parameters to the modified python.
msg259472 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-03 11:18
Well, it was simpler than I expected to implement calibration. Here is a PoC for regex_v8 and json_dump_v2.

json_dump_v2 takes 3 seconds per run, but it already uses internally 100 loop iterations. I divided the number of iterations by 100 in my patch.
msg259504 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2016-02-03 19:22
Here's a very interesting table from Zach Byrne: http://bugs.python.org/issue21955#msg259490

It shows that some benchmarks are indeed very unstable.  This also correlates with my own experience.

These ones are very unstable: pickle_dict, nbody, regex_v8, etree_parse, telco.
msg259510 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-03 19:56
If running time is close to the limit, different run will use different number of repetitions. This will add additional instability. I prefer a stable number of repetitions manually calibrated for average modern computer.
msg259515 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-03 20:33
> If running time is close to the limit, different run will use different number of repetitions. This will add additional instability.

Maybe, to be honest I don't know. How can we decide if a patch makes
perf.py more and less stable?

> I prefer a stable number of repetitions manually calibrated for average modern computer.

The problem is to define an "average modern computer".
msg259543 - (view) Author: Florin Papa (florin.papa) * Date: 2016-02-04 08:33
I have tested the patch and it does not seem to solve the stability
problem.

With patch:
python performance/bm_json_v2.py -n 10 --timer perf_counter
0.02829067991115153
0.027778947027400136
0.027756798081099987
0.027362094027921557
0.027256828034296632
0.02766443998552859
0.027390124974772334
0.027398303849622607
0.02731629996560514
0.02743363007903099

Without patch:
python performance/bm_json_v2.py -n 10 --timer perf_counter
2.732215315802023
2.7394272631499916
2.731212269049138
2.7511520159896463
2.739834441104904
2.754247901029885
2.734024364966899
2.7447318979538977
2.744731454877183
2.7513630259782076

Instead, I notice that, especially for the first runs, the measured
values tend to be higher. This is expected, since the cache has not
been populated yet, but this can also happen after a context switch.
Therefore, we could eliminate these "odd" values by calculating the
median value of all the runs and only keeping measurements within 1%
of that median value.

What do you think?
msg259544 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-04 08:39
> With patch:
> python performance/bm_json_v2.py -n 10 --timer perf_counter

Oh. There is a bit of confusion here. You must *not* run directly bm_xxx.py scripts. The calibration is done in perf.py.

Try for example:
python perf.py python2 python3 -b json_dump_v2

You must see something like:
Calibrating => num_runs=10, num_loops=16 (0.50 sec < 0.55 sec)

I should maybe share the calibration code to also compute the number of iterations when a bm_xxx.py script is run directly? But the risk is that someone compares two runs of bm_xxx.py using two python binaries, and seeing different results just because the number of calibrated loops is different...
msg259546 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-02-04 09:24
In my experience it is very hard to get stable benchmark results with Python.  Even long running benchmarks on an empty machine vary:

wget http://www.bytereef.org/software/mpdecimal/benchmarks/telco.py
wget http://speleotrove.com/decimal/expon180-1e6b.zip
unzip expon180-1e6b.zip

taskset -c 0 ./python telco.py full


$ taskset -c 0 ./python telco.py full

Control totals:
Actual   ['1004737.58', '57628.30', '25042.17']
Expected ['1004737.58', '57628.30', '25042.17']
Elapsed time: 7.16255
$ taskset -c 0 ./python telco.py full

Control totals:
Actual   ['1004737.58', '57628.30', '25042.17']
Expected ['1004737.58', '57628.30', '25042.17']
Elapsed time: 6.982884
$ taskset -c 0 ./python telco.py full

Control totals:
Actual   ['1004737.58', '57628.30', '25042.17']
Expected ['1004737.58', '57628.30', '25042.17']
Elapsed time: 7.095349000000001
$ taskset -c 0 ./python telco.py full

Control totals:
Actual   ['1004737.58', '57628.30', '25042.17']
Expected ['1004737.58', '57628.30', '25042.17']
msg259547 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-02-04 09:27
I've cut off the highest result in the previous message:

Control totals:
Actual   ['1004737.58', '57628.30', '25042.17']
Expected ['1004737.58', '57628.30', '25042.17']
Elapsed time: 7.304339
$ taskset -c 0 ./python telco.py full
msg259548 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-04 09:28
> taskset -c 0 ./python telco.py full

Did you see that I just merged Florin's patch to add --affinity parameter to perf.py? :-)

You may isolate some CPU cores using the kernel command parameter isolcpus=xxx. I don't think that the core #0 is the best choice, it may be preferred.

It would be nice to collect "tricks" to get most reliable benchmark results. Maybe in perf.py README page? Or a wiki page? Whatever? :-)
msg259550 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-04 09:35
For an older project (Fusil the fuzzer), I wrote a short function to sleep until the system load is lower than a threshold. I had to write such function to reduce the noise when the system is heavily loaded. I wrote that to be able to run a very long task (it takes at least 1 hour, but may run for multiple days!) on my desktop and continue to use my desktop for various other tasks.

On Linux, we can use the "cpu xxx xxx xxx ..." line of /proc/stat to get the system load.

My code to read the system load:
https://bitbucket.org/haypo/fusil/src/32ddc281219cd90c1ad12a3ee4ce212bea1c3e0f/fusil/linux/cpu_load.py?at=default&fileviewer=file-view-default#cpu_load.py-51

My code to wait until the system load is lower than a threshold:
https://bitbucket.org/haypo/fusil/src/32ddc281219cd90c1ad12a3ee4ce212bea1c3e0f/fusil/system_calm.py?at=default&fileviewer=file-view-default#system_calm.py-5

--

I also write a script to do the opposite :-) A script to stress the system to have a system load higher or equal than a minimum load:

https://bitbucket.org/haypo/misc/src/3fd3993413b128b37e945690865ea2c5ef48c446/bin/system_load.py?at=default&fileviewer=file-view-default

This script helped to me reproduce sporadic failures like timeouts which only occur when the system is highly loaded.
msg259551 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-02-04 09:37
Core 1 fluctuates even more (My machine only has 2 cores):

$ taskset -c 1 ./python telco.py full

Control totals:
Actual   ['1004737.58', '57628.30', '25042.17']
Expected ['1004737.58', '57628.30', '25042.17']
Elapsed time: 6.783009
Control totals:
Actual   ['1004737.58', '57628.30', '25042.17']
Expected ['1004737.58', '57628.30', '25042.17']
Elapsed time: 7.335563
$ taskset -c 1 ./python telco.py full


I have some of the same concerns as Serhiy. There's a lot of statistics going on in the benchmark suite -- is it really possible to separate that cleanly from the actual runtime of the benchmarks?
msg259555 - (view) Author: Florin Papa (florin.papa) * Date: 2016-02-04 11:08
I ran perf to use calibration and there is no difference in stability
compared to the unpatched version.

With patch:

python perf.py -b json_dump_v2 -v --csv=out1.csv --affinity=2 ../cpython/python ../cpython/python
INFO:root:Automatically selected timer: perf_counter
[1/1] json_dump_v2...
Calibrating
INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 1 --timer perf_counter`
INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 2 --timer perf_counter`
INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 4 --timer perf_counter`
INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 8 --timer perf_counter`
INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 16 --timer perf_counter`
INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 32 --timer perf_counter`
Calibrating => num_runs=10, num_loops=32 (0.50 sec < 0.87 sec)
INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 10 -l 32 --timer perf_counter`
INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 10 -l 32 --timer perf_counter`

Report on Linux centos 3.10.0-229.7.2.el7.x86_64 #1 SMP Tue Jun 23 22:06:11 UTC 2015 x86_64 x86_64
Total CPU cores: 18

### json_dump_v2 ###
Min: 0.877497 -> 0.886482: 1.01x slower   <----------
Avg: 0.878150 -> 0.888351: 1.01x slower
Not significant
Stddev: 0.00054 -> 0.00106: 1.9481x larger


Without patch:

python perf.py -b json_dump_v2 -v --csv=out1.csv --affinity=2 ../cpython/python ../cpython/python
INFO:root:Automatically selected timer: perf_counter
[1/1] json_dump_v2...
INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 50 --timer perf_counter`
INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 50 --timer perf_counter`

Report on Linux centos 3.10.0-229.7.2.el7.x86_64 #1 SMP Tue Jun 23 22:06:11 UTC 2015 x86_64 x86_64
Total CPU cores: 18

### json_dump_v2 ###
Min: 2.755514 -> 2.764131: 1.00x slower     <---------- (almost) same as above
Avg: 2.766546 -> 2.775587: 1.00x slower
Not significant
Stddev: 0.00538 -> 0.00382: 1.4069x smaller
msg259556 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-04 11:37
Stefan: "In my experience it is very hard to get stable benchmark results with Python.  Even long running benchmarks on an empty machine vary: (...)"

tl; dr We *can* tune the Linux kernel to avoid most of the system noise when running kernels.


I modified Stefan's telco.py to remove all I/O from the hot code: the benchmark is now really CPU-bound. I also modified telco.py to run the benchmark 5 times. One run takes around 2.6 seconds.

I also added the following lines to check the CPU affinity and the number of context switches:

    os.system("grep -E -i 'cpu|ctx' /proc/%s/status" % os.getpid())

Well, see attached telco_haypo.py for the full script.

I used my system_load.py script to get a system load >= 5.0. Without tasksel, the benchmark result changes completly: at least 5 seconds. Well, it's not really surprising, it's known that benchmarks depend on the system load.


*BUT* I have a great kernel called Linux which has cool features called "CPU isolation" and "no HZ" (tickless kernel). On my Fedoera 23, the kernel is compiled with CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y.

haypo@smithers$ lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ    MINMHZ
0   0    0      0    0:0:0:0       oui    5900,0000 1600,0000
1   0    0      1    1:1:1:0       oui    5900,0000 1600,0000
2   0    0      2    2:2:2:0       oui    5900,0000 1600,0000
3   0    0      3    3:3:3:0       oui    5900,0000 1600,0000
4   0    0      0    0:0:0:0       oui    5900,0000 1600,0000
5   0    0      1    1:1:1:0       oui    5900,0000 1600,0000
6   0    0      2    2:2:2:0       oui    5900,0000 1600,0000
7   0    0      3    3:3:3:0       oui    5900,0000 1600,0000

My CPU is on a single socket, has 4 physical cores, but Linux gets 8 cores because of hyper threading.


I modified the Linux command line during the boot in GRUB to add: isolcpus=2,3,6,7 nohz_full=2,3,6,7. Then I forced the CPU frequency to performance to avoid hiccups:

# for id in 2 3 6 7; do echo performance > cpu$id/cpufreq/scaling_governor; done 

Check the config with:

$ cat /sys/devices/system/cpu/isolated
2-3,6-7
$ cat /sys/devices/system/cpu/nohz_full
2-3,6-7
$ cat /sys/devices/system/cpu/cpu[2367]/cpufreq/scaling_governor
performance
performance
performance
performance


Ok now with this kernel config but still without tasksel on an idle system:
-----------------------
Elapsed time: 2.660088424000037
Elapsed time: 2.5927538629999844
Elapsed time: 2.6135682369999813
Elapsed time: 2.5819260570000324
Elapsed time: 2.5991294099999322

Cpus_allowed:	33
Cpus_allowed_list:	0-1,4-5
voluntary_ctxt_switches:	1
nonvoluntary_ctxt_switches:	21
-----------------------

With system load >= 5.0:
-----------------------
Elapsed time: 5.3484489170000415
Elapsed time: 5.336797472999933
Elapsed time: 5.187413687999992
Elapsed time: 5.24122020599998
Elapsed time: 5.10201246400004

Cpus_allowed_list:	0-1,4-5
voluntary_ctxt_switches:	1
nonvoluntary_ctxt_switches:	1597
-----------------------

And *NOW* using my isolated CPU physical cores #2 and #3 (Linux CPUs 2, 3, 6 and 7), still on the heavily loaded system:
-----------------------
$ taskset -c 2,3,6,7 python3 telco_haypo.py full 

Elapsed time: 2.579487486000062
Elapsed time: 2.5827961039999536
Elapsed time: 2.5811954810001225
Elapsed time: 2.5782033600000887
Elapsed time: 2.572370636999949

Cpus_allowed:	cc
Cpus_allowed_list:	2-3,6-7
voluntary_ctxt_switches:	2
nonvoluntary_ctxt_switches:	16
-----------------------

Numbers look *more* stable than the numbers of the first test without taskset on an idle system! You can see that number of context switches is very low (total: 18).

Example of a second run:
-----------------------
haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full 

Elapsed time: 2.538398498999868
Elapsed time: 2.544711968999991
Elapsed time: 2.5323677339999904
Elapsed time: 2.536252647000083
Elapsed time: 2.525748182999905

Cpus_allowed:	cc
Cpus_allowed_list:	2-3,6-7
voluntary_ctxt_switches:	2
nonvoluntary_ctxt_switches:	15
-----------------------

Third run:
-----------------------
haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full 

Elapsed time: 2.5819172930000605
Elapsed time: 2.5783024259999365
Elapsed time: 2.578493587999901
Elapsed time: 2.5774198510000588
Elapsed time: 2.5772148999999445

Cpus_allowed:	cc
Cpus_allowed_list:	2-3,6-7
voluntary_ctxt_switches:	2
nonvoluntary_ctxt_switches:	15
-----------------------

Well, it's no perfect, but it looks much stable than timings without specific kernel config nor CPU pinning.

Statistics on the 15 timings of the 3 runs with tunning on a heavily loaded system:

>>> times
[2.579487486000062, 2.5827961039999536, 2.5811954810001225, 2.5782033600000887, 2.572370636999949, 2.538398498999868, 2.544711968999991, 2.5323677339999904, 2.536252647000083, 2.525748182999905, 2.5819172930000605, 2.5783024259999365, 2.578493587999901, 2.5774198510000588, 2.5772148999999445]
>>> statistics.mean(times)
2.564325343866661
>>> statistics.pvariance(times)
0.0004340411190965491
>>> statistics.stdev(times)
0.021564880156747315


Compare if to the timings without tunning on an idle system:

>>> times
[2.660088424000037, 2.5927538629999844, 2.6135682369999813, 2.5819260570000324, 2.5991294099999322]
>>> statistics.mean(times)
2.6094931981999934
>>> statistics.pvariance(times)
0.0007448087075422725
>>> statistics.stdev(times)
0.030512470965620608

We get (no tuning, idle system => tuning, busy system):

* Population variance: 0.00074 => 0.00043
* Standard deviation: 0.031 => 0.022

It looks *much* better, no? Even I only used *5* timings on the benchmark without tuning, whereas I used 15 timings on the benchmark with tuning. I expect larger variance and deviation with more times.

--

Just for fun, I ran the benchmark 3 times (so to get 3x5 timings) on an idle system with tuning:

>>> times
[2.542378394000025, 2.5541740109999864, 2.5456488329998592, 2.54730951800002, 2.5495472409998, 2.56374302800009, 2.5737907220000125, 2.581463170999996, 2.578222832999927, 2.574441839999963, 2.569389365999996, 2.5792129209999075, 2.5689420860001064, 2.5681367900001533, 2.5563378829999692]
>>> import statistics
>>> statistics.mean(times)
2.563515909133321
>>> statistics.pvariance(times)
0.00016384530912002678
>>> statistics.stdev(times)
0.013249473404092065

As expected, it's even better (no tune, idle system => tuning, busy system => tuning, idle system):

* Population variance: 0.00074 => 0.00043 => 0.00016
* Standard deviation: 0.031 => 0.022 => 0.013
msg259557 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2016-02-04 12:03
> STINNER Victor added the comment:
> I modified Stefan's telco.py to remove all I/O from the hot code: the benchmark is now really CPU-bound. I also modified telco.py to run the benchmark 5 times. One run takes around 2.6 seconds.
> 

Nice. telco.py is an ad-hoc script from the original decimal.py sandbox,
I missed that it called "infil.read(8)". :)

> And *NOW* using my isolated CPU physical cores #2 and #3 (Linux CPUs 2, 3, 6 and 7), still on the heavily loaded system:
> -----------------------
> $ taskset -c 2,3,6,7 python3 telco_haypo.py full 
> 
> Elapsed time: 2.579487486000062
> Elapsed time: 2.5827961039999536
> Elapsed time: 2.5811954810001225
> Elapsed time: 2.5782033600000887
> Elapsed time: 2.572370636999949

Great.  I'll try that out in the weekend.
msg259558 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-02-04 12:16
Victor, this is a very interesting write-up, thank you.
msg259561 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-04 13:33
Florin Papa added the comment:
> I ran perf to use calibration and there is no difference in stability
> compared to the unpatched version.

Sorry, what are you calling "stability"? For me, stability means that
you run the same benchmark 3, 5 or 10 times, and the results must be
as close as possible: see variance and standard deviation of my
previous message.

I'm not talking of variance/deviation of the N runs of bm_xxx.py
scripts, but variance/deviation of the mean value displayed by
perf.py.

perf_calibration.patch is a proof-of-concept. I changed the number of
runs from 50 to 10 to test my patch more easily. You should modify the
patch to keep 50 runs if you want to compare the stability.

By the way, --fast/--rigorous options should not only change the
minimum duration of a single run to calibrate the number of loops, but
they should also change the "maximum" duration of perf.py by using
different number of runs.
msg259566 - (view) Author: Florin Papa (florin.papa) * Date: 2016-02-04 14:22
I was also talking about the variance/deviation of the mean value
displayed by perf.py, sorry if I was unclear. The perf.py output in my
previous message showed little difference between the patched and
non-patched version. I will also try increasing the number of
runs to see if there is any change.

The CPU isolation feature is a great finding, thank you.
msg259570 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-04 15:05
tl; dr I'm disappointed. According to the statistics module, running the bm_regex_v8.py benchmark more times with more iterations make the benchmark more unstable... I expected the opposite...


Patch version 2:

* patch also performance/bm_pickle.py
* change min_time from 100 ms to 500 ms with --fast
* compute the number of runs using a maximum time, maximum time change with --fast and --rigorous

+    if options.rigorous:
+        min_time = 1.0
+        max_time = 100.0  # 100 runs
+    elif options.fast:
+        min_time = 0.5
+        max_time = 25.0   # 50 runs
+    else:
+        min_time = 0.5
+        max_time = 50.0   # 100 runs


To measure the stability of perf.py, I pinned perf.py to CPU cores which are isolated of the system using Linux "isolcpus" kernel parameter. I also forced the CPU frequency governor to "performance" and enabled "no HZ full" on these cores.

I ran perf.py 5 times on regex_v8.


Calibration (original => patched):

* --fast: 1 iteration x 5 runs => 16 iterations x 50 runs
* (no option): 1 iteration x 50 runs => 16 iterations x 100 runs


Approximated duration of the benchmark (original => patched):

* --fast: 7 sec => 7 min 34 sec
* (no option): 30 sec => 14 min 35 sec

(I made a mistake, so I was unable to get the exact duration.)

Hum, maybe timings are not well chosen because the benchmark is really slow (minutes vs seconds) :-/


Standard deviation, --fast:

* (python2) 0.00071 (1.2%, mean=0.05961) => 0.01059 (1.1%, mean=0.96723)
* (python3) 0.00068 (1.5%, mean=0.04494) => 0.05925 (8.0%, mean=0.74248)
* (faster) 0.02986 (2.2%, mean=1.32750) => 0.09083 (6.9%, mean=1.31000)

Standard deviation, (no option):

* (python2) 0.00072 (1.2%, mean=0.05957) => 0.00874 (0.9%, mean=0.97028)
* (python3) 0.00053 (1.2%, mean=0.04477) => 0.00966 (1.3%, mean=0.72680)
* (faster) 0.02739 (2.1%, mean=1.33000) => 0.02608 (2.0%, mean=1.33600)

Variance, --fast:

* (python2) 0.00000 (0.001%, mean=0.05961) => 0.00009 (0.009%, mean=0.96723)
* (python3) 0.00000 (0.001%, mean=0.04494) => 0.00281 (0.378%, mean=0.74248)
* (faster) 0.00067 (0.050%, mean=1.32750) => 0.00660 (0.504%, mean=1.31000)

Variance, (no option):

* (python2) 0.00000 (0.001%, mean=0.05957) => 0.00006 (0.006%, mean=0.97028)
* (python3) 0.00000 (0.001%, mean=0.04477) => 0.00007 (0.010%, mean=0.72680)
* (faster) 0.00060 (0.045%, mean=1.33000) => 0.00054 (0.041%, mean=1.33600)

Legend:

* (python2) are timings of python2 ran by perf.py (of the "Min" line)
* (python3) are timings of python3 ran by perf.py (of the "Min" line)
* (faster) are the "1.34x" numbers of "faster" or "slower" of the "Min" line
* percentages are: value * 100 / mean

It's not easy to compare these values since the number of iterations is very different (1 => 16) and so timings are very different (ex: 0.059 sec => 0.950 sec). I guess that it's ok to compare percentages.


I used the stability.py script, attached to this issue, to compute deviation and variance from the "Min" line of the 5 runs. The script takes the output of perf.py as input.

I'm not sure that 5 runs are enough to compute statistics.

--

Raw data.

Original perf.py.

$ grep ^Min original.fast 
Min: 0.059236 -> 0.045948: 1.29x faster
Min: 0.059005 -> 0.044654: 1.32x faster
Min: 0.059601 -> 0.044547: 1.34x faster
Min: 0.060605 -> 0.044600: 1.36x faster

$ grep ^Min original
Min: 0.060479 -> 0.044762: 1.35x faster
Min: 0.059002 -> 0.045689: 1.29x faster
Min: 0.058991 -> 0.044587: 1.32x faster
Min: 0.060231 -> 0.044364: 1.36x faster
Min: 0.059165 -> 0.044464: 1.33x faster

Patched perf.py.

$ grep ^Min patched.fast 
Min: 0.950717 -> 0.711018: 1.34x faster
Min: 0.968413 -> 0.730810: 1.33x faster
Min: 0.976092 -> 0.847388: 1.15x faster
Min: 0.964355 -> 0.711083: 1.36x faster
Min: 0.976573 -> 0.712081: 1.37x faster

$ grep ^Min patched
Min: 0.968810 -> 0.729109: 1.33x faster
Min: 0.973615 -> 0.731308: 1.33x faster
Min: 0.974215 -> 0.734259: 1.33x faster
Min: 0.978781 -> 0.709915: 1.38x faster
Min: 0.955977 -> 0.729387: 1.31x faster

$ grep ^Calibration patched.fast 
Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.4 sec)
Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.3 sec)
Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.4 sec)
Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.6 sec)
Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.7 sec)

$ grep ^Calibration patched
Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.0 sec)
Calibration: num_runs=100, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 75.3 sec)
Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.2 sec)
Calibration: num_runs=100, num_loops=16 (0.74 sec per run > min_time 0.50 sec, estimated total: 73.7 sec)
Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 72.9 sec)
msg259582 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2016-02-04 17:47
What would happen if we shifted to counting the number of executions within a set amount of time instead of how fast a single execution occurred? I believe some JavaScript benchmarks started to do this about a decade ago when they realized CPUs were getting so fast that older benchmarks were completing too quickly to be reliably measured. This also would allow one to have a very strong notion of how long a benchmark run would take based on the number of iterations and what time length bucket a benchmark was placed in (i.e., for microbenchmarks we could say a second while for longer running benchmarks we can increase that threshold). And it won't hurt benchmark comparisons since we have always done relative comparisons rather than absolute ones.
msg273923 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-08-30 16:13
This issue was fixed in the new flavor of the benchmark, the new performance project:
https://github.com/python/performance
History
Date User Action Args
2016-08-30 16:13:19vstinnersetstatus: open -> closed
resolution: fixed
dependencies: - Fix telco benchmark
messages: + msg273923
2016-02-23 09:58:09skrahsetdependencies: + Fix telco benchmark
2016-02-04 17:47:37brett.cannonsetmessages: + msg259582
2016-02-04 15:05:37vstinnersetfiles: + perf_calibration-2.patch
2016-02-04 15:05:24vstinnersetfiles: + stability.py

messages: + msg259570
2016-02-04 14:22:21florin.papasetmessages: + msg259566
2016-02-04 13:33:05vstinnersetmessages: + msg259561
2016-02-04 12:16:08pitrousetmessages: + msg259558
2016-02-04 12:03:59skrahsetmessages: + msg259557
2016-02-04 11:37:12vstinnersetfiles: + telco_haypo.py

messages: + msg259556
2016-02-04 11:08:00florin.papasetmessages: + msg259555
2016-02-04 09:37:13skrahsetmessages: + msg259551
2016-02-04 09:35:48vstinnersetmessages: + msg259550
2016-02-04 09:28:12vstinnersetmessages: + msg259548
2016-02-04 09:27:46skrahsetmessages: + msg259547
2016-02-04 09:24:03skrahsetnosy: + skrah
messages: + msg259546
2016-02-04 08:39:51vstinnersetmessages: + msg259544
2016-02-04 08:33:48florin.papasetmessages: + msg259543
2016-02-03 20:33:03vstinnersetmessages: + msg259515
2016-02-03 19:56:42serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg259510
2016-02-03 19:22:36yselivanovsetnosy: + zbyrne
messages: + msg259504
2016-02-03 13:50:24vstinnersetnosy: + florin.papa

title: perf.py: bm_regex_v8 doesn't seem reliable even with --rigorous -> perf.py: calibrate benchmarks using time, not using a fixed number of iterations
2016-02-03 11:18:27vstinnersetfiles: + perf_calibration.patch
keywords: + patch
messages: + msg259472
2016-02-03 10:38:49vstinnercreate