perf.py: calibrate benchmarks using time, not using a fixed number of iterations #70463

vstinner · 2016-02-03T10:38:50Z

BPO	26275
Nosy	@brettcannon, @pitrou, @vstinner, @skrah, @serhiy-storchaka, @1st1, @florinpapa
Files	perf_calibration.patch telco_haypo.py stability.py perf_calibration-2.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2016-08-30.16:13:19.990>
created_at = <Date 2016-02-03.10:38:49.820>
labels = ['performance']
title = 'perf.py: calibrate benchmarks using time, not using a fixed number of iterations'
updated_at = <Date 2016-08-30.16:13:19.988>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2016-08-30.16:13:19.988>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2016-08-30.16:13:19.990>
closer = 'vstinner'
components = ['Benchmarks']
creation = <Date 2016-02-03.10:38:49.820>
creator = 'vstinner'
dependencies = []
files = ['41790', '41802', '41803', '41804']
hgrepos = []
issue_num = 26275
keywords = ['patch']
message_count = 21.0
messages = ['259469', '259472', '259504', '259510', '259515', '259543', '259544', '259546', '259547', '259548', '259550', '259551', '259555', '259556', '259557', '259558', '259561', '259566', '259570', '259582', '273923']
nosy_count = 8.0
nosy_names = ['brett.cannon', 'pitrou', 'vstinner', 'skrah', 'serhiy.storchaka', 'yselivanov', 'zbyrne', 'florin.papa']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue26275'
versions = ['Python 3.6']

vstinner · 2016-02-03T10:38:49Z

Hi, I'm working on some optimizations projects like FAT Python (PEP-509: issue bpo-26058, PEP-510: issue bpo-26098, and PEP-511: issue bpo-26145) and faster memory allocators (issue bpo-26249).

I have the *feeling* that perf.py output is not reliable even if it takes more than 20 minutes :-/ Maybe because Yury told that I must use -r (--rigorous) :-)

Example with 5 runs of "python3 perf.py ../default/python ../default/python.orig -b regex_v8":

---------------
Report on Linux smithers 4.3.3-300.fc23.x86_64 #1 SMP Tue Jan 5 23:31:01 UTC 2016 x86_64 x86_64
Total CPU cores: 8

### regex_v8 ###
Min: 0.043237 -> 0.050196: 1.16x slower
Avg: 0.043714 -> 0.050574: 1.16x slower
Significant (t=-19.83)
Stddev: 0.00171 -> 0.00174: 1.0178x larger

### regex_v8 ###
Min: 0.042774 -> 0.051420: 1.20x slower
Avg: 0.043843 -> 0.051874: 1.18x slower
Significant (t=-14.46)
Stddev: 0.00351 -> 0.00176: 2.0009x smaller

### regex_v8 ###
Min: 0.042673 -> 0.048870: 1.15x slower
Avg: 0.043726 -> 0.050474: 1.15x slower
Significant (t=-8.74)
Stddev: 0.00283 -> 0.00467: 1.6513x larger

### regex_v8 ###
Min: 0.044029 -> 0.049445: 1.12x slower
Avg: 0.044564 -> 0.049971: 1.12x slower
Significant (t=-13.97)
Stddev: 0.00175 -> 0.00211: 1.2073x larger

### regex_v8 ###
Min: 0.042692 -> 0.049084: 1.15x slower
Avg: 0.044295 -> 0.050725: 1.15x slower
Significant (t=-7.00)
Stddev: 0.00421 -> 0.00494: 1.1745x larger
---------------

I'm only care of the "Min", IMHO it's the most interesting information here.

The slowdown is betwen 12% and 20%, for me it's a big difference.

It looks like some benchmarks have very short iterations compare to others. For example, bm_json_v2 takes around 3 seconds, whereas bm_regex_v8 only takes less than 0.050 second (50 ms).

$ python3 performance/bm_json_v2.py -n 3 --timer perf_counter
3.310384973010514
3.3116717970115133
3.3077902760123834

$ python3 performance/bm_regex_v8.py -n 3 --timer perf_counter
0.0670697659952566
0.04515827298746444
0.045114840992027894

Do you think that bm_regex_v8 is reliable? I see that there is an "iteration scaling" to use run the benchmarks with more iterations. Maybe we can start to increase the "iteration scaling" for bm_regex_v8?

Instead of a fixed number of iterations, we should redesign benchmarks to use time. For example, one iteration must take at least 100 ms and should not take more than 1 second (but take longer to get more reliable results). Then the benchmark is responsible to ajust internal parameters.

I used this design for my "benchmark.py" script which is written to get "reliable" microbenchmarks:
https://bitbucket.org/haypo/misc/src/tip/python/benchmark.py?fileviewer=file-view-default

The script is based on time and calibrate a benchmark. It also uses the *effictive* resolution of the clock used by the benchmark to calibrate the benchmark.

I will maybe work on such patch, but it would be good to know first your opinion on such change.

I guess that we should use the base python to calibrate the benchmark and then pass the same parameters to the modified python.

vstinner · 2016-02-03T11:18:27Z

Well, it was simpler than I expected to implement calibration. Here is a PoC for regex_v8 and json_dump_v2.

json_dump_v2 takes 3 seconds per run, but it already uses internally 100 loop iterations. I divided the number of iterations by 100 in my patch.

1st1 · 2016-02-03T19:22:36Z

Here's a very interesting table from Zach Byrne: http://bugs.python.org/issue21955#msg259490

It shows that some benchmarks are indeed very unstable. This also correlates with my own experience.

These ones are very unstable: pickle_dict, nbody, regex_v8, etree_parse, telco.

serhiy-storchaka · 2016-02-03T19:56:42Z

If running time is close to the limit, different run will use different number of repetitions. This will add additional instability. I prefer a stable number of repetitions manually calibrated for average modern computer.

vstinner · 2016-02-03T20:33:04Z

If running time is close to the limit, different run will use different number of repetitions. This will add additional instability.

Maybe, to be honest I don't know. How can we decide if a patch makes
perf.py more and less stable?

I prefer a stable number of repetitions manually calibrated for average modern computer.

The problem is to define an "average modern computer".

florinpapa · 2016-02-04T08:33:48Z

I have tested the patch and it does not seem to solve the stability
problem.

With patch:
python performance/bm_json_v2.py -n 10 --timer perf_counter
0.02829067991115153
0.027778947027400136
0.027756798081099987
0.027362094027921557
0.027256828034296632
0.02766443998552859
0.027390124974772334
0.027398303849622607
0.02731629996560514
0.02743363007903099

Without patch:
python performance/bm_json_v2.py -n 10 --timer perf_counter
2.732215315802023
2.7394272631499916
2.731212269049138
2.7511520159896463
2.739834441104904
2.754247901029885
2.734024364966899
2.7447318979538977
2.744731454877183
2.7513630259782076

Instead, I notice that, especially for the first runs, the measured
values tend to be higher. This is expected, since the cache has not
been populated yet, but this can also happen after a context switch.
Therefore, we could eliminate these "odd" values by calculating the
median value of all the runs and only keeping measurements within 1%
of that median value.

What do you think?

vstinner · 2016-02-04T08:39:51Z

With patch:
python performance/bm_json_v2.py -n 10 --timer perf_counter

Oh. There is a bit of confusion here. You must *not* run directly bm_xxx.py scripts. The calibration is done in perf.py.

Try for example:
python perf.py python2 python3 -b json_dump_v2

You must see something like:
Calibrating => num_runs=10, num_loops=16 (0.50 sec < 0.55 sec)

I should maybe share the calibration code to also compute the number of iterations when a bm_xxx.py script is run directly? But the risk is that someone compares two runs of bm_xxx.py using two python binaries, and seeing different results just because the number of calibrated loops is different...

skrah · 2016-02-04T09:24:03Z

In my experience it is very hard to get stable benchmark results with Python. Even long running benchmarks on an empty machine vary:

wget http://www.bytereef.org/software/mpdecimal/benchmarks/telco.py
wget http://speleotrove.com/decimal/expon180-1e6b.zip
unzip expon180-1e6b.zip

taskset -c 0 ./python telco.py full

$ taskset -c 0 ./python telco.py full

Control totals:
Actual ['1004737.58', '57628.30', '25042.17']
Expected ['1004737.58', '57628.30', '25042.17']
Elapsed time: 7.16255
$ taskset -c 0 ./python telco.py full

Control totals:
Actual ['1004737.58', '57628.30', '25042.17']
Expected ['1004737.58', '57628.30', '25042.17']
Elapsed time: 6.982884
$ taskset -c 0 ./python telco.py full

Control totals:
Actual ['1004737.58', '57628.30', '25042.17']
Expected ['1004737.58', '57628.30', '25042.17']
Elapsed time: 7.095349000000001
$ taskset -c 0 ./python telco.py full

Control totals:
Actual ['1004737.58', '57628.30', '25042.17']
Expected ['1004737.58', '57628.30', '25042.17']

skrah · 2016-02-04T09:27:47Z

I've cut off the highest result in the previous message:

Control totals:
Actual ['1004737.58', '57628.30', '25042.17']
Expected ['1004737.58', '57628.30', '25042.17']
Elapsed time: 7.304339
$ taskset -c 0 ./python telco.py full

vstinner · 2016-02-04T09:28:12Z

taskset -c 0 ./python telco.py full

Did you see that I just merged Florin's patch to add --affinity parameter to perf.py? :-)

You may isolate some CPU cores using the kernel command parameter isolcpus=xxx. I don't think that the core #0 is the best choice, it may be preferred.

It would be nice to collect "tricks" to get most reliable benchmark results. Maybe in perf.py README page? Or a wiki page? Whatever? :-)

vstinner · 2016-02-04T09:35:48Z

For an older project (Fusil the fuzzer), I wrote a short function to sleep until the system load is lower than a threshold. I had to write such function to reduce the noise when the system is heavily loaded. I wrote that to be able to run a very long task (it takes at least 1 hour, but may run for multiple days!) on my desktop and continue to use my desktop for various other tasks.

On Linux, we can use the "cpu xxx xxx xxx ..." line of /proc/stat to get the system load.

My code to read the system load:
https://bitbucket.org/haypo/fusil/src/32ddc281219cd90c1ad12a3ee4ce212bea1c3e0f/fusil/linux/cpu_load.py?at=default&fileviewer=file-view-default#cpu_load.py-51

My code to wait until the system load is lower than a threshold:
https://bitbucket.org/haypo/fusil/src/32ddc281219cd90c1ad12a3ee4ce212bea1c3e0f/fusil/system_calm.py?at=default&fileviewer=file-view-default#system_calm.py-5

--

I also write a script to do the opposite :-) A script to stress the system to have a system load higher or equal than a minimum load:

https://bitbucket.org/haypo/misc/src/3fd3993413b128b37e945690865ea2c5ef48c446/bin/system_load.py?at=default&fileviewer=file-view-default

This script helped to me reproduce sporadic failures like timeouts which only occur when the system is highly loaded.

skrah · 2016-02-04T09:37:13Z

Core 1 fluctuates even more (My machine only has 2 cores):

$ taskset -c 1 ./python telco.py full

Control totals:
Actual ['1004737.58', '57628.30', '25042.17']
Expected ['1004737.58', '57628.30', '25042.17']
Elapsed time: 6.783009
Control totals:
Actual ['1004737.58', '57628.30', '25042.17']
Expected ['1004737.58', '57628.30', '25042.17']
Elapsed time: 7.335563
$ taskset -c 1 ./python telco.py full

I have some of the same concerns as Serhiy. There's a lot of statistics going on in the benchmark suite -- is it really possible to separate that cleanly from the actual runtime of the benchmarks?

florinpapa · 2016-02-04T11:08:00Z

I ran perf to use calibration and there is no difference in stability
compared to the unpatched version.

With patch:

python perf.py -b json_dump_v2 -v --csv=out1.csv --affinity=2 ../cpython/python ../cpython/python
INFO:root:Automatically selected timer: perf_counter
[1/1] json_dump_v2...
Calibrating
INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 1 --timer perf_counter
INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 2 --timer perf_counter
INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 4 --timer perf_counter
INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 8 --timer perf_counter
INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 16 --timer perf_counter
INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 32 --timer perf_counter
Calibrating => num_runs=10, num_loops=32 (0.50 sec < 0.87 sec)
INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 10 -l 32 --timer perf_counter
INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 10 -l 32 --timer perf_counter

Report on Linux centos 3.10.0-229.7.2.el7.x86_64 #1 SMP Tue Jun 23 22:06:11 UTC 2015 x86_64 x86_64
Total CPU cores: 18

### json_dump_v2 ###
Min: 0.877497 -> 0.886482: 1.01x slower <----------
Avg: 0.878150 -> 0.888351: 1.01x slower
Not significant
Stddev: 0.00054 -> 0.00106: 1.9481x larger

Without patch:

python perf.py -b json_dump_v2 -v --csv=out1.csv --affinity=2 ../cpython/python ../cpython/python
INFO:root:Automatically selected timer: perf_counter
[1/1] json_dump_v2...
INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 50 --timer perf_counter
INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 50 --timer perf_counter

Report on Linux centos 3.10.0-229.7.2.el7.x86_64 #1 SMP Tue Jun 23 22:06:11 UTC 2015 x86_64 x86_64
Total CPU cores: 18

### json_dump_v2 ###
Min: 2.755514 -> 2.764131: 1.00x slower <---------- (almost) same as above
Avg: 2.766546 -> 2.775587: 1.00x slower
Not significant
Stddev: 0.00538 -> 0.00382: 1.4069x smaller

vstinner · 2016-02-04T11:37:10Z

Stefan: "In my experience it is very hard to get stable benchmark results with Python. Even long running benchmarks on an empty machine vary: (...)"

tl; dr We *can* tune the Linux kernel to avoid most of the system noise when running kernels.

I modified Stefan's telco.py to remove all I/O from the hot code: the benchmark is now really CPU-bound. I also modified telco.py to run the benchmark 5 times. One run takes around 2.6 seconds.

I also added the following lines to check the CPU affinity and the number of context switches:

    os.system("grep -E -i 'cpu|ctx' /proc/%s/status" % os.getpid())

Well, see attached telco_haypo.py for the full script.

I used my system_load.py script to get a system load >= 5.0. Without tasksel, the benchmark result changes completly: at least 5 seconds. Well, it's not really surprising, it's known that benchmarks depend on the system load.

*BUT* I have a great kernel called Linux which has cool features called "CPU isolation" and "no HZ" (tickless kernel). On my Fedoera 23, the kernel is compiled with CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y.

haypo@smithers$ lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ
0 0 0 0 0:0:0:0 oui 5900,0000 1600,0000
1 0 0 1 1:1:1:0 oui 5900,0000 1600,0000
2 0 0 2 2:2:2:0 oui 5900,0000 1600,0000
3 0 0 3 3:3:3:0 oui 5900,0000 1600,0000
4 0 0 0 0:0:0:0 oui 5900,0000 1600,0000
5 0 0 1 1:1:1:0 oui 5900,0000 1600,0000
6 0 0 2 2:2:2:0 oui 5900,0000 1600,0000
7 0 0 3 3:3:3:0 oui 5900,0000 1600,0000

My CPU is on a single socket, has 4 physical cores, but Linux gets 8 cores because of hyper threading.

I modified the Linux command line during the boot in GRUB to add: isolcpus=2,3,6,7 nohz_full=2,3,6,7. Then I forced the CPU frequency to performance to avoid hiccups:

# for id in 2 3 6 7; do echo performance > cpu$id/cpufreq/scaling_governor; done

Check the config with:

$ cat /sys/devices/system/cpu/isolated
2-3,6-7
$ cat /sys/devices/system/cpu/nohz_full
2-3,6-7
$ cat /sys/devices/system/cpu/cpu[2367]/cpufreq/scaling_governor
performance
performance
performance
performance

Ok now with this kernel config but still without tasksel on an idle system:
-----------------------
Elapsed time: 2.660088424000037
Elapsed time: 2.5927538629999844
Elapsed time: 2.6135682369999813
Elapsed time: 2.5819260570000324
Elapsed time: 2.5991294099999322

Cpus_allowed: 33
Cpus_allowed_list: 0-1,4-5
voluntary_ctxt_switches: 1
nonvoluntary_ctxt_switches: 21
-----------------------

With system load >= 5.0:
-----------------------
Elapsed time: 5.3484489170000415
Elapsed time: 5.336797472999933
Elapsed time: 5.187413687999992
Elapsed time: 5.24122020599998
Elapsed time: 5.10201246400004

Cpus_allowed_list: 0-1,4-5
voluntary_ctxt_switches: 1
nonvoluntary_ctxt_switches: 1597
-----------------------

And *NOW* using my isolated CPU physical cores #2 and #3 (Linux CPUs 2, 3, 6 and 7), still on the heavily loaded system:
-----------------------

$ taskset -c 2,3,6,7 python3 telco_haypo.py full

Elapsed time: 2.579487486000062
Elapsed time: 2.5827961039999536
Elapsed time: 2.5811954810001225
Elapsed time: 2.5782033600000887
Elapsed time: 2.572370636999949

Cpus_allowed: cc
Cpus_allowed_list: 2-3,6-7
voluntary_ctxt_switches: 2
nonvoluntary_ctxt_switches: 16
-----------------------

Numbers look *more* stable than the numbers of the first test without taskset on an idle system! You can see that number of context switches is very low (total: 18).

Example of a second run:
-----------------------
haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full

Elapsed time: 2.538398498999868
Elapsed time: 2.544711968999991
Elapsed time: 2.5323677339999904
Elapsed time: 2.536252647000083
Elapsed time: 2.525748182999905

Cpus_allowed: cc
Cpus_allowed_list: 2-3,6-7
voluntary_ctxt_switches: 2
nonvoluntary_ctxt_switches: 15
-----------------------

Third run:
-----------------------
haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full

Elapsed time: 2.5819172930000605
Elapsed time: 2.5783024259999365
Elapsed time: 2.578493587999901
Elapsed time: 2.5774198510000588
Elapsed time: 2.5772148999999445

Cpus_allowed: cc
Cpus_allowed_list: 2-3,6-7
voluntary_ctxt_switches: 2
nonvoluntary_ctxt_switches: 15
-----------------------

Well, it's no perfect, but it looks much stable than timings without specific kernel config nor CPU pinning.

Statistics on the 15 timings of the 3 runs with tunning on a heavily loaded system:

>>> times
[2.579487486000062, 2.5827961039999536, 2.5811954810001225, 2.5782033600000887, 2.572370636999949, 2.538398498999868, 2.544711968999991, 2.5323677339999904, 2.536252647000083, 2.525748182999905, 2.5819172930000605, 2.5783024259999365, 2.578493587999901, 2.5774198510000588, 2.5772148999999445]
>>> statistics.mean(times)
2.564325343866661
>>> statistics.pvariance(times)
0.0004340411190965491
>>> statistics.stdev(times)
0.021564880156747315

Compare if to the timings without tunning on an idle system:

>>> times
[2.660088424000037, 2.5927538629999844, 2.6135682369999813, 2.5819260570000324, 2.5991294099999322]
>>> statistics.mean(times)
2.6094931981999934
>>> statistics.pvariance(times)
0.0007448087075422725
>>> statistics.stdev(times)
0.030512470965620608

We get (no tuning, idle system => tuning, busy system):

Population variance: 0.00074 => 0.00043
Standard deviation: 0.031 => 0.022

It looks *much* better, no? Even I only used *5* timings on the benchmark without tuning, whereas I used 15 timings on the benchmark with tuning. I expect larger variance and deviation with more times.

--

Just for fun, I ran the benchmark 3 times (so to get 3x5 timings) on an idle system with tuning:

>>> times
[2.542378394000025, 2.5541740109999864, 2.5456488329998592, 2.54730951800002, 2.5495472409998, 2.56374302800009, 2.5737907220000125, 2.581463170999996, 2.578222832999927, 2.574441839999963, 2.569389365999996, 2.5792129209999075, 2.5689420860001064, 2.5681367900001533, 2.5563378829999692]
>>> import statistics
>>> statistics.mean(times)
2.563515909133321
>>> statistics.pvariance(times)
0.00016384530912002678
>>> statistics.stdev(times)
0.013249473404092065

As expected, it's even better (no tune, idle system => tuning, busy system => tuning, idle system):

Population variance: 0.00074 => 0.00043 => 0.00016
Standard deviation: 0.031 => 0.022 => 0.013

skrah · 2016-02-04T12:03:59Z

STINNER Victor added the comment:
I modified Stefan's telco.py to remove all I/O from the hot code: the benchmark is now really CPU-bound. I also modified telco.py to run the benchmark 5 times. One run takes around 2.6 seconds.

Nice. telco.py is an ad-hoc script from the original decimal.py sandbox,
I missed that it called "infil.read(8)". :)

And *NOW* using my isolated CPU physical cores #2 and #3 (Linux CPUs 2, 3, 6 and 7), still on the heavily loaded system:
-----------------------
$ taskset -c 2,3,6,7 python3 telco_haypo.py full

Elapsed time: 2.579487486000062
Elapsed time: 2.5827961039999536
Elapsed time: 2.5811954810001225
Elapsed time: 2.5782033600000887
Elapsed time: 2.572370636999949

Great. I'll try that out in the weekend.

pitrou · 2016-02-04T12:16:08Z

Victor, this is a very interesting write-up, thank you.

vstinner · 2016-02-04T13:33:05Z

Florin Papa added the comment:

I ran perf to use calibration and there is no difference in stability
compared to the unpatched version.

Sorry, what are you calling "stability"? For me, stability means that
you run the same benchmark 3, 5 or 10 times, and the results must be
as close as possible: see variance and standard deviation of my
previous message.

I'm not talking of variance/deviation of the N runs of bm_xxx.py
scripts, but variance/deviation of the mean value displayed by
perf.py.

perf_calibration.patch is a proof-of-concept. I changed the number of
runs from 50 to 10 to test my patch more easily. You should modify the
patch to keep 50 runs if you want to compare the stability.

By the way, --fast/--rigorous options should not only change the
minimum duration of a single run to calibrate the number of loops, but
they should also change the "maximum" duration of perf.py by using
different number of runs.

florinpapa · 2016-02-04T14:22:22Z

I was also talking about the variance/deviation of the mean value
displayed by perf.py, sorry if I was unclear. The perf.py output in my
previous message showed little difference between the patched and
non-patched version. I will also try increasing the number of
runs to see if there is any change.

The CPU isolation feature is a great finding, thank you.

vstinner · 2016-02-04T15:05:23Z

tl; dr I'm disappointed. According to the statistics module, running the bm_regex_v8.py benchmark more times with more iterations make the benchmark more unstable... I expected the opposite...

Patch version 2:

patch also performance/bm_pickle.py
change min_time from 100 ms to 500 ms with --fast
compute the number of runs using a maximum time, maximum time change with --fast and --rigorous

+ if options.rigorous:
+ min_time = 1.0
+ max_time = 100.0 # 100 runs
+ elif options.fast:
+ min_time = 0.5
+ max_time = 25.0 # 50 runs
+ else:
+ min_time = 0.5
+ max_time = 50.0 # 100 runs

To measure the stability of perf.py, I pinned perf.py to CPU cores which are isolated of the system using Linux "isolcpus" kernel parameter. I also forced the CPU frequency governor to "performance" and enabled "no HZ full" on these cores.

I ran perf.py 5 times on regex_v8.

Calibration (original => patched):

--fast: 1 iteration x 5 runs => 16 iterations x 50 runs
(no option): 1 iteration x 50 runs => 16 iterations x 100 runs

Approximated duration of the benchmark (original => patched):

--fast: 7 sec => 7 min 34 sec
(no option): 30 sec => 14 min 35 sec

(I made a mistake, so I was unable to get the exact duration.)

Hum, maybe timings are not well chosen because the benchmark is really slow (minutes vs seconds) :-/

Standard deviation, --fast:

(python2) 0.00071 (1.2%, mean=0.05961) => 0.01059 (1.1%, mean=0.96723)
(python3) 0.00068 (1.5%, mean=0.04494) => 0.05925 (8.0%, mean=0.74248)
(faster) 0.02986 (2.2%, mean=1.32750) => 0.09083 (6.9%, mean=1.31000)

Standard deviation, (no option):

(python2) 0.00072 (1.2%, mean=0.05957) => 0.00874 (0.9%, mean=0.97028)
(python3) 0.00053 (1.2%, mean=0.04477) => 0.00966 (1.3%, mean=0.72680)
(faster) 0.02739 (2.1%, mean=1.33000) => 0.02608 (2.0%, mean=1.33600)

Variance, --fast:

(python2) 0.00000 (0.001%, mean=0.05961) => 0.00009 (0.009%, mean=0.96723)
(python3) 0.00000 (0.001%, mean=0.04494) => 0.00281 (0.378%, mean=0.74248)
(faster) 0.00067 (0.050%, mean=1.32750) => 0.00660 (0.504%, mean=1.31000)

Variance, (no option):

(python2) 0.00000 (0.001%, mean=0.05957) => 0.00006 (0.006%, mean=0.97028)
(python3) 0.00000 (0.001%, mean=0.04477) => 0.00007 (0.010%, mean=0.72680)
(faster) 0.00060 (0.045%, mean=1.33000) => 0.00054 (0.041%, mean=1.33600)

Legend:

(python2) are timings of python2 ran by perf.py (of the "Min" line)
(python3) are timings of python3 ran by perf.py (of the "Min" line)
(faster) are the "1.34x" numbers of "faster" or "slower" of the "Min" line
percentages are: value * 100 / mean

It's not easy to compare these values since the number of iterations is very different (1 => 16) and so timings are very different (ex: 0.059 sec => 0.950 sec). I guess that it's ok to compare percentages.

I used the stability.py script, attached to this issue, to compute deviation and variance from the "Min" line of the 5 runs. The script takes the output of perf.py as input.

I'm not sure that 5 runs are enough to compute statistics.

--

Raw data.

Original perf.py.

$ grep ^Min original.fast 
Min: 0.059236 -> 0.045948: 1.29x faster
Min: 0.059005 -> 0.044654: 1.32x faster
Min: 0.059601 -> 0.044547: 1.34x faster
Min: 0.060605 -> 0.044600: 1.36x faster

$ grep ^Min original
Min: 0.060479 -> 0.044762: 1.35x faster
Min: 0.059002 -> 0.045689: 1.29x faster
Min: 0.058991 -> 0.044587: 1.32x faster
Min: 0.060231 -> 0.044364: 1.36x faster
Min: 0.059165 -> 0.044464: 1.33x faster

Patched perf.py.

$ grep ^Min patched.fast 
Min: 0.950717 -> 0.711018: 1.34x faster
Min: 0.968413 -> 0.730810: 1.33x faster
Min: 0.976092 -> 0.847388: 1.15x faster
Min: 0.964355 -> 0.711083: 1.36x faster
Min: 0.976573 -> 0.712081: 1.37x faster

$ grep ^Min patched
Min: 0.968810 -> 0.729109: 1.33x faster
Min: 0.973615 -> 0.731308: 1.33x faster
Min: 0.974215 -> 0.734259: 1.33x faster
Min: 0.978781 -> 0.709915: 1.38x faster
Min: 0.955977 -> 0.729387: 1.31x faster

$ grep ^Calibration patched.fast 
Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.4 sec)
Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.3 sec)
Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.4 sec)
Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.6 sec)
Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.7 sec)

$ grep ^Calibration patched
Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.0 sec)
Calibration: num_runs=100, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 75.3 sec)
Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.2 sec)
Calibration: num_runs=100, num_loops=16 (0.74 sec per run > min_time 0.50 sec, estimated total: 73.7 sec)
Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 72.9 sec)

brettcannon · 2016-02-04T17:47:37Z

What would happen if we shifted to counting the number of executions within a set amount of time instead of how fast a single execution occurred? I believe some JavaScript benchmarks started to do this about a decade ago when they realized CPUs were getting so fast that older benchmarks were completing too quickly to be reliably measured. This also would allow one to have a very strong notion of how long a benchmark run would take based on the number of iterations and what time length bucket a benchmark was placed in (i.e., for microbenchmarks we could say a second while for longer running benchmarks we can increase that threshold). And it won't hurt benchmark comparisons since we have always done relative comparisons rather than absolute ones.

vstinner · 2016-08-30T16:13:20Z

This issue was fixed in the new flavor of the benchmark, the new performance project:
https://github.com/python/performance

vstinner added performance Performance or resource usage labels Feb 3, 2016

vstinner changed the title ~~perf.py: bm_regex_v8 doesn't seem reliable even with --rigorous~~ perf.py: calibrate benchmarks using time, not using a fixed number of iterations Feb 3, 2016

vstinner closed this as completed Aug 30, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf.py: calibrate benchmarks using time, not using a fixed number of iterations #70463

perf.py: calibrate benchmarks using time, not using a fixed number of iterations #70463

vstinner commented Feb 3, 2016

vstinner commented Feb 3, 2016

vstinner commented Feb 3, 2016

1st1 commented Feb 3, 2016

serhiy-storchaka commented Feb 3, 2016

vstinner commented Feb 3, 2016

florinpapa mannequin commented Feb 4, 2016

vstinner commented Feb 4, 2016

skrah mannequin commented Feb 4, 2016

skrah mannequin commented Feb 4, 2016

vstinner commented Feb 4, 2016

vstinner commented Feb 4, 2016

skrah mannequin commented Feb 4, 2016

florinpapa mannequin commented Feb 4, 2016

vstinner commented Feb 4, 2016

skrah mannequin commented Feb 4, 2016

pitrou commented Feb 4, 2016

vstinner commented Feb 4, 2016

florinpapa mannequin commented Feb 4, 2016

vstinner commented Feb 4, 2016

brettcannon commented Feb 4, 2016

vstinner commented Aug 30, 2016

perf.py: calibrate benchmarks using time, not using a fixed number of iterations #70463

perf.py: calibrate benchmarks using time, not using a fixed number of iterations #70463

Comments

vstinner commented Feb 3, 2016

vstinner commented Feb 3, 2016

vstinner commented Feb 3, 2016

1st1 commented Feb 3, 2016

serhiy-storchaka commented Feb 3, 2016

vstinner commented Feb 3, 2016

florinpapa mannequin commented Feb 4, 2016

vstinner commented Feb 4, 2016

skrah mannequin commented Feb 4, 2016

skrah mannequin commented Feb 4, 2016

vstinner commented Feb 4, 2016

vstinner commented Feb 4, 2016

skrah mannequin commented Feb 4, 2016

florinpapa mannequin commented Feb 4, 2016

vstinner commented Feb 4, 2016

skrah mannequin commented Feb 4, 2016

pitrou commented Feb 4, 2016

vstinner commented Feb 4, 2016

florinpapa mannequin commented Feb 4, 2016

vstinner commented Feb 4, 2016

brettcannon commented Feb 4, 2016

vstinner commented Aug 30, 2016