Issue 26275: perf.py: calibrate benchmarks using time, not using a fixed number of iterations

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70463

classification

Title:	perf.py: calibrate benchmarks using time, not using a fixed number of iterations
Type:	performance	Stage:
Components:	Benchmarks	Versions:	Python 3.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	brett.cannon, florin.papa, pitrou, serhiy.storchaka, skrah, vstinner, yselivanov, zbyrne
Priority:	normal	Keywords:	patch

Created on 2016-02-03 10:38 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
perf_calibration.patch	vstinner, 2016-02-03 11:18
telco_haypo.py	vstinner, 2016-02-04 11:37
stability.py	vstinner, 2016-02-04 15:05
perf_calibration-2.patch	vstinner, 2016-02-04 15:05

Messages (21)
msg259469 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-03 10:38
Hi, I'm working on some optimizations projects like FAT Python (PEP 509: issue #26058, PEP 510: issue #26098, and PEP 511: issue #26145) and faster memory allocators (issue #26249). I have the feeling that perf.py output is not reliable even if it takes more than 20 minutes :-/ Maybe because Yury told that I must use -r (--rigorous) :-) Example with 5 runs of "python3 perf.py ../default/python ../default/python.orig -b regex_v8": --------------- Report on Linux smithers 4.3.3-300.fc23.x86_64 #1 SMP Tue Jan 5 23:31:01 UTC 2016 x86_64 x86_64 Total CPU cores: 8 ### regex_v8 ### Min: 0.043237 -> 0.050196: 1.16x slower Avg: 0.043714 -> 0.050574: 1.16x slower Significant (t=-19.83) Stddev: 0.00171 -> 0.00174: 1.0178x larger ### regex_v8 ### Min: 0.042774 -> 0.051420: 1.20x slower Avg: 0.043843 -> 0.051874: 1.18x slower Significant (t=-14.46) Stddev: 0.00351 -> 0.00176: 2.0009x smaller ### regex_v8 ### Min: 0.042673 -> 0.048870: 1.15x slower Avg: 0.043726 -> 0.050474: 1.15x slower Significant (t=-8.74) Stddev: 0.00283 -> 0.00467: 1.6513x larger ### regex_v8 ### Min: 0.044029 -> 0.049445: 1.12x slower Avg: 0.044564 -> 0.049971: 1.12x slower Significant (t=-13.97) Stddev: 0.00175 -> 0.00211: 1.2073x larger ### regex_v8 ### Min: 0.042692 -> 0.049084: 1.15x slower Avg: 0.044295 -> 0.050725: 1.15x slower Significant (t=-7.00) Stddev: 0.00421 -> 0.00494: 1.1745x larger --------------- I'm only care of the "Min", IMHO it's the most interesting information here. The slowdown is betwen 12% and 20%, for me it's a big difference. It looks like some benchmarks have very short iterations compare to others. For example, bm_json_v2 takes around 3 seconds, whereas bm_regex_v8 only takes less than 0.050 second (50 ms). $ python3 performance/bm_json_v2.py -n 3 --timer perf_counter 3.310384973010514 3.3116717970115133 3.3077902760123834 $ python3 performance/bm_regex_v8.py -n 3 --timer perf_counter 0.0670697659952566 0.04515827298746444 0.045114840992027894 Do you think that bm_regex_v8 is reliable? I see that there is an "iteration scaling" to use run the benchmarks with more iterations. Maybe we can start to increase the "iteration scaling" for bm_regex_v8? Instead of a fixed number of iterations, we should redesign benchmarks to use time. For example, one iteration must take at least 100 ms and should not take more than 1 second (but take longer to get more reliable results). Then the benchmark is responsible to ajust internal parameters. I used this design for my "benchmark.py" script which is written to get "reliable" microbenchmarks: https://bitbucket.org/haypo/misc/src/tip/python/benchmark.py?fileviewer=file-view-default The script is based on time and calibrate a benchmark. It also uses the effictive resolution of the clock used by the benchmark to calibrate the benchmark. I will maybe work on such patch, but it would be good to know first your opinion on such change. I guess that we should use the base python to calibrate the benchmark and then pass the same parameters to the modified python.
msg259472 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-03 11:18
Well, it was simpler than I expected to implement calibration. Here is a PoC for regex_v8 and json_dump_v2. json_dump_v2 takes 3 seconds per run, but it already uses internally 100 loop iterations. I divided the number of iterations by 100 in my patch.
msg259504 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2016-02-03 19:22
Here's a very interesting table from Zach Byrne: http://bugs.python.org/issue21955#msg259490 It shows that some benchmarks are indeed very unstable. This also correlates with my own experience. These ones are very unstable: pickle_dict, nbody, regex_v8, etree_parse, telco.
msg259510 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-02-03 19:56
If running time is close to the limit, different run will use different number of repetitions. This will add additional instability. I prefer a stable number of repetitions manually calibrated for average modern computer.
msg259515 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-03 20:33
> If running time is close to the limit, different run will use different number of repetitions. This will add additional instability. Maybe, to be honest I don't know. How can we decide if a patch makes perf.py more and less stable? > I prefer a stable number of repetitions manually calibrated for average modern computer. The problem is to define an "average modern computer".
msg259543 - (view)	Author: Florin Papa (florin.papa) *	Date: 2016-02-04 08:33
I have tested the patch and it does not seem to solve the stability problem. With patch: python performance/bm_json_v2.py -n 10 --timer perf_counter 0.02829067991115153 0.027778947027400136 0.027756798081099987 0.027362094027921557 0.027256828034296632 0.02766443998552859 0.027390124974772334 0.027398303849622607 0.02731629996560514 0.02743363007903099 Without patch: python performance/bm_json_v2.py -n 10 --timer perf_counter 2.732215315802023 2.7394272631499916 2.731212269049138 2.7511520159896463 2.739834441104904 2.754247901029885 2.734024364966899 2.7447318979538977 2.744731454877183 2.7513630259782076 Instead, I notice that, especially for the first runs, the measured values tend to be higher. This is expected, since the cache has not been populated yet, but this can also happen after a context switch. Therefore, we could eliminate these "odd" values by calculating the median value of all the runs and only keeping measurements within 1% of that median value. What do you think?
msg259544 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-04 08:39
> With patch: > python performance/bm_json_v2.py -n 10 --timer perf_counter Oh. There is a bit of confusion here. You must not run directly bm_xxx.py scripts. The calibration is done in perf.py. Try for example: python perf.py python2 python3 -b json_dump_v2 You must see something like: Calibrating => num_runs=10, num_loops=16 (0.50 sec < 0.55 sec) I should maybe share the calibration code to also compute the number of iterations when a bm_xxx.py script is run directly? But the risk is that someone compares two runs of bm_xxx.py using two python binaries, and seeing different results just because the number of calibrated loops is different...
msg259546 - (view)	Author: Stefan Krah (skrah) *	Date: 2016-02-04 09:24
In my experience it is very hard to get stable benchmark results with Python. Even long running benchmarks on an empty machine vary: wget http://www.bytereef.org/software/mpdecimal/benchmarks/telco.py wget http://speleotrove.com/decimal/expon180-1e6b.zip unzip expon180-1e6b.zip taskset -c 0 ./python telco.py full $ taskset -c 0 ./python telco.py full Control totals: Actual ['1004737.58', '57628.30', '25042.17'] Expected ['1004737.58', '57628.30', '25042.17'] Elapsed time: 7.16255 $ taskset -c 0 ./python telco.py full Control totals: Actual ['1004737.58', '57628.30', '25042.17'] Expected ['1004737.58', '57628.30', '25042.17'] Elapsed time: 6.982884 $ taskset -c 0 ./python telco.py full Control totals: Actual ['1004737.58', '57628.30', '25042.17'] Expected ['1004737.58', '57628.30', '25042.17'] Elapsed time: 7.095349000000001 $ taskset -c 0 ./python telco.py full Control totals: Actual ['1004737.58', '57628.30', '25042.17'] Expected ['1004737.58', '57628.30', '25042.17']
msg259547 - (view)	Author: Stefan Krah (skrah) *	Date: 2016-02-04 09:27
I've cut off the highest result in the previous message: Control totals: Actual ['1004737.58', '57628.30', '25042.17'] Expected ['1004737.58', '57628.30', '25042.17'] Elapsed time: 7.304339 $ taskset -c 0 ./python telco.py full
msg259548 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-04 09:28
> taskset -c 0 ./python telco.py full Did you see that I just merged Florin's patch to add --affinity parameter to perf.py? :-) You may isolate some CPU cores using the kernel command parameter isolcpus=xxx. I don't think that the core #0 is the best choice, it may be preferred. It would be nice to collect "tricks" to get most reliable benchmark results. Maybe in perf.py README page? Or a wiki page? Whatever? :-)
msg259550 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-04 09:35
For an older project (Fusil the fuzzer), I wrote a short function to sleep until the system load is lower than a threshold. I had to write such function to reduce the noise when the system is heavily loaded. I wrote that to be able to run a very long task (it takes at least 1 hour, but may run for multiple days!) on my desktop and continue to use my desktop for various other tasks. On Linux, we can use the "cpu xxx xxx xxx ..." line of /proc/stat to get the system load. My code to read the system load: https://bitbucket.org/haypo/fusil/src/32ddc281219cd90c1ad12a3ee4ce212bea1c3e0f/fusil/linux/cpu_load.py?at=default&fileviewer=file-view-default#cpu_load.py-51 My code to wait until the system load is lower than a threshold: https://bitbucket.org/haypo/fusil/src/32ddc281219cd90c1ad12a3ee4ce212bea1c3e0f/fusil/system_calm.py?at=default&fileviewer=file-view-default#system_calm.py-5 -- I also write a script to do the opposite :-) A script to stress the system to have a system load higher or equal than a minimum load: https://bitbucket.org/haypo/misc/src/3fd3993413b128b37e945690865ea2c5ef48c446/bin/system_load.py?at=default&fileviewer=file-view-default This script helped to me reproduce sporadic failures like timeouts which only occur when the system is highly loaded.
msg259551 - (view)	Author: Stefan Krah (skrah) *	Date: 2016-02-04 09:37
Core 1 fluctuates even more (My machine only has 2 cores): $ taskset -c 1 ./python telco.py full Control totals: Actual ['1004737.58', '57628.30', '25042.17'] Expected ['1004737.58', '57628.30', '25042.17'] Elapsed time: 6.783009 Control totals: Actual ['1004737.58', '57628.30', '25042.17'] Expected ['1004737.58', '57628.30', '25042.17'] Elapsed time: 7.335563 $ taskset -c 1 ./python telco.py full I have some of the same concerns as Serhiy. There's a lot of statistics going on in the benchmark suite -- is it really possible to separate that cleanly from the actual runtime of the benchmarks?
msg259555 - (view)	Author: Florin Papa (florin.papa) *	Date: 2016-02-04 11:08
I ran perf to use calibration and there is no difference in stability compared to the unpatched version. With patch: python perf.py -b json_dump_v2 -v --csv=out1.csv --affinity=2 ../cpython/python ../cpython/python INFO:root:Automatically selected timer: perf_counter [1/1] json_dump_v2... Calibrating INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 1 --timer perf_counter` INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 2 --timer perf_counter` INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 4 --timer perf_counter` INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 8 --timer perf_counter` INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 16 --timer perf_counter` INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 32 --timer perf_counter` Calibrating => num_runs=10, num_loops=32 (0.50 sec < 0.87 sec) INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 10 -l 32 --timer perf_counter` INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 10 -l 32 --timer perf_counter` Report on Linux centos 3.10.0-229.7.2.el7.x86_64 #1 SMP Tue Jun 23 22:06:11 UTC 2015 x86_64 x86_64 Total CPU cores: 18 ### json_dump_v2 ### Min: 0.877497 -> 0.886482: 1.01x slower <---------- Avg: 0.878150 -> 0.888351: 1.01x slower Not significant Stddev: 0.00054 -> 0.00106: 1.9481x larger Without patch: python perf.py -b json_dump_v2 -v --csv=out1.csv --affinity=2 ../cpython/python ../cpython/python INFO:root:Automatically selected timer: perf_counter [1/1] json_dump_v2... INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 50 --timer perf_counter` INFO:root:Running `taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 50 --timer perf_counter` Report on Linux centos 3.10.0-229.7.2.el7.x86_64 #1 SMP Tue Jun 23 22:06:11 UTC 2015 x86_64 x86_64 Total CPU cores: 18 ### json_dump_v2 ### Min: 2.755514 -> 2.764131: 1.00x slower <---------- (almost) same as above Avg: 2.766546 -> 2.775587: 1.00x slower Not significant Stddev: 0.00538 -> 0.00382: 1.4069x smaller
msg259556 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-04 11:37
Stefan: "In my experience it is very hard to get stable benchmark results with Python. Even long running benchmarks on an empty machine vary: (...)" tl; dr We can tune the Linux kernel to avoid most of the system noise when running kernels. I modified Stefan's telco.py to remove all I/O from the hot code: the benchmark is now really CPU-bound. I also modified telco.py to run the benchmark 5 times. One run takes around 2.6 seconds. I also added the following lines to check the CPU affinity and the number of context switches: os.system("grep -E -i 'cpu\|ctx' /proc/%s/status" % os.getpid()) Well, see attached telco_haypo.py for the full script. I used my system_load.py script to get a system load >= 5.0. Without tasksel, the benchmark result changes completly: at least 5 seconds. Well, it's not really surprising, it's known that benchmarks depend on the system load. BUT I have a great kernel called Linux which has cool features called "CPU isolation" and "no HZ" (tickless kernel). On my Fedoera 23, the kernel is compiled with CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y. haypo@smithers$ lscpu --extended CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ 0 0 0 0 0:0:0:0 oui 5900,0000 1600,0000 1 0 0 1 1:1:1:0 oui 5900,0000 1600,0000 2 0 0 2 2:2:2:0 oui 5900,0000 1600,0000 3 0 0 3 3:3:3:0 oui 5900,0000 1600,0000 4 0 0 0 0:0:0:0 oui 5900,0000 1600,0000 5 0 0 1 1:1:1:0 oui 5900,0000 1600,0000 6 0 0 2 2:2:2:0 oui 5900,0000 1600,0000 7 0 0 3 3:3:3:0 oui 5900,0000 1600,0000 My CPU is on a single socket, has 4 physical cores, but Linux gets 8 cores because of hyper threading. I modified the Linux command line during the boot in GRUB to add: isolcpus=2,3,6,7 nohz_full=2,3,6,7. Then I forced the CPU frequency to performance to avoid hiccups: # for id in 2 3 6 7; do echo performance > cpu$id/cpufreq/scaling_governor; done Check the config with: $ cat /sys/devices/system/cpu/isolated 2-3,6-7 $ cat /sys/devices/system/cpu/nohz_full 2-3,6-7 $ cat /sys/devices/system/cpu/cpu[2367]/cpufreq/scaling_governor performance performance performance performance Ok now with this kernel config but still without tasksel on an idle system: ----------------------- Elapsed time: 2.660088424000037 Elapsed time: 2.5927538629999844 Elapsed time: 2.6135682369999813 Elapsed time: 2.5819260570000324 Elapsed time: 2.5991294099999322 Cpus_allowed: 33 Cpus_allowed_list: 0-1,4-5 voluntary_ctxt_switches: 1 nonvoluntary_ctxt_switches: 21 ----------------------- With system load >= 5.0: ----------------------- Elapsed time: 5.3484489170000415 Elapsed time: 5.336797472999933 Elapsed time: 5.187413687999992 Elapsed time: 5.24122020599998 Elapsed time: 5.10201246400004 Cpus_allowed_list: 0-1,4-5 voluntary_ctxt_switches: 1 nonvoluntary_ctxt_switches: 1597 ----------------------- And NOW using my isolated CPU physical cores #2 and #3 (Linux CPUs 2, 3, 6 and 7), still on the heavily loaded system: ----------------------- $ taskset -c 2,3,6,7 python3 telco_haypo.py full Elapsed time: 2.579487486000062 Elapsed time: 2.5827961039999536 Elapsed time: 2.5811954810001225 Elapsed time: 2.5782033600000887 Elapsed time: 2.572370636999949 Cpus_allowed: cc Cpus_allowed_list: 2-3,6-7 voluntary_ctxt_switches: 2 nonvoluntary_ctxt_switches: 16 ----------------------- Numbers look more stable than the numbers of the first test without taskset on an idle system! You can see that number of context switches is very low (total: 18). Example of a second run: ----------------------- haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full Elapsed time: 2.538398498999868 Elapsed time: 2.544711968999991 Elapsed time: 2.5323677339999904 Elapsed time: 2.536252647000083 Elapsed time: 2.525748182999905 Cpus_allowed: cc Cpus_allowed_list: 2-3,6-7 voluntary_ctxt_switches: 2 nonvoluntary_ctxt_switches: 15 ----------------------- Third run: ----------------------- haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full Elapsed time: 2.5819172930000605 Elapsed time: 2.5783024259999365 Elapsed time: 2.578493587999901 Elapsed time: 2.5774198510000588 Elapsed time: 2.5772148999999445 Cpus_allowed: cc Cpus_allowed_list: 2-3,6-7 voluntary_ctxt_switches: 2 nonvoluntary_ctxt_switches: 15 ----------------------- Well, it's no perfect, but it looks much stable than timings without specific kernel config nor CPU pinning. Statistics on the 15 timings of the 3 runs with tunning on a heavily loaded system: >>> times [2.579487486000062, 2.5827961039999536, 2.5811954810001225, 2.5782033600000887, 2.572370636999949, 2.538398498999868, 2.544711968999991, 2.5323677339999904, 2.536252647000083, 2.525748182999905, 2.5819172930000605, 2.5783024259999365, 2.578493587999901, 2.5774198510000588, 2.5772148999999445] >>> statistics.mean(times) 2.564325343866661 >>> statistics.pvariance(times) 0.0004340411190965491 >>> statistics.stdev(times) 0.021564880156747315 Compare if to the timings without tunning on an idle system: >>> times [2.660088424000037, 2.5927538629999844, 2.6135682369999813, 2.5819260570000324, 2.5991294099999322] >>> statistics.mean(times) 2.6094931981999934 >>> statistics.pvariance(times) 0.0007448087075422725 >>> statistics.stdev(times) 0.030512470965620608 We get (no tuning, idle system => tuning, busy system): * Population variance: 0.00074 => 0.00043 * Standard deviation: 0.031 => 0.022 It looks much better, no? Even I only used 5 timings on the benchmark without tuning, whereas I used 15 timings on the benchmark with tuning. I expect larger variance and deviation with more times. -- Just for fun, I ran the benchmark 3 times (so to get 3x5 timings) on an idle system with tuning: >>> times [2.542378394000025, 2.5541740109999864, 2.5456488329998592, 2.54730951800002, 2.5495472409998, 2.56374302800009, 2.5737907220000125, 2.581463170999996, 2.578222832999927, 2.574441839999963, 2.569389365999996, 2.5792129209999075, 2.5689420860001064, 2.5681367900001533, 2.5563378829999692] >>> import statistics >>> statistics.mean(times) 2.563515909133321 >>> statistics.pvariance(times) 0.00016384530912002678 >>> statistics.stdev(times) 0.013249473404092065 As expected, it's even better (no tune, idle system => tuning, busy system => tuning, idle system): * Population variance: 0.00074 => 0.00043 => 0.00016 * Standard deviation: 0.031 => 0.022 => 0.013
msg259557 - (view)	Author: Stefan Krah (skrah) *	Date: 2016-02-04 12:03
> STINNER Victor added the comment: > I modified Stefan's telco.py to remove all I/O from the hot code: the benchmark is now really CPU-bound. I also modified telco.py to run the benchmark 5 times. One run takes around 2.6 seconds. > Nice. telco.py is an ad-hoc script from the original decimal.py sandbox, I missed that it called "infil.read(8)". :) > And NOW using my isolated CPU physical cores #2 and #3 (Linux CPUs 2, 3, 6 and 7), still on the heavily loaded system: > ----------------------- > $ taskset -c 2,3,6,7 python3 telco_haypo.py full > > Elapsed time: 2.579487486000062 > Elapsed time: 2.5827961039999536 > Elapsed time: 2.5811954810001225 > Elapsed time: 2.5782033600000887 > Elapsed time: 2.572370636999949 Great. I'll try that out in the weekend.
msg259558 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2016-02-04 12:16
Victor, this is a very interesting write-up, thank you.
msg259561 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-04 13:33
Florin Papa added the comment: > I ran perf to use calibration and there is no difference in stability > compared to the unpatched version. Sorry, what are you calling "stability"? For me, stability means that you run the same benchmark 3, 5 or 10 times, and the results must be as close as possible: see variance and standard deviation of my previous message. I'm not talking of variance/deviation of the N runs of bm_xxx.py scripts, but variance/deviation of the mean value displayed by perf.py. perf_calibration.patch is a proof-of-concept. I changed the number of runs from 50 to 10 to test my patch more easily. You should modify the patch to keep 50 runs if you want to compare the stability. By the way, --fast/--rigorous options should not only change the minimum duration of a single run to calibrate the number of loops, but they should also change the "maximum" duration of perf.py by using different number of runs.
msg259566 - (view)	Author: Florin Papa (florin.papa) *	Date: 2016-02-04 14:22
I was also talking about the variance/deviation of the mean value displayed by perf.py, sorry if I was unclear. The perf.py output in my previous message showed little difference between the patched and non-patched version. I will also try increasing the number of runs to see if there is any change. The CPU isolation feature is a great finding, thank you.
msg259570 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-02-04 15:05
tl; dr I'm disappointed. According to the statistics module, running the bm_regex_v8.py benchmark more times with more iterations make the benchmark more unstable... I expected the opposite... Patch version 2: * patch also performance/bm_pickle.py * change min_time from 100 ms to 500 ms with --fast * compute the number of runs using a maximum time, maximum time change with --fast and --rigorous + if options.rigorous: + min_time = 1.0 + max_time = 100.0 # 100 runs + elif options.fast: + min_time = 0.5 + max_time = 25.0 # 50 runs + else: + min_time = 0.5 + max_time = 50.0 # 100 runs To measure the stability of perf.py, I pinned perf.py to CPU cores which are isolated of the system using Linux "isolcpus" kernel parameter. I also forced the CPU frequency governor to "performance" and enabled "no HZ full" on these cores. I ran perf.py 5 times on regex_v8. Calibration (original => patched): * --fast: 1 iteration x 5 runs => 16 iterations x 50 runs * (no option): 1 iteration x 50 runs => 16 iterations x 100 runs Approximated duration of the benchmark (original => patched): * --fast: 7 sec => 7 min 34 sec * (no option): 30 sec => 14 min 35 sec (I made a mistake, so I was unable to get the exact duration.) Hum, maybe timings are not well chosen because the benchmark is really slow (minutes vs seconds) :-/ Standard deviation, --fast: * (python2) 0.00071 (1.2%, mean=0.05961) => 0.01059 (1.1%, mean=0.96723) * (python3) 0.00068 (1.5%, mean=0.04494) => 0.05925 (8.0%, mean=0.74248) * (faster) 0.02986 (2.2%, mean=1.32750) => 0.09083 (6.9%, mean=1.31000) Standard deviation, (no option): * (python2) 0.00072 (1.2%, mean=0.05957) => 0.00874 (0.9%, mean=0.97028) * (python3) 0.00053 (1.2%, mean=0.04477) => 0.00966 (1.3%, mean=0.72680) * (faster) 0.02739 (2.1%, mean=1.33000) => 0.02608 (2.0%, mean=1.33600) Variance, --fast: * (python2) 0.00000 (0.001%, mean=0.05961) => 0.00009 (0.009%, mean=0.96723) * (python3) 0.00000 (0.001%, mean=0.04494) => 0.00281 (0.378%, mean=0.74248) * (faster) 0.00067 (0.050%, mean=1.32750) => 0.00660 (0.504%, mean=1.31000) Variance, (no option): * (python2) 0.00000 (0.001%, mean=0.05957) => 0.00006 (0.006%, mean=0.97028) * (python3) 0.00000 (0.001%, mean=0.04477) => 0.00007 (0.010%, mean=0.72680) * (faster) 0.00060 (0.045%, mean=1.33000) => 0.00054 (0.041%, mean=1.33600) Legend: * (python2) are timings of python2 ran by perf.py (of the "Min" line) * (python3) are timings of python3 ran by perf.py (of the "Min" line) * (faster) are the "1.34x" numbers of "faster" or "slower" of the "Min" line * percentages are: value * 100 / mean It's not easy to compare these values since the number of iterations is very different (1 => 16) and so timings are very different (ex: 0.059 sec => 0.950 sec). I guess that it's ok to compare percentages. I used the stability.py script, attached to this issue, to compute deviation and variance from the "Min" line of the 5 runs. The script takes the output of perf.py as input. I'm not sure that 5 runs are enough to compute statistics. -- Raw data. Original perf.py. $ grep ^Min original.fast Min: 0.059236 -> 0.045948: 1.29x faster Min: 0.059005 -> 0.044654: 1.32x faster Min: 0.059601 -> 0.044547: 1.34x faster Min: 0.060605 -> 0.044600: 1.36x faster $ grep ^Min original Min: 0.060479 -> 0.044762: 1.35x faster Min: 0.059002 -> 0.045689: 1.29x faster Min: 0.058991 -> 0.044587: 1.32x faster Min: 0.060231 -> 0.044364: 1.36x faster Min: 0.059165 -> 0.044464: 1.33x faster Patched perf.py. $ grep ^Min patched.fast Min: 0.950717 -> 0.711018: 1.34x faster Min: 0.968413 -> 0.730810: 1.33x faster Min: 0.976092 -> 0.847388: 1.15x faster Min: 0.964355 -> 0.711083: 1.36x faster Min: 0.976573 -> 0.712081: 1.37x faster $ grep ^Min patched Min: 0.968810 -> 0.729109: 1.33x faster Min: 0.973615 -> 0.731308: 1.33x faster Min: 0.974215 -> 0.734259: 1.33x faster Min: 0.978781 -> 0.709915: 1.38x faster Min: 0.955977 -> 0.729387: 1.31x faster $ grep ^Calibration patched.fast Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.4 sec) Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.3 sec) Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.4 sec) Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.6 sec) Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.7 sec) $ grep ^Calibration patched Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.0 sec) Calibration: num_runs=100, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 75.3 sec) Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.2 sec) Calibration: num_runs=100, num_loops=16 (0.74 sec per run > min_time 0.50 sec, estimated total: 73.7 sec) Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 72.9 sec)
msg259582 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2016-02-04 17:47
What would happen if we shifted to counting the number of executions within a set amount of time instead of how fast a single execution occurred? I believe some JavaScript benchmarks started to do this about a decade ago when they realized CPUs were getting so fast that older benchmarks were completing too quickly to be reliably measured. This also would allow one to have a very strong notion of how long a benchmark run would take based on the number of iterations and what time length bucket a benchmark was placed in (i.e., for microbenchmarks we could say a second while for longer running benchmarks we can increase that threshold). And it won't hurt benchmark comparisons since we have always done relative comparisons rather than absolute ones.
msg273923 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-08-30 16:13
This issue was fixed in the new flavor of the benchmark, the new performance project: https://github.com/python/performance

History
Date	User	Action	Args
2022-04-11 14:58:27	admin	set	github: 70463
2016-08-30 16:13:19	vstinner	set	status: open -> closed resolution: fixed dependencies: - Fix telco benchmark messages: + msg273923
2016-02-23 09:58:09	skrah	set	dependencies: + Fix telco benchmark
2016-02-04 17:47:37	brett.cannon	set	messages: + msg259582
2016-02-04 15:05:37	vstinner	set	files: + perf_calibration-2.patch
2016-02-04 15:05:24	vstinner	set	files: + stability.py messages: + msg259570
2016-02-04 14:22:21	florin.papa	set	messages: + msg259566
2016-02-04 13:33:05	vstinner	set	messages: + msg259561
2016-02-04 12:16:08	pitrou	set	messages: + msg259558
2016-02-04 12:03:59	skrah	set	messages: + msg259557
2016-02-04 11:37:12	vstinner	set	files: + telco_haypo.py messages: + msg259556
2016-02-04 11:08:00	florin.papa	set	messages: + msg259555
2016-02-04 09:37:13	skrah	set	messages: + msg259551
2016-02-04 09:35:48	vstinner	set	messages: + msg259550
2016-02-04 09:28:12	vstinner	set	messages: + msg259548
2016-02-04 09:27:46	skrah	set	messages: + msg259547
2016-02-04 09:24:03	skrah	set	nosy: + skrah messages: + msg259546
2016-02-04 08:39:51	vstinner	set	messages: + msg259544
2016-02-04 08:33:48	florin.papa	set	messages: + msg259543
2016-02-03 20:33:03	vstinner	set	messages: + msg259515
2016-02-03 19:56:42	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg259510
2016-02-03 19:22:36	yselivanov	set	nosy: + zbyrne messages: + msg259504
2016-02-03 13:50:24	vstinner	set	nosy: + florin.papa title: perf.py: bm_regex_v8 doesn't seem reliable even with --rigorous -> perf.py: calibrate benchmarks using time, not using a fixed number of iterations
2016-02-03 11:18:27	vstinner	set	files: + perf_calibration.patch keywords: + patch messages: + msg259472
2016-02-03 10:38:49	vstinner	create