New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf.py: calibrate benchmarks using time, not using a fixed number of iterations #70463
Comments
Hi, I'm working on some optimizations projects like FAT Python (PEP-509: issue bpo-26058, PEP-510: issue bpo-26098, and PEP-511: issue bpo-26145) and faster memory allocators (issue bpo-26249). I have the *feeling* that perf.py output is not reliable even if it takes more than 20 minutes :-/ Maybe because Yury told that I must use -r (--rigorous) :-) Example with 5 runs of "python3 perf.py ../default/python ../default/python.orig -b regex_v8": --------------- ### regex_v8 ### ### regex_v8 ### ### regex_v8 ### ### regex_v8 ### ### regex_v8 ### I'm only care of the "Min", IMHO it's the most interesting information here. The slowdown is betwen 12% and 20%, for me it's a big difference. It looks like some benchmarks have very short iterations compare to others. For example, bm_json_v2 takes around 3 seconds, whereas bm_regex_v8 only takes less than 0.050 second (50 ms). $ python3 performance/bm_json_v2.py -n 3 --timer perf_counter
3.310384973010514
3.3116717970115133
3.3077902760123834
$ python3 performance/bm_regex_v8.py -n 3 --timer perf_counter
0.0670697659952566
0.04515827298746444
0.045114840992027894 Do you think that bm_regex_v8 is reliable? I see that there is an "iteration scaling" to use run the benchmarks with more iterations. Maybe we can start to increase the "iteration scaling" for bm_regex_v8? Instead of a fixed number of iterations, we should redesign benchmarks to use time. For example, one iteration must take at least 100 ms and should not take more than 1 second (but take longer to get more reliable results). Then the benchmark is responsible to ajust internal parameters. I used this design for my "benchmark.py" script which is written to get "reliable" microbenchmarks: The script is based on time and calibrate a benchmark. It also uses the *effictive* resolution of the clock used by the benchmark to calibrate the benchmark. I will maybe work on such patch, but it would be good to know first your opinion on such change. I guess that we should use the base python to calibrate the benchmark and then pass the same parameters to the modified python. |
Well, it was simpler than I expected to implement calibration. Here is a PoC for regex_v8 and json_dump_v2. json_dump_v2 takes 3 seconds per run, but it already uses internally 100 loop iterations. I divided the number of iterations by 100 in my patch. |
Here's a very interesting table from Zach Byrne: http://bugs.python.org/issue21955#msg259490 It shows that some benchmarks are indeed very unstable. This also correlates with my own experience. These ones are very unstable: pickle_dict, nbody, regex_v8, etree_parse, telco. |
If running time is close to the limit, different run will use different number of repetitions. This will add additional instability. I prefer a stable number of repetitions manually calibrated for average modern computer. |
Maybe, to be honest I don't know. How can we decide if a patch makes
The problem is to define an "average modern computer". |
I have tested the patch and it does not seem to solve the stability With patch: Without patch: Instead, I notice that, especially for the first runs, the measured What do you think? |
Oh. There is a bit of confusion here. You must *not* run directly bm_xxx.py scripts. The calibration is done in perf.py. Try for example: You must see something like: I should maybe share the calibration code to also compute the number of iterations when a bm_xxx.py script is run directly? But the risk is that someone compares two runs of bm_xxx.py using two python binaries, and seeing different results just because the number of calibrated loops is different... |
In my experience it is very hard to get stable benchmark results with Python. Even long running benchmarks on an empty machine vary: wget http://www.bytereef.org/software/mpdecimal/benchmarks/telco.py taskset -c 0 ./python telco.py full $ taskset -c 0 ./python telco.py full Control totals: Control totals: Control totals: Control totals: |
I've cut off the highest result in the previous message: Control totals: |
Did you see that I just merged Florin's patch to add --affinity parameter to perf.py? :-) You may isolate some CPU cores using the kernel command parameter isolcpus=xxx. I don't think that the core #0 is the best choice, it may be preferred. It would be nice to collect "tricks" to get most reliable benchmark results. Maybe in perf.py README page? Or a wiki page? Whatever? :-) |
For an older project (Fusil the fuzzer), I wrote a short function to sleep until the system load is lower than a threshold. I had to write such function to reduce the noise when the system is heavily loaded. I wrote that to be able to run a very long task (it takes at least 1 hour, but may run for multiple days!) on my desktop and continue to use my desktop for various other tasks. On Linux, we can use the "cpu xxx xxx xxx ..." line of /proc/stat to get the system load. My code to read the system load: My code to wait until the system load is lower than a threshold: -- I also write a script to do the opposite :-) A script to stress the system to have a system load higher or equal than a minimum load: This script helped to me reproduce sporadic failures like timeouts which only occur when the system is highly loaded. |
Core 1 fluctuates even more (My machine only has 2 cores): $ taskset -c 1 ./python telco.py full Control totals: I have some of the same concerns as Serhiy. There's a lot of statistics going on in the benchmark suite -- is it really possible to separate that cleanly from the actual runtime of the benchmarks? |
I ran perf to use calibration and there is no difference in stability With patch: python perf.py -b json_dump_v2 -v --csv=out1.csv --affinity=2 ../cpython/python ../cpython/python Report on Linux centos 3.10.0-229.7.2.el7.x86_64 #1 SMP Tue Jun 23 22:06:11 UTC 2015 x86_64 x86_64 ### json_dump_v2 ### Without patch: python perf.py -b json_dump_v2 -v --csv=out1.csv --affinity=2 ../cpython/python ../cpython/python Report on Linux centos 3.10.0-229.7.2.el7.x86_64 #1 SMP Tue Jun 23 22:06:11 UTC 2015 x86_64 x86_64 ### json_dump_v2 ### |
Stefan: "In my experience it is very hard to get stable benchmark results with Python. Even long running benchmarks on an empty machine vary: (...)" tl; dr We *can* tune the Linux kernel to avoid most of the system noise when running kernels. I modified Stefan's telco.py to remove all I/O from the hot code: the benchmark is now really CPU-bound. I also modified telco.py to run the benchmark 5 times. One run takes around 2.6 seconds. I also added the following lines to check the CPU affinity and the number of context switches: os.system("grep -E -i 'cpu|ctx' /proc/%s/status" % os.getpid()) Well, see attached telco_haypo.py for the full script. I used my system_load.py script to get a system load >= 5.0. Without tasksel, the benchmark result changes completly: at least 5 seconds. Well, it's not really surprising, it's known that benchmarks depend on the system load. *BUT* I have a great kernel called Linux which has cool features called "CPU isolation" and "no HZ" (tickless kernel). On my Fedoera 23, the kernel is compiled with CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y. haypo@smithers$ lscpu --extended My CPU is on a single socket, has 4 physical cores, but Linux gets 8 cores because of hyper threading. I modified the Linux command line during the boot in GRUB to add: isolcpus=2,3,6,7 nohz_full=2,3,6,7. Then I forced the CPU frequency to performance to avoid hiccups: # for id in 2 3 6 7; do echo performance > cpu$id/cpufreq/scaling_governor; done Check the config with: $ cat /sys/devices/system/cpu/isolated
2-3,6-7
$ cat /sys/devices/system/cpu/nohz_full
2-3,6-7
$ cat /sys/devices/system/cpu/cpu[2367]/cpufreq/scaling_governor
performance
performance
performance
performance Ok now with this kernel config but still without tasksel on an idle system: Cpus_allowed: 33 With system load >= 5.0: Cpus_allowed_list: 0-1,4-5 And *NOW* using my isolated CPU physical cores #2 and #3 (Linux CPUs 2, 3, 6 and 7), still on the heavily loaded system: $ taskset -c 2,3,6,7 python3 telco_haypo.py full Elapsed time: 2.579487486000062 Cpus_allowed: cc Numbers look *more* stable than the numbers of the first test without taskset on an idle system! You can see that number of context switches is very low (total: 18). Example of a second run: Elapsed time: 2.538398498999868 Cpus_allowed: cc Third run: Elapsed time: 2.5819172930000605 Cpus_allowed: cc Well, it's no perfect, but it looks much stable than timings without specific kernel config nor CPU pinning. Statistics on the 15 timings of the 3 runs with tunning on a heavily loaded system: >>> times
[2.579487486000062, 2.5827961039999536, 2.5811954810001225, 2.5782033600000887, 2.572370636999949, 2.538398498999868, 2.544711968999991, 2.5323677339999904, 2.536252647000083, 2.525748182999905, 2.5819172930000605, 2.5783024259999365, 2.578493587999901, 2.5774198510000588, 2.5772148999999445]
>>> statistics.mean(times)
2.564325343866661
>>> statistics.pvariance(times)
0.0004340411190965491
>>> statistics.stdev(times)
0.021564880156747315 Compare if to the timings without tunning on an idle system: >>> times
[2.660088424000037, 2.5927538629999844, 2.6135682369999813, 2.5819260570000324, 2.5991294099999322]
>>> statistics.mean(times)
2.6094931981999934
>>> statistics.pvariance(times)
0.0007448087075422725
>>> statistics.stdev(times)
0.030512470965620608 We get (no tuning, idle system => tuning, busy system):
It looks *much* better, no? Even I only used *5* timings on the benchmark without tuning, whereas I used 15 timings on the benchmark with tuning. I expect larger variance and deviation with more times. -- Just for fun, I ran the benchmark 3 times (so to get 3x5 timings) on an idle system with tuning: >>> times
[2.542378394000025, 2.5541740109999864, 2.5456488329998592, 2.54730951800002, 2.5495472409998, 2.56374302800009, 2.5737907220000125, 2.581463170999996, 2.578222832999927, 2.574441839999963, 2.569389365999996, 2.5792129209999075, 2.5689420860001064, 2.5681367900001533, 2.5563378829999692]
>>> import statistics
>>> statistics.mean(times)
2.563515909133321
>>> statistics.pvariance(times)
0.00016384530912002678
>>> statistics.stdev(times)
0.013249473404092065 As expected, it's even better (no tune, idle system => tuning, busy system => tuning, idle system):
|
Nice. telco.py is an ad-hoc script from the original decimal.py sandbox,
Great. I'll try that out in the weekend. |
Victor, this is a very interesting write-up, thank you. |
Florin Papa added the comment:
Sorry, what are you calling "stability"? For me, stability means that I'm not talking of variance/deviation of the N runs of bm_xxx.py perf_calibration.patch is a proof-of-concept. I changed the number of By the way, --fast/--rigorous options should not only change the |
I was also talking about the variance/deviation of the mean value The CPU isolation feature is a great finding, thank you. |
tl; dr I'm disappointed. According to the statistics module, running the bm_regex_v8.py benchmark more times with more iterations make the benchmark more unstable... I expected the opposite... Patch version 2:
+ if options.rigorous: To measure the stability of perf.py, I pinned perf.py to CPU cores which are isolated of the system using Linux "isolcpus" kernel parameter. I also forced the CPU frequency governor to "performance" and enabled "no HZ full" on these cores. I ran perf.py 5 times on regex_v8. Calibration (original => patched):
Approximated duration of the benchmark (original => patched):
(I made a mistake, so I was unable to get the exact duration.) Hum, maybe timings are not well chosen because the benchmark is really slow (minutes vs seconds) :-/ Standard deviation, --fast:
Standard deviation, (no option):
Variance, --fast:
Variance, (no option):
Legend:
It's not easy to compare these values since the number of iterations is very different (1 => 16) and so timings are very different (ex: 0.059 sec => 0.950 sec). I guess that it's ok to compare percentages. I used the stability.py script, attached to this issue, to compute deviation and variance from the "Min" line of the 5 runs. The script takes the output of perf.py as input. I'm not sure that 5 runs are enough to compute statistics. -- Raw data. Original perf.py. $ grep ^Min original.fast
Min: 0.059236 -> 0.045948: 1.29x faster
Min: 0.059005 -> 0.044654: 1.32x faster
Min: 0.059601 -> 0.044547: 1.34x faster
Min: 0.060605 -> 0.044600: 1.36x faster
$ grep ^Min original
Min: 0.060479 -> 0.044762: 1.35x faster
Min: 0.059002 -> 0.045689: 1.29x faster
Min: 0.058991 -> 0.044587: 1.32x faster
Min: 0.060231 -> 0.044364: 1.36x faster
Min: 0.059165 -> 0.044464: 1.33x faster Patched perf.py. $ grep ^Min patched.fast
Min: 0.950717 -> 0.711018: 1.34x faster
Min: 0.968413 -> 0.730810: 1.33x faster
Min: 0.976092 -> 0.847388: 1.15x faster
Min: 0.964355 -> 0.711083: 1.36x faster
Min: 0.976573 -> 0.712081: 1.37x faster
$ grep ^Min patched
Min: 0.968810 -> 0.729109: 1.33x faster
Min: 0.973615 -> 0.731308: 1.33x faster
Min: 0.974215 -> 0.734259: 1.33x faster
Min: 0.978781 -> 0.709915: 1.38x faster
Min: 0.955977 -> 0.729387: 1.31x faster
$ grep ^Calibration patched.fast
Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.4 sec)
Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.3 sec)
Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.4 sec)
Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.6 sec)
Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.7 sec)
$ grep ^Calibration patched
Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.0 sec)
Calibration: num_runs=100, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 75.3 sec)
Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.2 sec)
Calibration: num_runs=100, num_loops=16 (0.74 sec per run > min_time 0.50 sec, estimated total: 73.7 sec)
Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 72.9 sec) |
What would happen if we shifted to counting the number of executions within a set amount of time instead of how fast a single execution occurred? I believe some JavaScript benchmarks started to do this about a decade ago when they realized CPUs were getting so fast that older benchmarks were completing too quickly to be reliably measured. This also would allow one to have a very strong notion of how long a benchmark run would take based on the number of iterations and what time length bucket a benchmark was placed in (i.e., for microbenchmarks we could say a second while for longer running benchmarks we can increase that threshold). And it won't hurt benchmark comparisons since we have always done relative comparisons rather than absolute ones. |
This issue was fixed in the new flavor of the benchmark, the new performance project: |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: