Stefan: "In my experience it is very hard to get stable benchmark results with Python. Even long running benchmarks on an empty machine vary: (...)"
tl; dr We *can* tune the Linux kernel to avoid most of the system noise when running kernels.
I modified Stefan's telco.py to remove all I/O from the hot code: the benchmark is now really CPU-bound. I also modified telco.py to run the benchmark 5 times. One run takes around 2.6 seconds.
I also added the following lines to check the CPU affinity and the number of context switches:
os.system("grep -E -i 'cpu|ctx' /proc/%s/status" % os.getpid())
Well, see attached telco_haypo.py for the full script.
I used my system_load.py script to get a system load >= 5.0. Without tasksel, the benchmark result changes completly: at least 5 seconds. Well, it's not really surprising, it's known that benchmarks depend on the system load.
*BUT* I have a great kernel called Linux which has cool features called "CPU isolation" and "no HZ" (tickless kernel). On my Fedoera 23, the kernel is compiled with CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y.
haypo@smithers$ lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ
0 0 0 0 0:0:0:0 oui 5900,0000 1600,0000
1 0 0 1 1:1:1:0 oui 5900,0000 1600,0000
2 0 0 2 2:2:2:0 oui 5900,0000 1600,0000
3 0 0 3 3:3:3:0 oui 5900,0000 1600,0000
4 0 0 0 0:0:0:0 oui 5900,0000 1600,0000
5 0 0 1 1:1:1:0 oui 5900,0000 1600,0000
6 0 0 2 2:2:2:0 oui 5900,0000 1600,0000
7 0 0 3 3:3:3:0 oui 5900,0000 1600,0000
My CPU is on a single socket, has 4 physical cores, but Linux gets 8 cores because of hyper threading.
I modified the Linux command line during the boot in GRUB to add: isolcpus=2,3,6,7 nohz_full=2,3,6,7. Then I forced the CPU frequency to performance to avoid hiccups:
# for id in 2 3 6 7; do echo performance > cpu$id/cpufreq/scaling_governor; done
Check the config with:
$ cat /sys/devices/system/cpu/isolated
2-3,6-7
$ cat /sys/devices/system/cpu/nohz_full
2-3,6-7
$ cat /sys/devices/system/cpu/cpu[2367]/cpufreq/scaling_governor
performance
performance
performance
performance
Ok now with this kernel config but still without tasksel on an idle system:
-----------------------
Elapsed time: 2.660088424000037
Elapsed time: 2.5927538629999844
Elapsed time: 2.6135682369999813
Elapsed time: 2.5819260570000324
Elapsed time: 2.5991294099999322
Cpus_allowed: 33
Cpus_allowed_list: 0-1,4-5
voluntary_ctxt_switches: 1
nonvoluntary_ctxt_switches: 21
-----------------------
With system load >= 5.0:
-----------------------
Elapsed time: 5.3484489170000415
Elapsed time: 5.336797472999933
Elapsed time: 5.187413687999992
Elapsed time: 5.24122020599998
Elapsed time: 5.10201246400004
Cpus_allowed_list: 0-1,4-5
voluntary_ctxt_switches: 1
nonvoluntary_ctxt_switches: 1597
-----------------------
And *NOW* using my isolated CPU physical cores #2 and #3 (Linux CPUs 2, 3, 6 and 7), still on the heavily loaded system:
-----------------------
$ taskset -c 2,3,6,7 python3 telco_haypo.py full
Elapsed time: 2.579487486000062
Elapsed time: 2.5827961039999536
Elapsed time: 2.5811954810001225
Elapsed time: 2.5782033600000887
Elapsed time: 2.572370636999949
Cpus_allowed: cc
Cpus_allowed_list: 2-3,6-7
voluntary_ctxt_switches: 2
nonvoluntary_ctxt_switches: 16
-----------------------
Numbers look *more* stable than the numbers of the first test without taskset on an idle system! You can see that number of context switches is very low (total: 18).
Example of a second run:
-----------------------
haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full
Elapsed time: 2.538398498999868
Elapsed time: 2.544711968999991
Elapsed time: 2.5323677339999904
Elapsed time: 2.536252647000083
Elapsed time: 2.525748182999905
Cpus_allowed: cc
Cpus_allowed_list: 2-3,6-7
voluntary_ctxt_switches: 2
nonvoluntary_ctxt_switches: 15
-----------------------
Third run:
-----------------------
haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full
Elapsed time: 2.5819172930000605
Elapsed time: 2.5783024259999365
Elapsed time: 2.578493587999901
Elapsed time: 2.5774198510000588
Elapsed time: 2.5772148999999445
Cpus_allowed: cc
Cpus_allowed_list: 2-3,6-7
voluntary_ctxt_switches: 2
nonvoluntary_ctxt_switches: 15
-----------------------
Well, it's no perfect, but it looks much stable than timings without specific kernel config nor CPU pinning.
Statistics on the 15 timings of the 3 runs with tunning on a heavily loaded system:
>>> times
[2.579487486000062, 2.5827961039999536, 2.5811954810001225, 2.5782033600000887, 2.572370636999949, 2.538398498999868, 2.544711968999991, 2.5323677339999904, 2.536252647000083, 2.525748182999905, 2.5819172930000605, 2.5783024259999365, 2.578493587999901, 2.5774198510000588, 2.5772148999999445]
>>> statistics.mean(times)
2.564325343866661
>>> statistics.pvariance(times)
0.0004340411190965491
>>> statistics.stdev(times)
0.021564880156747315
Compare if to the timings without tunning on an idle system:
>>> times
[2.660088424000037, 2.5927538629999844, 2.6135682369999813, 2.5819260570000324, 2.5991294099999322]
>>> statistics.mean(times)
2.6094931981999934
>>> statistics.pvariance(times)
0.0007448087075422725
>>> statistics.stdev(times)
0.030512470965620608
We get (no tuning, idle system => tuning, busy system):
* Population variance: 0.00074 => 0.00043
* Standard deviation: 0.031 => 0.022
It looks *much* better, no? Even I only used *5* timings on the benchmark without tuning, whereas I used 15 timings on the benchmark with tuning. I expect larger variance and deviation with more times.
--
Just for fun, I ran the benchmark 3 times (so to get 3x5 timings) on an idle system with tuning:
>>> times
[2.542378394000025, 2.5541740109999864, 2.5456488329998592, 2.54730951800002, 2.5495472409998, 2.56374302800009, 2.5737907220000125, 2.581463170999996, 2.578222832999927, 2.574441839999963, 2.569389365999996, 2.5792129209999075, 2.5689420860001064, 2.5681367900001533, 2.5563378829999692]
>>> import statistics
>>> statistics.mean(times)
2.563515909133321
>>> statistics.pvariance(times)
0.00016384530912002678
>>> statistics.stdev(times)
0.013249473404092065
As expected, it's even better (no tune, idle system => tuning, busy system => tuning, idle system):
* Population variance: 0.00074 => 0.00043 => 0.00016
* Standard deviation: 0.031 => 0.022 => 0.013 |