This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients brett.cannon, florin.papa, pitrou, serhiy.storchaka, skrah, vstinner, yselivanov, zbyrne
Date 2016-02-04.11:37:09
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1454585832.54.0.21571365928.issue26275@psf.upfronthosting.co.za>
In-reply-to
Content
Stefan: "In my experience it is very hard to get stable benchmark results with Python.  Even long running benchmarks on an empty machine vary: (...)"

tl; dr We *can* tune the Linux kernel to avoid most of the system noise when running kernels.


I modified Stefan's telco.py to remove all I/O from the hot code: the benchmark is now really CPU-bound. I also modified telco.py to run the benchmark 5 times. One run takes around 2.6 seconds.

I also added the following lines to check the CPU affinity and the number of context switches:

    os.system("grep -E -i 'cpu|ctx' /proc/%s/status" % os.getpid())

Well, see attached telco_haypo.py for the full script.

I used my system_load.py script to get a system load >= 5.0. Without tasksel, the benchmark result changes completly: at least 5 seconds. Well, it's not really surprising, it's known that benchmarks depend on the system load.


*BUT* I have a great kernel called Linux which has cool features called "CPU isolation" and "no HZ" (tickless kernel). On my Fedoera 23, the kernel is compiled with CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y.

haypo@smithers$ lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ    MINMHZ
0   0    0      0    0:0:0:0       oui    5900,0000 1600,0000
1   0    0      1    1:1:1:0       oui    5900,0000 1600,0000
2   0    0      2    2:2:2:0       oui    5900,0000 1600,0000
3   0    0      3    3:3:3:0       oui    5900,0000 1600,0000
4   0    0      0    0:0:0:0       oui    5900,0000 1600,0000
5   0    0      1    1:1:1:0       oui    5900,0000 1600,0000
6   0    0      2    2:2:2:0       oui    5900,0000 1600,0000
7   0    0      3    3:3:3:0       oui    5900,0000 1600,0000

My CPU is on a single socket, has 4 physical cores, but Linux gets 8 cores because of hyper threading.


I modified the Linux command line during the boot in GRUB to add: isolcpus=2,3,6,7 nohz_full=2,3,6,7. Then I forced the CPU frequency to performance to avoid hiccups:

# for id in 2 3 6 7; do echo performance > cpu$id/cpufreq/scaling_governor; done 

Check the config with:

$ cat /sys/devices/system/cpu/isolated
2-3,6-7
$ cat /sys/devices/system/cpu/nohz_full
2-3,6-7
$ cat /sys/devices/system/cpu/cpu[2367]/cpufreq/scaling_governor
performance
performance
performance
performance


Ok now with this kernel config but still without tasksel on an idle system:
-----------------------
Elapsed time: 2.660088424000037
Elapsed time: 2.5927538629999844
Elapsed time: 2.6135682369999813
Elapsed time: 2.5819260570000324
Elapsed time: 2.5991294099999322

Cpus_allowed:	33
Cpus_allowed_list:	0-1,4-5
voluntary_ctxt_switches:	1
nonvoluntary_ctxt_switches:	21
-----------------------

With system load >= 5.0:
-----------------------
Elapsed time: 5.3484489170000415
Elapsed time: 5.336797472999933
Elapsed time: 5.187413687999992
Elapsed time: 5.24122020599998
Elapsed time: 5.10201246400004

Cpus_allowed_list:	0-1,4-5
voluntary_ctxt_switches:	1
nonvoluntary_ctxt_switches:	1597
-----------------------

And *NOW* using my isolated CPU physical cores #2 and #3 (Linux CPUs 2, 3, 6 and 7), still on the heavily loaded system:
-----------------------
$ taskset -c 2,3,6,7 python3 telco_haypo.py full 

Elapsed time: 2.579487486000062
Elapsed time: 2.5827961039999536
Elapsed time: 2.5811954810001225
Elapsed time: 2.5782033600000887
Elapsed time: 2.572370636999949

Cpus_allowed:	cc
Cpus_allowed_list:	2-3,6-7
voluntary_ctxt_switches:	2
nonvoluntary_ctxt_switches:	16
-----------------------

Numbers look *more* stable than the numbers of the first test without taskset on an idle system! You can see that number of context switches is very low (total: 18).

Example of a second run:
-----------------------
haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full 

Elapsed time: 2.538398498999868
Elapsed time: 2.544711968999991
Elapsed time: 2.5323677339999904
Elapsed time: 2.536252647000083
Elapsed time: 2.525748182999905

Cpus_allowed:	cc
Cpus_allowed_list:	2-3,6-7
voluntary_ctxt_switches:	2
nonvoluntary_ctxt_switches:	15
-----------------------

Third run:
-----------------------
haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full 

Elapsed time: 2.5819172930000605
Elapsed time: 2.5783024259999365
Elapsed time: 2.578493587999901
Elapsed time: 2.5774198510000588
Elapsed time: 2.5772148999999445

Cpus_allowed:	cc
Cpus_allowed_list:	2-3,6-7
voluntary_ctxt_switches:	2
nonvoluntary_ctxt_switches:	15
-----------------------

Well, it's no perfect, but it looks much stable than timings without specific kernel config nor CPU pinning.

Statistics on the 15 timings of the 3 runs with tunning on a heavily loaded system:

>>> times
[2.579487486000062, 2.5827961039999536, 2.5811954810001225, 2.5782033600000887, 2.572370636999949, 2.538398498999868, 2.544711968999991, 2.5323677339999904, 2.536252647000083, 2.525748182999905, 2.5819172930000605, 2.5783024259999365, 2.578493587999901, 2.5774198510000588, 2.5772148999999445]
>>> statistics.mean(times)
2.564325343866661
>>> statistics.pvariance(times)
0.0004340411190965491
>>> statistics.stdev(times)
0.021564880156747315


Compare if to the timings without tunning on an idle system:

>>> times
[2.660088424000037, 2.5927538629999844, 2.6135682369999813, 2.5819260570000324, 2.5991294099999322]
>>> statistics.mean(times)
2.6094931981999934
>>> statistics.pvariance(times)
0.0007448087075422725
>>> statistics.stdev(times)
0.030512470965620608

We get (no tuning, idle system => tuning, busy system):

* Population variance: 0.00074 => 0.00043
* Standard deviation: 0.031 => 0.022

It looks *much* better, no? Even I only used *5* timings on the benchmark without tuning, whereas I used 15 timings on the benchmark with tuning. I expect larger variance and deviation with more times.

--

Just for fun, I ran the benchmark 3 times (so to get 3x5 timings) on an idle system with tuning:

>>> times
[2.542378394000025, 2.5541740109999864, 2.5456488329998592, 2.54730951800002, 2.5495472409998, 2.56374302800009, 2.5737907220000125, 2.581463170999996, 2.578222832999927, 2.574441839999963, 2.569389365999996, 2.5792129209999075, 2.5689420860001064, 2.5681367900001533, 2.5563378829999692]
>>> import statistics
>>> statistics.mean(times)
2.563515909133321
>>> statistics.pvariance(times)
0.00016384530912002678
>>> statistics.stdev(times)
0.013249473404092065

As expected, it's even better (no tune, idle system => tuning, busy system => tuning, idle system):

* Population variance: 0.00074 => 0.00043 => 0.00016
* Standard deviation: 0.031 => 0.022 => 0.013
History
Date User Action Args
2016-02-04 11:37:13vstinnersetrecipients: + vstinner, brett.cannon, pitrou, skrah, serhiy.storchaka, yselivanov, zbyrne, florin.papa
2016-02-04 11:37:12vstinnersetmessageid: <1454585832.54.0.21571365928.issue26275@psf.upfronthosting.co.za>
2016-02-04 11:37:12vstinnerlinkissue26275 messages
2016-02-04 11:37:11vstinnercreate