Message 259556 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	brett.cannon, florin.papa, pitrou, serhiy.storchaka, skrah, vstinner, yselivanov, zbyrne
Date	2016-02-04.11:37:09
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1454585832.54.0.21571365928.issue26275@psf.upfronthosting.co.za>
In-reply-to

Content
Stefan: "In my experience it is very hard to get stable benchmark results with Python. Even long running benchmarks on an empty machine vary: (...)" tl; dr We can tune the Linux kernel to avoid most of the system noise when running kernels. I modified Stefan's telco.py to remove all I/O from the hot code: the benchmark is now really CPU-bound. I also modified telco.py to run the benchmark 5 times. One run takes around 2.6 seconds. I also added the following lines to check the CPU affinity and the number of context switches: os.system("grep -E -i 'cpu\|ctx' /proc/%s/status" % os.getpid()) Well, see attached telco_haypo.py for the full script. I used my system_load.py script to get a system load >= 5.0. Without tasksel, the benchmark result changes completly: at least 5 seconds. Well, it's not really surprising, it's known that benchmarks depend on the system load. BUT I have a great kernel called Linux which has cool features called "CPU isolation" and "no HZ" (tickless kernel). On my Fedoera 23, the kernel is compiled with CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y. haypo@smithers$ lscpu --extended CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ 0 0 0 0 0:0:0:0 oui 5900,0000 1600,0000 1 0 0 1 1:1:1:0 oui 5900,0000 1600,0000 2 0 0 2 2:2:2:0 oui 5900,0000 1600,0000 3 0 0 3 3:3:3:0 oui 5900,0000 1600,0000 4 0 0 0 0:0:0:0 oui 5900,0000 1600,0000 5 0 0 1 1:1:1:0 oui 5900,0000 1600,0000 6 0 0 2 2:2:2:0 oui 5900,0000 1600,0000 7 0 0 3 3:3:3:0 oui 5900,0000 1600,0000 My CPU is on a single socket, has 4 physical cores, but Linux gets 8 cores because of hyper threading. I modified the Linux command line during the boot in GRUB to add: isolcpus=2,3,6,7 nohz_full=2,3,6,7. Then I forced the CPU frequency to performance to avoid hiccups: # for id in 2 3 6 7; do echo performance > cpu$id/cpufreq/scaling_governor; done Check the config with: $ cat /sys/devices/system/cpu/isolated 2-3,6-7 $ cat /sys/devices/system/cpu/nohz_full 2-3,6-7 $ cat /sys/devices/system/cpu/cpu[2367]/cpufreq/scaling_governor performance performance performance performance Ok now with this kernel config but still without tasksel on an idle system: ----------------------- Elapsed time: 2.660088424000037 Elapsed time: 2.5927538629999844 Elapsed time: 2.6135682369999813 Elapsed time: 2.5819260570000324 Elapsed time: 2.5991294099999322 Cpus_allowed: 33 Cpus_allowed_list: 0-1,4-5 voluntary_ctxt_switches: 1 nonvoluntary_ctxt_switches: 21 ----------------------- With system load >= 5.0: ----------------------- Elapsed time: 5.3484489170000415 Elapsed time: 5.336797472999933 Elapsed time: 5.187413687999992 Elapsed time: 5.24122020599998 Elapsed time: 5.10201246400004 Cpus_allowed_list: 0-1,4-5 voluntary_ctxt_switches: 1 nonvoluntary_ctxt_switches: 1597 ----------------------- And NOW using my isolated CPU physical cores #2 and #3 (Linux CPUs 2, 3, 6 and 7), still on the heavily loaded system: ----------------------- $ taskset -c 2,3,6,7 python3 telco_haypo.py full Elapsed time: 2.579487486000062 Elapsed time: 2.5827961039999536 Elapsed time: 2.5811954810001225 Elapsed time: 2.5782033600000887 Elapsed time: 2.572370636999949 Cpus_allowed: cc Cpus_allowed_list: 2-3,6-7 voluntary_ctxt_switches: 2 nonvoluntary_ctxt_switches: 16 ----------------------- Numbers look more stable than the numbers of the first test without taskset on an idle system! You can see that number of context switches is very low (total: 18). Example of a second run: ----------------------- haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full Elapsed time: 2.538398498999868 Elapsed time: 2.544711968999991 Elapsed time: 2.5323677339999904 Elapsed time: 2.536252647000083 Elapsed time: 2.525748182999905 Cpus_allowed: cc Cpus_allowed_list: 2-3,6-7 voluntary_ctxt_switches: 2 nonvoluntary_ctxt_switches: 15 ----------------------- Third run: ----------------------- haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full Elapsed time: 2.5819172930000605 Elapsed time: 2.5783024259999365 Elapsed time: 2.578493587999901 Elapsed time: 2.5774198510000588 Elapsed time: 2.5772148999999445 Cpus_allowed: cc Cpus_allowed_list: 2-3,6-7 voluntary_ctxt_switches: 2 nonvoluntary_ctxt_switches: 15 ----------------------- Well, it's no perfect, but it looks much stable than timings without specific kernel config nor CPU pinning. Statistics on the 15 timings of the 3 runs with tunning on a heavily loaded system: >>> times [2.579487486000062, 2.5827961039999536, 2.5811954810001225, 2.5782033600000887, 2.572370636999949, 2.538398498999868, 2.544711968999991, 2.5323677339999904, 2.536252647000083, 2.525748182999905, 2.5819172930000605, 2.5783024259999365, 2.578493587999901, 2.5774198510000588, 2.5772148999999445] >>> statistics.mean(times) 2.564325343866661 >>> statistics.pvariance(times) 0.0004340411190965491 >>> statistics.stdev(times) 0.021564880156747315 Compare if to the timings without tunning on an idle system: >>> times [2.660088424000037, 2.5927538629999844, 2.6135682369999813, 2.5819260570000324, 2.5991294099999322] >>> statistics.mean(times) 2.6094931981999934 >>> statistics.pvariance(times) 0.0007448087075422725 >>> statistics.stdev(times) 0.030512470965620608 We get (no tuning, idle system => tuning, busy system): * Population variance: 0.00074 => 0.00043 * Standard deviation: 0.031 => 0.022 It looks much better, no? Even I only used 5 timings on the benchmark without tuning, whereas I used 15 timings on the benchmark with tuning. I expect larger variance and deviation with more times. -- Just for fun, I ran the benchmark 3 times (so to get 3x5 timings) on an idle system with tuning: >>> times [2.542378394000025, 2.5541740109999864, 2.5456488329998592, 2.54730951800002, 2.5495472409998, 2.56374302800009, 2.5737907220000125, 2.581463170999996, 2.578222832999927, 2.574441839999963, 2.569389365999996, 2.5792129209999075, 2.5689420860001064, 2.5681367900001533, 2.5563378829999692] >>> import statistics >>> statistics.mean(times) 2.563515909133321 >>> statistics.pvariance(times) 0.00016384530912002678 >>> statistics.stdev(times) 0.013249473404092065 As expected, it's even better (no tune, idle system => tuning, busy system => tuning, idle system): * Population variance: 0.00074 => 0.00043 => 0.00016 * Standard deviation: 0.031 => 0.022 => 0.013

Stefan: "In my experience it is very hard to get stable benchmark results with Python.  Even long running benchmarks on an empty machine vary: (...)"

tl; dr We *can* tune the Linux kernel to avoid most of the system noise when running kernels.


I modified Stefan's telco.py to remove all I/O from the hot code: the benchmark is now really CPU-bound. I also modified telco.py to run the benchmark 5 times. One run takes around 2.6 seconds.

I also added the following lines to check the CPU affinity and the number of context switches:

    os.system("grep -E -i 'cpu|ctx' /proc/%s/status" % os.getpid())

Well, see attached telco_haypo.py for the full script.

I used my system_load.py script to get a system load >= 5.0. Without tasksel, the benchmark result changes completly: at least 5 seconds. Well, it's not really surprising, it's known that benchmarks depend on the system load.


*BUT* I have a great kernel called Linux which has cool features called "CPU isolation" and "no HZ" (tickless kernel). On my Fedoera 23, the kernel is compiled with CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y.

haypo@smithers$ lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ    MINMHZ
0   0    0      0    0:0:0:0       oui    5900,0000 1600,0000
1   0    0      1    1:1:1:0       oui    5900,0000 1600,0000
2   0    0      2    2:2:2:0       oui    5900,0000 1600,0000
3   0    0      3    3:3:3:0       oui    5900,0000 1600,0000
4   0    0      0    0:0:0:0       oui    5900,0000 1600,0000
5   0    0      1    1:1:1:0       oui    5900,0000 1600,0000
6   0    0      2    2:2:2:0       oui    5900,0000 1600,0000
7   0    0      3    3:3:3:0       oui    5900,0000 1600,0000

My CPU is on a single socket, has 4 physical cores, but Linux gets 8 cores because of hyper threading.


I modified the Linux command line during the boot in GRUB to add: isolcpus=2,3,6,7 nohz_full=2,3,6,7. Then I forced the CPU frequency to performance to avoid hiccups:

# for id in 2 3 6 7; do echo performance > cpu$id/cpufreq/scaling_governor; done 

Check the config with:

$ cat /sys/devices/system/cpu/isolated
2-3,6-7
$ cat /sys/devices/system/cpu/nohz_full
2-3,6-7
$ cat /sys/devices/system/cpu/cpu[2367]/cpufreq/scaling_governor
performance
performance
performance
performance


Ok now with this kernel config but still without tasksel on an idle system:
-----------------------
Elapsed time: 2.660088424000037
Elapsed time: 2.5927538629999844
Elapsed time: 2.6135682369999813
Elapsed time: 2.5819260570000324
Elapsed time: 2.5991294099999322

Cpus_allowed:	33
Cpus_allowed_list:	0-1,4-5
voluntary_ctxt_switches:	1
nonvoluntary_ctxt_switches:	21
-----------------------

With system load >= 5.0:
-----------------------
Elapsed time: 5.3484489170000415
Elapsed time: 5.336797472999933
Elapsed time: 5.187413687999992
Elapsed time: 5.24122020599998
Elapsed time: 5.10201246400004

Cpus_allowed_list:	0-1,4-5
voluntary_ctxt_switches:	1
nonvoluntary_ctxt_switches:	1597
-----------------------

And *NOW* using my isolated CPU physical cores #2 and #3 (Linux CPUs 2, 3, 6 and 7), still on the heavily loaded system:
-----------------------
$ taskset -c 2,3,6,7 python3 telco_haypo.py full 

Elapsed time: 2.579487486000062
Elapsed time: 2.5827961039999536
Elapsed time: 2.5811954810001225
Elapsed time: 2.5782033600000887
Elapsed time: 2.572370636999949

Cpus_allowed:	cc
Cpus_allowed_list:	2-3,6-7
voluntary_ctxt_switches:	2
nonvoluntary_ctxt_switches:	16
-----------------------

Numbers look *more* stable than the numbers of the first test without taskset on an idle system! You can see that number of context switches is very low (total: 18).

Example of a second run:
-----------------------
haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full 

Elapsed time: 2.538398498999868
Elapsed time: 2.544711968999991
Elapsed time: 2.5323677339999904
Elapsed time: 2.536252647000083
Elapsed time: 2.525748182999905

Cpus_allowed:	cc
Cpus_allowed_list:	2-3,6-7
voluntary_ctxt_switches:	2
nonvoluntary_ctxt_switches:	15
-----------------------

Third run:
-----------------------
haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full 

Elapsed time: 2.5819172930000605
Elapsed time: 2.5783024259999365
Elapsed time: 2.578493587999901
Elapsed time: 2.5774198510000588
Elapsed time: 2.5772148999999445

Cpus_allowed:	cc
Cpus_allowed_list:	2-3,6-7
voluntary_ctxt_switches:	2
nonvoluntary_ctxt_switches:	15
-----------------------

Well, it's no perfect, but it looks much stable than timings without specific kernel config nor CPU pinning.

Statistics on the 15 timings of the 3 runs with tunning on a heavily loaded system:

>>> times
[2.579487486000062, 2.5827961039999536, 2.5811954810001225, 2.5782033600000887, 2.572370636999949, 2.538398498999868, 2.544711968999991, 2.5323677339999904, 2.536252647000083, 2.525748182999905, 2.5819172930000605, 2.5783024259999365, 2.578493587999901, 2.5774198510000588, 2.5772148999999445]
>>> statistics.mean(times)
2.564325343866661
>>> statistics.pvariance(times)
0.0004340411190965491
>>> statistics.stdev(times)
0.021564880156747315


Compare if to the timings without tunning on an idle system:

>>> times
[2.660088424000037, 2.5927538629999844, 2.6135682369999813, 2.5819260570000324, 2.5991294099999322]
>>> statistics.mean(times)
2.6094931981999934
>>> statistics.pvariance(times)
0.0007448087075422725
>>> statistics.stdev(times)
0.030512470965620608

We get (no tuning, idle system => tuning, busy system):

* Population variance: 0.00074 => 0.00043
* Standard deviation: 0.031 => 0.022

It looks *much* better, no? Even I only used *5* timings on the benchmark without tuning, whereas I used 15 timings on the benchmark with tuning. I expect larger variance and deviation with more times.

--

Just for fun, I ran the benchmark 3 times (so to get 3x5 timings) on an idle system with tuning:

>>> times
[2.542378394000025, 2.5541740109999864, 2.5456488329998592, 2.54730951800002, 2.5495472409998, 2.56374302800009, 2.5737907220000125, 2.581463170999996, 2.578222832999927, 2.574441839999963, 2.569389365999996, 2.5792129209999075, 2.5689420860001064, 2.5681367900001533, 2.5563378829999692]
>>> import statistics
>>> statistics.mean(times)
2.563515909133321
>>> statistics.pvariance(times)
0.00016384530912002678
>>> statistics.stdev(times)
0.013249473404092065

As expected, it's even better (no tune, idle system => tuning, busy system => tuning, idle system):

* Population variance: 0.00074 => 0.00043 => 0.00016
* Standard deviation: 0.031 => 0.022 => 0.013

History
Date	User	Action	Args
2016-02-04 11:37:13	vstinner	set	recipients: + vstinner, brett.cannon, pitrou, skrah, serhiy.storchaka, yselivanov, zbyrne, florin.papa
2016-02-04 11:37:12	vstinner	set	messageid: <1454585832.54.0.21571365928.issue26275@psf.upfronthosting.co.za>
2016-02-04 11:37:12	vstinner	link	issue26275 messages
2016-02-04 11:37:11	vstinner	create