Message 259570 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	brett.cannon, florin.papa, pitrou, serhiy.storchaka, skrah, vstinner, yselivanov, zbyrne
Date	2016-02-04.15:05:22
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1454598324.37.0.252389063605.issue26275@psf.upfronthosting.co.za>
In-reply-to

Content
tl; dr I'm disappointed. According to the statistics module, running the bm_regex_v8.py benchmark more times with more iterations make the benchmark more unstable... I expected the opposite... Patch version 2: * patch also performance/bm_pickle.py * change min_time from 100 ms to 500 ms with --fast * compute the number of runs using a maximum time, maximum time change with --fast and --rigorous + if options.rigorous: + min_time = 1.0 + max_time = 100.0 # 100 runs + elif options.fast: + min_time = 0.5 + max_time = 25.0 # 50 runs + else: + min_time = 0.5 + max_time = 50.0 # 100 runs To measure the stability of perf.py, I pinned perf.py to CPU cores which are isolated of the system using Linux "isolcpus" kernel parameter. I also forced the CPU frequency governor to "performance" and enabled "no HZ full" on these cores. I ran perf.py 5 times on regex_v8. Calibration (original => patched): * --fast: 1 iteration x 5 runs => 16 iterations x 50 runs * (no option): 1 iteration x 50 runs => 16 iterations x 100 runs Approximated duration of the benchmark (original => patched): * --fast: 7 sec => 7 min 34 sec * (no option): 30 sec => 14 min 35 sec (I made a mistake, so I was unable to get the exact duration.) Hum, maybe timings are not well chosen because the benchmark is really slow (minutes vs seconds) :-/ Standard deviation, --fast: * (python2) 0.00071 (1.2%, mean=0.05961) => 0.01059 (1.1%, mean=0.96723) * (python3) 0.00068 (1.5%, mean=0.04494) => 0.05925 (8.0%, mean=0.74248) * (faster) 0.02986 (2.2%, mean=1.32750) => 0.09083 (6.9%, mean=1.31000) Standard deviation, (no option): * (python2) 0.00072 (1.2%, mean=0.05957) => 0.00874 (0.9%, mean=0.97028) * (python3) 0.00053 (1.2%, mean=0.04477) => 0.00966 (1.3%, mean=0.72680) * (faster) 0.02739 (2.1%, mean=1.33000) => 0.02608 (2.0%, mean=1.33600) Variance, --fast: * (python2) 0.00000 (0.001%, mean=0.05961) => 0.00009 (0.009%, mean=0.96723) * (python3) 0.00000 (0.001%, mean=0.04494) => 0.00281 (0.378%, mean=0.74248) * (faster) 0.00067 (0.050%, mean=1.32750) => 0.00660 (0.504%, mean=1.31000) Variance, (no option): * (python2) 0.00000 (0.001%, mean=0.05957) => 0.00006 (0.006%, mean=0.97028) * (python3) 0.00000 (0.001%, mean=0.04477) => 0.00007 (0.010%, mean=0.72680) * (faster) 0.00060 (0.045%, mean=1.33000) => 0.00054 (0.041%, mean=1.33600) Legend: * (python2) are timings of python2 ran by perf.py (of the "Min" line) * (python3) are timings of python3 ran by perf.py (of the "Min" line) * (faster) are the "1.34x" numbers of "faster" or "slower" of the "Min" line * percentages are: value * 100 / mean It's not easy to compare these values since the number of iterations is very different (1 => 16) and so timings are very different (ex: 0.059 sec => 0.950 sec). I guess that it's ok to compare percentages. I used the stability.py script, attached to this issue, to compute deviation and variance from the "Min" line of the 5 runs. The script takes the output of perf.py as input. I'm not sure that 5 runs are enough to compute statistics. -- Raw data. Original perf.py. $ grep ^Min original.fast Min: 0.059236 -> 0.045948: 1.29x faster Min: 0.059005 -> 0.044654: 1.32x faster Min: 0.059601 -> 0.044547: 1.34x faster Min: 0.060605 -> 0.044600: 1.36x faster $ grep ^Min original Min: 0.060479 -> 0.044762: 1.35x faster Min: 0.059002 -> 0.045689: 1.29x faster Min: 0.058991 -> 0.044587: 1.32x faster Min: 0.060231 -> 0.044364: 1.36x faster Min: 0.059165 -> 0.044464: 1.33x faster Patched perf.py. $ grep ^Min patched.fast Min: 0.950717 -> 0.711018: 1.34x faster Min: 0.968413 -> 0.730810: 1.33x faster Min: 0.976092 -> 0.847388: 1.15x faster Min: 0.964355 -> 0.711083: 1.36x faster Min: 0.976573 -> 0.712081: 1.37x faster $ grep ^Min patched Min: 0.968810 -> 0.729109: 1.33x faster Min: 0.973615 -> 0.731308: 1.33x faster Min: 0.974215 -> 0.734259: 1.33x faster Min: 0.978781 -> 0.709915: 1.38x faster Min: 0.955977 -> 0.729387: 1.31x faster $ grep ^Calibration patched.fast Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.4 sec) Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.3 sec) Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.4 sec) Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.6 sec) Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.7 sec) $ grep ^Calibration patched Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.0 sec) Calibration: num_runs=100, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 75.3 sec) Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.2 sec) Calibration: num_runs=100, num_loops=16 (0.74 sec per run > min_time 0.50 sec, estimated total: 73.7 sec) Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 72.9 sec)

tl; dr I'm disappointed. According to the statistics module, running the bm_regex_v8.py benchmark more times with more iterations make the benchmark more unstable... I expected the opposite...


Patch version 2:

* patch also performance/bm_pickle.py
* change min_time from 100 ms to 500 ms with --fast
* compute the number of runs using a maximum time, maximum time change with --fast and --rigorous

+    if options.rigorous:
+        min_time = 1.0
+        max_time = 100.0  # 100 runs
+    elif options.fast:
+        min_time = 0.5
+        max_time = 25.0   # 50 runs
+    else:
+        min_time = 0.5
+        max_time = 50.0   # 100 runs


To measure the stability of perf.py, I pinned perf.py to CPU cores which are isolated of the system using Linux "isolcpus" kernel parameter. I also forced the CPU frequency governor to "performance" and enabled "no HZ full" on these cores.

I ran perf.py 5 times on regex_v8.


Calibration (original => patched):

* --fast: 1 iteration x 5 runs => 16 iterations x 50 runs
* (no option): 1 iteration x 50 runs => 16 iterations x 100 runs


Approximated duration of the benchmark (original => patched):

* --fast: 7 sec => 7 min 34 sec
* (no option): 30 sec => 14 min 35 sec

(I made a mistake, so I was unable to get the exact duration.)

Hum, maybe timings are not well chosen because the benchmark is really slow (minutes vs seconds) :-/


Standard deviation, --fast:

* (python2) 0.00071 (1.2%, mean=0.05961) => 0.01059 (1.1%, mean=0.96723)
* (python3) 0.00068 (1.5%, mean=0.04494) => 0.05925 (8.0%, mean=0.74248)
* (faster) 0.02986 (2.2%, mean=1.32750) => 0.09083 (6.9%, mean=1.31000)

Standard deviation, (no option):

* (python2) 0.00072 (1.2%, mean=0.05957) => 0.00874 (0.9%, mean=0.97028)
* (python3) 0.00053 (1.2%, mean=0.04477) => 0.00966 (1.3%, mean=0.72680)
* (faster) 0.02739 (2.1%, mean=1.33000) => 0.02608 (2.0%, mean=1.33600)

Variance, --fast:

* (python2) 0.00000 (0.001%, mean=0.05961) => 0.00009 (0.009%, mean=0.96723)
* (python3) 0.00000 (0.001%, mean=0.04494) => 0.00281 (0.378%, mean=0.74248)
* (faster) 0.00067 (0.050%, mean=1.32750) => 0.00660 (0.504%, mean=1.31000)

Variance, (no option):

* (python2) 0.00000 (0.001%, mean=0.05957) => 0.00006 (0.006%, mean=0.97028)
* (python3) 0.00000 (0.001%, mean=0.04477) => 0.00007 (0.010%, mean=0.72680)
* (faster) 0.00060 (0.045%, mean=1.33000) => 0.00054 (0.041%, mean=1.33600)

Legend:

* (python2) are timings of python2 ran by perf.py (of the "Min" line)
* (python3) are timings of python3 ran by perf.py (of the "Min" line)
* (faster) are the "1.34x" numbers of "faster" or "slower" of the "Min" line
* percentages are: value * 100 / mean

It's not easy to compare these values since the number of iterations is very different (1 => 16) and so timings are very different (ex: 0.059 sec => 0.950 sec). I guess that it's ok to compare percentages.


I used the stability.py script, attached to this issue, to compute deviation and variance from the "Min" line of the 5 runs. The script takes the output of perf.py as input.

I'm not sure that 5 runs are enough to compute statistics.

--

Raw data.

Original perf.py.

$ grep ^Min original.fast 
Min: 0.059236 -> 0.045948: 1.29x faster
Min: 0.059005 -> 0.044654: 1.32x faster
Min: 0.059601 -> 0.044547: 1.34x faster
Min: 0.060605 -> 0.044600: 1.36x faster

$ grep ^Min original
Min: 0.060479 -> 0.044762: 1.35x faster
Min: 0.059002 -> 0.045689: 1.29x faster
Min: 0.058991 -> 0.044587: 1.32x faster
Min: 0.060231 -> 0.044364: 1.36x faster
Min: 0.059165 -> 0.044464: 1.33x faster

Patched perf.py.

$ grep ^Min patched.fast 
Min: 0.950717 -> 0.711018: 1.34x faster
Min: 0.968413 -> 0.730810: 1.33x faster
Min: 0.976092 -> 0.847388: 1.15x faster
Min: 0.964355 -> 0.711083: 1.36x faster
Min: 0.976573 -> 0.712081: 1.37x faster

$ grep ^Min patched
Min: 0.968810 -> 0.729109: 1.33x faster
Min: 0.973615 -> 0.731308: 1.33x faster
Min: 0.974215 -> 0.734259: 1.33x faster
Min: 0.978781 -> 0.709915: 1.38x faster
Min: 0.955977 -> 0.729387: 1.31x faster

$ grep ^Calibration patched.fast 
Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.4 sec)
Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.3 sec)
Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.4 sec)
Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.6 sec)
Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.7 sec)

$ grep ^Calibration patched
Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.0 sec)
Calibration: num_runs=100, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 75.3 sec)
Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.2 sec)
Calibration: num_runs=100, num_loops=16 (0.74 sec per run > min_time 0.50 sec, estimated total: 73.7 sec)
Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 72.9 sec)

History
Date	User	Action	Args
2016-02-04 15:05:24	vstinner	set	recipients: + vstinner, brett.cannon, pitrou, skrah, serhiy.storchaka, yselivanov, zbyrne, florin.papa
2016-02-04 15:05:24	vstinner	set	messageid: <1454598324.37.0.252389063605.issue26275@psf.upfronthosting.co.za>
2016-02-04 15:05:24	vstinner	link	issue26275 messages
2016-02-04 15:05:23	vstinner	create