Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf.py: calibrate benchmarks using time, not using a fixed number of iterations #70463

Closed
vstinner opened this issue Feb 3, 2016 · 21 comments
Closed
Labels
performance Performance or resource usage

Comments

@vstinner
Copy link
Member

vstinner commented Feb 3, 2016

BPO 26275
Nosy @brettcannon, @pitrou, @vstinner, @skrah, @serhiy-storchaka, @1st1, @florinpapa
Files
  • perf_calibration.patch
  • telco_haypo.py
  • stability.py
  • perf_calibration-2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-08-30.16:13:19.990>
    created_at = <Date 2016-02-03.10:38:49.820>
    labels = ['performance']
    title = 'perf.py: calibrate benchmarks using time, not using a fixed number of iterations'
    updated_at = <Date 2016-08-30.16:13:19.988>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2016-08-30.16:13:19.988>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-08-30.16:13:19.990>
    closer = 'vstinner'
    components = ['Benchmarks']
    creation = <Date 2016-02-03.10:38:49.820>
    creator = 'vstinner'
    dependencies = []
    files = ['41790', '41802', '41803', '41804']
    hgrepos = []
    issue_num = 26275
    keywords = ['patch']
    message_count = 21.0
    messages = ['259469', '259472', '259504', '259510', '259515', '259543', '259544', '259546', '259547', '259548', '259550', '259551', '259555', '259556', '259557', '259558', '259561', '259566', '259570', '259582', '273923']
    nosy_count = 8.0
    nosy_names = ['brett.cannon', 'pitrou', 'vstinner', 'skrah', 'serhiy.storchaka', 'yselivanov', 'zbyrne', 'florin.papa']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue26275'
    versions = ['Python 3.6']

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 3, 2016

    Hi, I'm working on some optimizations projects like FAT Python (PEP-509: issue bpo-26058, PEP-510: issue bpo-26098, and PEP-511: issue bpo-26145) and faster memory allocators (issue bpo-26249).

    I have the *feeling* that perf.py output is not reliable even if it takes more than 20 minutes :-/ Maybe because Yury told that I must use -r (--rigorous) :-)

    Example with 5 runs of "python3 perf.py ../default/python ../default/python.orig -b regex_v8":

    ---------------
    Report on Linux smithers 4.3.3-300.fc23.x86_64 #1 SMP Tue Jan 5 23:31:01 UTC 2016 x86_64 x86_64
    Total CPU cores: 8

    ### regex_v8 ###
    Min: 0.043237 -> 0.050196: 1.16x slower
    Avg: 0.043714 -> 0.050574: 1.16x slower
    Significant (t=-19.83)
    Stddev: 0.00171 -> 0.00174: 1.0178x larger

    ### regex_v8 ###
    Min: 0.042774 -> 0.051420: 1.20x slower
    Avg: 0.043843 -> 0.051874: 1.18x slower
    Significant (t=-14.46)
    Stddev: 0.00351 -> 0.00176: 2.0009x smaller

    ### regex_v8 ###
    Min: 0.042673 -> 0.048870: 1.15x slower
    Avg: 0.043726 -> 0.050474: 1.15x slower
    Significant (t=-8.74)
    Stddev: 0.00283 -> 0.00467: 1.6513x larger

    ### regex_v8 ###
    Min: 0.044029 -> 0.049445: 1.12x slower
    Avg: 0.044564 -> 0.049971: 1.12x slower
    Significant (t=-13.97)
    Stddev: 0.00175 -> 0.00211: 1.2073x larger

    ### regex_v8 ###
    Min: 0.042692 -> 0.049084: 1.15x slower
    Avg: 0.044295 -> 0.050725: 1.15x slower
    Significant (t=-7.00)
    Stddev: 0.00421 -> 0.00494: 1.1745x larger
    ---------------

    I'm only care of the "Min", IMHO it's the most interesting information here.

    The slowdown is betwen 12% and 20%, for me it's a big difference.

    It looks like some benchmarks have very short iterations compare to others. For example, bm_json_v2 takes around 3 seconds, whereas bm_regex_v8 only takes less than 0.050 second (50 ms).

    $ python3 performance/bm_json_v2.py -n 3 --timer perf_counter
    3.310384973010514
    3.3116717970115133
    3.3077902760123834
    
    $ python3 performance/bm_regex_v8.py -n 3 --timer perf_counter
    0.0670697659952566
    0.04515827298746444
    0.045114840992027894

    Do you think that bm_regex_v8 is reliable? I see that there is an "iteration scaling" to use run the benchmarks with more iterations. Maybe we can start to increase the "iteration scaling" for bm_regex_v8?

    Instead of a fixed number of iterations, we should redesign benchmarks to use time. For example, one iteration must take at least 100 ms and should not take more than 1 second (but take longer to get more reliable results). Then the benchmark is responsible to ajust internal parameters.

    I used this design for my "benchmark.py" script which is written to get "reliable" microbenchmarks:
    https://bitbucket.org/haypo/misc/src/tip/python/benchmark.py?fileviewer=file-view-default

    The script is based on time and calibrate a benchmark. It also uses the *effictive* resolution of the clock used by the benchmark to calibrate the benchmark.

    I will maybe work on such patch, but it would be good to know first your opinion on such change.

    I guess that we should use the base python to calibrate the benchmark and then pass the same parameters to the modified python.

    @vstinner vstinner added performance Performance or resource usage labels Feb 3, 2016
    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 3, 2016

    Well, it was simpler than I expected to implement calibration. Here is a PoC for regex_v8 and json_dump_v2.

    json_dump_v2 takes 3 seconds per run, but it already uses internally 100 loop iterations. I divided the number of iterations by 100 in my patch.

    @vstinner vstinner changed the title perf.py: bm_regex_v8 doesn't seem reliable even with --rigorous perf.py: calibrate benchmarks using time, not using a fixed number of iterations Feb 3, 2016
    @1st1
    Copy link
    Member

    1st1 commented Feb 3, 2016

    Here's a very interesting table from Zach Byrne: http://bugs.python.org/issue21955#msg259490

    It shows that some benchmarks are indeed very unstable. This also correlates with my own experience.

    These ones are very unstable: pickle_dict, nbody, regex_v8, etree_parse, telco.

    @serhiy-storchaka
    Copy link
    Member

    If running time is close to the limit, different run will use different number of repetitions. This will add additional instability. I prefer a stable number of repetitions manually calibrated for average modern computer.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 3, 2016

    If running time is close to the limit, different run will use different number of repetitions. This will add additional instability.

    Maybe, to be honest I don't know. How can we decide if a patch makes
    perf.py more and less stable?

    I prefer a stable number of repetitions manually calibrated for average modern computer.

    The problem is to define an "average modern computer".

    @florinpapa
    Copy link
    Mannequin

    florinpapa mannequin commented Feb 4, 2016

    I have tested the patch and it does not seem to solve the stability
    problem.

    With patch:
    python performance/bm_json_v2.py -n 10 --timer perf_counter
    0.02829067991115153
    0.027778947027400136
    0.027756798081099987
    0.027362094027921557
    0.027256828034296632
    0.02766443998552859
    0.027390124974772334
    0.027398303849622607
    0.02731629996560514
    0.02743363007903099

    Without patch:
    python performance/bm_json_v2.py -n 10 --timer perf_counter
    2.732215315802023
    2.7394272631499916
    2.731212269049138
    2.7511520159896463
    2.739834441104904
    2.754247901029885
    2.734024364966899
    2.7447318979538977
    2.744731454877183
    2.7513630259782076

    Instead, I notice that, especially for the first runs, the measured
    values tend to be higher. This is expected, since the cache has not
    been populated yet, but this can also happen after a context switch.
    Therefore, we could eliminate these "odd" values by calculating the
    median value of all the runs and only keeping measurements within 1%
    of that median value.

    What do you think?

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 4, 2016

    With patch:
    python performance/bm_json_v2.py -n 10 --timer perf_counter

    Oh. There is a bit of confusion here. You must *not* run directly bm_xxx.py scripts. The calibration is done in perf.py.

    Try for example:
    python perf.py python2 python3 -b json_dump_v2

    You must see something like:
    Calibrating => num_runs=10, num_loops=16 (0.50 sec < 0.55 sec)

    I should maybe share the calibration code to also compute the number of iterations when a bm_xxx.py script is run directly? But the risk is that someone compares two runs of bm_xxx.py using two python binaries, and seeing different results just because the number of calibrated loops is different...

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 4, 2016

    In my experience it is very hard to get stable benchmark results with Python. Even long running benchmarks on an empty machine vary:

    wget http://www.bytereef.org/software/mpdecimal/benchmarks/telco.py
    wget http://speleotrove.com/decimal/expon180-1e6b.zip
    unzip expon180-1e6b.zip

    taskset -c 0 ./python telco.py full

    $ taskset -c 0 ./python telco.py full

    Control totals:
    Actual ['1004737.58', '57628.30', '25042.17']
    Expected ['1004737.58', '57628.30', '25042.17']
    Elapsed time: 7.16255
    $ taskset -c 0 ./python telco.py full

    Control totals:
    Actual ['1004737.58', '57628.30', '25042.17']
    Expected ['1004737.58', '57628.30', '25042.17']
    Elapsed time: 6.982884
    $ taskset -c 0 ./python telco.py full

    Control totals:
    Actual ['1004737.58', '57628.30', '25042.17']
    Expected ['1004737.58', '57628.30', '25042.17']
    Elapsed time: 7.095349000000001
    $ taskset -c 0 ./python telco.py full

    Control totals:
    Actual ['1004737.58', '57628.30', '25042.17']
    Expected ['1004737.58', '57628.30', '25042.17']

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 4, 2016

    I've cut off the highest result in the previous message:

    Control totals:
    Actual ['1004737.58', '57628.30', '25042.17']
    Expected ['1004737.58', '57628.30', '25042.17']
    Elapsed time: 7.304339
    $ taskset -c 0 ./python telco.py full

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 4, 2016

    taskset -c 0 ./python telco.py full

    Did you see that I just merged Florin's patch to add --affinity parameter to perf.py? :-)

    You may isolate some CPU cores using the kernel command parameter isolcpus=xxx. I don't think that the core #0 is the best choice, it may be preferred.

    It would be nice to collect "tricks" to get most reliable benchmark results. Maybe in perf.py README page? Or a wiki page? Whatever? :-)

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 4, 2016

    For an older project (Fusil the fuzzer), I wrote a short function to sleep until the system load is lower than a threshold. I had to write such function to reduce the noise when the system is heavily loaded. I wrote that to be able to run a very long task (it takes at least 1 hour, but may run for multiple days!) on my desktop and continue to use my desktop for various other tasks.

    On Linux, we can use the "cpu xxx xxx xxx ..." line of /proc/stat to get the system load.

    My code to read the system load:
    https://bitbucket.org/haypo/fusil/src/32ddc281219cd90c1ad12a3ee4ce212bea1c3e0f/fusil/linux/cpu_load.py?at=default&fileviewer=file-view-default#cpu_load.py-51

    My code to wait until the system load is lower than a threshold:
    https://bitbucket.org/haypo/fusil/src/32ddc281219cd90c1ad12a3ee4ce212bea1c3e0f/fusil/system_calm.py?at=default&fileviewer=file-view-default#system_calm.py-5

    --

    I also write a script to do the opposite :-) A script to stress the system to have a system load higher or equal than a minimum load:

    https://bitbucket.org/haypo/misc/src/3fd3993413b128b37e945690865ea2c5ef48c446/bin/system_load.py?at=default&fileviewer=file-view-default

    This script helped to me reproduce sporadic failures like timeouts which only occur when the system is highly loaded.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 4, 2016

    Core 1 fluctuates even more (My machine only has 2 cores):

    $ taskset -c 1 ./python telco.py full

    Control totals:
    Actual ['1004737.58', '57628.30', '25042.17']
    Expected ['1004737.58', '57628.30', '25042.17']
    Elapsed time: 6.783009
    Control totals:
    Actual ['1004737.58', '57628.30', '25042.17']
    Expected ['1004737.58', '57628.30', '25042.17']
    Elapsed time: 7.335563
    $ taskset -c 1 ./python telco.py full

    I have some of the same concerns as Serhiy. There's a lot of statistics going on in the benchmark suite -- is it really possible to separate that cleanly from the actual runtime of the benchmarks?

    @florinpapa
    Copy link
    Mannequin

    florinpapa mannequin commented Feb 4, 2016

    I ran perf to use calibration and there is no difference in stability
    compared to the unpatched version.

    With patch:

    python perf.py -b json_dump_v2 -v --csv=out1.csv --affinity=2 ../cpython/python ../cpython/python
    INFO:root:Automatically selected timer: perf_counter
    [1/1] json_dump_v2...
    Calibrating
    INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 1 --timer perf_counter
    INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 2 --timer perf_counter
    INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 4 --timer perf_counter
    INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 8 --timer perf_counter
    INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 16 --timer perf_counter
    INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 1 -l 32 --timer perf_counter
    Calibrating => num_runs=10, num_loops=32 (0.50 sec < 0.87 sec)
    INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 10 -l 32 --timer perf_counter
    INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 10 -l 32 --timer perf_counter

    Report on Linux centos 3.10.0-229.7.2.el7.x86_64 #1 SMP Tue Jun 23 22:06:11 UTC 2015 x86_64 x86_64
    Total CPU cores: 18

    ### json_dump_v2 ###
    Min: 0.877497 -> 0.886482: 1.01x slower <----------
    Avg: 0.878150 -> 0.888351: 1.01x slower
    Not significant
    Stddev: 0.00054 -> 0.00106: 1.9481x larger

    Without patch:

    python perf.py -b json_dump_v2 -v --csv=out1.csv --affinity=2 ../cpython/python ../cpython/python
    INFO:root:Automatically selected timer: perf_counter
    [1/1] json_dump_v2...
    INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 50 --timer perf_counter
    INFO:root:Running taskset --cpu-list 2 ../cpython/python performance/bm_json_v2.py -n 50 --timer perf_counter

    Report on Linux centos 3.10.0-229.7.2.el7.x86_64 #1 SMP Tue Jun 23 22:06:11 UTC 2015 x86_64 x86_64
    Total CPU cores: 18

    ### json_dump_v2 ###
    Min: 2.755514 -> 2.764131: 1.00x slower <---------- (almost) same as above
    Avg: 2.766546 -> 2.775587: 1.00x slower
    Not significant
    Stddev: 0.00538 -> 0.00382: 1.4069x smaller

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 4, 2016

    Stefan: "In my experience it is very hard to get stable benchmark results with Python. Even long running benchmarks on an empty machine vary: (...)"

    tl; dr We *can* tune the Linux kernel to avoid most of the system noise when running kernels.

    I modified Stefan's telco.py to remove all I/O from the hot code: the benchmark is now really CPU-bound. I also modified telco.py to run the benchmark 5 times. One run takes around 2.6 seconds.

    I also added the following lines to check the CPU affinity and the number of context switches:

        os.system("grep -E -i 'cpu|ctx' /proc/%s/status" % os.getpid())

    Well, see attached telco_haypo.py for the full script.

    I used my system_load.py script to get a system load >= 5.0. Without tasksel, the benchmark result changes completly: at least 5 seconds. Well, it's not really surprising, it's known that benchmarks depend on the system load.

    *BUT* I have a great kernel called Linux which has cool features called "CPU isolation" and "no HZ" (tickless kernel). On my Fedoera 23, the kernel is compiled with CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y.

    haypo@smithers$ lscpu --extended
    CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ
    0 0 0 0 0:0:0:0 oui 5900,0000 1600,0000
    1 0 0 1 1:1:1:0 oui 5900,0000 1600,0000
    2 0 0 2 2:2:2:0 oui 5900,0000 1600,0000
    3 0 0 3 3:3:3:0 oui 5900,0000 1600,0000
    4 0 0 0 0:0:0:0 oui 5900,0000 1600,0000
    5 0 0 1 1:1:1:0 oui 5900,0000 1600,0000
    6 0 0 2 2:2:2:0 oui 5900,0000 1600,0000
    7 0 0 3 3:3:3:0 oui 5900,0000 1600,0000

    My CPU is on a single socket, has 4 physical cores, but Linux gets 8 cores because of hyper threading.

    I modified the Linux command line during the boot in GRUB to add: isolcpus=2,3,6,7 nohz_full=2,3,6,7. Then I forced the CPU frequency to performance to avoid hiccups:

    # for id in 2 3 6 7; do echo performance > cpu$id/cpufreq/scaling_governor; done

    Check the config with:

    $ cat /sys/devices/system/cpu/isolated
    2-3,6-7
    $ cat /sys/devices/system/cpu/nohz_full
    2-3,6-7
    $ cat /sys/devices/system/cpu/cpu[2367]/cpufreq/scaling_governor
    performance
    performance
    performance
    performance

    Ok now with this kernel config but still without tasksel on an idle system:
    -----------------------
    Elapsed time: 2.660088424000037
    Elapsed time: 2.5927538629999844
    Elapsed time: 2.6135682369999813
    Elapsed time: 2.5819260570000324
    Elapsed time: 2.5991294099999322

    Cpus_allowed: 33
    Cpus_allowed_list: 0-1,4-5
    voluntary_ctxt_switches: 1
    nonvoluntary_ctxt_switches: 21
    -----------------------

    With system load >= 5.0:
    -----------------------
    Elapsed time: 5.3484489170000415
    Elapsed time: 5.336797472999933
    Elapsed time: 5.187413687999992
    Elapsed time: 5.24122020599998
    Elapsed time: 5.10201246400004

    Cpus_allowed_list: 0-1,4-5
    voluntary_ctxt_switches: 1
    nonvoluntary_ctxt_switches: 1597
    -----------------------

    And *NOW* using my isolated CPU physical cores #2 and #3 (Linux CPUs 2, 3, 6 and 7), still on the heavily loaded system:
    -----------------------

    $ taskset -c 2,3,6,7 python3 telco_haypo.py full 

    Elapsed time: 2.579487486000062
    Elapsed time: 2.5827961039999536
    Elapsed time: 2.5811954810001225
    Elapsed time: 2.5782033600000887
    Elapsed time: 2.572370636999949

    Cpus_allowed: cc
    Cpus_allowed_list: 2-3,6-7
    voluntary_ctxt_switches: 2
    nonvoluntary_ctxt_switches: 16
    -----------------------

    Numbers look *more* stable than the numbers of the first test without taskset on an idle system! You can see that number of context switches is very low (total: 18).

    Example of a second run:
    -----------------------
    haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full

    Elapsed time: 2.538398498999868
    Elapsed time: 2.544711968999991
    Elapsed time: 2.5323677339999904
    Elapsed time: 2.536252647000083
    Elapsed time: 2.525748182999905

    Cpus_allowed: cc
    Cpus_allowed_list: 2-3,6-7
    voluntary_ctxt_switches: 2
    nonvoluntary_ctxt_switches: 15
    -----------------------

    Third run:
    -----------------------
    haypo@smithers$ taskset -c 2,3,6,7 python3 telco_haypo.py full

    Elapsed time: 2.5819172930000605
    Elapsed time: 2.5783024259999365
    Elapsed time: 2.578493587999901
    Elapsed time: 2.5774198510000588
    Elapsed time: 2.5772148999999445

    Cpus_allowed: cc
    Cpus_allowed_list: 2-3,6-7
    voluntary_ctxt_switches: 2
    nonvoluntary_ctxt_switches: 15
    -----------------------

    Well, it's no perfect, but it looks much stable than timings without specific kernel config nor CPU pinning.

    Statistics on the 15 timings of the 3 runs with tunning on a heavily loaded system:

    >>> times
    [2.579487486000062, 2.5827961039999536, 2.5811954810001225, 2.5782033600000887, 2.572370636999949, 2.538398498999868, 2.544711968999991, 2.5323677339999904, 2.536252647000083, 2.525748182999905, 2.5819172930000605, 2.5783024259999365, 2.578493587999901, 2.5774198510000588, 2.5772148999999445]
    >>> statistics.mean(times)
    2.564325343866661
    >>> statistics.pvariance(times)
    0.0004340411190965491
    >>> statistics.stdev(times)
    0.021564880156747315

    Compare if to the timings without tunning on an idle system:

    >>> times
    [2.660088424000037, 2.5927538629999844, 2.6135682369999813, 2.5819260570000324, 2.5991294099999322]
    >>> statistics.mean(times)
    2.6094931981999934
    >>> statistics.pvariance(times)
    0.0007448087075422725
    >>> statistics.stdev(times)
    0.030512470965620608

    We get (no tuning, idle system => tuning, busy system):

    • Population variance: 0.00074 => 0.00043
    • Standard deviation: 0.031 => 0.022

    It looks *much* better, no? Even I only used *5* timings on the benchmark without tuning, whereas I used 15 timings on the benchmark with tuning. I expect larger variance and deviation with more times.

    --

    Just for fun, I ran the benchmark 3 times (so to get 3x5 timings) on an idle system with tuning:

    >>> times
    [2.542378394000025, 2.5541740109999864, 2.5456488329998592, 2.54730951800002, 2.5495472409998, 2.56374302800009, 2.5737907220000125, 2.581463170999996, 2.578222832999927, 2.574441839999963, 2.569389365999996, 2.5792129209999075, 2.5689420860001064, 2.5681367900001533, 2.5563378829999692]
    >>> import statistics
    >>> statistics.mean(times)
    2.563515909133321
    >>> statistics.pvariance(times)
    0.00016384530912002678
    >>> statistics.stdev(times)
    0.013249473404092065

    As expected, it's even better (no tune, idle system => tuning, busy system => tuning, idle system):

    • Population variance: 0.00074 => 0.00043 => 0.00016
    • Standard deviation: 0.031 => 0.022 => 0.013

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 4, 2016

    STINNER Victor added the comment:
    I modified Stefan's telco.py to remove all I/O from the hot code: the benchmark is now really CPU-bound. I also modified telco.py to run the benchmark 5 times. One run takes around 2.6 seconds.

    Nice. telco.py is an ad-hoc script from the original decimal.py sandbox,
    I missed that it called "infil.read(8)". :)

    And *NOW* using my isolated CPU physical cores #2 and #3 (Linux CPUs 2, 3, 6 and 7), still on the heavily loaded system:
    -----------------------
    $ taskset -c 2,3,6,7 python3 telco_haypo.py full

    Elapsed time: 2.579487486000062
    Elapsed time: 2.5827961039999536
    Elapsed time: 2.5811954810001225
    Elapsed time: 2.5782033600000887
    Elapsed time: 2.572370636999949

    Great. I'll try that out in the weekend.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 4, 2016

    Victor, this is a very interesting write-up, thank you.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 4, 2016

    Florin Papa added the comment:

    I ran perf to use calibration and there is no difference in stability
    compared to the unpatched version.

    Sorry, what are you calling "stability"? For me, stability means that
    you run the same benchmark 3, 5 or 10 times, and the results must be
    as close as possible: see variance and standard deviation of my
    previous message.

    I'm not talking of variance/deviation of the N runs of bm_xxx.py
    scripts, but variance/deviation of the mean value displayed by
    perf.py.

    perf_calibration.patch is a proof-of-concept. I changed the number of
    runs from 50 to 10 to test my patch more easily. You should modify the
    patch to keep 50 runs if you want to compare the stability.

    By the way, --fast/--rigorous options should not only change the
    minimum duration of a single run to calibrate the number of loops, but
    they should also change the "maximum" duration of perf.py by using
    different number of runs.

    @florinpapa
    Copy link
    Mannequin

    florinpapa mannequin commented Feb 4, 2016

    I was also talking about the variance/deviation of the mean value
    displayed by perf.py, sorry if I was unclear. The perf.py output in my
    previous message showed little difference between the patched and
    non-patched version. I will also try increasing the number of
    runs to see if there is any change.

    The CPU isolation feature is a great finding, thank you.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 4, 2016

    tl; dr I'm disappointed. According to the statistics module, running the bm_regex_v8.py benchmark more times with more iterations make the benchmark more unstable... I expected the opposite...

    Patch version 2:

    • patch also performance/bm_pickle.py
    • change min_time from 100 ms to 500 ms with --fast
    • compute the number of runs using a maximum time, maximum time change with --fast and --rigorous

    + if options.rigorous:
    + min_time = 1.0
    + max_time = 100.0 # 100 runs
    + elif options.fast:
    + min_time = 0.5
    + max_time = 25.0 # 50 runs
    + else:
    + min_time = 0.5
    + max_time = 50.0 # 100 runs

    To measure the stability of perf.py, I pinned perf.py to CPU cores which are isolated of the system using Linux "isolcpus" kernel parameter. I also forced the CPU frequency governor to "performance" and enabled "no HZ full" on these cores.

    I ran perf.py 5 times on regex_v8.

    Calibration (original => patched):

    • --fast: 1 iteration x 5 runs => 16 iterations x 50 runs
    • (no option): 1 iteration x 50 runs => 16 iterations x 100 runs

    Approximated duration of the benchmark (original => patched):

    • --fast: 7 sec => 7 min 34 sec
    • (no option): 30 sec => 14 min 35 sec

    (I made a mistake, so I was unable to get the exact duration.)

    Hum, maybe timings are not well chosen because the benchmark is really slow (minutes vs seconds) :-/

    Standard deviation, --fast:

    • (python2) 0.00071 (1.2%, mean=0.05961) => 0.01059 (1.1%, mean=0.96723)
    • (python3) 0.00068 (1.5%, mean=0.04494) => 0.05925 (8.0%, mean=0.74248)
    • (faster) 0.02986 (2.2%, mean=1.32750) => 0.09083 (6.9%, mean=1.31000)

    Standard deviation, (no option):

    • (python2) 0.00072 (1.2%, mean=0.05957) => 0.00874 (0.9%, mean=0.97028)
    • (python3) 0.00053 (1.2%, mean=0.04477) => 0.00966 (1.3%, mean=0.72680)
    • (faster) 0.02739 (2.1%, mean=1.33000) => 0.02608 (2.0%, mean=1.33600)

    Variance, --fast:

    • (python2) 0.00000 (0.001%, mean=0.05961) => 0.00009 (0.009%, mean=0.96723)
    • (python3) 0.00000 (0.001%, mean=0.04494) => 0.00281 (0.378%, mean=0.74248)
    • (faster) 0.00067 (0.050%, mean=1.32750) => 0.00660 (0.504%, mean=1.31000)

    Variance, (no option):

    • (python2) 0.00000 (0.001%, mean=0.05957) => 0.00006 (0.006%, mean=0.97028)
    • (python3) 0.00000 (0.001%, mean=0.04477) => 0.00007 (0.010%, mean=0.72680)
    • (faster) 0.00060 (0.045%, mean=1.33000) => 0.00054 (0.041%, mean=1.33600)

    Legend:

    • (python2) are timings of python2 ran by perf.py (of the "Min" line)
    • (python3) are timings of python3 ran by perf.py (of the "Min" line)
    • (faster) are the "1.34x" numbers of "faster" or "slower" of the "Min" line
    • percentages are: value * 100 / mean

    It's not easy to compare these values since the number of iterations is very different (1 => 16) and so timings are very different (ex: 0.059 sec => 0.950 sec). I guess that it's ok to compare percentages.

    I used the stability.py script, attached to this issue, to compute deviation and variance from the "Min" line of the 5 runs. The script takes the output of perf.py as input.

    I'm not sure that 5 runs are enough to compute statistics.

    --

    Raw data.

    Original perf.py.

    $ grep ^Min original.fast 
    Min: 0.059236 -> 0.045948: 1.29x faster
    Min: 0.059005 -> 0.044654: 1.32x faster
    Min: 0.059601 -> 0.044547: 1.34x faster
    Min: 0.060605 -> 0.044600: 1.36x faster
    
    $ grep ^Min original
    Min: 0.060479 -> 0.044762: 1.35x faster
    Min: 0.059002 -> 0.045689: 1.29x faster
    Min: 0.058991 -> 0.044587: 1.32x faster
    Min: 0.060231 -> 0.044364: 1.36x faster
    Min: 0.059165 -> 0.044464: 1.33x faster

    Patched perf.py.

    $ grep ^Min patched.fast 
    Min: 0.950717 -> 0.711018: 1.34x faster
    Min: 0.968413 -> 0.730810: 1.33x faster
    Min: 0.976092 -> 0.847388: 1.15x faster
    Min: 0.964355 -> 0.711083: 1.36x faster
    Min: 0.976573 -> 0.712081: 1.37x faster
    
    $ grep ^Min patched
    Min: 0.968810 -> 0.729109: 1.33x faster
    Min: 0.973615 -> 0.731308: 1.33x faster
    Min: 0.974215 -> 0.734259: 1.33x faster
    Min: 0.978781 -> 0.709915: 1.38x faster
    Min: 0.955977 -> 0.729387: 1.31x faster
    
    $ grep ^Calibration patched.fast 
    Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.4 sec)
    Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.3 sec)
    Calibration: num_runs=50, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 37.4 sec)
    Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.6 sec)
    Calibration: num_runs=50, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 36.7 sec)
    
    $ grep ^Calibration patched
    Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.0 sec)
    Calibration: num_runs=100, num_loops=16 (0.75 sec per run > min_time 0.50 sec, estimated total: 75.3 sec)
    Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 73.2 sec)
    Calibration: num_runs=100, num_loops=16 (0.74 sec per run > min_time 0.50 sec, estimated total: 73.7 sec)
    Calibration: num_runs=100, num_loops=16 (0.73 sec per run > min_time 0.50 sec, estimated total: 72.9 sec)

    @brettcannon
    Copy link
    Member

    What would happen if we shifted to counting the number of executions within a set amount of time instead of how fast a single execution occurred? I believe some JavaScript benchmarks started to do this about a decade ago when they realized CPUs were getting so fast that older benchmarks were completing too quickly to be reliably measured. This also would allow one to have a very strong notion of how long a benchmark run would take based on the number of iterations and what time length bucket a benchmark was placed in (i.e., for microbenchmarks we could say a second while for longer running benchmarks we can increase that threshold). And it won't hurt benchmark comparisons since we have always done relative comparisons rather than absolute ones.

    @vstinner
    Copy link
    Member Author

    This issue was fixed in the new flavor of the benchmark, the new performance project:
    https://github.com/python/performance

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants