Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pybench and test.pystone poorly documented #59574

Closed
florentx mannequin opened this issue Jul 16, 2012 · 16 comments
Closed

pybench and test.pystone poorly documented #59574

florentx mannequin opened this issue Jul 16, 2012 · 16 comments
Labels
3.7 (EOL) end of life docs Documentation in the Doc dir performance Performance or resource usage type-bug An unexpected behavior, bug, or error

Comments

@florentx
Copy link
Mannequin

florentx mannequin commented Jul 16, 2012

BPO 15369
Nosy @malemburg, @pitrou, @vstinner, @florentx
PRs
  • [Do Not Merge] Convert Misc/NEWS so that it is managed by towncrier #552
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-10-18.15:00:16.343>
    created_at = <Date 2012-07-16.13:41:44.969>
    labels = ['type-bug', '3.7', 'docs', 'performance']
    title = 'pybench and test.pystone poorly documented'
    updated_at = <Date 2017-03-31.16:36:20.104>
    user = 'https://github.com/florentx'

    bugs.python.org fields:

    activity = <Date 2017-03-31.16:36:20.104>
    actor = 'dstufft'
    assignee = 'docs@python'
    closed = True
    closed_date = <Date 2016-10-18.15:00:16.343>
    closer = 'vstinner'
    components = ['Documentation', 'Benchmarks']
    creation = <Date 2012-07-16.13:41:44.969>
    creator = 'flox'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 15369
    keywords = []
    message_count = 16.0
    messages = ['165603', '165717', '165719', '165721', '165724', '181218', '276227', '276228', '276231', '276332', '276439', '276536', '276537', '276538', '276540', '278888']
    nosy_count = 6.0
    nosy_names = ['lemburg', 'pitrou', 'vstinner', 'flox', 'docs@python', 'python-dev']
    pr_nums = ['552']
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue15369'
    versions = ['Python 3.7']

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jul 16, 2012

    The benchmarking tools "pystones" and "pybench" which are shipped with the Python standard distribution are not documented.

    The only information is in the what's-new for Python 2.5:
    http://docs.python.org/dev/whatsnew/2.5.html?highlight=pybench#new-improved-and-removed-modules

    IMHO, they should be mentioned somewhere in the HOWTOs, the FAQ or the standard library documentation ("Development Tools" or "Debugging and Profiling")

    @florentx florentx mannequin assigned docspython Jul 16, 2012
    @florentx florentx mannequin added docs Documentation in the Doc dir performance Performance or resource usage type-bug An unexpected behavior, bug, or error labels Jul 16, 2012
    @brettcannon
    Copy link
    Member

    I disagree. They are outdated benchmarks and probably should either be removed or left undocumented. Proper testing of performance is with the Unladen Swallow benchmarks.

    @malemburg
    Copy link
    Member

    Brett Cannon wrote:

    Brett Cannon <brett@python.org> added the comment:

    I disagree. They are outdated benchmarks and probably should either be removed or left undocumented. Proper testing of performance is with the Unladen Swallow benchmarks.

    I disagree with your statement. Just like every benchmark, they serve
    their purpose in their particular field of use, e.g. pybench may not
    be useful for the JIT approach originally taken by the Unladden Swallow
    project, but it's still useful to test/check changes in the non-JIT
    CPython interpreter and it's extensible to take new developments
    into account. pystone is useful to get a quick feel the performance
    of Python on a machine.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jul 17, 2012

    Actually, I discovered "python -m test.pystone" during the talk of Mike Müller at EuroPython. http://is.gd/fasterpy

    Even if they are suboptimal for true benchmarks, they should probably be mentioned somewhere.
    In the same paragraph, there should be a link to the "Grand Unified Python Benchmark Suite" as best practice:

    http://hg.python.org/benchmarks
    http://hg.python.org/benchmarks/file/tip
    http://hg.python.org/benchmarks/file/tip/README.txt

    The last paragraph of this wiki page might be reworded and included in the Python documentation:
    http://code.google.com/p/unladen-swallow/wiki/Benchmarks
    http://code.google.com/p/unladen-swallow/wiki/Benchmarks#Benchmarks_we_don't_use

    BTW, there's also this website which seems not updated anymore…
    http://speed.python.org/

    @brettcannon
    Copy link
    Member

    The Unladen Swallow benchmarks are in no way specific to JITs; it is a set of thorough benchmarks for measuring the overall performance of a Python VM.

    As for speed.python.org, we know that it is currently not being updated as we are waiting for people to have the time to move it forward and replace speed.pypy.org for all Python VMs.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 2, 2013

    I don't really think they deserve documenting.

    pystones can arguably be a cheap and easy way of comparing performance of different systems *using the exact same Python interpreter*. It's the only point of running pystones.

    As for pybench, it probably had a point when there wasn't anything better, but I don't think it has anymore. We have a much better benchmarks suite right now, and we also have a couple specialized benchmarks in the tools directory.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 13, 2016

    New changeset 08a0b75904c6 by Victor Stinner in branch 'default':
    Remove pybench microbenchmark
    https://hg.python.org/cpython/rev/08a0b75904c6

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 13, 2016

    New changeset e03c1b6830fd by Victor Stinner in branch 'default':
    Remove pystone microbenchmark
    https://hg.python.org/cpython/rev/e03c1b6830fd

    @vstinner
    Copy link
    Member

    We now have a good and stable benchmark suite: https://github.com/python/performance

    I removed pystone and pybench from Python 3.7. Please use performance instead of old and not reliable microbenchmarks like pybench or pystone.

    @vstinner vstinner added the 3.7 (EOL) end of life label Sep 13, 2016
    @malemburg
    Copy link
    Member

    Please add notes to the Tools/README pointing users to the performance suite.

    I'd also like to request that you reword this dismissive line in the performance package's readme:

    """
    pybench - run the standard Python PyBench benchmark suite. This is considered an unreliable, unrepresentative benchmark; do not base decisions off it. It is included only for completeness.
    """

    I suppose this was taken from the Unladden Swallow list of benchmarks and completely misses the point of what pybench is all about: it's a benchmark to run performance tests for individual parts of CPython's VM implementation. It never was intended to be representative. The main purpose is to be able to tell whether an optimization in CPython has an impact on individual areas of the interpreter or not.

    Thanks.

    @vstinner
    Copy link
    Member

    I'd also like to request that you reword this dismissive line in the performance package's readme: (...)

    Please report issues of the performance module on its own bug tracker:
    https://github.com/python/performance

    Can you please propose a new description? You might even create a pull
    request ;-)

    Note: I'm not sure that we should keep pybench, this benchmark really
    looks unreliable. But I should still try at least to use the same
    number of iterations for all worker child processes. Currently the
    calibration is done in each child process.

    @malemburg
    Copy link
    Member

    On 14.09.2016 15:20, STINNER Victor wrote:

    STINNER Victor added the comment:

    > I'd also like to request that you reword this dismissive line in the performance package's readme: (...)

    Please report issues of the performance module on its own bug tracker:
    https://github.com/python/performance

    Can you please propose a new description? You might even create a pull
    request ;-)

    I'll send a PR.

    Note: I'm not sure that we should keep pybench, this benchmark really
    looks unreliable. But I should still try at least to use the same
    number of iterations for all worker child processes. Currently the
    calibration is done in each child process.

    Well, pybench is not just one benchmark, it's a whole collection of
    benchmarks for various different aspects of the CPython VM and per
    concept it tries to calibrate itself per benchmark, since each
    benchmark has different overhead.

    The number of iterations per benchmark will not change between
    runs, since this number is fixed in each benchmark. These numbers
    do need an update, though, since at the time of writing pybench
    CPUs were a lot less powerful compare to today.

    Here's the comment with the guideline for the number of rounds
    to use per benchmark:

        # Number of rounds to execute per test run. This should be
        # adjusted to a figure that results in a test run-time of between
        # 1-2 seconds.
        rounds = 100000

    BTW: Why would you want to run benchmarks in child processes
    and in parallel ? This will usually dramatically effect the
    results of the benchmark runs. Ideally, the pybench process
    should be the only CPU intense work load on the entire CPU
    to get reasonable results.

    @vstinner
    Copy link
    Member

    Hum, since the discussion restarted, I reopen the issue ...

    "Well, pybench is not just one benchmark, it's a whole collection of benchmarks for various different aspects of the CPython VM and per concept it tries to calibrate itself per benchmark, since each benchmark has different overhead."

    In the performance module, you now get individual timing for each pybench benchmark, instead of an overall total which was less useful.

    "The number of iterations per benchmark will not change between runs, since this number is fixed in each benchmark."

    Please take a look at the new performance module, it has a different design. Calibration is based on minimum time per sample, no more on hardcoded things. I modified all benchmarks, not only pybench.

    "BTW: Why would you want to run benchmarks in child processes and in parallel ?"

    Child processes are run sequentially.

    Running benchmarks in multiple processes help to get more reliable benchmarks. Read my article if you want to learn more about the design of my perf module:
    http://haypo-notes.readthedocs.io/microbenchmark.html#my-articles

    "Ideally, the pybench process should be the only CPU intense work load on the entire CPU to get reasonable results."

    The perf module automatically uses isolated CPU. It strongly suggests to use this amazing Linux feature to run benchmarks!
    https://haypo.github.io/journey-to-stable-benchmark-system.html

    I started to write advices to get stable benchmarks:
    https://github.com/python/performance#how-to-get-stable-benchmarks

    Note: See also the https://mail.python.org/mailman/listinfo/speed mailing list ;-)

    @vstinner vstinner reopened this Sep 15, 2016
    @malemburg
    Copy link
    Member

    On 15.09.2016 11:11, STINNER Victor wrote:

    STINNER Victor added the comment:

    Hum, since the discussion restarted, I reopen the issue ...

    "Well, pybench is not just one benchmark, it's a whole collection of benchmarks for various different aspects of the CPython VM and per concept it tries to calibrate itself per benchmark, since each benchmark has different overhead."

    In the performance module, you now get individual timing for each pybench benchmark, instead of an overall total which was less useful.

    pybench had the same intention. It was a design mistake to add an
    overall timing to each suite run. The original intention was to
    compare each benchmark individually.

    Perhaps it would make sense to try to port the individual benchmark
    tests in pybench to performance.

    "The number of iterations per benchmark will not change between runs, since this number is fixed in each benchmark."

    Please take a look at the new performance module, it has a different design. Calibration is based on minimum time per sample, no more on hardcoded things. I modified all benchmarks, not only pybench.

    I think we are talking about different things here: calibration is
    pybench means that you try to determine the overhead of the
    outer loop and possible setup code that is needed to run the
    the test.

    pybench runs a calibration method which has the same
    code as the main test, but without the actual operations that you
    want to test, in order to determine the timing of the overhead.

    It then takes the minimum timing from overhead runs and uses
    this as base line for the actual test runs (it subtracts the
    overhead timing from the test run results).

    This may not be ideal in all cases, but it's the closest
    I could get to timing of the test operations at the time.

    I'll have a look at what performance does.

    "BTW: Why would you want to run benchmarks in child processes and in parallel ?"

    Child processes are run sequentially.

    Ah, ok.

    Running benchmarks in multiple processes help to get more reliable benchmarks. Read my article if you want to learn more about the design of my perf module:
    http://haypo-notes.readthedocs.io/microbenchmark.html#my-articles

    Will do, thanks.

    "Ideally, the pybench process should be the only CPU intense work load on the entire CPU to get reasonable results."

    The perf module automatically uses isolated CPU. It strongly suggests to use this amazing Linux feature to run benchmarks!
    https://haypo.github.io/journey-to-stable-benchmark-system.html

    I started to write advices to get stable benchmarks:
    https://github.com/python/performance#how-to-get-stable-benchmarks

    Note: See also the https://mail.python.org/mailman/listinfo/speed mailing list ;-)

    I've read some of your blog posts and articles on the subject
    and your journey. Interesting stuff, definitely. Benchmarking
    these days appears to have gotten harder not simpler compared to
    the days of pybench some 19 years ago.

    @vstinner
    Copy link
    Member

    2016-09-15 11:21 GMT+02:00 Marc-Andre Lemburg <report@bugs.python.org>:

    I think we are talking about different things here: calibration is
    pybench means that you try to determine the overhead of the
    outer loop and possible setup code that is needed to run the
    the test.
    (...)
    It then takes the minimum timing from overhead runs and uses
    this as base line for the actual test runs (it subtracts the
    overhead timing from the test run results).

    Calibration in perf means computing automatically the number of
    outer-loops to get a sample of at least 100 ms (default min time).

    I simply removed the code to estimate the overhead of the outer loop
    in pybench. The reason is this line:

            # Get calibration
            min_overhead = min(self.overhead_times)

    This is no such "minimum timing", it doesn't exist :-) In benchmarks,
    you have to work on statistics: use average, standard deviation, etc.

    If you badly estimate the minimum overhead, you might get negative
    timings, which is not allowed in perf (even zero is an hard error in
    perf).

    It's not possible to compute *exactly* the "minimum overhead".

    Moreover, removing the code to estimate the overhead simplified the code.

    Benchmarking these days appears to have gotten harder not simpler compared to the days of pybench some 19 years ago.

    Benchmarking was always a hard problem. Modern hardware (Out of order
    CPU, variable CPU frequency, power saving, etc.) problably didn't help
    :-)

    @vstinner
    Copy link
    Member

    I'm closing the issue again.

    Again, pybench moved to http://github.com/python/performance : please continue the discussion there if you consider that we still need to do something on pybench.

    FYI I reworked deeply pybench recently using the new perf 0.8 API. perf 0.8 now supports running multiple benchmarks per script, so pybench was written as only a benchmark runner. Comparison between benchmarks can be done using performance, or directly using perf (python3 -m perf compare a.json b.json).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life docs Documentation in the Doc dir performance Performance or resource usage type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants