Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceval: use Wordcode, 16-bit bytecode #70834

Closed
serprex mannequin opened this issue Mar 26, 2016 · 68 comments
Closed

ceval: use Wordcode, 16-bit bytecode #70834

serprex mannequin opened this issue Mar 26, 2016 · 68 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@serprex
Copy link
Mannequin

serprex mannequin commented Mar 26, 2016

BPO 26647
Nosy @brettcannon, @birkenfeld, @rhettinger, @ncoghlan, @vstinner, @benjaminp, @serhiy-storchaka, @1st1, @MojoVampire, @serprex, @dopplershift
Dependencies
  • bpo-26881: modulefinder should reuse the dis module
  • Files
  • wpy.patch
  • wcpybm.txt: pybench, ccbench, EXTENDED_ARG counts, pycache size
  • exarg_in_funcs.txt: Better EXTENDED_ARG counts
  • 2to3re.txt: Benchmarks: 2to3, regex
  • wpy2.patch: Changes from initial code review
  • wpy3.patch
  • wpy4.patch: f_lasti = -1
  • module_finder.patch
  • wpy5.patch
  • wpy6.patch
  • wpy7.patch
  • wpy7.patch: Regenerated for review
  • wpy8.patch
  • wpy8.patch: Regenerated for review
  • wpy9.patch
  • wpyA.patch
  • wpyA.patch: Regenerated for review
  • wpyB.patch
  • wpyC.patch
  • wpyD.patch
  • default-May26-03-05-10.log
  • wordcode.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2016-11-24.22:20:31.147>
    created_at = <Date 2016-03-26.23:27:48.913>
    labels = ['interpreter-core', 'performance']
    title = 'ceval: use Wordcode, 16-bit bytecode'
    updated_at = <Date 2018-03-19.16:29:10.271>
    user = 'https://github.com/serprex'

    bugs.python.org fields:

    activity = <Date 2018-03-19.16:29:10.271>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2016-11-24.22:20:31.147>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2016-03-26.23:27:48.913>
    creator = 'Demur Rumed'
    dependencies = ['26881']
    files = ['42300', '42324', '42325', '42328', '42331', '42339', '42353', '42446', '42453', '42455', '42659', '42661', '42886', '42912', '42935', '42946', '42947', '42948', '42949', '42950', '43010', '43013']
    hgrepos = []
    issue_num = 26647
    keywords = ['patch']
    message_count = 68.0
    messages = ['262501', '262542', '262597', '262616', '262622', '262624', '262646', '262647', '262648', '262658', '262660', '262662', '262671', '262677', '262684', '262716', '262758', '262787', '263094', '263263', '263265', '263278', '263334', '263335', '263336', '263337', '263339', '263340', '264256', '264470', '264495', '264496', '264536', '264558', '264604', '264610', '264643', '265285', '265346', '265799', '265917', '265918', '266045', '266053', '266060', '266078', '266090', '266092', '266096', '266101', '266110', '266112', '266114', '266233', '266234', '266240', '266377', '266388', '266407', '266408', '266417', '266420', '266425', '266426', '266432', '281656', '281663', '281789']
    nosy_count = 13.0
    nosy_names = ['brett.cannon', 'georg.brandl', 'rhettinger', 'ncoghlan', 'vstinner', 'benjamin.peterson', 'python-dev', 'serhiy.storchaka', 'yselivanov', 'abarnert', 'josh.r', 'Demur Rumed', 'Ryan May']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue26647'
    versions = ['Python 3.6']

    @serprex
    Copy link
    Mannequin Author

    serprex mannequin commented Mar 26, 2016

    Originally started @ https://github.com/abarnert/cpython/tree/wpy

    This patch is based off of https://github.com/serprex/cpython/tree/wpy

    It omits importlib.h & importlib_external.h as those are generated

    It omits https://github.com/serprex/cpython/blob/wpy/Python/wordcode.md

    I got around to benchmarking against building on master rather than using my repo's packaged version, it's currently a 1% speed improvement (every bit counts). I'm testing on an Intel Atom 330 with Linux. Besides the minor perf increase, it generates smaller bytecode & is simpler (peephole now handles EXTENDED_ARG since it isn't too hard to track & while loops become for loops in dis)

    Previous discussion: https://mail.python.org/pipermail/python-dev/2016-February/143357.html

    pdb works without changes. coverage.py doesn't seem to rely on anything this changes

    I modified byteplay to target this change mostly over the course of half an hour before work: https://github.com/serprex/byteplay/blob/master/wbyteplay.py

    I'd be interested to hear if this encoding simplifies things for FAT python & the recent work to cache attribute/global lookup

    Remaining code issues: peepholer could allocate half the space as it does now for basic block tracking, compile.c & peephole.c repeat themselves on computing instruction size given an argument & how to spit out an instruction given an argument

    Breaking change in dis: I've removed HAVE_ARGUMENT. This is to help code fail fast. It could be replaced with IGNORES_ARGUMENT or, as abarnert suggested, a range(90,256) named after the other hasXXXs 'hasarg'

    @serprex serprex mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Mar 26, 2016
    @serprex
    Copy link
    Mannequin Author

    serprex mannequin commented Mar 28, 2016

    Also missing from this patch is modification of the bytecode magic number

    @vstinner
    Copy link
    Member

    Sorry, I don't have the context. Can you please explain your change? What did you do? What is the rationale? Do you expect better performances? If yes, please run the Python benchmark suite and post results here. What is the new format of bytecode? etc.

    @serprex
    Copy link
    Mannequin Author

    serprex mannequin commented Mar 29, 2016

    I'll dig up benchmark results when I get home, but I'd be interested to get results on a less wannabe RISC CPU

    The change is to have all instructions take an argument. This removes the branch on each instruction on whether to load oparg. It then also aligns instructions to always be 2 bytes rather than 1 or 3 by having arguments only take up 1 byte. In the case that an argument to an instruction is greater than 255, it can chain EXTENDED_ARG up to 3 times. In practice this rarely occurs, mostly only for jumps, & abarnert measured stdlib to be ~5% smaller

    The rationale is that this offers 3 benefits: Smaller code size, simpler instruction iteration/indexing (One may now scan backwards, as peephole.c does in this patch), which between the two results in a small perf gain (the only way for perf to be negatively impacted is by an increase in EXTENDED_ARGs, when I post benchmarking I'll also post a count of how many more EXTENDED_ARGs are emitted)

    This also means that if I want to create something like a tracer that tracks some information for each instruction, I can allocate an array of codesize/2 bytes, then index off of half the instruction index. This isn't currently done in peephole.c, nor does this include halving jump opargs

    I've looked up the 'recent work to cache attribute/global lookup' issue I mentioned: http://bugs.python.org/issue26219
    I believe that patch would benefit from this one, but it'd be better to get Yury's opinion that belief

    @vstinner
    Copy link
    Member

    This also means that if I want to create something like a tracer that tracks some information for each instruction, I can allocate an array of codesize/2 bytes, then index off of half the instruction index. This isn't currently done in peephole.c, nor does this include halving jump opargs

    There is something called "inline caching": put the cache between instructions, in the same memory block. Example of paper on CPython:

    "Efficient Inline Caching without Dynamic Translation" by Stefan Brunthaler (2009)
    https://www.sba-research.org/wp-content/uploads/publications/sac10.pdf

    Yury's approach is a standard lookup table: offset => cache. In the issue bpo-26219, he even used two tables: co->co_opt_opcodemap is an array mapping an instruction offset to the offset in the cache, then the second offset is used to retrieve cache data from a second array. You have 3 structures (co_code, co_opt_opcodemap, co_opt), whereas inline caching propose to only use one flat structure (a single array).

    The paper promises "improved data locality and instruction decoding effciency".

    but "The new combined data-structure requires significantly more space—two native machine words for each instruction byte. To compensate for the additional space requirements, we use a profiling infrastructure to decide when to switch to this new instruction encoding at run time."

    Memory footprint and detection of hot code is handled in the issue bpo-26219.

    @vstinner
    Copy link
    Member

    The change is to have all instructions take an argument. This removes the branch on each instruction on whether to load oparg. (...)

    Oh ok, I like that :-) I had the same idea.

    Your patch contains unrelated changes, you should revert them to have a change simpler to review.

    Removing HAVE_ARGUMENT from opcode.h/dis.py doesn't seem like a good idea. IMHO it's stil useful for dis to show a more compact bytcode. For example, I expect "DUP_TOP", not "DUP_TOP 0", or worse "DUP_TOP 5".

    For backward compatibility, I also suggest to keep HAS_ARG() even if it must not be used to decode instructions anymore.

    The following obvious change is to use a pointer aligned to 16-bits for co_code to be able to use 16-bit instructions rather than two 8-bit instructions to retrieve the opcode and then the argument in ceval.c (see the issue bpo-25823). I suggest to implement that later to keep the change as simple as possible.

    @vstinner vstinner changed the title Wordcode ceval: use Wordcode, 16-bit bytecode Mar 29, 2016
    @serprex
    Copy link
    Mannequin Author

    serprex mannequin commented Mar 30, 2016

    I've attached some benchmarking results as requested

    There is 1 failing test which doesn't fail in master for test_trace; the unit test for bpo-9936

    @serprex
    Copy link
    Mannequin Author

    serprex mannequin commented Mar 30, 2016

    To clarify format of extended arg listings: 1st column is the number of instances of EXTENDED_ARG being emitted, 2nd column is length of bytecode, followed by filename

    The previous numbers were of modules, which generally are run-once and listing many constants. I've attached a modification where instead I iterated over the code objects inside co_consts from compiling the *.py file. Trunk currently only emits EXTENDED_ARGs for classes (Pdb, & then the rest in Lib/typing.py) so I've omitted it

    @brettcannon
    Copy link
    Member

    Thanks for the benchmark results, Demur, but I think the benchmarks Victor was talking about hg.Python.org/benchmarks

    @rhettinger
    Copy link
    Contributor

    Demur, I think you're on the right track here. It will nice to be rid of the HAVE_ARGUMENT tests and to not have to decode the arguments one-byte at a time. Overall, the patch looks good (although it includes several small unrelated changes). Besides the speed benefit, the code looks cleaner than before.

    I was surprised to see that the peephole optimizer grew larger, but the handling of extended arguments is likely worth it even though it adds several new wordy chunks of code.

    When it comes to benchmarks, expect a certain amount of noise (especially from those that use I/O or that exercise the C-API more than the pure python bytecode).

    @rhettinger
    Copy link
    Contributor

    FWIW, I'm seeing about a 7% improvement to pystone.

    @serprex
    Copy link
    Mannequin Author

    serprex mannequin commented Mar 30, 2016

    While it's good to know benchmarking in core Python goes beyond the microbenchmarks included in the distribution, I'm having some trouble with hg.python.org/benchmarks due to my system only having 256MB of ram

    I've attached results for 2 benchmarks: 2to3 & regex

    @rhettinger
    Copy link
    Contributor

    Report on Darwin Raymonds-2013-MacBook-Pro.local 15.4.0 Darwin Kernel Version 15.4.0: Fri Feb 26 22:08:05 PST 2016; root:xnu-3248.40.184~3/RELEASE_X86_64 x86_64 i386
    Total CPU cores: 8

    ### 2to3 ###
    Min: 4.680941 -> 4.437426: 1.05x faster
    Avg: 4.703692 -> 4.498773: 1.05x faster
    Significant (t=9.57)
    Stddev: 0.02670 -> 0.03972: 1.4874x larger

    ### chameleon_v2 ###
    Min: 3.391806 -> 3.300793: 1.03x faster
    Avg: 3.447192 -> 3.340437: 1.03x faster
    Significant (t=28.26)
    Stddev: 0.03141 -> 0.02098: 1.4972x smaller

    ### django_v3 ###
    Min: 0.339693 -> 0.328680: 1.03x faster
    Avg: 0.347655 -> 0.335704: 1.04x faster
    Significant (t=16.97)
    Stddev: 0.00477 -> 0.00518: 1.0871x larger

    ### nbody ###
    Min: 0.159703 -> 0.148231: 1.08x faster
    Avg: 0.164307 -> 0.152380: 1.08x faster
    Significant (t=34.06)
    Stddev: 0.00260 -> 0.00234: 1.1123x smaller

    The following not significant results are hidden, use -v to show them:
    fastpickle, fastunpickle, json_dump_v2, json_load, regex_v8, tornado_http.

    @brettcannon
    Copy link
    Member

    Thanks to Demur and Raymond for running the benchmarks. All of these numbers look good and with Raymond saying the code looks cleaner and everyone so far -- including me -- liking the overall idea this definitely seems worth continuing to work on. Thanks for starting this, Demur, and I hope you feel up for continuing to work on making this work!

    @serprex
    Copy link
    Mannequin Author

    serprex mannequin commented Mar 31, 2016

    Added back HAVE_ARGUMENT & HAS_ARG. As a result printing has removed arguments

    Removed some code which was labelled unrelated

    This does _not_ include having f_lasti be -1 instead of -2

    @serprex
    Copy link
    Mannequin Author

    serprex mannequin commented Apr 1, 2016

    Addressed feedback from josh.rosenberg besides reintroducing FOURTH/SET_FOURTH

    @vstinner
    Copy link
    Member

    vstinner commented Apr 1, 2016

    I reviewed wpy3.patch.

    I concur with Raymond, it's really nice to have a regular structure for the bytecode.

    --

    Serhiy proposed to *reduce* the size of bytecode by adding new specialized bytecode which include the argument. For example (LOAD_CONST, 0) => LOAD_CONST_0. I would like to hear his opinion on this change.
    https://mail.python.org/pipermail/python-ideas/2016-February/038276.html

    Data+code loaded by import is the top #1 memory consumer on basic scripts according to tracemalloc:
    https://docs.python.org/dev/library/tracemalloc.html#examples

    I don't know the ratio between data and code. But here we are only talking about the co_code fields of code objects. I guess that the file size of .pyc is a good estimation.

    I don't think that the memory footprint of bytecode (co_code fields of code objects) really matters on computers (and smartphones?) of 2016.

    *If* I have to choose between CPU performance and memory footprint, I choose the CPU!

    --

    This does _not_ include having f_lasti be -1 instead of -2

    IMHO it's ok to break the C API, but I would prefer to keep the backward compatibility for the Python API (replace any negative number with -1 for the Python API).

    @serprex
    Copy link
    Mannequin Author

    serprex mannequin commented Apr 2, 2016

    Got f_lasti working as -1. Applied PEP-7. Unrelated: fixed a misnamed variable in test_grammar because it ran into a peephole bug (const -> jump_if_false erase didn't work when EXTENDED_ARGs were involved). dis has argval/arg set to None instead of the unused argument value

    Things are seeming more brittle with f_lasti as -1. But maybe it's all in my head

    @serprex
    Copy link
    Mannequin Author

    serprex mannequin commented Apr 9, 2016

    [12:36] <serprex> Could I get a code review for wordcode's 4th patchset? http://bugs.python.org/review/26647/#ps16875
    ...
    [13:13] <SilentGhost> serprex: you'd be better off bumping the issue

    @vstinner
    Copy link
    Member

    module_finder.patch: cleanup (optimize?) modulefinder.ModuleFinder.scan_opcodes_25(): Use an index rather than creating a lot of substrings.

    It's unrelated to Wordcode, it's just that I noticed the inefficient code while reviewing the whole patch.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 12, 2016

    New changeset 7bf08a11d4c9 by Victor Stinner in branch 'default':
    Issue bpo-26647: Cleanup opcode
    https://hg.python.org/cpython/rev/7bf08a11d4c9

    New changeset 423e2a96189e by Victor Stinner in branch 'default':
    Issue bpo-26647: Cleanup modulefinder
    https://hg.python.org/cpython/rev/423e2a96189e

    New changeset f8398dba48fb by Victor Stinner in branch '3.5':
    Issue bpo-26647: Fix typo in test_grammar
    https://hg.python.org/cpython/rev/f8398dba48fb

    @vstinner
    Copy link
    Member

    Demur Rumed: can you please rebase your patch? And can you please generate a patch without the git format?

    @serprex
    Copy link
    Mannequin Author

    serprex mannequin commented Apr 13, 2016

    Made changes from code review, did a little extra on fixing up type consistency, not sure if this is exactly the patch format you wanted; I tried git difftool --extcmd='diff -u' python/master but it's listing the original files as being from /tmp

    I've updated modulefinder with haypo's index patch except in the context of wordcode

    @vstinner
    Copy link
    Member

    Updated wpy5.patch to use a more standard diff format (patch generated with Mercurial, hg diff > patch).

    @vstinner
    Copy link
    Member

    Updated wpy5.patch to use a more standard diff format (patch generated with Mercurial, hg diff > patch).

    Crap, I forgot Python/wordcode_helpers.h.

    I updated a fixed wpy6.patch.

    @vstinner
    Copy link
    Member

    Demur Rumed: Is the peephole optimizer able to emit *two* EXTENDED_ARG for jump larger than 16 bits? Currently, it only has to retry once to add EXTENDED_ARG if a jump is larger than 16 bits (to use 32-bit jump offset).

    @vstinner
    Copy link
    Member

    I ran the Python benchmark suite on wpy6.patch.

    • My platform: Linux-4.4.4-301.fc23.x86_64-x86_64-with-fedora-23-Twenty_Three
    • My PC: CPU Intel i7-2600 (~2.9 GHz) with 12 GB of RAM
    • Benchmark ran on isolated CPU: http://haypo-notes.readthedocs.org/microbenchmark.html
    • Command line: ~/bin/taskset_isolated.py time python3 -u perf.py --rigorous "$ORIG_PYTHON" "$PATCHED_PYTHON" -b all 2>&1

    It looks like we get more faster benchmarks than slower benchamrks. Faster is up to 11% faster, whereas the worst slowdown is only 4%. The overall results look good to me.

    Slower:

    • fannkuch: 1.04x slower
    • pickle_dict: 1.04x slower
    • telco: 1.03x slower
    • django_v3: 1.02x slower
    • simple_logging: 1.02x slower
    • meteor_contest: 1.02x slower

    Faster:

    • unpack_sequence: 1.11x faster
    • etree_parse: 1.06x faster
    • call_method_slots: 1.06x faster
    • etree_iterparse: 1.05x faster
    • call_simple: 1.04x faster
    • nbody: 1.04x faster
    • float: 1.04x faster
    • call_method_unknown: 1.03x faster
    • call_method: 1.03x faster
    • chaos: 1.03x faster
    • mako_v2: 1.03x faster
    • richards: 1.02x faster
    • silent_logging1: 1.02x faster

    Full Output:

    Original python: ../wordcode/python
    3.6.0a0 (default:ad5b079565ad, Apr 13 2016, 16:30:36)
    [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)]

    Patched python: ../wordcode/python
    3.6.0a0 (default:c050d203e82b, Apr 13 2016, 16:30:24)
    [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)]

    INFO:root:Automatically selected timer: perf_counter
    INFO:root:Skipping benchmark slowpickle; not compatible with Python 3.6
    INFO:root:Skipping benchmark pybench; not compatible with Python 3.6
    INFO:root:Skipping benchmark hg_startup; not compatible with Python 3.6
    INFO:root:Skipping benchmark rietveld; not compatible with Python 3.6
    INFO:root:Skipping benchmark slowspitfire; not compatible with Python 3.6
    INFO:root:Skipping benchmark bzr_startup; not compatible with Python 3.6
    INFO:root:Skipping benchmark html5lib_warmup; not compatible with Python 3.6
    INFO:root:Skipping benchmark slowunpickle; not compatible with Python 3.6
    INFO:root:Skipping benchmark html5lib; not compatible with Python 3.6
    INFO:root:Skipping benchmark spambayes; not compatible with Python 3.6
    [ 1/43] 2to3...
    INFO:root:Running ../wordcode/python lib3/2to3/2to3 -f all lib/2to3
    INFO:root:Running ../wordcode/python lib3/2to3/2to3 -f all lib/2to3 5 times
    INFO:root:Running ../default/python lib3/2to3/2to3 -f all lib/2to3
    INFO:root:Running ../default/python lib3/2to3/2to3 -f all lib/2to3 5 times
    [ 2/43] call_method...
    INFO:root:Running ../wordcode/python performance/bm_call_method.py -n 300 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_call_method.py -n 300 --timer perf_counter
    mer. avril 13 16:36:47 CEST 2016
    Original python: ../wordcode/python
    3.6.0a0 (default:ad5b079565ad, Apr 13 2016, 16:30:36)
    [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)]

    Patched python: ../wordcode/python
    3.6.0a0 (default:c050d203e82b, Apr 13 2016, 16:30:24)
    [GCC 5.3.1 20151207 (Red Hat 5.3.1-2)]

    INFO:root:Automatically selected timer: perf_counter
    INFO:root:Skipping benchmark html5lib; not compatible with Python 3.6
    INFO:root:Skipping benchmark html5lib_warmup; not compatible with Python 3.6
    INFO:root:Skipping benchmark slowpickle; not compatible with Python 3.6
    INFO:root:Skipping benchmark slowunpickle; not compatible with Python 3.6
    INFO:root:Skipping benchmark slowspitfire; not compatible with Python 3.6
    INFO:root:Skipping benchmark rietveld; not compatible with Python 3.6
    INFO:root:Skipping benchmark bzr_startup; not compatible with Python 3.6
    INFO:root:Skipping benchmark spambayes; not compatible with Python 3.6
    INFO:root:Skipping benchmark pybench; not compatible with Python 3.6
    INFO:root:Skipping benchmark hg_startup; not compatible with Python 3.6
    [ 1/43] 2to3...
    INFO:root:Running ../wordcode/python lib3/2to3/2to3 -f all lib/2to3
    INFO:root:Running ../wordcode/python lib3/2to3/2to3 -f all lib/2to3 5 times
    INFO:root:Running ../default/python lib3/2to3/2to3 -f all lib/2to3
    INFO:root:Running ../default/python lib3/2to3/2to3 -f all lib/2to3 5 times
    [ 2/43] call_method...
    INFO:root:Running ../wordcode/python performance/bm_call_method.py -n 300 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_call_method.py -n 300 --timer perf_counter
    [ 3/43] call_method_slots...
    INFO:root:Running ../wordcode/python performance/bm_call_method_slots.py -n 300 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_call_method_slots.py -n 300 --timer perf_counter
    [ 4/43] call_method_unknown...
    INFO:root:Running ../wordcode/python performance/bm_call_method_unknown.py -n 300 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_call_method_unknown.py -n 300 --timer perf_counter
    [ 5/43] call_simple...
    INFO:root:Running ../wordcode/python performance/bm_call_simple.py -n 300 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_call_simple.py -n 300 --timer perf_counter
    [ 6/43] chameleon_v2...
    INFO:root:Running ../wordcode/python performance/bm_chameleon_v2.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_chameleon_v2.py -n 100 --timer perf_counter
    [ 7/43] chaos...
    INFO:root:Running ../wordcode/python performance/bm_chaos.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_chaos.py -n 100 --timer perf_counter
    [ 8/43] django_v3...
    INFO:root:Running ../wordcode/python performance/bm_django_v3.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_django_v3.py -n 100 --timer perf_counter
    [ 9/43] etree_generate...
    INFO:root:Running ../wordcode/python performance/bm_elementtree.py -n 100 --timer perf_counter generate
    INFO:root:Running ../default/python performance/bm_elementtree.py -n 100 --timer perf_counter generate
    [10/43] etree_iterparse...
    INFO:root:Running ../wordcode/python performance/bm_elementtree.py -n 100 --timer perf_counter iterparse
    INFO:root:Running ../default/python performance/bm_elementtree.py -n 100 --timer perf_counter iterparse
    [11/43] etree_parse...
    INFO:root:Running ../wordcode/python performance/bm_elementtree.py -n 100 --timer perf_counter parse
    INFO:root:Running ../default/python performance/bm_elementtree.py -n 100 --timer perf_counter parse
    [12/43] etree_process...
    INFO:root:Running ../wordcode/python performance/bm_elementtree.py -n 100 --timer perf_counter process
    INFO:root:Running ../default/python performance/bm_elementtree.py -n 100 --timer perf_counter process
    [13/43] fannkuch...
    INFO:root:Running ../wordcode/python performance/bm_fannkuch.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_fannkuch.py -n 100 --timer perf_counter
    [14/43] fastpickle...
    INFO:root:Running ../wordcode/python performance/bm_pickle.py -n 100 --timer perf_counter --use_cpickle pickle
    INFO:root:Running ../default/python performance/bm_pickle.py -n 100 --timer perf_counter --use_cpickle pickle
    [15/43] fastunpickle...
    INFO:root:Running ../wordcode/python performance/bm_pickle.py -n 100 --timer perf_counter --use_cpickle unpickle
    INFO:root:Running ../default/python performance/bm_pickle.py -n 100 --timer perf_counter --use_cpickle unpickle
    [16/43] float...
    INFO:root:Running ../wordcode/python performance/bm_float.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_float.py -n 100 --timer perf_counter
    [17/43] formatted_logging...
    INFO:root:Running ../wordcode/python performance/bm_logging.py -n 100 --timer perf_counter formatted_output
    INFO:root:Running ../default/python performance/bm_logging.py -n 100 --timer perf_counter formatted_output
    [18/43] go...
    INFO:root:Running ../wordcode/python performance/bm_go.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_go.py -n 100 --timer perf_counter
    [19/43] hexiom2...
    INFO:root:Running ../wordcode/python performance/bm_hexiom2.py -n 4 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_hexiom2.py -n 4 --timer perf_counter
    [20/43] json_dump_v2...
    INFO:root:Running ../wordcode/python performance/bm_json_v2.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_json_v2.py -n 100 --timer perf_counter
    [21/43] json_load...
    INFO:root:Running ../wordcode/python performance/bm_json.py -n 100 --timer perf_counter json_load
    INFO:root:Running ../default/python performance/bm_json.py -n 100 --timer perf_counter json_load
    [22/43] mako_v2...
    INFO:root:Running ../wordcode/python performance/bm_mako_v2.py -n 1000 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_mako_v2.py -n 1000 --timer perf_counter
    [23/43] meteor_contest...
    INFO:root:Running ../wordcode/python performance/bm_meteor_contest.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_meteor_contest.py -n 100 --timer perf_counter
    [24/43] nbody...
    INFO:root:Running ../wordcode/python performance/bm_nbody.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_nbody.py -n 100 --timer perf_counter
    [25/43] normal_startup...
    INFO:root:Running ../wordcode/python -c 2000 times
    INFO:root:Running ../default/python -c 2000 times
    [26/43] nqueens...
    INFO:root:Running ../wordcode/python performance/bm_nqueens.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_nqueens.py -n 100 --timer perf_counter
    [27/43] pathlib...
    INFO:root:Running ../wordcode/python performance/bm_pathlib.py -n 1000 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_pathlib.py -n 1000 --timer perf_counter
    [28/43] pickle_dict...
    INFO:root:Running ../wordcode/python performance/bm_pickle.py -n 100 --timer perf_counter --use_cpickle pickle_dict
    INFO:root:Running ../default/python performance/bm_pickle.py -n 100 --timer perf_counter --use_cpickle pickle_dict
    [29/43] pickle_list...
    INFO:root:Running ../wordcode/python performance/bm_pickle.py -n 100 --timer perf_counter --use_cpickle pickle_list
    INFO:root:Running ../default/python performance/bm_pickle.py -n 100 --timer perf_counter --use_cpickle pickle_list
    [30/43] pidigits...
    INFO:root:Running ../wordcode/python performance/bm_pidigits.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_pidigits.py -n 100 --timer perf_counter
    [31/43] raytrace...
    INFO:root:Running ../wordcode/python performance/bm_raytrace.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_raytrace.py -n 100 --timer perf_counter
    [32/43] regex_compile...
    INFO:root:Running ../wordcode/python performance/bm_regex_compile.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_regex_compile.py -n 100 --timer perf_counter
    [33/43] regex_effbot...
    INFO:root:Running ../wordcode/python performance/bm_regex_effbot.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_regex_effbot.py -n 100 --timer perf_counter
    [34/43] regex_v8...
    INFO:root:Running ../wordcode/python performance/bm_regex_v8.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_regex_v8.py -n 100 --timer perf_counter
    [35/43] richards...
    INFO:root:Running ../wordcode/python performance/bm_richards.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_richards.py -n 100 --timer perf_counter
    [36/43] silent_logging...
    INFO:root:Running ../wordcode/python performance/bm_logging.py -n 100 --timer perf_counter no_output
    INFO:root:Running ../default/python performance/bm_logging.py -n 100 --timer perf_counter no_output
    [37/43] simple_logging...
    INFO:root:Running ../wordcode/python performance/bm_logging.py -n 100 --timer perf_counter simple_output
    INFO:root:Running ../default/python performance/bm_logging.py -n 100 --timer perf_counter simple_output
    [38/43] spectral_norm...
    INFO:root:Running ../wordcode/python performance/bm_spectral_norm.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_spectral_norm.py -n 100 --timer perf_counter
    [39/43] startup_nosite...
    INFO:root:Running ../wordcode/python -S -c 4000 times
    INFO:root:Running ../default/python -S -c 4000 times
    [40/43] telco...
    INFO:root:Running ../wordcode/python performance/bm_telco.py -n 100 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_telco.py -n 100 --timer perf_counter
    [41/43] tornado_http...
    INFO:root:Running ../wordcode/python performance/bm_tornado_http.py -n 200 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_tornado_http.py -n 200 --timer perf_counter
    [42/43] unpack_sequence...
    INFO:root:Running ../wordcode/python performance/bm_unpack_sequence.py -n 100000 --timer perf_counter
    INFO:root:Running ../default/python performance/bm_unpack_sequence.py -n 100000 --timer perf_counter
    [43/43] unpickle_list...
    INFO:root:Running ../wordcode/python performance/bm_pickle.py -n 100 --timer perf_counter --use_cpickle unpickle_list
    INFO:root:Running ../default/python performance/bm_pickle.py -n 100 --timer perf_counter --use_cpickle unpickle_list

    Report on Linux smithers 4.4.4-301.fc23.x86_64 #1 SMP Fri Mar 4 17:42:42 UTC 2016 x86_64 x86_64
    Total CPU cores: 8

    ### call_method ###
    Min: 0.313558 -> 0.304460: 1.03x faster
    Avg: 0.313797 -> 0.304661: 1.03x faster
    Significant (t=773.69)
    Stddev: 0.00015 -> 0.00014: 1.1084x smaller

    ### call_method_slots ###
    Min: 0.317374 -> 0.300388: 1.06x faster
    Avg: 0.317527 -> 0.300701: 1.06x faster
    Significant (t=1971.52)
    Stddev: 0.00011 -> 0.00010: 1.0595x smaller

    ### call_method_unknown ###
    Min: 0.309548 -> 0.301112: 1.03x faster
    Avg: 0.309619 -> 0.301828: 1.03x faster
    Significant (t=636.50)
    Stddev: 0.00008 -> 0.00020: 2.3452x larger

    ### call_simple ###
    Min: 0.245480 -> 0.235982: 1.04x faster
    Avg: 0.246004 -> 0.236310: 1.04x faster
    Significant (t=492.66)
    Stddev: 0.00023 -> 0.00025: 1.1069x larger

    ### chaos ###
    Min: 0.271012 -> 0.264204: 1.03x faster
    Avg: 0.271723 -> 0.264787: 1.03x faster
    Significant (t=132.15)
    Stddev: 0.00044 -> 0.00028: 1.5564x smaller

    ### django_v3 ###
    Min: 0.544071 -> 0.555346: 1.02x slower
    Avg: 0.544697 -> 0.556142: 1.02x slower
    Significant (t=-210.46)
    Stddev: 0.00036 -> 0.00041: 1.1510x larger

    ### etree_iterparse ###
    Min: 0.215644 -> 0.205198: 1.05x faster
    Avg: 0.219440 -> 0.208423: 1.05x faster
    Significant (t=53.95)
    Stddev: 0.00145 -> 0.00144: 1.0016x smaller

    ### etree_parse ###
    Min: 0.287245 -> 0.271355: 1.06x faster
    Avg: 0.288902 -> 0.273051: 1.06x faster
    Significant (t=107.60)
    Stddev: 0.00106 -> 0.00102: 1.0348x smaller

    ### fannkuch ###
    Min: 0.957137 -> 0.993462: 1.04x slower
    Avg: 0.965306 -> 0.995223: 1.03x slower
    Significant (t=-42.85)
    Stddev: 0.00665 -> 0.00214: 3.1094x smaller

    ### float ###
    Min: 0.258390 -> 0.248217: 1.04x faster
    Avg: 0.265902 -> 0.255380: 1.04x faster
    Significant (t=17.29)
    Stddev: 0.00441 -> 0.00419: 1.0510x smaller

    ### mako_v2 ###
    Min: 0.040757 -> 0.039408: 1.03x faster
    Avg: 0.041534 -> 0.040058: 1.04x faster
    Significant (t=106.39)
    Stddev: 0.00033 -> 0.00029: 1.1548x smaller

    ### meteor_contest ###
    Min: 0.187423 -> 0.192079: 1.02x slower
    Avg: 0.188739 -> 0.193440: 1.02x slower
    Significant (t=-61.30)
    Stddev: 0.00053 -> 0.00056: 1.0503x larger

    ### nbody ###
    Min: 0.227627 -> 0.219617: 1.04x faster
    Avg: 0.229736 -> 0.221310: 1.04x faster
    Significant (t=23.23)
    Stddev: 0.00276 -> 0.00235: 1.1745x smaller

    ### pickle_dict ###
    Min: 0.491946 -> 0.513859: 1.04x slower
    Avg: 0.492796 -> 0.515723: 1.05x slower
    Significant (t=-158.63)
    Stddev: 0.00063 -> 0.00130: 2.0672x larger

    ### richards ###
    Min: 0.159527 -> 0.155970: 1.02x faster
    Avg: 0.160603 -> 0.157190: 1.02x faster
    Significant (t=36.37)
    Stddev: 0.00067 -> 0.00066: 1.0168x smaller

    ### silent_logging ###
    Min: 0.068349 -> 0.067301: 1.02x faster
    Avg: 0.069759 -> 0.067481: 1.03x faster
    Significant (t=56.73)
    Stddev: 0.00038 -> 0.00013: 2.8514x smaller

    ### simple_logging ###
    Min: 0.276149 -> 0.282515: 1.02x slower
    Avg: 0.277709 -> 0.283773: 1.02x slower
    Significant (t=-53.60)
    Stddev: 0.00080 -> 0.00080: 1.0045x smaller

    ### telco ###
    Min: 0.011922 -> 0.012221: 1.03x slower
    Avg: 0.011985 -> 0.012283: 1.02x slower
    Significant (t=-59.48)
    Stddev: 0.00003 -> 0.00004: 1.0912x larger

    ### unpack_sequence ###
    Min: 0.000047 -> 0.000042: 1.11x faster
    Avg: 0.000047 -> 0.000042: 1.10x faster
    Significant (t=2242.55)
    Stddev: 0.00000 -> 0.00000: 1.2134x larger

    The following not significant results are hidden, use -v to show them:
    2to3, chameleon_v2, etree_generate, etree_process, fastpickle, fastunpickle, formatted_logging, go, hexiom2, json_dump_v2, json_load, normal_startup, nqueens, pathlib, pickle_list, pidigits, raytrace, regex_compile, regex_effbot, regex_v8, spectral_norm, startup_nosite, tornado_http, unpickle_list.

    @serprex
    Copy link
    Mannequin Author

    serprex mannequin commented May 22, 2016

    Sorry for the nuisance of uploading another patch so soon. wpyB modifies test_ctypes now that __phello__ is smaller, & fixes a typo in a comment I made & removes a blank line I had added in when adding in if(0) logic

    @serhiy-storchaka
    Copy link
    Member

    Warnings still emitted in debug build.

    @serprex
    Copy link
    Mannequin Author

    serprex mannequin commented May 22, 2016

    I have verified that wpyC does not produce signed/unsigned warnings with make DEBUG=1

    @serprex
    Copy link
    Mannequin Author

    serprex mannequin commented May 22, 2016

    Removes 0 <= unsigned assertion & fixes j < 0 check to avoid overflow bug

    @serhiy-storchaka
    Copy link
    Member

    LGTM. If no one has more comments, I'm going to commit the patch.

    @serhiy-storchaka serhiy-storchaka self-assigned this May 22, 2016
    @vstinner
    Copy link
    Member

    wpyD.patch LGTM, go ahead! We can still polish it later and discuss how to implement the 16-bit fetch ;-)

    It would be nice to add a short comment in Python/wordcode_helpers.h explaining that it contains code shared by the compiler and the peephole optimizer. It can be done later.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 24, 2016

    New changeset 3a57eafd8401 by Serhiy Storchaka in branch 'default':
    Issue bpo-26647: Python interpreter now uses 16-bit wordcode instead of bytecode.
    https://hg.python.org/cpython/rev/3a57eafd8401

    @serhiy-storchaka
    Copy link
    Member

    Oh, I forgot to add a note in What's New. And the documentation of the dis module should be updated (EXTENDED_ARG etc). Could anyone do this?

    @vstinner
    Copy link
    Member

    New changeset 3a57eafd8401 by Serhiy Storchaka in branch 'default':
    Issue bpo-26647: Python interpreter now uses 16-bit wordcode instead of bytecode.
    https://hg.python.org/cpython/rev/3a57eafd8401

    Yeah, congrats Demur!

    @serhiy-storchaka
    Copy link
    Member

    I join in the congratulations, Demur! Thank you for your contributions.

    I left this issue open for updating the documentation and other polishing.

    @serprex
    Copy link
    Mannequin Author

    serprex mannequin commented May 25, 2016

    A documentation touch up for EXTENDED_ARG is included in bpo-27095

    @ncoghlan
    Copy link
    Contributor

    Chatting to Russell Keith-Magee, I realised the bytecode section in the devguide's description of the code generation pipeline may need some tweaks to account for the differences between 3.6 and earlier versions: https://docs.python.org/devguide/compiler.html#ast-to-cfg-to-bytecode

    @ncoghlan
    Copy link
    Contributor

    I switched the target component to Documentation to reflect that as far as we know this is feature complete from a functional perspective, but there hasn't been a review of the docs for bytecode references yet, nor a decision on whether or not we want to systematically switching to using "wordcode" instead.

    @ncoghlan ncoghlan added docs Documentation in the Doc dir and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels May 26, 2016
    @vstinner
    Copy link
    Member

    Hi, I ran the CPython benchmark suite (my fork modified to be more stable) on ed4eec682199 (patched) vs 7a7f54fe0698 (base). The patched version contains wordcode (issue bpo-26647) + 16-bit fetch for opcode and oparg (issue bpo-27097).

    The speedup is quite nice. Attached default-May26-03-05-10.log contains the full output.

    Faster (27):

    • unpack_sequence: 1.11x faster
    • simple_logging: 1.11x faster
    • silent_logging: 1.10x faster
    • formatted_logging: 1.09x faster
    • raytrace: 1.08x faster
    • chaos: 1.08x faster
    • etree_process: 1.08x faster
    • call_simple: 1.07x faster
    • mako_v2: 1.07x faster
    • tornado_http: 1.07x faster
    • nqueens: 1.07x faster
    • regex_compile: 1.06x faster
    • pathlib: 1.06x faster
    • 2to3: 1.06x faster
    • richards: 1.05x faster
    • spectral_norm: 1.05x faster
    • etree_generate: 1.05x faster
    • chameleon_v2: 1.04x faster
    • pickle_list: 1.03x faster
    • pickle_dict: 1.03x faster
    • regex_v8: 1.03x faster
    • go: 1.03x faster
    • call_method: 1.03x faster
    • django_v3: 1.03x faster
    • telco: 1.02x faster
    • json_load: 1.02x faster
    • call_method_unknown: 1.02x faster

    Slower (1):

    • fannkuch: 1.07x slower

    Not significat (14):

    • unpickle_list
    • startup_nosite
    • regex_effbot
    • pidigits
    • normal_startup
    • nbody
    • meteor_contest
    • json_dump_v2
    • float
    • fastunpickle
    • fastpickle
    • etree_parse
    • etree_iterparse
    • call_method_slots

    @serhiy-storchaka
    Copy link
    Member

    I think we should make yet few related changes:

    • Change meaning of jump offsets. They should count not bytes, but code units (16-bit words). This will extend the range addressed by short commands (from 256 bytes to 256 words) and simplify ceval.c.
    • Change f_lasti, tb_lasti etc to count code units instead of bytes.
    • Change disassembler to show addresses in code units, not bytes.

    These changes break compatibility (already broken by switching to 16-bit bytecode). The first one breaks compatibility with compiled bytecode and needs incrementing the magic number. That is why I think we should do this in this issue.

    What is better, provide one large patch or separate simpler patches for every stage?

    @serhiy-storchaka serhiy-storchaka added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label May 26, 2016
    @serhiy-storchaka
    Copy link
    Member

    Here is large patch (not including generated Python/importlib.h and Python/importlib_external.h).

    @vstinner
    Copy link
    Member

    Serhiy: please open a new issue for your change. While it's related, it's
    different enough to deserve its own issue.

    By the way , please don't include generated importlib .h file in your
    patches.

    @serhiy-storchaka
    Copy link
    Member

    bpo-27129.

    @serhiy-storchaka serhiy-storchaka removed the interpreter-core (Objects, Python, Grammar, and Parser dirs) label May 26, 2016
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 24, 2016

    New changeset 303cedfb9e7a by Victor Stinner in branch '3.6':
    Fix _PyGen_yf()
    https://hg.python.org/cpython/rev/303cedfb9e7a

    @vstinner
    Copy link
    Member

    This issue is done: see issue bpo-27129 for the next step.

    @serhiy-storchaka
    Copy link
    Member

    I left this issue open for documenting the wordcode. Now opened separate bpo-28810 for this.

    @abalkin abalkin added interpreter-core (Objects, Python, Grammar, and Parser dirs) and removed docs Documentation in the Doc dir labels Jan 22, 2018
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants