Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate decimal to use PEP 567 context variables #76811

Closed
1st1 opened this issue Jan 23, 2018 · 32 comments
Closed

Migrate decimal to use PEP 567 context variables #76811

1st1 opened this issue Jan 23, 2018 · 32 comments
Assignees
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@1st1
Copy link
Member

1st1 commented Jan 23, 2018

BPO 32630
Nosy @gvanrossum, @gpshead, @mdickinson, @vstinner, @ned-deily, @methane, @skrah, @1st1
PRs
  • bpo-32630: Use contextvars in the decimal module #5278
  • Files
  • xwith.py: small benchmark
  • xwith2.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/skrah'
    closed_at = <Date 2018-01-27.18:47:18.042>
    created_at = <Date 2018-01-23.06:56:55.612>
    labels = ['3.7', 'type-feature', 'library']
    title = 'Migrate decimal to use PEP 567 context variables'
    updated_at = <Date 2020-03-01.20:18:29.215>
    user = 'https://github.com/1st1'

    bugs.python.org fields:

    activity = <Date 2020-03-01.20:18:29.215>
    actor = 'gregory.p.smith'
    assignee = 'skrah'
    closed = True
    closed_date = <Date 2018-01-27.18:47:18.042>
    closer = 'yselivanov'
    components = ['Library (Lib)']
    creation = <Date 2018-01-23.06:56:55.612>
    creator = 'yselivanov'
    dependencies = []
    files = ['47410', '47411']
    hgrepos = []
    issue_num = 32630
    keywords = ['patch']
    message_count = 32.0
    messages = ['310472', '310482', '310630', '310698', '310699', '310700', '310701', '310704', '310714', '310715', '310716', '310738', '310790', '310791', '310794', '310795', '310796', '310799', '310800', '310803', '310805', '310806', '310809', '310811', '310815', '310817', '310820', '310821', '310864', '310875', '310876', '363091']
    nosy_count = 8.0
    nosy_names = ['gvanrossum', 'gregory.p.smith', 'mark.dickinson', 'vstinner', 'ned.deily', 'methane', 'skrah', 'yselivanov']
    pr_nums = ['5278']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue32630'
    versions = ['Python 3.7']

    @1st1
    Copy link
    Member Author

    1st1 commented Jan 23, 2018

    PEP-567 allows decimal to be safely used in async/await code.

    I couldn't observe any performance impact by the proposed PR. The PR doesn't modify decimal context behaviour: instead of using a thread-local storage it now uses a context variable.

    @1st1 1st1 added 3.7 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Jan 23, 2018
    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jan 23, 2018

    I'll take a look.

    @skrah skrah mannequin self-assigned this Jan 23, 2018
    @1st1
    Copy link
    Member Author

    1st1 commented Jan 24, 2018

    Stefan, it would be great to have this committed before 3.7 feature freeze.

    The change is pretty straightforward -- we replaced threading.local() with a contextvar, which should be a backwards compatible change.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jan 25, 2018

    I realize that you had to fight massive mailing list distractions
    during the PEP discussions, but this is very close to the beta ...

    Let's start here:

    >>> from decimal import *
    ==18887== Invalid read of size 8
    ==18887==    at 0x5324E0: contextvar_new (context.c:744)
    ==18887==    by 0x53141A: PyContextVar_New (context.c:137)
    ==18887==    by 0xFED052B: PyInit__decimal (_decimal.c:5542)
    ==18887==    by 0x51FC56: _PyImport_LoadDynamicModuleWithSpec (importdl.c:159)
    ==18887==    by 0x51F29F: _imp_create_dynamic_impl (import.c:2145)
    ==18887==    by 0x51A4BA: _imp_create_dynamic (import.c.h:289)
    ==18887==    by 0x43257A: _PyMethodDef_RawFastCallDict (call.c:530)
    ==18887==    by 0x432710: _PyCFunction_FastCallDict (call.c:582)
    ==18887==    by 0x432DD6: PyCFunction_Call (call.c:787)
    ==18887==    by 0x4FAA44: do_call_core (ceval.c:4659)
    ==18887==    by 0x4F58CC: _PyEval_EvalFrameDefault (ceval.c:3232)
    ==18887==    by 0x4E7F99: PyEval_EvalFrameEx (ceval.c:545)
    ==18887==  Address 0xcf589a8 is 8 bytes before a block of size 64 alloc'd
    ==18887==    at 0x4C2A9A1: malloc (vg_replace_malloc.c:299)
    ==18887==    by 0x470498: _PyMem_RawMalloc (obmalloc.c:75)
    ==18887==    by 0x470FFC: PyMem_RawMalloc (obmalloc.c:503)
    ==18887==    by 0x471DEF: _PyObject_Malloc (obmalloc.c:1560)
    ==18887==    by 0x471312: PyObject_Malloc (obmalloc.c:616)
    ==18887==    by 0x4A35D6: PyUnicode_New (unicodeobject.c:1293)
    ==18887==    by 0x4CA16B: _PyUnicodeWriter_PrepareInternal (unicodeobject.c:13423)
    ==18887==    by 0x4B1843: PyUnicode_DecodeUTF8Stateful (unicodeobject.c:4806)
    ==18887==    by 0x4A5E67: PyUnicode_FromString (unicodeobject.c:2105)
    ==18887==    by 0x5313F5: PyContextVar_New (context.c:133)
    ==18887==    by 0xFED052B: PyInit__decimal (_decimal.c:5542)
    ==18887==    by 0x51FC56: _PyImport_LoadDynamicModuleWithSpec (importdl.c:159)
    ==18887==

    @1st1
    Copy link
    Member Author

    1st1 commented Jan 25, 2018

    I realize that you had to fight massive mailing list distractions
    during the PEP discussions, but this is very close to the beta ...

    Oh thanks, but I see no reason for you to be condescending here.

    I cannot reproduce this on Mac OS / Linux. Are you sure you've built your Python correctly? Can you run 'make distclean; ./configure --with-pydebug; make -j4'?

    @1st1
    Copy link
    Member Author

    1st1 commented Jan 25, 2018

    (Just in case I rebased my patch onto the latest master)

    @1st1
    Copy link
    Member Author

    1st1 commented Jan 25, 2018

    I think I found what cause this, but I have no idea why it has surfaced only now :)

    https://github.com/python/cpython/pull/5326/files
    (see the added PyType_IS_GC(Py_TYPE(name)) check)

    I'll merge that PR and rebase the decimal patch again.

    @1st1
    Copy link
    Member Author

    1st1 commented Jan 25, 2018

    I pushed a fix (already in the master branch) and rebased the patch once again. I expect it to work now :)

    @skrah skrah mannequin removed the release-blocker label Jan 25, 2018
    @1st1
    Copy link
    Member Author

    1st1 commented Jan 25, 2018

    Stefan, I do think that this is a release blocker. We want to get this change as early as possible to ensure that it's well tested.

    AFAIK Guido also wants decimal to be updated and well supported in async/await code.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jan 25, 2018

    Sure, and *I* am the one running the extended decimal test suite as we speak,
    not Guido.

    You are playing power games here, and you did that from the start by choosing
    the nosy list.

    @1st1
    Copy link
    Member Author

    1st1 commented Jan 25, 2018

    Sure, and *I* am the one running the extended decimal test suite as we speak, not Guido.

    Thank you.

    You are playing power games here, and you did that from the start by choosing the nosy list.

    Please. I thought it was pretty much decided that we will update decimal if there is no significant performance degradation, so there's no need for a conspiracy. I put Guido to the nosy-list not because I want to force something, but just because we've discussed decimal and PEP 567/550 with him numerous times.

    @gvanrossum
    Copy link
    Member

    You guys both need to calm down.

    Stefan, what's your objection against this, assuming the crash is fixed?

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jan 26, 2018

    Tests
    -----

    I ran some of my own tests (not even close to all), they seem fine.

    However, I could not find any tests for the added feature (safe
    use with async) though. We would be adding a new feature without
    tests.

    Performance
    -----------

    I'm getting a large slowdown:

    ./python Modules/_decimal/tests/bench.py

    bench.py


    patched: [0.199, 0.206, 0.198, 0.199, 0.197, 0.202, 0.198, 0.201, 0.213, 0.199]
    status-quo: [0.187, 0.184, 0.185, 0.183, 0.184, 0.188, 0.184, 0.183, 0.183, 0.185]

    slowdown: > 10%

    xwith.py
    --------

    patched: [0.535, 0.541, 0.523]
    status-quo: [0.412, 0.393, 0.375]

    slowdown: > 30%

    Given the performance issues I'm -1 for adding the feature at
    this point.

    @1st1
    Copy link
    Member Author

    1st1 commented Jan 26, 2018

    However, I could not find any tests for the added feature (safe
    use with async) though. We would be adding a new feature without
    tests.

    This is no problem, I can add a few async/await tests.

    I'm getting a large slowdown:
    ./python Modules/_decimal/tests/bench.py
    [..]
    patched: [0.199, 0.206, 0.198, 0.199, 0.197, 0.202, 0.198, 0.201, 0.213, 0.199]
    status-quo: [0.187, 0.184, 0.185, 0.183, 0.184, 0.188, 0.184, 0.183, 0.183, 0.185]

    I'd like you to elaborate a bit more here. First, bench.py produces a completely different output from what you've quoted. How exactly did you compile these results? Are those numbers results of Pi calculation or factorial? Can you upload the actual script you used here (if there's one)?

    Second, here's my run of bench.py with contextvars and without: https://gist.github.com/1st1/1187fc58dfdef86e3cad8874e0894938

    I don't see any difference, left alone 10% slowdown.

    xwith.py
    --------

    patched: [0.535, 0.541, 0.523]
    status-quo: [0.412, 0.393, 0.375]

    This benchmark is specially constructed to profile creating decimal contexts and doing almost nothing with them.

    I've optimized PEP-567 for contextvar.get() operation, not contextvar.set (it's hard to make hamt.set() as fast as dict.set()). That way, if you have an some decimal code that performs actual calculations with decimal objects, the operation of looking up the current context is cheap.

    It's hard to imagine a situation, where a real decimal-related code just creates decimal contexts and does nothing else with them.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jan 26, 2018

    On Fri, Jan 26, 2018 at 09:06:38PM +0000, Yury Selivanov wrote:

    This benchmark is specially constructed to profile creating decimal contexts and doing almost nothing with the

    It is not constructed at all. It was the first thing I wrote down trying
    to play a bit with speed.

    Even the telco benchmark (where there's a lot of other stuff going
    on) slows down by around 7-8%.

    I did not hunt for these benchmarks. They are the first things I tried out. I
    cannot believe that you never saw a slowdown as claimed in your OP.

    @gvanrossum
    Copy link
    Member

    Guys. Please stop with the editorializing. "I cannot believe ..." (used
    essentially by both of you) is not constructive.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jan 26, 2018

    Guido, I have the feeling that the feature -- about which I was actually
    positive in the first place -- is being pushed aggressively with no
    respect for the module author.

    BTW, prec is changed quite frequently in decimal code, so if people
    want infix notation they also have to use many with-statements.

    @gvanrossum
    Copy link
    Member

    Stefan, I don't think a module author should retain veto over everything
    affecting their code forever. (We've had spectacular process failures with
    this in the past.) Please take a deep breath and patiently answer Yury's
    questions. If you two can't agree on this, the status quo wins, but it will
    be a blemish on your reputation if you just block it unilaterally. At the
    very least help Yury reproduce your timing results -- at this point the
    burden is on you since nobody else can reproduce them.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jan 26, 2018

    I have run about 1000 times more decimal benchmarks than both Yury and you. You attempt to hurt my reputation is laughable.

    Show me some top-performance code that you have written.

    @1st1
    Copy link
    Member Author

    1st1 commented Jan 26, 2018

    Sorry Stefan, I never wanted this to look like "I'm pushing this without listening to Stefan". I apologize if it looked that way.

    I ran bm_telco on my machine before submitting the PR, and I indeed did not see any performance impact. I'll try again. I also have a idea of a micro-optimization that might make it a tiny bit faster.

    @gvanrossum
    Copy link
    Member

    Stefan this is unacceptable abuse. Please read the code of conduct.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jan 26, 2018

    Yury, would you be willing to work this out by email? -- I think it
    was you who I discussed the context-subclassing with and that was
    quite a pleasant experience.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jan 26, 2018

    Guido, I apologize for the outburst. I had the impression that
    msg310799 implicitly asserted my incompetence in benchmarking.

    @elprans
    Copy link
    Mannequin

    elprans mannequin commented Jan 26, 2018

    FWIW, I ran bm_telco with pyperformance on a benchmark-tuned system and did not observe the slowdown. Benchmarks were done on a release build (--enable-optimizations)

    $ sudo (which python3) -m perf system tune

    MASTER:

    $ pyperformance run --python=envs/3.7-master-pgo-lto/prefix/bin/python3.7m --affinity=2,3 --rigorous --benchmarks=telco -o json/3.7-master.json
    Python benchmark suite 0.6.1

    [1/1] telco...
    INFO:root:Running /home/elvis/dev/python/performance/venv/cpython3.7-8cfe759dd297/bin/python -u /home/elvis/dev/python/performance/performance/benchmarks/bm_telco.py --rigorous --affinity=2,3 --output /tmp/tmpgszxc792
    .........................................
    telco: Mean +- std dev: 9.17 ms +- 0.32 ms

    MASTER + contextvars patch:

    $ pyperformance run --python=envs/3.7-master-pgo+lto+decimal-contextvars/prefix/bin/python3.7m --affinity=2,3 --rigorous --benchmarks=telco -o json/3.7-contextvars.json
    Python benchmark suite 0.6.1

    [1/1] telco...
    INFO:root:Running /home/elvis/dev/python/performance/venv/cpython3.7-8a6fbdee5a5b/bin/python -u /home/elvis/dev/python/performance/performance/benchmarks/bm_telco.py --rigorous --affinity=2,3 --output /tmp/tmp8y4mivnp
    .........................................
    telco: Mean +- std dev: 9.29 ms +- 0.19 ms

    COMPARISON:

    ### telco ###
    Mean +- std dev: 9.17 ms +- 0.32 ms -> 9.29 ms +- 0.19 ms: 1.01x slower
    Not significant

    @elprans
    Copy link
    Mannequin

    elprans mannequin commented Jan 26, 2018

    Likewise, on the same builds, running _decimal/tests/bench.py does not show a significant difference: https://gist.github.com/elprans/fb31510ee28a3aa091aee3f42fe65e00

    @gvanrossum
    Copy link
    Member

    Apologies accepted. I did not imply that -- I was simply stating that Yury
    needed your help reproducing your result so he could do something about it.
    It seems you two are taking this offline so I trust that there will be no
    more barbs.

    @vstinner
    Copy link
    Member

    Since the root of the discussion is a performance regression, let me take a look since I also care of not regressing in term of performance. We (CPython core developers, as as team) spent a lot of time on optimizing CPython to make benchmarks like telco faster at each release. The good news is that Python 3.7 *is* faster than Python 3.6 on telco. If I recall correctly, it's not because of recent optimizations in the decimal module, but more general changes like CALL_METHOD optimization!

    Python master vs 3.6 (normalized on 3.6):

    https://speed.python.org/comparison/?exe=12%2BL%2Bmaster%2C12%2BL%2B3.6&ben=670&env=1%2C2&hor=false&bas=12%2BL%2B3.6&chart=normal+bars

    Graph of telco performance on master since April 2014 to January 2017:
    https://speed.python.org/timeline/#/?exe=12&ben=telco&env=1&revs=50&equid=off&quarts=on&extr=on

    20.2 ms => 14.1 ms, well done!

    If you are curious of reasons why Python became faster, see my documentation:
    http://pyperformance.readthedocs.io/cpython_results_2017.html

    Or even my talk at Pycon 2017:
    https://www.youtube.com/watch?v=d65dCD3VH9Q&t=957s

    Sorry, I moved off topic. Let's move back to this measuring the performance of this issue...

    --

    I rewrote xwith.py using my perf module to use CPU pinning (on my isolated CPUs), automatic calibration of the number of loops, ignore the first "warmup" value, spawn 20 processes, compute the average and standard deviation, etc. => see attached xwidth2.py

    Results on my laptop with 2 physical cores isolated for best benchmark stability (*):

    vstinner@apu$ ./python -m perf compare_to master.json pr5278.json
    Mean +- std dev: [master] 1.86 us +- 0.03 us -> [pr5278] 2.27 us +- 0.04 us: 1.22x slower (+22%)

    Note: master is the commit 29a7df7 and I rebased PR 5278 on top on this commit.

    (*) http://perf.readthedocs.io/en/latest/run_benchmark.html#how-to-get-reproductible-benchmark-results

    This is obvious the *worst* case: a *micro* benchmark using local contexts and modifying this local context. In this case, I understand that this microbenchmark basically measures the overhead of contextvars on modying a context.

    The question here is if the bottleneck of applications using decimal is the code modifying the context or the code computing numbers (a+b, a*b, a/b, etc.).

    Except for a few small projects, I rarely use decimal, so I'm unable to judge that.

    But just to add my 2 cents, I never used "with localcontext()", I don't see the point of this tool in my short applications. I prefer to modify directly the current context (getcontext()), and only modify this context *once*, at startup. For example, set the rounding mode and set the precision, and that's all.

    --

    The Python benchmark suite does have a benchmark dedicated to the decimal module:
    http://pyperformance.readthedocs.io/benchmarks.html#telco

    I ran this benchmark on PR 5278:

    vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json
    Benchmark hidden because not significant (1): telco

    ... not significant. Honestly, I'm not surprised at all:

    • telco benchmark doesn't modify the context in the hot code, only *outside* the benchmark
    • telco likely spends most of its runtime in computing numbers (sum += x; d = d.quantize(...), etc.) and converting Decimal to string. I don't know the decimal module, but I guess that it requires to *read* the current context. But getting the context is likely efficient and not significant compared to the cost of other operations.

    FYI timings can be seen in verbose mode:

    vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json -v
    Mean +- std dev: [telco_master] 10.7 ms +- 0.4 ms -> [telco_pr5278] 10.7 ms +- 0.4 ms: 1.00x faster (-0%)
    Not significant!

    @vstinner
    Copy link
    Member

    Likewise, on the same builds, running _decimal/tests/bench.py does not show a significant difference: https://gist.github.com/elprans/fb31510ee28a3aa091aee3f42fe65e00

    Note: it may be interesting to rewrite this benchmark my perf module to be able to easily check if a benchmark result is significant.

    http://perf.readthedocs.io/en/latest/cli.html#perf-compare-to

    "perf determines whether two samples differ significantly using a Student’s two-sample, two-tailed t-test with alpha equals to 0.95."

    => https://en.wikipedia.org/wiki/Student's_t-test

    Usually, I consider that between 5% slower and 5% faster is not significant. But it depends how the benchmark was run, it depends on the type of benchmark, etc. Here I don't know bench.py so I cannot judge.

    For example, for an optimization, I'm more interested by an optimization making a benchmark 10% faster ;-)

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jan 27, 2018

    On Fri, Jan 26, 2018 at 11:11:00PM +0000, STINNER Victor wrote:

    vstinner@apu$ ./python -m perf compare_to master.json pr5278.json
    Mean +- std dev: [master] 1.86 us +- 0.03 us -> [pr5278] 2.27 us +- 0.04 us: 1.22x slower (+22%)

    Note: master is the commit 29a7df7 and I rebased PR 5278 on top on this commit.

    Thank you and Elvis for running the benchmarks. Yes, the exact version does seem
    important -- I have been getting some differences based on the checkout.

    This is obvious the *worst* case: a *micro* benchmark using local contexts and modifying this local context. In this case, I understand that this microbenchmark basically measures the overhead of contextvars on modying a context.

    The question here is if the bottleneck of applications using decimal is the code modifying the context or the code computing numbers (a+b, a*b, a/b, etc.).

    Yes, that's the big question. In the generator discussions people were advised
    to use "with" whenever possible, so I assume short blocks *will* be used.

    I would use the context functions, which would not require PEP-567 at all.
    This means that I'm somewhat okay with excessive with-statements being a
    bit slower.

    vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json -v
    Mean +- std dev: [telco_master] 10.7 ms +- 0.4 ms -> [telco_pr5278] 10.7 ms +- 0.4 ms: 1.00x faster (-0%)
    Not significant!

    Okay. In my above reference to telco, I ran the "telco.py full" command
    from http://www.bytereef.org/mpdecimal/quickstart.html#telco-benchmark .

    The numbers I posted weren't cooked, but I have a hard time reproducing
    them myself now consistently with the latest revisions, so let's declare
    telco.py and bench.py a tie.

    This means that I no longer have any objections, so Yury, please go ahead
    and merge the PR!

    Stefan Krah

    @1st1
    Copy link
    Member Author

    1st1 commented Jan 27, 2018

    Thank you, Stefan. I've updated the PR with an asyncio+decimal test and run tests in refleak mode to make sure there's no regression there.

    If during the beta/rc period we see that contextvars isn't stable enough or something I'll revert this change before 3.7.0 myself, so that decimal users will not be disturbed.

    I'll merge the PR once the CI is green.

    Yes, that's the big question. In the generator discussions people were advised to use "with" whenever possible, so I assume short blocks *will* be used.

    Yes, I used decimal examples all the time to showcase how context is supposed to work with generators. Most of those examples were specifically constructed to illustrate some point, but I don't think that real-world code uses a 'with localcontext()' statement in every function.

    Unfortunately there's no way (at least known to me) to make 'ContextVar.set()' faster than it is now. I use HAMT which guarantees that all set operations will have O(log n) performance; the other known approach is to use copy-on-write (as in .NET), but that has an O(n) ContextVar.set() performance. So I guess a slightly slower 'with localcontext()' is the price to pay to make decimal easier to use for async/await code.

    @1st1
    Copy link
    Member Author

    1st1 commented Jan 27, 2018

    New changeset f13f12d by Yury Selivanov in branch 'master':
    bpo-32630: Use contextvars in decimal (GH-5278)
    f13f12d

    @1st1 1st1 closed this as completed Jan 27, 2018
    @gpshead
    Copy link
    Member

    gpshead commented Mar 1, 2020

    FYI - this appears to have caused a regression - https://bugs.python.org/issue39776

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants