New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate decimal to use PEP 567 context variables #76811
Comments
PEP-567 allows decimal to be safely used in async/await code. I couldn't observe any performance impact by the proposed PR. The PR doesn't modify decimal context behaviour: instead of using a thread-local storage it now uses a context variable. |
I'll take a look. |
Stefan, it would be great to have this committed before 3.7 feature freeze. The change is pretty straightforward -- we replaced threading.local() with a contextvar, which should be a backwards compatible change. |
I realize that you had to fight massive mailing list distractions Let's start here: >>> from decimal import *
==18887== Invalid read of size 8
==18887== at 0x5324E0: contextvar_new (context.c:744)
==18887== by 0x53141A: PyContextVar_New (context.c:137)
==18887== by 0xFED052B: PyInit__decimal (_decimal.c:5542)
==18887== by 0x51FC56: _PyImport_LoadDynamicModuleWithSpec (importdl.c:159)
==18887== by 0x51F29F: _imp_create_dynamic_impl (import.c:2145)
==18887== by 0x51A4BA: _imp_create_dynamic (import.c.h:289)
==18887== by 0x43257A: _PyMethodDef_RawFastCallDict (call.c:530)
==18887== by 0x432710: _PyCFunction_FastCallDict (call.c:582)
==18887== by 0x432DD6: PyCFunction_Call (call.c:787)
==18887== by 0x4FAA44: do_call_core (ceval.c:4659)
==18887== by 0x4F58CC: _PyEval_EvalFrameDefault (ceval.c:3232)
==18887== by 0x4E7F99: PyEval_EvalFrameEx (ceval.c:545)
==18887== Address 0xcf589a8 is 8 bytes before a block of size 64 alloc'd
==18887== at 0x4C2A9A1: malloc (vg_replace_malloc.c:299)
==18887== by 0x470498: _PyMem_RawMalloc (obmalloc.c:75)
==18887== by 0x470FFC: PyMem_RawMalloc (obmalloc.c:503)
==18887== by 0x471DEF: _PyObject_Malloc (obmalloc.c:1560)
==18887== by 0x471312: PyObject_Malloc (obmalloc.c:616)
==18887== by 0x4A35D6: PyUnicode_New (unicodeobject.c:1293)
==18887== by 0x4CA16B: _PyUnicodeWriter_PrepareInternal (unicodeobject.c:13423)
==18887== by 0x4B1843: PyUnicode_DecodeUTF8Stateful (unicodeobject.c:4806)
==18887== by 0x4A5E67: PyUnicode_FromString (unicodeobject.c:2105)
==18887== by 0x5313F5: PyContextVar_New (context.c:133)
==18887== by 0xFED052B: PyInit__decimal (_decimal.c:5542)
==18887== by 0x51FC56: _PyImport_LoadDynamicModuleWithSpec (importdl.c:159)
==18887== |
Oh thanks, but I see no reason for you to be condescending here. I cannot reproduce this on Mac OS / Linux. Are you sure you've built your Python correctly? Can you run 'make distclean; ./configure --with-pydebug; make -j4'? |
(Just in case I rebased my patch onto the latest master) |
I think I found what cause this, but I have no idea why it has surfaced only now :) https://github.com/python/cpython/pull/5326/files I'll merge that PR and rebase the decimal patch again. |
I pushed a fix (already in the master branch) and rebased the patch once again. I expect it to work now :) |
Stefan, I do think that this is a release blocker. We want to get this change as early as possible to ensure that it's well tested. AFAIK Guido also wants decimal to be updated and well supported in async/await code. |
Sure, and *I* am the one running the extended decimal test suite as we speak, You are playing power games here, and you did that from the start by choosing |
Thank you.
Please. I thought it was pretty much decided that we will update decimal if there is no significant performance degradation, so there's no need for a conspiracy. I put Guido to the nosy-list not because I want to force something, but just because we've discussed decimal and PEP 567/550 with him numerous times. |
You guys both need to calm down. Stefan, what's your objection against this, assuming the crash is fixed? |
Tests I ran some of my own tests (not even close to all), they seem fine. However, I could not find any tests for the added feature (safe Performance I'm getting a large slowdown: ./python Modules/_decimal/tests/bench.py bench.py patched: [0.199, 0.206, 0.198, 0.199, 0.197, 0.202, 0.198, 0.201, 0.213, 0.199] slowdown: > 10% xwith.py patched: [0.535, 0.541, 0.523] slowdown: > 30% Given the performance issues I'm -1 for adding the feature at |
This is no problem, I can add a few async/await tests.
I'd like you to elaborate a bit more here. First, bench.py produces a completely different output from what you've quoted. How exactly did you compile these results? Are those numbers results of Pi calculation or factorial? Can you upload the actual script you used here (if there's one)? Second, here's my run of bench.py with contextvars and without: https://gist.github.com/1st1/1187fc58dfdef86e3cad8874e0894938 I don't see any difference, left alone 10% slowdown.
This benchmark is specially constructed to profile creating decimal contexts and doing almost nothing with them. I've optimized PEP-567 for contextvar.get() operation, not contextvar.set (it's hard to make hamt.set() as fast as dict.set()). That way, if you have an some decimal code that performs actual calculations with decimal objects, the operation of looking up the current context is cheap. It's hard to imagine a situation, where a real decimal-related code just creates decimal contexts and does nothing else with them. |
On Fri, Jan 26, 2018 at 09:06:38PM +0000, Yury Selivanov wrote:
It is not constructed at all. It was the first thing I wrote down trying Even the telco benchmark (where there's a lot of other stuff going I did not hunt for these benchmarks. They are the first things I tried out. I |
Guys. Please stop with the editorializing. "I cannot believe ..." (used |
Guido, I have the feeling that the feature -- about which I was actually BTW, prec is changed quite frequently in decimal code, so if people |
Stefan, I don't think a module author should retain veto over everything |
I have run about 1000 times more decimal benchmarks than both Yury and you. You attempt to hurt my reputation is laughable. Show me some top-performance code that you have written. |
Sorry Stefan, I never wanted this to look like "I'm pushing this without listening to Stefan". I apologize if it looked that way. I ran bm_telco on my machine before submitting the PR, and I indeed did not see any performance impact. I'll try again. I also have a idea of a micro-optimization that might make it a tiny bit faster. |
Stefan this is unacceptable abuse. Please read the code of conduct. |
Yury, would you be willing to work this out by email? -- I think it |
Guido, I apologize for the outburst. I had the impression that |
FWIW, I ran bm_telco with pyperformance on a benchmark-tuned system and did not observe the slowdown. Benchmarks were done on a release build (--enable-optimizations) $ sudo (which python3) -m perf system tune MASTER: $ pyperformance run --python=envs/3.7-master-pgo-lto/prefix/bin/python3.7m --affinity=2,3 --rigorous --benchmarks=telco -o json/3.7-master.json
Python benchmark suite 0.6.1 [1/1] telco... MASTER + contextvars patch: $ pyperformance run --python=envs/3.7-master-pgo+lto+decimal-contextvars/prefix/bin/python3.7m --affinity=2,3 --rigorous --benchmarks=telco -o json/3.7-contextvars.json
Python benchmark suite 0.6.1 [1/1] telco... COMPARISON: ### telco ### |
Likewise, on the same builds, running _decimal/tests/bench.py does not show a significant difference: https://gist.github.com/elprans/fb31510ee28a3aa091aee3f42fe65e00 |
Apologies accepted. I did not imply that -- I was simply stating that Yury |
Since the root of the discussion is a performance regression, let me take a look since I also care of not regressing in term of performance. We (CPython core developers, as as team) spent a lot of time on optimizing CPython to make benchmarks like telco faster at each release. The good news is that Python 3.7 *is* faster than Python 3.6 on telco. If I recall correctly, it's not because of recent optimizations in the decimal module, but more general changes like CALL_METHOD optimization! Python master vs 3.6 (normalized on 3.6): Graph of telco performance on master since April 2014 to January 2017: 20.2 ms => 14.1 ms, well done! If you are curious of reasons why Python became faster, see my documentation: Or even my talk at Pycon 2017: Sorry, I moved off topic. Let's move back to this measuring the performance of this issue... -- I rewrote xwith.py using my perf module to use CPU pinning (on my isolated CPUs), automatic calibration of the number of loops, ignore the first "warmup" value, spawn 20 processes, compute the average and standard deviation, etc. => see attached xwidth2.py Results on my laptop with 2 physical cores isolated for best benchmark stability (*): vstinner@apu$ ./python -m perf compare_to master.json pr5278.json Note: master is the commit 29a7df7 and I rebased PR 5278 on top on this commit. (*) http://perf.readthedocs.io/en/latest/run_benchmark.html#how-to-get-reproductible-benchmark-results This is obvious the *worst* case: a *micro* benchmark using local contexts and modifying this local context. In this case, I understand that this microbenchmark basically measures the overhead of contextvars on modying a context. The question here is if the bottleneck of applications using decimal is the code modifying the context or the code computing numbers (a+b, a*b, a/b, etc.). Except for a few small projects, I rarely use decimal, so I'm unable to judge that. But just to add my 2 cents, I never used "with localcontext()", I don't see the point of this tool in my short applications. I prefer to modify directly the current context (getcontext()), and only modify this context *once*, at startup. For example, set the rounding mode and set the precision, and that's all. -- The Python benchmark suite does have a benchmark dedicated to the decimal module: I ran this benchmark on PR 5278: vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json ... not significant. Honestly, I'm not surprised at all:
FYI timings can be seen in verbose mode: vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json -v |
Note: it may be interesting to rewrite this benchmark my perf module to be able to easily check if a benchmark result is significant. http://perf.readthedocs.io/en/latest/cli.html#perf-compare-to "perf determines whether two samples differ significantly using a Student’s two-sample, two-tailed t-test with alpha equals to 0.95." => https://en.wikipedia.org/wiki/Student's_t-test Usually, I consider that between 5% slower and 5% faster is not significant. But it depends how the benchmark was run, it depends on the type of benchmark, etc. Here I don't know bench.py so I cannot judge. For example, for an optimization, I'm more interested by an optimization making a benchmark 10% faster ;-) |
On Fri, Jan 26, 2018 at 11:11:00PM +0000, STINNER Victor wrote:
Thank you and Elvis for running the benchmarks. Yes, the exact version does seem
Yes, that's the big question. In the generator discussions people were advised I would use the context functions, which would not require PEP-567 at all.
Okay. In my above reference to telco, I ran the "telco.py full" command The numbers I posted weren't cooked, but I have a hard time reproducing This means that I no longer have any objections, so Yury, please go ahead Stefan Krah |
Thank you, Stefan. I've updated the PR with an asyncio+decimal test and run tests in refleak mode to make sure there's no regression there. If during the beta/rc period we see that contextvars isn't stable enough or something I'll revert this change before 3.7.0 myself, so that decimal users will not be disturbed. I'll merge the PR once the CI is green.
Yes, I used decimal examples all the time to showcase how context is supposed to work with generators. Most of those examples were specifically constructed to illustrate some point, but I don't think that real-world code uses a 'with localcontext()' statement in every function. Unfortunately there's no way (at least known to me) to make 'ContextVar.set()' faster than it is now. I use HAMT which guarantees that all set operations will have O(log n) performance; the other known approach is to use copy-on-write (as in .NET), but that has an O(n) ContextVar.set() performance. So I guess a slightly slower 'with localcontext()' is the price to pay to make decimal easier to use for async/await code. |
FYI - this appears to have caused a regression - https://bugs.python.org/issue39776 |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: