Migrate decimal to use PEP 567 context variables #76811

1st1 · 2018-01-23T06:56:56Z

BPO	32630
Nosy	@gvanrossum, @gpshead, @mdickinson, @vstinner, @ned-deily, @methane, @skrah, @1st1
PRs	bpo-32630: Use contextvars in the decimal module #5278
Files	xwith.py: small benchmark xwith2.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/skrah'
closed_at = <Date 2018-01-27.18:47:18.042>
created_at = <Date 2018-01-23.06:56:55.612>
labels = ['3.7', 'type-feature', 'library']
title = 'Migrate decimal to use PEP 567 context variables'
updated_at = <Date 2020-03-01.20:18:29.215>
user = 'https://github.com/1st1'

bugs.python.org fields:

activity = <Date 2020-03-01.20:18:29.215>
actor = 'gregory.p.smith'
assignee = 'skrah'
closed = True
closed_date = <Date 2018-01-27.18:47:18.042>
closer = 'yselivanov'
components = ['Library (Lib)']
creation = <Date 2018-01-23.06:56:55.612>
creator = 'yselivanov'
dependencies = []
files = ['47410', '47411']
hgrepos = []
issue_num = 32630
keywords = ['patch']
message_count = 32.0
messages = ['310472', '310482', '310630', '310698', '310699', '310700', '310701', '310704', '310714', '310715', '310716', '310738', '310790', '310791', '310794', '310795', '310796', '310799', '310800', '310803', '310805', '310806', '310809', '310811', '310815', '310817', '310820', '310821', '310864', '310875', '310876', '363091']
nosy_count = 8.0
nosy_names = ['gvanrossum', 'gregory.p.smith', 'mark.dickinson', 'vstinner', 'ned.deily', 'methane', 'skrah', 'yselivanov']
pr_nums = ['5278']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue32630'
versions = ['Python 3.7']

1st1 · 2018-01-23T06:56:55Z

PEP-567 allows decimal to be safely used in async/await code.

I couldn't observe any performance impact by the proposed PR. The PR doesn't modify decimal context behaviour: instead of using a thread-local storage it now uses a context variable.

skrah · 2018-01-23T08:25:54Z

I'll take a look.

1st1 · 2018-01-24T20:48:18Z

Stefan, it would be great to have this committed before 3.7 feature freeze.

The change is pretty straightforward -- we replaced threading.local() with a contextvar, which should be a backwards compatible change.

skrah · 2018-01-25T18:14:08Z

I realize that you had to fight massive mailing list distractions
during the PEP discussions, but this is very close to the beta ...

Let's start here:

>>> from decimal import *
==18887== Invalid read of size 8
==18887==    at 0x5324E0: contextvar_new (context.c:744)
==18887==    by 0x53141A: PyContextVar_New (context.c:137)
==18887==    by 0xFED052B: PyInit__decimal (_decimal.c:5542)
==18887==    by 0x51FC56: _PyImport_LoadDynamicModuleWithSpec (importdl.c:159)
==18887==    by 0x51F29F: _imp_create_dynamic_impl (import.c:2145)
==18887==    by 0x51A4BA: _imp_create_dynamic (import.c.h:289)
==18887==    by 0x43257A: _PyMethodDef_RawFastCallDict (call.c:530)
==18887==    by 0x432710: _PyCFunction_FastCallDict (call.c:582)
==18887==    by 0x432DD6: PyCFunction_Call (call.c:787)
==18887==    by 0x4FAA44: do_call_core (ceval.c:4659)
==18887==    by 0x4F58CC: _PyEval_EvalFrameDefault (ceval.c:3232)
==18887==    by 0x4E7F99: PyEval_EvalFrameEx (ceval.c:545)
==18887==  Address 0xcf589a8 is 8 bytes before a block of size 64 alloc'd
==18887==    at 0x4C2A9A1: malloc (vg_replace_malloc.c:299)
==18887==    by 0x470498: _PyMem_RawMalloc (obmalloc.c:75)
==18887==    by 0x470FFC: PyMem_RawMalloc (obmalloc.c:503)
==18887==    by 0x471DEF: _PyObject_Malloc (obmalloc.c:1560)
==18887==    by 0x471312: PyObject_Malloc (obmalloc.c:616)
==18887==    by 0x4A35D6: PyUnicode_New (unicodeobject.c:1293)
==18887==    by 0x4CA16B: _PyUnicodeWriter_PrepareInternal (unicodeobject.c:13423)
==18887==    by 0x4B1843: PyUnicode_DecodeUTF8Stateful (unicodeobject.c:4806)
==18887==    by 0x4A5E67: PyUnicode_FromString (unicodeobject.c:2105)
==18887==    by 0x5313F5: PyContextVar_New (context.c:133)
==18887==    by 0xFED052B: PyInit__decimal (_decimal.c:5542)
==18887==    by 0x51FC56: _PyImport_LoadDynamicModuleWithSpec (importdl.c:159)
==18887==

1st1 · 2018-01-25T18:21:44Z

I realize that you had to fight massive mailing list distractions
during the PEP discussions, but this is very close to the beta ...

Oh thanks, but I see no reason for you to be condescending here.

I cannot reproduce this on Mac OS / Linux. Are you sure you've built your Python correctly? Can you run 'make distclean; ./configure --with-pydebug; make -j4'?

1st1 · 2018-01-25T18:42:58Z

(Just in case I rebased my patch onto the latest master)

1st1 · 2018-01-25T18:54:41Z

I think I found what cause this, but I have no idea why it has surfaced only now :)

https://github.com/python/cpython/pull/5326/files
(see the added PyType_IS_GC(Py_TYPE(name)) check)

I'll merge that PR and rebase the decimal patch again.

1st1 · 2018-01-25T19:20:12Z

I pushed a fix (already in the master branch) and rebased the patch once again. I expect it to work now :)

1st1 · 2018-01-25T22:48:05Z

Stefan, I do think that this is a release blocker. We want to get this change as early as possible to ensure that it's well tested.

AFAIK Guido also wants decimal to be updated and well supported in async/await code.

skrah · 2018-01-25T22:53:23Z

Sure, and *I* am the one running the extended decimal test suite as we speak,
not Guido.

You are playing power games here, and you did that from the start by choosing
the nosy list.

1st1 · 2018-01-25T23:03:39Z

Sure, and *I* am the one running the extended decimal test suite as we speak, not Guido.

Thank you.

You are playing power games here, and you did that from the start by choosing the nosy list.

Please. I thought it was pretty much decided that we will update decimal if there is no significant performance degradation, so there's no need for a conspiracy. I put Guido to the nosy-list not because I want to force something, but just because we've discussed decimal and PEP 567/550 with him numerous times.

gvanrossum · 2018-01-26T06:23:22Z

You guys both need to calm down.

Stefan, what's your objection against this, assuming the crash is fixed?

skrah · 2018-01-26T20:36:19Z

Tests
-----

I ran some of my own tests (not even close to all), they seem fine.

However, I could not find any tests for the added feature (safe
use with async) though. We would be adding a new feature without
tests.

Performance
-----------

I'm getting a large slowdown:

./python Modules/_decimal/tests/bench.py

bench.py

patched: [0.199, 0.206, 0.198, 0.199, 0.197, 0.202, 0.198, 0.201, 0.213, 0.199]
status-quo: [0.187, 0.184, 0.185, 0.183, 0.184, 0.188, 0.184, 0.183, 0.183, 0.185]

slowdown: > 10%

xwith.py
--------

patched: [0.535, 0.541, 0.523]
status-quo: [0.412, 0.393, 0.375]

slowdown: > 30%

Given the performance issues I'm -1 for adding the feature at
this point.

1st1 · 2018-01-26T21:06:39Z

However, I could not find any tests for the added feature (safe
use with async) though. We would be adding a new feature without
tests.

This is no problem, I can add a few async/await tests.

I'm getting a large slowdown:
./python Modules/_decimal/tests/bench.py
[..]
patched: [0.199, 0.206, 0.198, 0.199, 0.197, 0.202, 0.198, 0.201, 0.213, 0.199]
status-quo: [0.187, 0.184, 0.185, 0.183, 0.184, 0.188, 0.184, 0.183, 0.183, 0.185]

I'd like you to elaborate a bit more here. First, bench.py produces a completely different output from what you've quoted. How exactly did you compile these results? Are those numbers results of Pi calculation or factorial? Can you upload the actual script you used here (if there's one)?

Second, here's my run of bench.py with contextvars and without: https://gist.github.com/1st1/1187fc58dfdef86e3cad8874e0894938

I don't see any difference, left alone 10% slowdown.

xwith.py
--------

patched: [0.535, 0.541, 0.523]
status-quo: [0.412, 0.393, 0.375]

This benchmark is specially constructed to profile creating decimal contexts and doing almost nothing with them.

I've optimized PEP-567 for contextvar.get() operation, not contextvar.set (it's hard to make hamt.set() as fast as dict.set()). That way, if you have an some decimal code that performs actual calculations with decimal objects, the operation of looking up the current context is cheap.

It's hard to imagine a situation, where a real decimal-related code just creates decimal contexts and does nothing else with them.

skrah · 2018-01-26T21:12:40Z

On Fri, Jan 26, 2018 at 09:06:38PM +0000, Yury Selivanov wrote:

This benchmark is specially constructed to profile creating decimal contexts and doing almost nothing with the

It is not constructed at all. It was the first thing I wrote down trying
to play a bit with speed.

Even the telco benchmark (where there's a lot of other stuff going
on) slows down by around 7-8%.

I did not hunt for these benchmarks. They are the first things I tried out. I
cannot believe that you never saw a slowdown as claimed in your OP.

gvanrossum · 2018-01-26T21:17:54Z

Guys. Please stop with the editorializing. "I cannot believe ..." (used
essentially by both of you) is not constructive.

skrah · 2018-01-26T21:21:41Z

Guido, I have the feeling that the feature -- about which I was actually
positive in the first place -- is being pushed aggressively with no
respect for the module author.

BTW, prec is changed quite frequently in decimal code, so if people
want infix notation they also have to use many with-statements.

gvanrossum · 2018-01-26T21:27:22Z

Stefan, I don't think a module author should retain veto over everything
affecting their code forever. (We've had spectacular process failures with
this in the past.) Please take a deep breath and patiently answer Yury's
questions. If you two can't agree on this, the status quo wins, but it will
be a blemish on your reputation if you just block it unilaterally. At the
very least help Yury reproduce your timing results -- at this point the
burden is on you since nobody else can reproduce them.

skrah · 2018-01-26T21:30:56Z

I have run about 1000 times more decimal benchmarks than both Yury and you. You attempt to hurt my reputation is laughable.

Show me some top-performance code that you have written.

1st1 · 2018-01-26T21:37:36Z

Sorry Stefan, I never wanted this to look like "I'm pushing this without listening to Stefan". I apologize if it looked that way.

I ran bm_telco on my machine before submitting the PR, and I indeed did not see any performance impact. I'll try again. I also have a idea of a micro-optimization that might make it a tiny bit faster.

gvanrossum · 2018-01-26T21:45:20Z

Stefan this is unacceptable abuse. Please read the code of conduct.

skrah · 2018-01-26T21:52:31Z

Yury, would you be willing to work this out by email? -- I think it
was you who I discussed the context-subclassing with and that was
quite a pleasant experience.

skrah · 2018-01-26T22:13:18Z

Guido, I apologize for the outburst. I had the impression that
msg310799 implicitly asserted my incompetence in benchmarking.

elprans · 2018-01-26T22:24:08Z

FWIW, I ran bm_telco with pyperformance on a benchmark-tuned system and did not observe the slowdown. Benchmarks were done on a release build (--enable-optimizations)

$ sudo (which python3) -m perf system tune

MASTER:

$ pyperformance run --python=envs/3.7-master-pgo-lto/prefix/bin/python3.7m --affinity=2,3 --rigorous --benchmarks=telco -o json/3.7-master.json
Python benchmark suite 0.6.1

[1/1] telco...
INFO:root:Running /home/elvis/dev/python/performance/venv/cpython3.7-8cfe759dd297/bin/python -u /home/elvis/dev/python/performance/performance/benchmarks/bm_telco.py --rigorous --affinity=2,3 --output /tmp/tmpgszxc792
.........................................
telco: Mean +- std dev: 9.17 ms +- 0.32 ms

MASTER + contextvars patch:

$ pyperformance run --python=envs/3.7-master-pgo+lto+decimal-contextvars/prefix/bin/python3.7m --affinity=2,3 --rigorous --benchmarks=telco -o json/3.7-contextvars.json
Python benchmark suite 0.6.1

[1/1] telco...
INFO:root:Running /home/elvis/dev/python/performance/venv/cpython3.7-8a6fbdee5a5b/bin/python -u /home/elvis/dev/python/performance/performance/benchmarks/bm_telco.py --rigorous --affinity=2,3 --output /tmp/tmp8y4mivnp
.........................................
telco: Mean +- std dev: 9.29 ms +- 0.19 ms

COMPARISON:

### telco ###
Mean +- std dev: 9.17 ms +- 0.32 ms -> 9.29 ms +- 0.19 ms: 1.01x slower
Not significant

elprans · 2018-01-26T22:40:55Z

Likewise, on the same builds, running _decimal/tests/bench.py does not show a significant difference: https://gist.github.com/elprans/fb31510ee28a3aa091aee3f42fe65e00

gvanrossum · 2018-01-26T22:52:30Z

Apologies accepted. I did not imply that -- I was simply stating that Yury
needed your help reproducing your result so he could do something about it.
It seems you two are taking this offline so I trust that there will be no
more barbs.

vstinner · 2018-01-26T23:11:00Z

Since the root of the discussion is a performance regression, let me take a look since I also care of not regressing in term of performance. We (CPython core developers, as as team) spent a lot of time on optimizing CPython to make benchmarks like telco faster at each release. The good news is that Python 3.7 *is* faster than Python 3.6 on telco. If I recall correctly, it's not because of recent optimizations in the decimal module, but more general changes like CALL_METHOD optimization!

Python master vs 3.6 (normalized on 3.6):

https://speed.python.org/comparison/?exe=12%2BL%2Bmaster%2C12%2BL%2B3.6&ben=670&env=1%2C2&hor=false&bas=12%2BL%2B3.6&chart=normal+bars

Graph of telco performance on master since April 2014 to January 2017:
https://speed.python.org/timeline/#/?exe=12&ben=telco&env=1&revs=50&equid=off&quarts=on&extr=on

20.2 ms => 14.1 ms, well done!

If you are curious of reasons why Python became faster, see my documentation:
http://pyperformance.readthedocs.io/cpython_results_2017.html

Or even my talk at Pycon 2017:
https://www.youtube.com/watch?v=d65dCD3VH9Q&t=957s

Sorry, I moved off topic. Let's move back to this measuring the performance of this issue...

--

I rewrote xwith.py using my perf module to use CPU pinning (on my isolated CPUs), automatic calibration of the number of loops, ignore the first "warmup" value, spawn 20 processes, compute the average and standard deviation, etc. => see attached xwidth2.py

Results on my laptop with 2 physical cores isolated for best benchmark stability (*):

vstinner@apu$ ./python -m perf compare_to master.json pr5278.json
Mean +- std dev: [master] 1.86 us +- 0.03 us -> [pr5278] 2.27 us +- 0.04 us: 1.22x slower (+22%)

Note: master is the commit 29a7df7 and I rebased PR 5278 on top on this commit.

(*) http://perf.readthedocs.io/en/latest/run_benchmark.html#how-to-get-reproductible-benchmark-results

This is obvious the *worst* case: a *micro* benchmark using local contexts and modifying this local context. In this case, I understand that this microbenchmark basically measures the overhead of contextvars on modying a context.

The question here is if the bottleneck of applications using decimal is the code modifying the context or the code computing numbers (a+b, a*b, a/b, etc.).

Except for a few small projects, I rarely use decimal, so I'm unable to judge that.

But just to add my 2 cents, I never used "with localcontext()", I don't see the point of this tool in my short applications. I prefer to modify directly the current context (getcontext()), and only modify this context *once*, at startup. For example, set the rounding mode and set the precision, and that's all.

--

The Python benchmark suite does have a benchmark dedicated to the decimal module:
http://pyperformance.readthedocs.io/benchmarks.html#telco

I ran this benchmark on PR 5278:

vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json
Benchmark hidden because not significant (1): telco

... not significant. Honestly, I'm not surprised at all:

telco benchmark doesn't modify the context in the hot code, only *outside* the benchmark
telco likely spends most of its runtime in computing numbers (sum += x; d = d.quantize(...), etc.) and converting Decimal to string. I don't know the decimal module, but I guess that it requires to *read* the current context. But getting the context is likely efficient and not significant compared to the cost of other operations.

FYI timings can be seen in verbose mode:

vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json -v
Mean +- std dev: [telco_master] 10.7 ms +- 0.4 ms -> [telco_pr5278] 10.7 ms +- 0.4 ms: 1.00x faster (-0%)
Not significant!

vstinner · 2018-01-26T23:16:30Z

Likewise, on the same builds, running _decimal/tests/bench.py does not show a significant difference: https://gist.github.com/elprans/fb31510ee28a3aa091aee3f42fe65e00

Note: it may be interesting to rewrite this benchmark my perf module to be able to easily check if a benchmark result is significant.

http://perf.readthedocs.io/en/latest/cli.html#perf-compare-to

"perf determines whether two samples differ significantly using a Student’s two-sample, two-tailed t-test with alpha equals to 0.95."

=> https://en.wikipedia.org/wiki/Student's_t-test

Usually, I consider that between 5% slower and 5% faster is not significant. But it depends how the benchmark was run, it depends on the type of benchmark, etc. Here I don't know bench.py so I cannot judge.

For example, for an optimization, I'm more interested by an optimization making a benchmark 10% faster ;-)

skrah · 2018-01-27T15:09:06Z

On Fri, Jan 26, 2018 at 11:11:00PM +0000, STINNER Victor wrote:

vstinner@apu$ ./python -m perf compare_to master.json pr5278.json
Mean +- std dev: [master] 1.86 us +- 0.03 us -> [pr5278] 2.27 us +- 0.04 us: 1.22x slower (+22%)

Note: master is the commit 29a7df7 and I rebased PR 5278 on top on this commit.

Thank you and Elvis for running the benchmarks. Yes, the exact version does seem
important -- I have been getting some differences based on the checkout.

This is obvious the *worst* case: a *micro* benchmark using local contexts and modifying this local context. In this case, I understand that this microbenchmark basically measures the overhead of contextvars on modying a context.

The question here is if the bottleneck of applications using decimal is the code modifying the context or the code computing numbers (a+b, a*b, a/b, etc.).

Yes, that's the big question. In the generator discussions people were advised
to use "with" whenever possible, so I assume short blocks *will* be used.

I would use the context functions, which would not require PEP-567 at all.
This means that I'm somewhat okay with excessive with-statements being a
bit slower.

vstinner@apu$ ./python -m perf compare_to telco_master.json telco_pr5278.json -v
Mean +- std dev: [telco_master] 10.7 ms +- 0.4 ms -> [telco_pr5278] 10.7 ms +- 0.4 ms: 1.00x faster (-0%)
Not significant!

Okay. In my above reference to telco, I ran the "telco.py full" command
from http://www.bytereef.org/mpdecimal/quickstart.html#telco-benchmark .

The numbers I posted weren't cooked, but I have a hard time reproducing
them myself now consistently with the latest revisions, so let's declare
telco.py and bench.py a tie.

This means that I no longer have any objections, so Yury, please go ahead
and merge the PR!

Stefan Krah

1st1 · 2018-01-27T18:46:03Z

Thank you, Stefan. I've updated the PR with an asyncio+decimal test and run tests in refleak mode to make sure there's no regression there.

If during the beta/rc period we see that contextvars isn't stable enough or something I'll revert this change before 3.7.0 myself, so that decimal users will not be disturbed.

I'll merge the PR once the CI is green.

Yes, that's the big question. In the generator discussions people were advised to use "with" whenever possible, so I assume short blocks *will* be used.

Yes, I used decimal examples all the time to showcase how context is supposed to work with generators. Most of those examples were specifically constructed to illustrate some point, but I don't think that real-world code uses a 'with localcontext()' statement in every function.

Unfortunately there's no way (at least known to me) to make 'ContextVar.set()' faster than it is now. I use HAMT which guarantees that all set operations will have O(log n) performance; the other known approach is to use copy-on-write (as in .NET), but that has an O(n) ContextVar.set() performance. So I guess a slightly slower 'with localcontext()' is the price to pay to make decimal easier to use for async/await code.

1st1 · 2018-01-27T18:46:49Z

New changeset f13f12d by Yury Selivanov in branch 'master':
bpo-32630: Use contextvars in decimal (GH-5278)
f13f12d

gpshead · 2020-03-01T20:18:29Z

FYI - this appears to have caused a regression - https://bugs.python.org/issue39776

1st1 added 3.7 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Jan 23, 2018

skrah mannequin self-assigned this Jan 23, 2018

1st1 added the release-blocker label Jan 25, 2018

skrah mannequin removed the release-blocker label Jan 25, 2018

1st1 closed this as completed Jan 27, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate decimal to use PEP 567 context variables #76811

Migrate decimal to use PEP 567 context variables #76811

1st1 commented Jan 23, 2018

1st1 commented Jan 23, 2018

skrah mannequin commented Jan 23, 2018

1st1 commented Jan 24, 2018

skrah mannequin commented Jan 25, 2018

1st1 commented Jan 25, 2018

1st1 commented Jan 25, 2018

1st1 commented Jan 25, 2018

1st1 commented Jan 25, 2018

1st1 commented Jan 25, 2018

skrah mannequin commented Jan 25, 2018

1st1 commented Jan 25, 2018

gvanrossum commented Jan 26, 2018

skrah mannequin commented Jan 26, 2018

1st1 commented Jan 26, 2018

skrah mannequin commented Jan 26, 2018

gvanrossum commented Jan 26, 2018

skrah mannequin commented Jan 26, 2018

gvanrossum commented Jan 26, 2018

skrah mannequin commented Jan 26, 2018

1st1 commented Jan 26, 2018

gvanrossum commented Jan 26, 2018

skrah mannequin commented Jan 26, 2018

skrah mannequin commented Jan 26, 2018

elprans mannequin commented Jan 26, 2018

elprans mannequin commented Jan 26, 2018

gvanrossum commented Jan 26, 2018

vstinner commented Jan 26, 2018

vstinner commented Jan 26, 2018

skrah mannequin commented Jan 27, 2018

1st1 commented Jan 27, 2018

1st1 commented Jan 27, 2018

gpshead commented Mar 1, 2020

Migrate decimal to use PEP 567 context variables #76811

Migrate decimal to use PEP 567 context variables #76811

Comments

1st1 commented Jan 23, 2018

1st1 commented Jan 23, 2018

skrah mannequin commented Jan 23, 2018

1st1 commented Jan 24, 2018

skrah mannequin commented Jan 25, 2018

1st1 commented Jan 25, 2018

1st1 commented Jan 25, 2018

1st1 commented Jan 25, 2018

1st1 commented Jan 25, 2018

1st1 commented Jan 25, 2018

skrah mannequin commented Jan 25, 2018

1st1 commented Jan 25, 2018

gvanrossum commented Jan 26, 2018

skrah mannequin commented Jan 26, 2018

1st1 commented Jan 26, 2018

skrah mannequin commented Jan 26, 2018

gvanrossum commented Jan 26, 2018

skrah mannequin commented Jan 26, 2018

gvanrossum commented Jan 26, 2018

skrah mannequin commented Jan 26, 2018

1st1 commented Jan 26, 2018

gvanrossum commented Jan 26, 2018

skrah mannequin commented Jan 26, 2018

skrah mannequin commented Jan 26, 2018

elprans mannequin commented Jan 26, 2018

elprans mannequin commented Jan 26, 2018

gvanrossum commented Jan 26, 2018

vstinner commented Jan 26, 2018

vstinner commented Jan 26, 2018

skrah mannequin commented Jan 27, 2018

1st1 commented Jan 27, 2018

1st1 commented Jan 27, 2018

gpshead commented Mar 1, 2020