classification
Title: add internal _PyLong_FromUnsignedChar() function
Type: performance Stage: patch review
Components: Interpreter Core Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Greg Price, jdemeyer, sir-sigurd
Priority: normal Keywords: patch

Created on 2019-08-13 10:29 by sir-sigurd, last changed 2019-08-28 05:55 by Greg Price.

Pull Requests
URL Status Linked Edit
PR 15251 closed sir-sigurd, 2019-08-13 10:34
Messages (9)
msg349540 - (view) Author: Sergey Fedoseev (sir-sigurd) * Date: 2019-08-13 10:29
When compiled with default NSMALLPOSINTS, _PyLong_FromUnsignedChar() is significantly faster than other PyLong_From*():

$ python -m perf timeit -s "from collections import deque; consume = deque(maxlen=0).extend; b = bytes(2**20)" "consume(b)" --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 7.10 ms +- 0.02 ms
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 4.29 ms +- 0.03 ms

Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 7.10 ms +- 0.02 ms -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 4.29 ms +- 0.03 ms: 1.66x faster (-40%)

It's mostly useful for bytes/bytearray, but also can be used in several other places.
msg349551 - (view) Author: Jeroen Demeyer (jdemeyer) * (Python triager) Date: 2019-08-13 12:51
Maybe an even better idea would be to partially inline PyLong_FromLong(). If the check for small ints in PyLong_FromLong() would be inlined, then the compiler could optimize those checks. This would benefit all users of PyLong_FromLong() without code changes.
msg350406 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-24 23:03
Hmm, I'm a bit confused because:

* Your patch at GH-15251 replaces a number of calls to PyLong_FromLong with calls to the new _PyLong_FromUnsignedChar.

* That function, in turn, just calls PyLong_FromSize_t.

* And that function begins:

PyObject *
PyLong_FromSize_t(size_t ival)
{
    PyLongObject *v;
    size_t t;
    int ndigits = 0;

    if (ival < PyLong_BASE)
        return PyLong_FromLong((long)ival);
// ...


* So, it seems like after your patch we still end up calling PyLong_FromLong at each of these callsites, just after a couple more indirections than before.

Given the magic of compilers and of hardware branch prediction, it wouldn't at all surprise me for those indirections to not make anything slower... but if the measurements are coming out *faster*, then I feel like something else must be going on. ;-)

Ohhh, I see -- I bet it's that at _PyLong_FromUnsignedChar, the compiler can see that `is_small_int(ival)` is always true, so the whole function just turns into get_small_int.  Whereas when compiling a call to PyLong_FromLong from some other file (other translation unit), it can't see that and can't make the optimization.

Two questions, then:

* How do the measurements look under LTO? I wonder if with LTO the linker is able to make the same optimization that this change helps the compiler make.

* Is there a particular reason to specifically call PyLong_FromSize_t? Seems like PyLong_FromLong is the natural default (and what we default to in the rest of the code), and it's what this ends up calling anyway.
msg350407 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-24 23:04
Oh also:

* What compiler, and what compilation flags, are you using in your benchmarking?  That seems relevant :)
msg350411 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-24 23:22
> Is there a particular reason to specifically call PyLong_FromSize_t? Seems like PyLong_FromLong is the natural default (and what we default to in the rest of the code), and it's what this ends up calling anyway.

Ah I see, the patch is meant to go on top of GH-15192, which makes PyLong_FromSize_t apply the small-int optimization itself.

As I just suggested on GH-15192, I'd like to see that PR apply the small-int optimization in the more broadly-used PyLong_FromUnsignedLong... and then I think the natural thing for this new function to do is to call that.

Still quite curious how LTO does, and also curious what compiler and flags you're using in benchmarks.
msg350427 - (view) Author: Sergey Fedoseev (sir-sigurd) * Date: 2019-08-25 05:20
$ gcc -v 2>&1 | grep 'gcc version'
gcc version 8.3.0 (Debian 8.3.0-19)

using ./configure --enable-optimizations --with-lto
$ python -m perf timeit -s "from collections import deque; consume = deque(maxlen=0).extend; b = bytes(2**20)" "consume(b)" --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 6.71 ms +- 0.09 ms
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 6.71 ms +- 0.08 ms
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 6.71 ms +- 0.09 ms -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 6.71 ms +- 0.08 ms: 1.00x slower (+0%)

using ./configure --enable-optimizations
$ python -m perf timeit -s "from collections import deque; consume = deque(maxlen=0).extend; b = bytes(2**20)" "consume(b)" --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 6.73 ms +- 0.17 ms
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 4.28 ms +- 0.01 ms
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 6.73 ms +- 0.17 ms -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 4.28 ms +- 0.01 ms: 1.57x faster (-36%)
msg350435 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-25 06:32
Very interesting, thanks!

It looks like with LTO enabled, this optimization has no effect at all.

This change adds significant complexity, and it seems like the hoped-for payoff is entirely in terms of performance on rather narrowly-focused microbenchmarks.  In general I think that sets the bar rather high for making sure the performance gains are meaningful enough to justify the increase in complexity in the code.

In particular, I expect most anyone running Python and concerned with performance to be using LTO.  (It's standard in distro builds of Python, so that covers probably most users already.)  That means if the optimization doesn't do anything in the presence of LTO, it doesn't really count. ;-)

---

Now, I am surprised at the specifics of the result! I expected that LTO would probably pick up the equivalent optimization on its own, so that this change didn't have an effect. Instead, it looks like with LTO, this microbenchmark performs the same as it does without LTO *before* this change. That suggests that LTO may instead be blocking this optimization.

In that case, there may still be an opportunity: if you can work out why the change doesn't help under LTO, maybe you can find a way to make this optimization happen under LTO after all.
msg350459 - (view) Author: Sergey Fedoseev (sir-sigurd) * Date: 2019-08-25 14:01
These last results are invalid :-)

I thought that I was checking _PyLong_FromUnsignedChar() on top of GH-15192, but that wasn't true. So the correct results for LTO build are:

$ python -m perf timeit -s "from collections import deque; consume = deque(maxlen=0).extend; b = bytes(2**20)" "consume(b)" --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 6.93 ms +- 0.04 ms
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 3.96 ms +- 0.01 ms
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 6.93 ms +- 0.04 ms -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 3.96 ms +- 0.01 ms: 1.75x faster (-43%)

But the most important thing is that using PyLong_FromUnsignedLong() instead of _PyLong_FromUnsignedChar() on top of GH-15192 is producing the same results: striter_next() uses small_ints[] directly. However that's not true for bytearrayiter_next(): PyLong_FromUnsignedLong() is called there. I think that's due to how code is profiled so I'm satisfied with these results more or less.

I'm closing existing PR and probably will close this issue soon after GH-15192 will be merged.
msg350659 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-28 05:55
Ah OK, that makes sense of it then :)

> But the most important thing is that using PyLong_FromUnsignedLong() instead of _PyLong_FromUnsignedChar() on top of GH-15192 is producing the same results: striter_next() uses small_ints[] directly. However that's not true for bytearrayiter_next(): PyLong_FromUnsignedLong() is called there. I think that's due to how code is profiled so I'm satisfied with these results more or less.

Very interesting! That's a good comparison to have made, too.

I agree with your conclusions.
History
Date User Action Args
2019-08-28 05:55:19Greg Pricesetmessages: + msg350659
2019-08-25 14:01:36sir-sigurdsetmessages: + msg350459
2019-08-25 06:32:35Greg Pricesetmessages: + msg350435
2019-08-25 05:20:44sir-sigurdsetmessages: + msg350427
2019-08-24 23:22:38Greg Pricesetmessages: + msg350411
2019-08-24 23:04:46Greg Pricesetmessages: + msg350407
2019-08-24 23:03:05Greg Pricesetnosy: + Greg Price
messages: + msg350406
2019-08-13 12:51:22jdemeyersetnosy: + jdemeyer
messages: + msg349551
2019-08-13 10:34:33sir-sigurdsetkeywords: + patch
stage: patch review
pull_requests: + pull_request14971
2019-08-13 10:29:55sir-sigurdcreate