Title: inline function generates slightly inefficient machine code
Type: Stage: resolved
Components: Interpreter Core Versions: Python 3.9
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: Greg Price, aeros, benjamin.peterson, malin, mark.dickinson, rhettinger, sir-sigurd
Priority: normal Keywords: patch

Created on 2019-09-03 04:02 by malin, last changed 2019-09-18 05:11 by Greg Price. This issue is now closed.

File name Uploaded Description Edit
demo.c malin, 2019-09-03 04:02
Pull Requests
URL Status Linked Edit
PR 15710 merged malin, 2019-09-06 04:26
PR 15718 closed sir-sigurd, 2019-09-06 14:21
Messages (12)
msg351052 - (view) Author: Ma Lin (malin) * Date: 2019-09-03 04:02
Commit 5e63ab0 replaces macro with this inline function:

    static inline int
    is_small_int(long long ival)
        return -NSMALLNEGINTS <= ival && ival < NSMALLPOSINTS;

(by default, NSMALLNEGINTS is 5, NSMALLPOSINTS is 257)

However, when invoking this function, and `sizeof(value) < sizeof(long long)`, there is an unnecessary type casting.

For example, on 32-bit platform, if `value` is `Py_ssize_t`, it needs to be converted to 8-byte `long long` type.

The following assembly code is the beginning part of `PyLong_FromSsize_t(Py_ssize_t v)` function.
(32-bit x86 build generated by GCC 9.2, with `-m32 -O2` option)

Use macro before commit 5e63ab0:
        mov     eax, DWORD PTR [esp+4]
        add     eax, 5
        cmp     eax, 261
        ja      .L2
        sal     eax, 4
        add     eax, OFFSET FLAT:small_ints
        add     DWORD PTR [eax], 1
.L2:    jmp     PyLong_FromSsize_t_rest(int)

Use inlined function:
        push    ebx
        mov     eax, DWORD PTR [esp+8]
        mov     edx, 261
        mov     ecx, eax
        mov     ebx, eax
        sar     ebx, 31
        add     ecx, 5
        adc     ebx, 0
        cmp     edx, ecx
        mov     edx, 0
        sbb     edx, ebx
        jc      .L7
        sal     eax, 4
        add     eax, OFFSET FLAT:small_ints+80
        add     DWORD PTR [eax], 1
        pop     ebx
.L7:    pop     ebx
        jmp     PyLong_FromSsize_t_rest(int)

On 32-bit x86 platform, 8-byte `long long` is implemented in using two registers, so the machine code is much longer than macro version.

At least these hot functions are suffered from this:
  PyObject* PyLong_FromSsize_t(Py_ssize_t v)
  PyObject* PyLong_FromLong(long v)

Replacing the inline function with a macro version will fix this:
#define IS_SMALL_INT(ival) (-NSMALLNEGINTS <= (ival) && (ival) < NSMALLPOSINTS)

If you want to see assembly code generated by major compilers, you can paste attached file demo.c to
- demo.c was original written by Greg Price.
- use `-m32 -O2` to generate 32-bit build.
msg351053 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2019-09-03 04:27
Do you see a way to salvage the commit?

If not, I'm fine with reverting it.  Its benefit was only aesthetic.
msg351054 - (view) Author: Ma Lin (malin) * Date: 2019-09-03 04:35
There will always be a new commit, replacing with a macro version also looks good.

I have no opinion, both are fine.
msg351228 - (view) Author: Ma Lin (malin) * Date: 2019-09-06 04:31
Revert commit 5e63ab0 or use PR 15710, both are fine.
msg351255 - (view) Author: Sergey Fedoseev (sir-sigurd) * Date: 2019-09-06 14:28
I added similar patch that replaces get_small_int() with macro version, since it also induces unnecessary casts and makes machine code less efficient.

Example assembly can be checked at

This change produces tiny, but measurable speed-up for handling small ints:

$ python -m pyperf timeit -s "from collections import deque; consume = deque(maxlen=0).extend; r = range(256)" "consume(r)" --compare-to=../cpython-master/venv/bin/python --duplicate=1000
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 1.03 us +- 0.08 us
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 973 ns +- 18 ns

Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 1.03 us +- 0.08 us -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 973 ns +- 18 ns: 1.05x faster (-5%)
msg351278 - (view) Author: Ma Lin (malin) * Date: 2019-09-07 03:57
This range has not been changed since "preallocated small integer pool" was introduced:


The commit (Jan 2007):

Is it worth increase the range?
FYI, build with MSVC 2017, the `small_ints` size:

32-bit build:
    sizeof(PyLongObject)    16 bytes
    sizeof(small_ints)    4192 bytes

64-bit build:
    sizeof(PyLongObject)    32 bytes
    sizeof(small_ints)    8384 bytes
msg351315 - (view) Author: Ma Lin (malin) * Date: 2019-09-08 05:53
> This change produces tiny, but measurable speed-up for handling small ints

I didn't get measurable change, I run this command a dozen times and take the best result:

D:\dev\cpython\PCbuild\amd64\python.exe -m pyperf timeit -s "from collections import deque; consume = deque(maxlen=0).extend; r = range(256)" "consume(r)"  --duplicate=1000

before: Mean +- std dev: 771 ns +- 16 ns
after:  Mean +- std dev: 770 ns +- 10 ns

    64-bit release build by MSVC 2017
    CPU: i3 4160, System: latest Windows 10 64-bit

Check the machine code from, x64 MSVC v19.14 only saves one instruction:
    movsxd  rax, ecx

x86-64 GCC 9.2 saves two instructions:
    lea     eax, [rdi+5]
msg351316 - (view) Author: Sergey Fedoseev (sir-sigurd) * Date: 2019-09-08 05:59
I use GCC 9.2.
msg351348 - (view) Author: Greg Price (Greg Price) * Date: 2019-09-09 05:41
(Just to help keep discussions together: some earlier discussion was on GH-15216 .)

Because is_small_int / IS_SMALL_INT is so small, there's not much cost in the source code to making it a macro (as GH-15710 did).

But I think it'd be a mistake to go a lot farther than that and convert significantly larger chunks of code like get_small_int to macros (as GH-15718 would), unless the payoff were really commensurate.  It'd be better to focus optimization efforts where profiling shows a lot of time is really being spent.
msg351387 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-09-09 09:45
I agree with msg351348. The performance wins are very context-dependent and even then very small. It's not worth further disturbing the small int code. So, we should close this issue out.
msg351564 - (view) Author: Ma Lin (malin) * Date: 2019-09-10 02:18
PR 15710 has been merged into the master, but the merge message is not shown here.

Maybe this issue can be closed.
msg352692 - (view) Author: Greg Price (Greg Price) * Date: 2019-09-18 05:11
See followup at and .

The patch in GH-15710 had a side effect of introducing a call to `Py_UNREACHABLE` inside a comma-expression.  A subsequent commit 3ab61473b changed `Py_UNREACHABLE` in a way that made that a syntax error; the issue wasn't caught then because the code is only compiled if `NSMALLNEGINTS + NSMALLPOSINTS <= 0`.  Detailed discussion on those threads.
Date User Action Args
2019-09-18 05:11:29Greg Pricesetmessages: + msg352692
2019-09-10 02:18:12malinsetstatus: open -> closed
resolution: fixed
messages: + msg351564

stage: patch review -> resolved
2019-09-09 09:45:59benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg351387
2019-09-09 05:41:42Greg Pricesetmessages: + msg351348
2019-09-08 05:59:21sir-sigurdsetmessages: + msg351316
2019-09-08 05:53:17malinsetmessages: + msg351315
2019-09-07 03:57:06malinsetmessages: + msg351278
2019-09-06 14:28:25sir-sigurdsetmessages: + msg351255
2019-09-06 14:21:20sir-sigurdsetpull_requests: + pull_request15372
2019-09-06 04:31:10malinsetmessages: + msg351228
2019-09-06 04:26:53malinsetkeywords: + patch
stage: patch review
pull_requests: + pull_request15365
2019-09-03 04:35:31malinsetmessages: + msg351054
2019-09-03 04:27:46rhettingersetassignee: rhettinger
messages: + msg351053
2019-09-03 04:02:44malincreate