classification
Title: Faster type checking in listobject.c
Type: performance Stage: patch review
Components: Interpreter Core Versions: Python 3.6
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: jkloth, pitrou, rhettinger, serhiy.storchaka, vstinner, yselivanov
Priority: normal Keywords: patch

Created on 2016-01-25 19:25 by rhettinger, last changed 2016-01-26 21:34 by rhettinger. This issue is now closed.

Files
File name Uploaded Description Edit
list_check_fastpath.diff rhettinger, 2016-01-25 19:25 In-lined version. review
list_check_fastpath2.diff rhettinger, 2016-01-25 21:09 Fix PyList_Append() review
Messages (9)
msg258918 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-01-25 19:25
A number of fine-grained methods in Objects/listobject.c use PyList_Check().  They include PyList_Size, PyList_GetItem, PyList_SetItem, PyList_Insert, and PyList_Append.

The PyList_Check() works by making two sequentially dependent memory fetches:

    movq    8(%rdi), %rax
    testb   $2, 171(%rax)
    je  L1645

This patch proposes a fast path for the common case of an exact match, using PyList_CheckExact() as an early-out before the PyList_Check() test:

    leaq    _PyList_Type(%rip), %rdx # parallelizable
    movq    8(%rdi), %rax            # only 1 memory access
    cmpq    %rdx, %rax               # macro-fusion  
    je  L1604                        # early-out
    testb   $2, 171(%rax)            # fallback to 2nd memory access
    je  L1604

This technique won't help outside of Objects/listobject.c because the initial LEA instruction becomes a MOV for the global offset table, nullifying the advantage.
msg258919 - (view) Author: Jeremy Kloth (jkloth) * Date: 2016-01-25 20:10
Added review
msg258920 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-01-25 20:26
Could you please provide any microbencmarks that show the benefit of this optimization?
msg258924 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-01-25 21:09
Jeremy, thanks for the keen eyes.  Attaching a revised patch.

Serhiy, pure python code doesn't directly access any of the C-API functions touched by the patch, so timeit-style examples aren't possible.  The beneficiaries of the change are external modules.  I suppose we could write a small C extension just to time this but that seems like overkill.

It is a general purpose optimization technique to reduce the number of memory accesses in fine-grained functions.  That technique has worked well elsewhere in the core where measurable progress was made in many small steps that were individually hard to measure.
msg258928 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-25 21:35
> Could you please provide any microbencmarks that show the benefit of this optimization?

Yeah, analyzing the assembler seems overkill to me. I'm not sure that it really make the code faster (but it makes the code more complex).
msg258932 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-01-25 23:18
Writing your own static optimizer from scratch is overkill.  Following basic techniques from the Agner Fog manuals and Intel optimization manuals is quite reasonable in comparison.

Unless you see an actual defect in the patch, I'm applying it.
msg258935 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-26 00:07
> Unless you see an actual defect in the patch, I'm applying it.

Well, Serhiy Storchaka asked for a microbenchmark. For an optimization, it's a reasonable request no?


> Writing your own static optimizer from scratch is overkill.

My goal is to get speedup on macro benchmarks.

It's hard to get a real speedup on macro benchmark using micro-optimizations, even if we combine a lot of them. I'm not saying that micro optimization is a waste of time, I like micro-optimizing Python, but we should not abuse of it.
msg258938 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-01-26 00:47
> Well, Serhiy Storchaka asked for a microbenchmark. 
> For an optimization, it's a reasonable request no?

As noted above, it not reasonable for a C-API that isn't used internally because there is no reasonable way to measure the effect without writing a C extension.

FWIW, I look at assembly to see direct evidence for the effect of a change.  It makes it easy to see how many memory accesses happen on a given path.

If you're going to be in the optimization business, I really wish you would read some of the basics in the field so that each and every technique doesn't have to be re-proved to you each time it used.  Here's a short-list I put together for those who were interested : https://dl.dropboxusercontent.com/u/3967849/sftalk/_build/html/misc.html#optimization-resources
msg258940 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-01-26 01:40
> As noted above, it not reasonable for a C-API that isn't used internally because there is no reasonable way to measure the effect without writing a C extension.

Come on, there is a lot of C functions in Python using the C PyList_xxx() API. It's easy to write a benchmark on it in pure Python. Example.


Without the patch (best of 10 runs):

$ for i in $(seq 1 10); do ./python -m timeit -s 's="a " * 10**5' 's.split()'; done
1000 loops, best of 3: 1.40 msec per loop

With the patch (best of 10 runs):

$ for i in $(seq 1 10); do ./python -m timeit -s 's="a " * 10**5' 's.split()'; done
1000 loops, best of 3: 1.49 msec per loop

It looks like list_check_fastpath2.diff makes PyList_Append() *slower* (6%) in practice. I expected it to be *faster*.


Please try to reproduce my microbenchmark, timeit is not really reliable, and I hate microbenchmarks, it's easy to make mistakes :-/

Note: It looks like s.split() creates a list of 'a' strings where all 'a' strings are the same singleton. So I don't think that creating substrings has an important cost.


> If you're going to be in the optimization business, I really wish you would read some of the basics in the field so that each and every technique doesn't have to be re-proved to you each time it used.

It's easy to convince me: show me a benchmark where your patch makes the code faster :-) I don't think that there is a strict rule for a miniumum speedup in CPython. I'm open for discussion :-)
History
Date User Action Args
2016-01-26 21:34:00rhettingersetmessages: - msg258942
2016-01-26 02:30:31rhettingersetstatus: open -> closed

messages: + msg258942
2016-01-26 01:41:52vstinnersetnosy: + pitrou, yselivanov
2016-01-26 01:40:33vstinnersetmessages: + msg258940
2016-01-26 00:47:30rhettingersetmessages: + msg258938
2016-01-26 00:07:51vstinnersetmessages: + msg258935
2016-01-25 23:18:45rhettingersetmessages: + msg258932
2016-01-25 21:35:37vstinnersetnosy: + vstinner
messages: + msg258928
2016-01-25 21:09:07rhettingersetfiles: + list_check_fastpath2.diff
assignee: serhiy.storchaka -> rhettinger
messages: + msg258924
2016-01-25 20:26:38serhiy.storchakasetmessages: + msg258920
2016-01-25 20:10:54jklothsetnosy: + jkloth
messages: + msg258919
2016-01-25 19:25:39rhettingercreate