Issue 26201: Faster type checking in listobject.c

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70389

classification

Title:	Faster type checking in listobject.c
Type:	performance	Stage:	patch review
Components:	Interpreter Core	Versions:	Python 3.6

process

Status:	closed	Resolution:
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	jkloth, pitrou, rhettinger, serhiy.storchaka, vstinner, yselivanov
Priority:	normal	Keywords:	patch

Created on 2016-01-25 19:25 by rhettinger, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
list_check_fastpath.diff	rhettinger, 2016-01-25 19:25	In-lined version.	review
list_check_fastpath2.diff	rhettinger, 2016-01-25 21:09	Fix PyList_Append()	review

Messages (9)
msg258918 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2016-01-25 19:25
A number of fine-grained methods in Objects/listobject.c use PyList_Check(). They include PyList_Size, PyList_GetItem, PyList_SetItem, PyList_Insert, and PyList_Append. The PyList_Check() works by making two sequentially dependent memory fetches: movq 8(%rdi), %rax testb $2, 171(%rax) je L1645 This patch proposes a fast path for the common case of an exact match, using PyList_CheckExact() as an early-out before the PyList_Check() test: leaq _PyList_Type(%rip), %rdx # parallelizable movq 8(%rdi), %rax # only 1 memory access cmpq %rdx, %rax # macro-fusion je L1604 # early-out testb $2, 171(%rax) # fallback to 2nd memory access je L1604 This technique won't help outside of Objects/listobject.c because the initial LEA instruction becomes a MOV for the global offset table, nullifying the advantage.
msg258919 - (view)	Author: Jeremy Kloth (jkloth) *	Date: 2016-01-25 20:10
Added review
msg258920 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-01-25 20:26
Could you please provide any microbencmarks that show the benefit of this optimization?
msg258924 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2016-01-25 21:09
Jeremy, thanks for the keen eyes. Attaching a revised patch. Serhiy, pure python code doesn't directly access any of the C-API functions touched by the patch, so timeit-style examples aren't possible. The beneficiaries of the change are external modules. I suppose we could write a small C extension just to time this but that seems like overkill. It is a general purpose optimization technique to reduce the number of memory accesses in fine-grained functions. That technique has worked well elsewhere in the core where measurable progress was made in many small steps that were individually hard to measure.
msg258928 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-01-25 21:35
> Could you please provide any microbencmarks that show the benefit of this optimization? Yeah, analyzing the assembler seems overkill to me. I'm not sure that it really make the code faster (but it makes the code more complex).
msg258932 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2016-01-25 23:18
Writing your own static optimizer from scratch is overkill. Following basic techniques from the Agner Fog manuals and Intel optimization manuals is quite reasonable in comparison. Unless you see an actual defect in the patch, I'm applying it.
msg258935 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-01-26 00:07
> Unless you see an actual defect in the patch, I'm applying it. Well, Serhiy Storchaka asked for a microbenchmark. For an optimization, it's a reasonable request no? > Writing your own static optimizer from scratch is overkill. My goal is to get speedup on macro benchmarks. It's hard to get a real speedup on macro benchmark using micro-optimizations, even if we combine a lot of them. I'm not saying that micro optimization is a waste of time, I like micro-optimizing Python, but we should not abuse of it.
msg258938 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2016-01-26 00:47
> Well, Serhiy Storchaka asked for a microbenchmark. > For an optimization, it's a reasonable request no? As noted above, it not reasonable for a C-API that isn't used internally because there is no reasonable way to measure the effect without writing a C extension. FWIW, I look at assembly to see direct evidence for the effect of a change. It makes it easy to see how many memory accesses happen on a given path. If you're going to be in the optimization business, I really wish you would read some of the basics in the field so that each and every technique doesn't have to be re-proved to you each time it used. Here's a short-list I put together for those who were interested : https://dl.dropboxusercontent.com/u/3967849/sftalk/_build/html/misc.html#optimization-resources
msg258940 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-01-26 01:40
> As noted above, it not reasonable for a C-API that isn't used internally because there is no reasonable way to measure the effect without writing a C extension. Come on, there is a lot of C functions in Python using the C PyList_xxx() API. It's easy to write a benchmark on it in pure Python. Example. Without the patch (best of 10 runs): $ for i in $(seq 1 10); do ./python -m timeit -s 's="a " * 10*5' 's.split()'; done 1000 loops, best of 3: 1.40 msec per loop With the patch (best of 10 runs): $ for i in $(seq 1 10); do ./python -m timeit -s 's="a " 10*5' 's.split()'; done 1000 loops, best of 3: 1.49 msec per loop It looks like list_check_fastpath2.diff makes PyList_Append() slower* (6%) in practice. I expected it to be faster. Please try to reproduce my microbenchmark, timeit is not really reliable, and I hate microbenchmarks, it's easy to make mistakes :-/ Note: It looks like s.split() creates a list of 'a' strings where all 'a' strings are the same singleton. So I don't think that creating substrings has an important cost. > If you're going to be in the optimization business, I really wish you would read some of the basics in the field so that each and every technique doesn't have to be re-proved to you each time it used. It's easy to convince me: show me a benchmark where your patch makes the code faster :-) I don't think that there is a strict rule for a miniumum speedup in CPython. I'm open for discussion :-)

History
Date	User	Action	Args
2022-04-11 14:58:26	admin	set	github: 70389
2016-01-26 21:34:00	rhettinger	set	messages: - msg258942
2016-01-26 02:30:31	rhettinger	set	status: open -> closed messages: + msg258942
2016-01-26 01:41:52	vstinner	set	nosy: + pitrou, yselivanov
2016-01-26 01:40:33	vstinner	set	messages: + msg258940
2016-01-26 00:47:30	rhettinger	set	messages: + msg258938
2016-01-26 00:07:51	vstinner	set	messages: + msg258935
2016-01-25 23:18:45	rhettinger	set	messages: + msg258932
2016-01-25 21:35:37	vstinner	set	nosy: + vstinner messages: + msg258928
2016-01-25 21:09:07	rhettinger	set	files: + list_check_fastpath2.diff assignee: serhiy.storchaka -> rhettinger messages: + msg258924
2016-01-25 20:26:38	serhiy.storchaka	set	messages: + msg258920
2016-01-25 20:10:54	jkloth	set	nosy: + jkloth messages: + msg258919
2016-01-25 19:25:39	rhettinger	create