Message 401512 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rhettinger
Recipients	gvanrossum, iritkatriel, rhettinger, serhiy.storchaka
Date	2021-09-09.19:15:09
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1631214909.77.0.433788913461.issue45152@roundup.psfhosted.org>
In-reply-to

Content
Thanks for the link. This is a worthwhile experiment. However, the potential gains will be hard to come by. The workload of LOAD_CONST is very small. After paying for the usual dispatch logic overhead, all it does is an index into a struct member and an incref. Both the co_consts table and the popular constant objects are already likely to be in the L1 data cache. ##DEBUG_LABEL: TARGET_LOAD_CONST movslq %r15d, %rax ## OpArg fetch, typically a zero code register rename movq -368(%rbp), %rcx ## 8-byte Reload to access co_consts movq 24(%rcx,%rax,8), %rax ## The actual indexing operation (3 cycles) incq (%rax) ## The incref A specialized opcode for a specific constant like None can 1) eliminate the oparg fetch (likely saving nothing), and 2) eliminate the two sequentially dependent memory access (this is a win): ##DEBUG_LABEL: TARGET_LOAD_NONE movq __Py_NoneStruct@GOTPCREL(%rip), rax incq (%rax) ## The incref Any more general opcode for loading small ints would still need the oparg fetch and the incref. To win, it would need to convert the oparg into an int more efficiently than the two movq steps. If the small int table is in a fixed location (not per-subinterpreter), then you can save 2 cycles with the simpler address computation: ##DEBUG_LABEL: TARGET_SMALLINT movslq %r15d, %rax ## OpArg fetch, typically a zero code register rename movq __Py_SmallInt@GOTPCREL(%rip), %rcx ## Find an array of ints movq (%rcx,%rax,8), %rax ## Cheaper address computation takes 1 cycle incq (%rax) ## The incref The 2 cycle win (intel-only) will be partially offset by the additional pressure on the L1 data cache. Right now, the co_consts is almost certainly in cache, holding only the constants that actually get used (at a rate of 8 per cache line). Accesses into a small_int array will push other data out of L1. IIRC, Serhiy already experimented with a LOAD_NONE opcode and couldn't get a measurable win.

Thanks for the link. This is a worthwhile experiment. However, the potential gains will be hard to come by.

The workload of LOAD_CONST is very small. After paying for the usual dispatch logic overhead, all it does is an index into a struct member and an incref. Both the co_consts table and the popular constant objects are already likely to be in the L1 data cache.

##DEBUG_LABEL: TARGET_LOAD_CONST
movslq %r15d, %rax ## OpArg fetch, typically a zero code register rename
movq -368(%rbp), %rcx ## 8-byte Reload to access co_consts
movq 24(%rcx,%rax,8), %rax ## The actual indexing operation (3 cycles)
incq (%rax) ## The incref

A specialized opcode for a specific constant like None can 1) eliminate the oparg fetch (likely saving nothing), and 2) eliminate the two sequentially dependent memory access (this is a win):

##DEBUG_LABEL: TARGET_LOAD_NONE
movq __Py_NoneStruct@GOTPCREL(%rip), rax
incq (%rax) ## The incref

Any more general opcode for loading small ints would still need the oparg fetch and the incref. To win, it would need to convert the oparg into an int more efficiently than the two movq steps. If the small int table is in a fixed location (not per-subinterpreter), then you can save 2 cycles with the simpler address computation:

##DEBUG_LABEL: TARGET_SMALLINT
movslq %r15d, %rax ## OpArg fetch, typically a zero code register rename
movq __Py_SmallInt@GOTPCREL(%rip), %rcx ## Find an array of ints
movq (%rcx,%rax,8), %rax ## Cheaper address computation takes 1 cycle
incq (%rax) ## The incref

The 2 cycle win (intel-only) will be partially offset by the additional pressure on the L1 data cache. Right now, the co_consts is almost certainly in cache, holding only the constants that actually get used (at a rate of 8 per cache line). Accesses into a small_int array will push other data out of L1.

IIRC, Serhiy already experimented with a LOAD_NONE opcode and couldn't get a measurable win.

History
Date	User	Action	Args
2021-09-09 19:15:09	rhettinger	set	recipients: + rhettinger, gvanrossum, serhiy.storchaka, iritkatriel
2021-09-09 19:15:09	rhettinger	set	messageid: <1631214909.77.0.433788913461.issue45152@roundup.psfhosted.org>
2021-09-09 19:15:09	rhettinger	link	issue45152 messages
2021-09-09 19:15:09	rhettinger	create