This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author rhettinger
Recipients gvanrossum, iritkatriel, rhettinger, serhiy.storchaka
Date 2021-09-09.19:15:09
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1631214909.77.0.433788913461.issue45152@roundup.psfhosted.org>
In-reply-to
Content
Thanks for the link.  This is a worthwhile experiment.  However, the potential gains will be hard to come by.

The workload of LOAD_CONST is very small.  After paying for the usual dispatch logic overhead, all it does is an index into a struct member and an incref.  Both the co_consts table and the popular constant objects are already likely to be in the L1 data cache.  


	##DEBUG_LABEL: TARGET_LOAD_CONST
	movslq	%r15d, %rax             ## OpArg fetch, typically a zero code register rename   
	movq	-368(%rbp), %rcx        ## 8-byte Reload to access co_consts
	movq	24(%rcx,%rax,8), %rax   ## The actual indexing operation  (3 cycles)
	incq	(%rax)                  ## The incref  


A specialized opcode for a specific constant like None can 1) eliminate the oparg fetch (likely saving nothing), and 2) eliminate the two sequentially dependent memory access (this is a win):

	##DEBUG_LABEL: TARGET_LOAD_NONE
      ​  movq	__Py_NoneStruct@GOTPCREL(%rip), rax
	incq	(%rax)                  ## The incref


Any more general opcode for loading small ints would still need the oparg fetch and the incref.  To win, it would need to convert the oparg into an int more efficiently than the two movq steps.  If the small int table is in a fixed location (not per-subinterpreter), then you can save 2 cycles with the simpler address computation:

	##DEBUG_LABEL: TARGET_SMALLINT
	movslq	%r15d, %rax             ## OpArg fetch, typically a zero code register rename 
	movq	__Py_SmallInt@GOTPCREL(%rip), %rcx        ## Find an array of ints
	movq	(%rcx,%rax,8), %rax     ## Cheaper address computation takes 1 cycle
	incq	(%rax)                  ## The incref 

The 2 cycle win (intel-only) will be partially offset by the additional pressure on the L1 data cache.  Right now, the co_consts is almost certainly in cache, holding only the constants that actually get used (at a rate of 8 per cache line).  Accesses into a small_int array will push other data out of L1.

IIRC, Serhiy already experimented with a LOAD_NONE opcode and couldn't get a measurable win.
History
Date User Action Args
2021-09-09 19:15:09rhettingersetrecipients: + rhettinger, gvanrossum, serhiy.storchaka, iritkatriel
2021-09-09 19:15:09rhettingersetmessageid: <1631214909.77.0.433788913461.issue45152@roundup.psfhosted.org>
2021-09-09 19:15:09rhettingerlinkissue45152 messages
2021-09-09 19:15:09rhettingercreate