Message372744
The performance issue was noticed by Raymond Hettinger who ran a microbenchmark on tuplegetter_descr_get(), comparison between Python 3.8 and Python 3.9:
https://mail.python.org/archives/list/python-dev@python.org/message/Q3YHYIKNUQH34FDEJRSLUP2MTYELFWY3/
INADA-san confirms that the performance regression was introduced by the commit 45ec5b99aefa54552947049086e87ec01bc2fc9a (bpo-40170) which changes PyType_HasFeature() implementation to always call PyType_GetFlags() as a function rather than reading directly the PyTypeObject.tp_flags member.
https://mail.python.org/archives/list/python-dev@python.org/message/FOKJXG2SYMXCHYPGUZWVYMHLDR42BYFB/
On Fedora 32, there is no performance difference because binaries are built with GCC using LTO and PGO: the PyType_GetFlags() function call is inlined by GCC 10.
I built Python on macOS with clang 11.0.3 on macOS 10.15.4, and I confirm that LTO+PGO allows to inline the PyType_GetFlags() function call in tuplegetter_descr_get().
Using "./configure && make":
---
$ lldb ./python.exe
(lldb) disassemble --name tuplegetter_descr_get
(...)
python.exe[0x1001c46ad] <+29>: callq 0x10009c720 ; PyType_GetFlags at typeobject.c:2338
python.exe[0x1001c46b2] <+34>: testl $0x4000000, %eax ; imm = 0x4000000
(...)
---
Using "./configure --with-lto --enable-optimizations && make":
---
$ lldb ./python.exe
(lldb) disassemble --name tuplegetter_descr_get
(...)
python.exe[0x1002a9542] <+18>: movq 0x10(%rbx), %rdx
python.exe[0x1002a9546] <+22>: movq 0x8(%rsi), %rax
python.exe[0x1002a954a] <+26>: testb $0x4, 0xab(%rax)
python.exe[0x1002a9551] <+33>: je 0x1002a956f ; <+63>
(...)
--- |
|
Date |
User |
Action |
Args |
2020-07-01 10:36:38 | vstinner | set | recipients:
+ vstinner |
2020-07-01 10:36:38 | vstinner | set | messageid: <1593599798.85.0.412490602686.issue41181@roundup.psfhosted.org> |
2020-07-01 10:36:38 | vstinner | link | issue41181 messages |
2020-07-01 10:36:38 | vstinner | create | |
|