Date 2020-07-01.10:36:38
The performance issue was noticed by Raymond Hettinger who ran a microbenchmark on tuplegetter_descr_get(), comparison between Python 3.8 and Python 3.9:

INADA-san confirms that the performance regression was introduced by the commit 45ec5b99aefa54552947049086e87ec01bc2fc9a (bpo-40170) which changes PyType_HasFeature() implementation to always call PyType_GetFlags() as a function rather than reading directly the PyTypeObject.tp_flags member.

On Fedora 32, there is no performance difference because binaries are built with GCC using LTO and PGO: the PyType_GetFlags() function call is inlined by GCC 10.

I built Python on macOS with clang 11.0.3 on macOS 10.15.4, and I confirm that LTO+PGO allows to inline the PyType_GetFlags() function call in tuplegetter_descr_get().

Using "./configure && make":
$ lldb ./python.exe
(lldb) disassemble --name tuplegetter_descr_get
python.exe[0x1001c46ad] <+29>:  callq  0x10009c720               ; PyType_GetFlags at typeobject.c:2338
python.exe[0x1001c46b2] <+34>:  testl  $0x4000000, %eax          ; imm = 0x4000000 

Using "./configure --with-lto --enable-optimizations && make":
$ lldb ./python.exe
(lldb) disassemble --name tuplegetter_descr_get
python.exe[0x1002a9542] <+18>:  movq   0x10(%rbx), %rdx
python.exe[0x1002a9546] <+22>:  movq   0x8(%rsi), %rax
python.exe[0x1002a954a] <+26>:  testb  $0x4, 0xab(%rax)
python.exe[0x1002a9551] <+33>:  je     0x1002a956f               ; <+63>
