This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: [macOS] Build macOS installer with LTO and PGO optimizations
Type: performance Stage: resolved
Components: Build, macOS Versions: Python 3.10, Python 3.9, Python 3.8
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: lukasz.langa, methane, ned.deily, rhettinger, ronaldoussoren, vstinner
Priority: normal Keywords: patch

Created on 2020-07-01 10:32 by vstinner, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 21256 closed vstinner, 2020-07-01 10:39
Messages (10)
msg372743 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-07-01 10:32
Link Time Optimization (LTO) and Profile-Guided Optimization (PGO) have a major impact on Python performance: they make Python between 10% and 30% faster (coarse estimation).

Currently, macOS installers distributed on python.org are built with Clang 6.0 without LTO or PGO. I propose to enable LTO and PGO to make these binaries faster.

IMO we should build all new Python macOS installers with these optimizations.

Attached PR adds the flags.


Python 3.9.0b3 binary:

$ python3.9
Python 3.9.0b3 (v3.9.0b3:b484871ba7, Jun  9 2020, 16:05:25) 
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

configure options:

>>> import sysconfig; print(sysconfig.get_config_var('CONFIG_ARGS'))
'-C' '--enable-framework' '--enable-universalsdk=/' '--with-universal-archs=intel-64' '--with-computed-gotos' '--without-ensurepip' '--with-tcltk-includes=-I/tmp/_py/libraries/usr/local/include' '--with-tcltk-libs=-ltcl8.6 -ltk8.6' 'LDFLAGS=-g' 'CFLAGS=-g' 'CC=gcc'

Compiler flags:

>>> sysconfig.get_config_var('PY_CFLAGS') + sysconfig.get_config_var('PY_CFLAGS_NODIST')
'-Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch x86_64 -g-std=c99 -Wextra -Wno-unused-result -Wno-unused-parameter -Wno-missing-field-initializers -Wstrict-prototypes -Werror=implicit-function-declaration -fvisibility=hidden  -I/Users/sysadmin/build/v3.9.0b3/Include/internal'

Linker flags:

>>> sysconfig.get_config_var('PY_LDFLAGS') + sysconfig.get_config_var('PY_LDFLAGS_NODIST')
'-arch x86_64 -g'
msg372744 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-07-01 10:36
The performance issue was noticed by Raymond Hettinger who ran a microbenchmark on tuplegetter_descr_get(), comparison between Python 3.8 and Python 3.9:

https://mail.python.org/archives/list/python-dev@python.org/message/Q3YHYIKNUQH34FDEJRSLUP2MTYELFWY3/

INADA-san confirms that the performance regression was introduced by the commit 45ec5b99aefa54552947049086e87ec01bc2fc9a (bpo-40170) which changes PyType_HasFeature() implementation to always call PyType_GetFlags() as a function rather than reading directly the PyTypeObject.tp_flags member.

https://mail.python.org/archives/list/python-dev@python.org/message/FOKJXG2SYMXCHYPGUZWVYMHLDR42BYFB/


On Fedora 32, there is no performance difference because binaries are built with GCC using LTO and PGO: the PyType_GetFlags() function call is inlined by GCC 10.


I built Python on macOS with clang 11.0.3 on macOS 10.15.4, and I confirm that LTO+PGO allows to inline the PyType_GetFlags() function call in tuplegetter_descr_get().

Using "./configure && make":
---
$ lldb ./python.exe
(lldb) disassemble --name tuplegetter_descr_get
(...)
python.exe[0x1001c46ad] <+29>:  callq  0x10009c720               ; PyType_GetFlags at typeobject.c:2338
python.exe[0x1001c46b2] <+34>:  testl  $0x4000000, %eax          ; imm = 0x4000000 
(...)
---

Using "./configure --with-lto --enable-optimizations && make":
---
$ lldb ./python.exe
(lldb) disassemble --name tuplegetter_descr_get
(...)
python.exe[0x1002a9542] <+18>:  movq   0x10(%rbx), %rdx
python.exe[0x1002a9546] <+22>:  movq   0x8(%rsi), %rax
python.exe[0x1002a954a] <+26>:  testb  $0x4, 0xab(%rax)
python.exe[0x1002a9551] <+33>:  je     0x1002a956f               ; <+63>
(...)
---
msg372747 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2020-07-01 11:10
The installer is built on Mac OS X 10.9 so that it is forward compatible with all OS X and macOS versions. We cannot depend on PGO and LTO for it unless we start building the installer on 10.15. We cannot do this currently as those installers would not work with older macOS and OS X versions.

Since the Mac is switching to Apple Silicon, the plan is to start building a separate macOS 11+ installer. *That* could use PGO and LTO.
msg372758 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-07-01 15:24
> We cannot depend on PGO and LTO for it unless we start building the installer on 10.15.

Clang 6.0 doesn't support LTO and PGO? Would you mind to elaborate?
msg372764 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2020-07-01 16:20
> Clang 6.0 doesn't support LTO and PGO?

No, it appears not.  And it's not an oversight that we don't use the these options.

As Łukasz points out, for the current macOS installer variants we supply are designed to run on all Mac systems from macOS 10.9 on. To accomplish that safely, we build the Python binaries on macOS 10.9 system to ensure they will be compatible, in other words, we build on the oldest system support and rely on upward compatibility when running on newer systems. The other approach is to build on the newest systems available after adding runtime checks throughout the C code to test for the presence of newer features (i.e. runtime calls that have been added in an operating system release newer than the oldest one support).  While this ("weaklinking") can be a viable option, it's a lot more work to implement initially and then keep updated over each o/s release to avoid segfaults and other failures when users on older systems try to use newer features.  Eventually we would like to fully support weaklinking so that we could provide one installer variant for all supported o/s versions that has all features available at each o/s version, it's not a high priority item at the moment (for example, supporting the upcoming 11.0 Big Sur with Apple Silicon is) and the current practices have worked well for many years.

Keep in mind that the main goal of the python.org macOS installers is to provide a single installable binary that works correctly on a wide-range of macOS releases and hardware.  What we provide today works on all Macs capable of running macOS 10.9 or later.  In particular, it is *not* a goal to provide the most optimized configuration for a particular system.  In general, consider the range of hardware and operating system releases, that's not easy to do. I believe that the intended users for the python.org macOS pythons are (1) beginners (like in a teaching environment where ease of deployment and uniformity is key) and (2) third-party Mac applications developers who want an embeddable Python that will allow their applications to work on multiple levels of macOS. If you are looking for the highest performance for a particular use, like benchmarking, you should look elsewhere - like one of the third-party distributors who specialize in numeric Pythons - or build it yourself on your own system.

So, thanks for the suggestion but we won't be using it now. Sometime in the future, if and when we support weaklinking and/or use newer toolchains across the board we will look at adding and other optimizations.
msg372766 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-07-01 16:40
>> Clang 6.0 doesn't support LTO and PGO?
> No, it appears not.

That's really surprising. I see LTO mentioned in LLVM 3.4 changelog for example:
https://releases.llvm.org/3.4/tools/clang/docs/ReleaseNotes.html#new-compiler-flags

Did you try to build Python with my PR? Which error message do you get? How can I try? I only own a macbook which runs a recent macOS version. Maybe I could try to get clang 6.0 on Linux.

If PGO is not available, just enabling LTO should already make Python faster significantly.

I understand why Python is built on macOS 10.9, and this issue and my PR doesn't change anything about that. I don't request to require newer CPU features or to require newer macOS API or syscall. LTO only changes how Python itself is built.
msg372767 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-07-01 16:43
If clang 6.0 is a dead end for LTO, another option is to build a recent clang version on macOS 10.9. If I manage to do that, would it sound like an acceptable solution? I don't expect any API/ABI issue just by changing the clang version. Upgrading clang should not change the semantics.
msg372778 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2020-07-01 18:18
I should have made it clearer that we expect to release a new installer variant for macOS 11.6 Big Sur that supports both Intel and Apple Silicon architectures later this year (i.e. in several months) when Big Sur releases. It will be much easier to support newer optimizations in that variant.  We are in the process right now of getting builds to work on the developer previews and on developer hardware. We will look at optimizations for that variant then.

Please drop the idea of trying to change how we build on 10.9 (and, yes, we are perfectly capable of finding newer compilers to run on 10.9 but that's not the point - we *only* support building installers with standard Apple Developer Tool chains and with good reason); hacking on 10.9 is not worth it at this point.
msg372779 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2020-07-01 18:18
er, "macOS 11.0 Big Sur" :)
msg380153 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-11-01 16:37
> Link Time Optimization (LTO) and Profile-Guided Optimization (PGO) have a major impact on Python performance: they make Python between 10% and 30% faster (coarse estimation).
>
> Currently, macOS installers distributed on python.org are built with Clang 6.0 without LTO or PGO. I propose to enable LTO and PGO to make these binaries faster.

Oh, I forgot to mention that I discovered that macOS doesn't use LTO when I worked on the https://bugs.python.org/issue39542#msg373230 issue.
History
Date User Action Args
2022-04-11 14:59:33adminsetgithub: 85353
2020-11-01 16:37:44vstinnersetmessages: + msg380153
2020-07-01 18:18:58ned.deilysetmessages: + msg372779
2020-07-01 18:18:19ned.deilysetmessages: + msg372778
2020-07-01 16:43:28vstinnersetmessages: + msg372767
2020-07-01 16:40:00vstinnersetmessages: + msg372766
2020-07-01 16:20:28ned.deilysetstatus: open -> closed
resolution: not a bug
messages: + msg372764

stage: patch review -> resolved
2020-07-01 15:24:41vstinnersetmessages: + msg372758
2020-07-01 11:10:41lukasz.langasetnosy: + lukasz.langa
messages: + msg372747
2020-07-01 10:46:48vstinnersetnosy: + rhettinger, ronaldoussoren, ned.deily, methane
components: + macOS
2020-07-01 10:39:15vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request20402
2020-07-01 10:36:38vstinnersetmessages: + msg372744
2020-07-01 10:32:21vstinnercreate