Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[macOS] Build macOS installer with LTO and PGO optimizations #85353

Closed
vstinner opened this issue Jul 1, 2020 · 10 comments
Closed

[macOS] Build macOS installer with LTO and PGO optimizations #85353

vstinner opened this issue Jul 1, 2020 · 10 comments
Labels
3.8 only security fixes 3.9 only security fixes 3.10 only security fixes build The build process and cross-build OS-mac performance Performance or resource usage

Comments

@vstinner
Copy link
Member

vstinner commented Jul 1, 2020

BPO 41181
Nosy @rhettinger, @ronaldoussoren, @vstinner, @ned-deily, @methane, @ambv
PRs
  • bpo-41181: macOS build script uses LTO and PGO #21256
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-07-01.16:20:28.758>
    created_at = <Date 2020-07-01.10:32:21.690>
    labels = ['OS-mac', '3.8', '3.9', '3.10', 'performance', 'build', 'invalid']
    title = '[macOS] Build macOS installer with LTO and PGO optimizations'
    updated_at = <Date 2020-11-01.16:37:44.044>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2020-11-01.16:37:44.044>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-07-01.16:20:28.758>
    closer = 'ned.deily'
    components = ['Build', 'macOS']
    creation = <Date 2020-07-01.10:32:21.690>
    creator = 'vstinner'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 41181
    keywords = ['patch']
    message_count = 10.0
    messages = ['372743', '372744', '372747', '372758', '372764', '372766', '372767', '372778', '372779', '380153']
    nosy_count = 6.0
    nosy_names = ['rhettinger', 'ronaldoussoren', 'vstinner', 'ned.deily', 'methane', 'lukasz.langa']
    pr_nums = ['21256']
    priority = 'normal'
    resolution = 'not a bug'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue41181'
    versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

    @vstinner
    Copy link
    Member Author

    vstinner commented Jul 1, 2020

    Link Time Optimization (LTO) and Profile-Guided Optimization (PGO) have a major impact on Python performance: they make Python between 10% and 30% faster (coarse estimation).

    Currently, macOS installers distributed on python.org are built with Clang 6.0 without LTO or PGO. I propose to enable LTO and PGO to make these binaries faster.

    IMO we should build all new Python macOS installers with these optimizations.

    Attached PR adds the flags.

    Python 3.9.0b3 binary:

    $ python3.9
    Python 3.9.0b3 (v3.9.0b3:b484871ba7, Jun  9 2020, 16:05:25) 
    [Clang 6.0 (clang-600.0.57)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.

    configure options:

    >>> import sysconfig; print(sysconfig.get_config_var('CONFIG_ARGS'))
    '-C' '--enable-framework' '--enable-universalsdk=/' '--with-universal-archs=intel-64' '--with-computed-gotos' '--without-ensurepip' '--with-tcltk-includes=-I/tmp/_py/libraries/usr/local/include' '--with-tcltk-libs=-ltcl8.6 -ltk8.6' 'LDFLAGS=-g' 'CFLAGS=-g' 'CC=gcc'

    Compiler flags:

    >>> sysconfig.get_config_var('PY_CFLAGS') + sysconfig.get_config_var('PY_CFLAGS_NODIST')
    '-Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch x86_64 -g-std=c99 -Wextra -Wno-unused-result -Wno-unused-parameter -Wno-missing-field-initializers -Wstrict-prototypes -Werror=implicit-function-declaration -fvisibility=hidden  -I/Users/sysadmin/build/v3.9.0b3/Include/internal'

    Linker flags:

    >>> sysconfig.get_config_var('PY_LDFLAGS') + sysconfig.get_config_var('PY_LDFLAGS_NODIST')
    '-arch x86_64 -g'

    @vstinner vstinner added 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes build The build process and cross-build performance Performance or resource usage labels Jul 1, 2020
    @vstinner
    Copy link
    Member Author

    vstinner commented Jul 1, 2020

    The performance issue was noticed by Raymond Hettinger who ran a microbenchmark on tuplegetter_descr_get(), comparison between Python 3.8 and Python 3.9:

    https://mail.python.org/archives/list/python-dev@python.org/message/Q3YHYIKNUQH34FDEJRSLUP2MTYELFWY3/

    INADA-san confirms that the performance regression was introduced by the commit 45ec5b9 (bpo-40170) which changes PyType_HasFeature() implementation to always call PyType_GetFlags() as a function rather than reading directly the PyTypeObject.tp_flags member.

    https://mail.python.org/archives/list/python-dev@python.org/message/FOKJXG2SYMXCHYPGUZWVYMHLDR42BYFB/

    On Fedora 32, there is no performance difference because binaries are built with GCC using LTO and PGO: the PyType_GetFlags() function call is inlined by GCC 10.

    I built Python on macOS with clang 11.0.3 on macOS 10.15.4, and I confirm that LTO+PGO allows to inline the PyType_GetFlags() function call in tuplegetter_descr_get().

    Using "./configure && make":
    ---

    $ lldb ./python.exe
    (lldb) disassemble --name tuplegetter_descr_get
    (...)
    python.exe[0x1001c46ad] <+29>:  callq  0x10009c720               ; PyType_GetFlags at typeobject.c:2338
    python.exe[0x1001c46b2] <+34>:  testl  $0x4000000, %eax          ; imm = 0x4000000 
    (...)

    Using "./configure --with-lto --enable-optimizations && make":
    ---

    $ lldb ./python.exe
    (lldb) disassemble --name tuplegetter_descr_get
    (...)
    python.exe[0x1002a9542] <+18>:  movq   0x10(%rbx), %rdx
    python.exe[0x1002a9546] <+22>:  movq   0x8(%rsi), %rax
    python.exe[0x1002a954a] <+26>:  testb  $0x4, 0xab(%rax)
    python.exe[0x1002a9551] <+33>:  je     0x1002a956f               ; <+63>
    (...)

    @ambv
    Copy link
    Contributor

    ambv commented Jul 1, 2020

    The installer is built on Mac OS X 10.9 so that it is forward compatible with all OS X and macOS versions. We cannot depend on PGO and LTO for it unless we start building the installer on 10.15. We cannot do this currently as those installers would not work with older macOS and OS X versions.

    Since the Mac is switching to Apple Silicon, the plan is to start building a separate macOS 11+ installer. *That* could use PGO and LTO.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jul 1, 2020

    We cannot depend on PGO and LTO for it unless we start building the installer on 10.15.

    Clang 6.0 doesn't support LTO and PGO? Would you mind to elaborate?

    @ned-deily
    Copy link
    Member

    Clang 6.0 doesn't support LTO and PGO?

    No, it appears not. And it's not an oversight that we don't use the these options.

    As Łukasz points out, for the current macOS installer variants we supply are designed to run on all Mac systems from macOS 10.9 on. To accomplish that safely, we build the Python binaries on macOS 10.9 system to ensure they will be compatible, in other words, we build on the oldest system support and rely on upward compatibility when running on newer systems. The other approach is to build on the newest systems available after adding runtime checks throughout the C code to test for the presence of newer features (i.e. runtime calls that have been added in an operating system release newer than the oldest one support). While this ("weaklinking") can be a viable option, it's a lot more work to implement initially and then keep updated over each o/s release to avoid segfaults and other failures when users on older systems try to use newer features. Eventually we would like to fully support weaklinking so that we could provide one installer variant for all supported o/s versions that has all features available at each o/s version, it's not a high priority item at the moment (for example, supporting the upcoming 11.0 Big Sur with Apple Silicon is) and the current practices have worked well for many years.

    Keep in mind that the main goal of the python.org macOS installers is to provide a single installable binary that works correctly on a wide-range of macOS releases and hardware. What we provide today works on all Macs capable of running macOS 10.9 or later. In particular, it is *not* a goal to provide the most optimized configuration for a particular system. In general, consider the range of hardware and operating system releases, that's not easy to do. I believe that the intended users for the python.org macOS pythons are (1) beginners (like in a teaching environment where ease of deployment and uniformity is key) and (2) third-party Mac applications developers who want an embeddable Python that will allow their applications to work on multiple levels of macOS. If you are looking for the highest performance for a particular use, like benchmarking, you should look elsewhere - like one of the third-party distributors who specialize in numeric Pythons - or build it yourself on your own system.

    So, thanks for the suggestion but we won't be using it now. Sometime in the future, if and when we support weaklinking and/or use newer toolchains across the board we will look at adding and other optimizations.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jul 1, 2020

    > Clang 6.0 doesn't support LTO and PGO?
    No, it appears not.

    That's really surprising. I see LTO mentioned in LLVM 3.4 changelog for example:
    https://releases.llvm.org/3.4/tools/clang/docs/ReleaseNotes.html#new-compiler-flags

    Did you try to build Python with my PR? Which error message do you get? How can I try? I only own a macbook which runs a recent macOS version. Maybe I could try to get clang 6.0 on Linux.

    If PGO is not available, just enabling LTO should already make Python faster significantly.

    I understand why Python is built on macOS 10.9, and this issue and my PR doesn't change anything about that. I don't request to require newer CPU features or to require newer macOS API or syscall. LTO only changes how Python itself is built.

    @vstinner
    Copy link
    Member Author

    vstinner commented Jul 1, 2020

    If clang 6.0 is a dead end for LTO, another option is to build a recent clang version on macOS 10.9. If I manage to do that, would it sound like an acceptable solution? I don't expect any API/ABI issue just by changing the clang version. Upgrading clang should not change the semantics.

    @ned-deily
    Copy link
    Member

    I should have made it clearer that we expect to release a new installer variant for macOS 11.6 Big Sur that supports both Intel and Apple Silicon architectures later this year (i.e. in several months) when Big Sur releases. It will be much easier to support newer optimizations in that variant. We are in the process right now of getting builds to work on the developer previews and on developer hardware. We will look at optimizations for that variant then.

    Please drop the idea of trying to change how we build on 10.9 (and, yes, we are perfectly capable of finding newer compilers to run on 10.9 but that's not the point - we *only* support building installers with standard Apple Developer Tool chains and with good reason); hacking on 10.9 is not worth it at this point.

    @ned-deily
    Copy link
    Member

    er, "macOS 11.0 Big Sur" :)

    @vstinner
    Copy link
    Member Author

    vstinner commented Nov 1, 2020

    Link Time Optimization (LTO) and Profile-Guided Optimization (PGO) have a major impact on Python performance: they make Python between 10% and 30% faster (coarse estimation).

    Currently, macOS installers distributed on python.org are built with Clang 6.0 without LTO or PGO. I propose to enable LTO and PGO to make these binaries faster.

    Oh, I forgot to mention that I discovered that macOS doesn't use LTO when I worked on the https://bugs.python.org/issue39542#msg373230 issue.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes 3.9 only security fixes 3.10 only security fixes build The build process and cross-build OS-mac performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants