Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile libpython with -fno-semantic-interposition #83161

Closed
vstinner opened this issue Dec 5, 2019 · 22 comments
Closed

Compile libpython with -fno-semantic-interposition #83161

vstinner opened this issue Dec 5, 2019 · 22 comments
Labels
3.10 only security fixes build The build process and cross-build performance Performance or resource usage

Comments

@vstinner
Copy link
Member

vstinner commented Dec 5, 2019

BPO 38980
Nosy @gpshead, @vstinner, @encukou, @methane, @serhiy-storchaka, @stratakis, @ammaraskar, @hroncok, @corona10, @pablogsal, @tianon, @shihai1991, @FireballDWF
PRs
  • bpo-38980: Add -fno-semantic-interposition when building with optimizations #22862
  • bpo-38980: Only apply -fno-semantic-interposition if available #22892
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-10-22.16:12:29.772>
    created_at = <Date 2019-12-05.16:00:31.384>
    labels = ['build', '3.10', 'performance']
    title = 'Compile libpython with -fno-semantic-interposition'
    updated_at = <Date 2020-10-27.02:07:22.639>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2020-10-27.02:07:22.639>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-10-22.16:12:29.772>
    closer = 'petr.viktorin'
    components = ['Build']
    creation = <Date 2019-12-05.16:00:31.384>
    creator = 'vstinner'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 38980
    keywords = ['patch']
    message_count = 22.0
    messages = ['357856', '357858', '357860', '357861', '357862', '357888', '357889', '357890', '357891', '357905', '357926', '358684', '372687', '372877', '372879', '379184', '379188', '379225', '379257', '379293', '379310', '379712']
    nosy_count = 13.0
    nosy_names = ['gregory.p.smith', 'vstinner', 'petr.viktorin', 'methane', 'serhiy.storchaka', 'cstratak', 'ammar2', 'hroncok', 'corona10', 'pablogsal', 'tianon', 'shihai1991', 'David Filiatrault']
    pr_nums = ['22862', '22892']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue38980'
    versions = ['Python 3.10']

    @vstinner
    Copy link
    Member Author

    vstinner commented Dec 5, 2019

    The Fedora packaging has been modified to compile libpython with -fno-semantic-interposition flag: it makes Python up to 1.3x faster without having to touch any line of the C code! See pyperformance results:
    https://fedoraproject.org/wiki/Changes/PythonNoSemanticInterpositionSpeedup#Benefit_to_Fedora

    The main drawback is that -fno-semantic-interposition prevents to override Python symbols using a custom library preloaded by LD_PRELOAD. For example, override PyErr_Occurred() function.

    We (authors of the Fedora change) failed to find any use case for LD_PRELOAD.

    To be honest, I found *one* user in the last 10 years who used LD_PRELOAD to track memory allocations in Python 2.7. This use case is no longer relevant in Python 3 with PEP-445 which provides a supported C API to override Python memory allocators or to install hooks on Python memory allocators. Moreover, tracemalloc is a nice way to track memory allocations.

    Is there anyone aware of any special use of LD_PRELOAD for libpython?

    To be clear: -fno-semantic-interposition only impacts libpython. All other libraries still respect LD_PRELOAD. For example, it is still possible to override glibc malloc/free.

    Why -fno-semantic-interposition makes Python faster? There are multiple reasons. For of all, libpython makes a lot of function calls to libpython. Like really a lot, especially in the hot code paths. Without -fno-semantic-interposition, function calls to libpython requires to get through "interposition": for example "Procedure Linkage Table" (PLT) indirection on Linux. It prevents function inlining which has a major impact on performance (missed optimization). In short, even with PGO and LTO, libpython function calls have two performance "penalities":

    • indirect function calls (PLT)
    • no inlining

    I'm comparing Python performance of "statically linked Python" (Debian/Ubuntu choice: don't use ./configure --enable-shared, python is not linked to libpython) to "dynamically linked Python" (Fedora choice: use "./configure --enable-shared", python is dynamically linked to libpython).

    With -fno-semantic-interposition, function calls are direct and can be inlined when appropriate. You don't have to trust me, look at pyperformance benchmark results ;-)

    When using ./configure --enable-shared (libpython), the "python" binary is exactly one function call and that's all:

    int main(int argc, char **argv)
    { return Py_BytesMain(argc, argv); }

    So 100% of the time is only spent in libpython.

    For a longer rationale, see the accepted Fedora change:
    https://fedoraproject.org/wiki/Changes/PythonNoSemanticInterpositionSpeedup

    @vstinner vstinner added 3.9 only security fixes build The build process and cross-build performance Performance or resource usage labels Dec 5, 2019
    @vstinner
    Copy link
    Member Author

    vstinner commented Dec 5, 2019

    Maybe we need to offer a way to *opt out* from -fno-semantic-interposition. For example, ./configure --with-interposition. The default would be --without-interposition.

    @pablogsal
    Copy link
    Member

    I have seen people using LD_PRELOAD to interpose some auditing functions that can modify the actual call into libpython, or to interpose faster versions of some functions or to collect metrics (although there are better ways).

    If we do this by default, once functions will be inlined these use cases will be broken.

    @vstinner
    Copy link
    Member Author

    vstinner commented Dec 5, 2019

    If we do this by default, once functions will be inlined these use cases will be broken.

    Could these user use a "./configure --with-interposition --enable-shared" build?

    @pablogsal
    Copy link
    Member

    Could these user use a "./configure --with-interposition --enable-shared" build?

    Sure, but the problem is the default value, no?

    Maybe it should only be default when using --with-optimizations

    @stratakis
    Copy link
    Mannequin

    stratakis mannequin commented Dec 5, 2019

    Maybe it should only be default when using --with-optimizations

    I think it will add to the complexity of the --with-optimizations flag which already implies PGO and LTO.

    Maybe an opt-in flag would be better IMHO.

    @pablogsal
    Copy link
    Member

    I think it will add to the complexity of the --with-optimizations flag which already implies PGO and LTO.

    That is why I was suggesting it: --with-optimizations for me means "activate everything that you can to make python faster".

    @ammaraskar
    Copy link
    Member

    Just for a quick datapoint: llvm/clang do this by default and you need an explicit -fsemantic-interposition to disable it http://lists.llvm.org/pipermail/llvm-dev/2016-November/107625.html

    It seems to me that the performance gains here really outweigh any weird usage of LD_PRELOAD.

    @pablogsal
    Copy link
    Member

    It seems to me that the performance gains here really outweigh any weird usage of LD_PRELOAD.

    I am very convinced of this assertion, but other users could not be, I think the discussion is how to provide/activate the option in the less intrusive way and without breaking too many use cases.

    To be honest, I think it would be very rare for users to use LD_PRELOAD in this way, so I am fine if we activate it by default. But I still think it would be good to discuss these cases and take them into consideration :)

    @serhiy-storchaka
    Copy link
    Member

    In case of malloc, every memory allocating code need to use malloc/calloc/realloc. This is official and the only way to allocate a memory. But we do not guarantee that Python core uses only public C API like PyErr_Occurred(). It can use more low-level and efficient but less safer C API internally. It can replace the function with a macro which access internal structures directly (for compiling the core only). And this is actually the case. Overridding the public C API functions not always has an effect on the core.

    So I think that adding -fno-semantic-interposition will likely not break many things which were not broken before.

    But this should be discussed on Python-Dev. I am sure some C API functions are purposed to be overridden.

    @vstinner
    Copy link
    Member Author

    vstinner commented Dec 6, 2019

    In case of malloc, every memory allocating code need to use malloc/calloc/realloc. This is official and the only way to allocate a memory. But we do not guarantee that Python core uses only public C API like PyErr_Occurred(). It can use more low-level and efficient but less safer C API internally. It can replace the function with a macro which access internal structures directly (for compiling the core only). And this is actually the case. Overridding the public C API functions not always has an effect on the core.

    To confirm what you said: if we take the specific example of PyErr_Occurred(), I recently added a new _PyErr_Occurred() function which is declared as a static inline function. _PyErr_Occurred() cannot be overriden.

    static inline PyObject* _PyErr_Occurred(PyThreadState *tstate)
    {
        assert(tstate != NULL);
        return tstate->curexc_type;
    }

    @vstinner
    Copy link
    Member Author

    Pablo:

    I have seen people using LD_PRELOAD (...) to interpose faster versions of some functions or to collect metrics (although there are better ways).

    IMHO if someone has to go so far into "hacking" Python, they should recompile Python with specific options. I'm not sure that using LD_PRELOAD to get "faster versions of some functions" is the best approach in term of performance, but I expect it to be convenient :-)

    Charalampos:

    I think it will add to the complexity of the --with-optimizations flag which already implies PGO and LTO.

    It doesn't enable LTO, only PGO :-) We had to disable LTO because of multiple compiler bugs last years.

    Serhiy:

    I am sure some C API functions are purposed to be overridden.

    Is it a theorical use case, or are you aware of such use case being currently used in the wild?

    Ammar Askar:

    Just for a quick datapoint: llvm/clang do this by default and you need an explicit -fsemantic-interposition to disable it http://lists.llvm.org/pipermail/llvm-dev/2016-November/107625.html

    Oh, that's really interesting, thanks!

    @vstinner
    Copy link
    Member Author

    We wrote an article about -fno-semantic-interposition flag that we use with GCC on RHEL8 and Fedora:
    https://developers.redhat.com/blog/2020/06/25/red-hat-enterprise-linux-8-2-brings-faster-python-3-8-run-speeds/
    "Enabling this flag disables semantic interposition, which can increase run speed by as much as 30%."

    In short, the flag allows the compiler to inline code and so make further optimizations, when Python is built with --enable-shared.

    @gpshead
    Copy link
    Member

    gpshead commented Jul 2, 2020

    Yes this should become part of --with-optimizations when building on a platform using a compiler that (a) supports it and (b) where it matters.

    If this is only relevant on --enable-shared builds (not the default), i'd assume also make it conditional on that.

    I never use --enable-shared builds myself.

    @gpshead
    Copy link
    Member

    gpshead commented Jul 2, 2020

    and to echo others: Do not worry about LD_PRELOAD users trying to override internals. That is not a supported use case. It is always a hack. anyone using it knows this.

    @ammaraskar
    Copy link
    Member

    Hey Victor, should we try to land this in Python 3.10?

    Given that no one has brought up any big concerns aside from LD_PRELOAD based hacks and how clang has already had this as the default I think it's relatively safe to make a default for with-optimizations.

    @methane
    Copy link
    Member

    methane commented Oct 21, 2020

    +1

    @methane methane added 3.10 only security fixes and removed 3.9 only security fixes labels Oct 21, 2020
    @pablogsal
    Copy link
    Member

    Victor is on vacation for some weeks, so I am creating a PR to push this forward.

    @pablogsal
    Copy link
    Member

    New changeset b451b0e by Pablo Galindo in branch 'master':
    bpo-38980: Add -fno-semantic-interposition when building with optimizations (GH-22862)
    b451b0e

    @encukou
    Copy link
    Member

    encukou commented Oct 22, 2020

    I was too eager in reviewing this :(
    It turns out -fno-semantic-interposition is GCC 5.3, so builds fail on older GCC.

    I'm researching how to make this conditional in autotools.

    @encukou encukou reopened this Oct 22, 2020
    @encukou
    Copy link
    Member

    encukou commented Oct 22, 2020

    New changeset c6d7e82 by Petr Viktorin in branch 'master':
    bpo-38980: Only apply -fno-semantic-interposition if available (GH-22892)
    c6d7e82

    @encukou encukou closed this as completed Oct 22, 2020
    @vstinner
    Copy link
    Member Author

    Since Fedora and RHEL build Python with -fno-semantic-interposition, we did not get any user bug report about the LD_PRELOAD use case. IMO we can safely consider that no user rely on LD_PRELOAD to override libpython symbols.

    Thanks for implementing the feature Pablo and Petr!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes build The build process and cross-build performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants