Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[subinterpreters] Per-interpreter singletons (None, True, False, etc.) #83692

Closed
vstinner opened this issue Jan 31, 2020 · 33 comments
Closed

[subinterpreters] Per-interpreter singletons (None, True, False, etc.) #83692

vstinner opened this issue Jan 31, 2020 · 33 comments
Labels
3.9 only security fixes topic-subinterpreters

Comments

@vstinner
Copy link
Member

BPO 39511
Nosy @rhettinger, @vstinner, @ericsnowcurrently, @mloskot, @pablogsal, @h-vetinari, @JunyiXie, @KubaO
PRs
  • bpo-39511: PyThreadState_Clear() calls on_delete #18296
  • bpo-39511: Fix multiprocessing semlock_acquire() #18298
  • [WIP] bpo-39511: Add Py_GetNone() and Py_GetNoneRef() functions #18301
  • bpo-43503: Make limited API objects effectively immutable. #24828
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2020-01-31.15:33:35.261>
    labels = ['expert-subinterpreters', '3.9']
    title = '[subinterpreters] Per-interpreter singletons (None, True, False, etc.)'
    updated_at = <Date 2022-01-28.13:50:38.344>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2022-01-28.13:50:38.344>
    actor = 'vstinner'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Subinterpreters']
    creation = <Date 2020-01-31.15:33:35.261>
    creator = 'vstinner'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 39511
    keywords = ['patch']
    message_count = 32.0
    messages = ['361121', '361122', '361137', '361138', '361145', '361146', '361161', '361163', '361164', '361256', '361257', '361258', '361355', '361356', '361357', '361380', '361382', '361543', '364384', '364429', '364446', '364469', '364471', '364485', '366342', '368937', '388409', '388417', '388894', '388921', '411512', '411996']
    nosy_count = 8.0
    nosy_names = ['rhettinger', 'vstinner', 'eric.snow', 'mloskot', 'pablogsal', 'h-vetinari', 'JunyiXie', 'KubaO']
    pr_nums = ['18296', '18298', '18301', '24828']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue39511'
    versions = ['Python 3.9']

    @vstinner
    Copy link
    Member Author

    The long-term goal of the PEP-554 is to run two Python interpreters in parallel. To achieve this goal, no object must be shared between two interpreters. See for example my article "Pass the Python thread state explicitly" which gives a longer rationale:
    https://vstinner.github.io/cpython-pass-tstate.html

    In bpo-38858, I modified Objects/longobject.c to have per-interpreter small integer singletons: commit 630c8df.

    This issue is about other singletons like None or Py_True which are currently shared between two interpreters.

    I propose to add new functions. Example for None:

    • Py_GetNone(): return a *borrowed* reference to the None singleton (similar to existing Py_None macro)
    • Py_GetNoneRef(): return a *strong* reference to the None singleton (similar to "Py_INCREF(Py_None); return Py_None;" and Py_RETURN_NONE macro)

    And add PyInterpreterState.none field: strong reference to the per-interpreter None object.

    We should do that for each singletons:

    • None (Py_None)
    • True (Py_True)
    • False (Py_False)
    • Ellipsis (Py_Ellipsis)

    GIL issue
    =========

    Py_GetNone() would look like:

    PyObject* Py_GetNone(void)
    { return _PyThreadState_GET()->interp->none; }

    Problem: _PyThreadState_GET() returns NULL if the caller function doesn't hold the GIL.

    Using the Python C API when the GIL is not held is a violation of the API: it is not supported. But it worked previously.

    One solution is to fail with an assertion error (abort the process) in debug mode, and let Python crash in release mode.

    Another option is to only fail with an assertion error in debug mode in Python 3.9. In Python 3.9, Py_GetNone() would use PyGILState_GetThisThreadState() function which works even when the GIL is released. In Python 3.10, we would switch to _PyThreadState_GET() and so crash in release mode.

    One concrete example of such issue can be found in the multiprocessing C code, in semlock_acquire():

                Py_BEGIN_ALLOW_THREADS
                if (timeout_obj == Py_None) {
                    res = sem_wait(self->handle);
                }
                else {
                    res = sem_timedwait(self->handle, &deadline);
                }
                Py_END_ALLOW_THREADS

    Py_None is accessed when the GIL is released.

    @vstinner vstinner added interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.9 only security fixes labels Jan 31, 2020
    @jkloth
    Copy link
    Contributor

    jkloth commented Jan 31, 2020

    Would it not suffice to just make the singletons "immortal"?

    Without affecting the hotpaths that are Py_INCREF and Py_DECREF, changing _Py_Dealloc to test for objects with a "special" destructor could be used:

        destructor dealloc = Py_TYPE(op)->tp_dealloc;
        if (dealloc == _Py_SingletonSentinel) {
            /* reset refcnt so as to not return here too often */
            op->ob_refcnt = PY_SSIZE_T_MAX;
        }
        else {
            (*dealloc)(op);
        }

    Even in the presence of multiple mutating threads, the object cannot be destroyed. Worst case, they all call _Py_Dealloc.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 1, 2020

    New changeset 7dc1401 by Victor Stinner in branch 'master':
    bpo-39511: Fix multiprocessing semlock_acquire() (GH-18298)
    7dc1401

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 1, 2020

    Would it not suffice to just make the singletons "immortal"?

    The problem is to make Py_INCREF/Py_DECREF efficient. Last time someone tried to use an atomic variable for ob_refcnt, it was 20% slower if I recall correctly. If many threads start to update such atomic variable, the CPU cacheline of common singletons like None, True and False can quickly become a performance bottleneck.

    On the other side, if each interpreter has its own objects, there is no need to protect ob_refcnt, the interpreter lock protects it.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 1, 2020

    New changeset 4d96b46 by Victor Stinner in branch 'master':
    bpo-39511: PyThreadState_Clear() calls on_delete (GH-18296)
    4d96b46

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 1, 2020

    PR 18301 is a WIP showing my intent. I'm not sure if it would be possible to land such change right now in Python. It has different drawbacks described in my previous messages. I don't know the impact on performance neither.

    @jeremykloth
    Copy link
    Mannequin

    jeremykloth mannequin commented Feb 1, 2020

    The problem is to make Py_INCREF/Py_DECREF efficient.

    That is exactly why I didn't propose a change to them. The singletons
    still are refcounted as usual, just that their ob_refcnt is ignored.
    If they somehow reach 0, they just "resurrect" themselves and ignore
    the regular collection behavior. In the presence of multiple
    DECREF'ers, the ob_refcnt field is garbage, but that is OK as it is
    effectively ignored. Practicality vs Purity and all that.

    Last time someone tried to use an atomic variable for ob_refcnt, it was 20% slower if I recall correctly. If many threads start to update such atomic variable, the CPU cacheline of common singletons like None, True and False can quickly become a performance bottleneck.

    Exactly so, hence why I chose the simple solution of effectively
    ignoring ob_refcnt.

    On the other side, if each interpreter has its own objects, there is no need to protect ob_refcnt, the interpreter lock protects it.

    My solution also does not need any protection around ob_refcnt.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 1, 2020

    I vaguely recall discussions about immortal Python objects.

    (*) Instagram gc.freeze()

    (*) Python immortal strings

    • PyUnicode_InternImmortal()
    • SSTATE_INTERNED_IMMORTAL
    • They are only destroyed if Python is compiled with Valgrind or Purify support: unicode_release_interned() function

    (*) COUNT_ALLOCS

    • When Python is built with COUNT_ALLOCS macro defined, types are immortal
    • Many tests have to be skipped if COUNT_ALLOCS is used
    • I plan to remove COUNT_ALLOCS feature in bpo-39489 :-)

    (*) Static types

    • Types which are not allocated on the heap but "static" are immortal
    • These types cause various issues and should go away
    • For example, they are incompatible with multi-phase initialization module (PEP-489) and stable ABI (PEP-384)

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 1, 2020

    Recently, Petr Viktorin proposed immortal singletons in my latest "Pass the Python thread state to internal C functions" thread on python-dev list:
    https://mail.python.org/archives/list/python-dev@python.org/message/RAVSH7HYHTROXSTUR3677WGTCTEO6FYF/

    In 2004, Jewett, Jim J proposed:

    "What if a few common (constant, singleton) objects (such as None, -1, 0, 1) were declared immortal at compile-time?"

    https://mail.python.org/archives/list/python-dev@python.org/thread/XWGRUATMRAVXXZKQ7QA2YX22KI55T7P5/#OQO7DWRVH7IIOE4RUJ2ZR7S5UT6WOCPS

    @rhettinger
    Copy link
    Contributor

    Is the sub-interpreter PEP approved? If not, I had thought the plan was to only implement PRs that made clean-ups that would have been necessary anyway.

    @rhettinger
    Copy link
    Contributor

    Random idea (not carefully thought-out): Would it be simpler to have these objects just ignore their refcount by having dealloc() be a null operation or having it set the refcount back to a positive number). That would let sub-interpreters share the objects without worrying about race-conditions on incref/decref operations. To make this work, the objects can register themselves as permanent, shared, objects; then, during shutdown, we could explicitly call a hard dealloc on those objects.

    @jkloth
    Copy link
    Contributor

    jkloth commented Feb 2, 2020

    +1, obviously, as I came to the same conclusion above (see msg361122)

    @ericsnowcurrently
    Copy link
    Member

    On Sun, Feb 2, 2020 at 2:53 PM Raymond Hettinger <report@bugs.python.org> wrote:

    Is the sub-interpreter PEP approved?

    PEP-554 is not approved yet (and certainly is not guaranteed, though
    I'm hopeful). However, that PEP is exclusively about exposing
    subinterpreters in the stdlib.

    There is no PEP for the effort to isolate subinterpreters and stop
    sharing the GIL between them. We are not planning on having such a
    PEP since the actual proposal is so basic ("make the GIL
    per-interpreter") and there has been no controversy.

    I need to do a better job about communicating the difference, as folks
    keep conflating PEP-554 with the effort to stop sharing the GIL
    between subinterpreters. Note that both are part of my "multi-core
    Python" project. [1]

    If not, I had thought the plan was to only
    implement PRs that made clean-ups that would have been necessary anyway.

    Right. In this case the "cleanup" is how singletons are finalized
    when the runtime shuts down. This has a material impact on projects
    that embed the CPython runtime. Currently runtime finalization is a
    mess.

    That said, making the singletons per-interpreter isn't a prerequisite
    of that cleanup. For specific case of the per-interpreter GIL world,
    we just need an approach that addresses their life cycle in a
    thread-safe way.

    [1] https://github.com/ericsnowcurrently/multi-core-python

    @ericsnowcurrently
    Copy link
    Member

    On Sun, Feb 2, 2020 at 3:32 PM Raymond Hettinger <report@bugs.python.org> wrote:

    Random idea (not carefully thought-out): Would it be simpler to have these objects just
    ignore their refcount by having dealloc() be a null operation or having it set the refcount
    back to a positive number). That would let sub-interpreters share the objects without
    worrying about race-conditions on incref/decref operations.

    This is pretty much one of the two approaches I have been considering. :)

    Just to be clear, singletons normally won't end up with a refcount of
    0, by design. However, if there's a incref race between two
    interpreters (that don't share a GIL) then you *can* end up triggering
    dealloc (via Py_DECREF) and even get negative refcounts (which cause
    Py_FatalError() on debug builds). Here are some mitigations:

    • (as noted above) make dealloc() for singletons a noop
    • (as noted above) in dealloc() set the refcount back to some positive value
    • make the initial refcount sufficiently large such that it is
      unlikely to reach 0 even with races
    • incref each singleton once during each interpreter initiialization
      (and decref once during interp. finalization)

    We would have to special-case refleak checks for singletons, to avoid
    false positives.

    Also note that currently extension authors (and CPython contributors)
    can rely on the runtime to identify when then accidentally
    incref/decref a singleton too much (or forget to do so). So it may
    make sense to do something in the "singleton dealloc()" in debug
    builds to provide similar checks.

    To make this work, the objects can register themselves as permanent, shared, objects;
    then, during shutdown, we could explicitly call a hard dealloc on those objects.

    great point

    @ericsnowcurrently
    Copy link
    Member

    This is pretty much one of the two approaches I have been considering.

    The other approach is to leave the current static singletons alone and
    only use them for the main interpreter. Each subinterpreter would get
    its own copy, created when that interpreter is initialized.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 4, 2020

    The other approach is to leave the current static singletons alone and
    only use them for the main interpreter. Each subinterpreter would get
    its own copy, created when that interpreter is initialized.

    Which API should be used in C extensions to be "subinterpreter-safe"? Currently, Py_None is a singleton shared by multiple interpreters. Should suddenly all C extensions use a new Py_GetNone() function which returns the per-interpreter singleton? If yes, that's basically what my PR 18301 does:

       #define Py_None Py_GetNone()

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 5, 2020

    (as noted above) make dealloc() for singletons a noop

    I expect issues with negative reference count value. As you wrote, it triggers a fatal error when Python is built in release mode.

    make the initial refcount sufficiently large such that it is
    unlikely to reach 0 even with races

    Py_None is heavily used. If the reference count is updated by multiple threads with no lock to protect it, there is a significant risk that value zero will be reached soon or later.

    --

    In the Linux kernel, they started to special type for reference counters, to reduce the risk of vulnerability on reference counter underflow or overflow:

    The kernel already used atomic_t type. But the issue here is about bugs, since no program is perfect, even the Linux kernel.

    @vstinner
    Copy link
    Member Author

    vstinner commented Feb 7, 2020

    Ah, I also found the idea of immortal None in an old discussion on tagged pointer:
    https://mail.python.org/archives/list/capi-sig@python.org/thread/JPUNPN3AILGXOA3C2TTSLMOFNSWJE3QX/

    Stefan Behnel proposed the idea: "All negative refcounts would have special meanings, such as: this is the immortal None, (...)".

    @larryhastings
    Copy link
    Contributor

    We should do that for each singletons:

    • None (Py_None)
    • True (Py_True)
    • False (Py_False)
    • Ellipsis (Py_Ellipsis)

    Aren't there a couple more lurking in the interpreter? E.g. empty tuple, empty frozenset.

    That is exactly why I didn't propose a change to them.
    The singletons still are refcounted as usual,
    just that their ob_refcnt is ignored.
    If they somehow reach 0, they just "resurrect" themselves
    and ignore the regular collection behavior.

    That seems like a very good idea! They don't even need to "resurrect" themselves--we just ensure tp_dealloc is a no-op for those special values. If we do that, different threads and different interpreters can change ob_refcnt willy-nilly, there can be unsafe races between threads, the value could no longer make any sense--but as long as we never free the object, it's all totally fine.

    (Actually: tp_dealloc shouldn't be a no-op--it should add a million to the reference count for these special objects, to forestall future irrelevant calls to tp_dealloc.)

    This might have minor deleterious effects, e.g. sys.getrefcount() would return misleading results for such objects. I think that's acceptable.

    @markshannon
    Copy link
    Member

    Consider the case where a thread that doesn't hold the GIL attempts to get a reference on None.

    The problem with having a single immortal None, is that it will cause data cache thrashing as two different CPUs modify the refcount on the shared None object.
    Each subinterpreter needs its own distinct None.

    None could be made immortal, it just can't be shared between sub-interpreters.

    @vstinner
    Copy link
    Member Author

    Mark:

    The problem with having a single immortal None, is that it will cause data cache thrashing as two different CPUs modify the refcount on the shared None object.

    Yeah, I concur with Mark: having one singleton per interpreter should provide better usage of the CPU caches, especially CPU data cache level 1.

    Mark:

    Consider the case where a thread that doesn't hold the GIL attempts to get a reference on None.

    The main drawback of PR 18301 is that accessing "Py_None" means accessing tstate->interp->none. Except that the commonly used _PyThreadState_GET() returns NULL if the thread doesn't hold the GIL.

    One alternative would be to use PyGILState_GetThisThreadState() but this API doesn't support subinterpreters.

    Maybe we are moving towards a major backward incompatible changes required to make the subinterpreters implementation more efficient.

    Maybe CPython should have a backward compatible behavior by default (Py_None can be read without holding the GIL), but running subinterpreters in parallel would change Py_None behavior (cannot be read without holding the GIL).

    I don't know.

    @larryhastings
    Copy link
    Contributor

    The problem with having a single immortal None, is that it will
    cause data cache thrashing as two different CPUs modify the
    refcount on the shared None object.

    That's a very reasonable theory. Personally, I find modern CPU architecture bewildering and unpredictable. So I'd prefer it if somebody tests such performance claims, rather than simply asserting them and having that be the final design.

    @markshannon
    Copy link
    Member

    Having two CPUs write to the same cache line is a well known performance problem. There's nothing special about CPython here.

    The proper name for it seems to be "cache line ping-pong", but a search for "false sharing" might be more informative.

    @vstinner
    Copy link
    Member Author

    I expect that for objects which are not commonly modified by two interpreters "at the same time", it should be fine.

    But None, True, small integer singletons, latin-1 str single character singletons, etc. objects are likely to be frequently accessed and so can become a bottleneck.

    Moreover, Python has an infamous feature: write (modify ob_refcnt) on "read-only" access :-D See "Copy-on-Read" feature popularized by Instagram ;-)

    https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf

    @vstinner
    Copy link
    Member Author

    bpo-40255 proposes a concrete implementation of immortal objects: it modifies Py_INCREF and Py_DECREF which makes Python up to 1.17x slower.

    @vstinner vstinner added topic-subinterpreters and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels May 15, 2020
    @markshannon
    Copy link
    Member

    Those numbers are for code without immortal objects.
    They don't apply in this case, as the branch misprediction rate would rise.

    @JunyiXie
    Copy link
    Mannequin

    JunyiXie mannequin commented Mar 10, 2021

    Which API should be used in C extensions to be "subinterpreter-safe"? ?> Currently, Py_None is a singleton shared by multiple interpreters. > > > Should suddenly all C extensions use a new Py_GetNone() function which > returns the per-interpreter singleton? If yes, that's basically what my > PR 18301 does:

    #define Py_None Py_GetNone()

    after read you [WIP] bpo-39511: Add Py_GetNone() and Py_GetNoneRef() functions bpo-18301.

    Actually, interp->none shared _Py_NoneStruct variable.

    when two interperter modify interp->none refcount,will modify _Py_NoneStruct variable.

    the CPU cacheline of common singletons like None, True and False can quickly become a performance bottleneck.

    even if add Py_INCREF(none);.
    In the scenario of parallel interpreter, will also have thread safety issues.

    PyStatus
    _Py_InitSingletons(PyThreadState *tstate)
    {
    PyObject *none = &_Py_NoneStruct;
    Py_INCREF(none);
    tstate->interp->none = none;
    return _PyStatus_OK();
    }

    @vstinner
    Copy link
    Member Author

    Actually, interp->none shared _Py_NoneStruct variable.

    My PR 18301 is a draft to check if we can solve the issue without breaking the C API compatibility. You're right that it doesn't solve the issue, it only checks the C API issue. IMO the PR 18301 proves that the "#define Py_None Py_GetNone()" trick works.

    --

    By the way, when I worked on a tagged pointer experiment:
    https://github.com/vstinner/cpython/pull/6

    I had to introduce Py_IS_NONE(op) function, since it was no longer possible to compare directly "op == Py_None".

    static inline int Py_IS_NONE(PyObject *op) {
        return (op == &_Py_NoneStruct || op == _Py_TAGPTR_NONE);
    }

    But this is not needed to solve this issue.

    @rhettinger
    Copy link
    Contributor

    Shouldn't this wait to see if the subinterpreters PEP is approved? Because if it isn't, then no chance should be made. We shouldn't change something this fundamental without good cause.

    @vstinner
    Copy link
    Member Author

    Raymond Hettinger: "Shouldn't this wait to see if the subinterpreters PEP is approved? Because if it isn't, then no chance should be made. We shouldn't change something this fundamental without good cause."

    I agree that we reached a point where a PEP is needed before pushing further "controversial" changes related to subinterpreters and bpo-1635741 (especially converting static types to heap types (bpo-40077).

    I plan to write multiple PEPs:

    • One general PEP about the idea of isolating subinterpreters to be able to run them in parallel, without requesting any specific technical change
    • One PEP to solve the problem of static types like &PyLong_Type: see bpo-40601 and bpo-43503
    • More PEPs for other controversial changes

    @KubaO
    Copy link
    Mannequin

    KubaO mannequin commented Jan 24, 2022

    I'm looking very much forward to isolated subinterpreters and thus the per-subinterpreter GIL, as I've been keeping a private exploratory project where I had to make them work.

    Here are my thoughts:

    1. Any sort of reference count on heavily used objects cannot be shared between the threads, even if its value is otherwise ignored. I.e., a write-only shared refcount is already a no-no. The mere fact that it's being modified from different threads is a performance bottleneck, as the cacheline that holds the refcount has to be shuttled between cores. That's a bad thing and the penalty only becomes worse as the time marches on.

    2. For platforms where the C language supports thread-local storage (TLS) at the declaration level, it's trivial to have "global static" immortal objects become thread-local static. This could be perhaps an escape to keep old code working to an extent, as opposed to immediately breaking. On such platforms, the PyGet_Foo API can be on equal footing with the legacy Py_Foo statics, i.e. both would do the same thing. That's how I've done it in my experiment. The obvious problem is that on platforms without compiler support for TLS, Py_Foo would be unavailable, and that's probably a no-go for an API that wouldn't be deprecated. For portability, everyone should be using PyGet_Foo, iff platforms without language-level TLS support are of interest (are they? what would they be?)

    @vstinner
    Copy link
    Member Author

    On such platforms, the PyGet_Foo API can be on equal footing with the legacy Py_Foo statics, i.e. both would do the same thing. That's how I've done it in my experiment. The obvious problem is that on platforms without compiler support for TLS, Py_Foo would be unavailable, and that's probably a no-go for an API that wouldn't be deprecated.

    My #62501 PR uses "#define Py_None Py_GetNone()" which is backward compatible in terms of API.

    Py_GetNone() can have various implementations, it doesn't matter at the API level.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @vstinner
    Copy link
    Member Author

    vstinner commented Nov 3, 2022

    It seems like most core devs prefer https://peps.python.org/pep-0683/ (even if it's still a draft). I close this issue.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes topic-subinterpreters
    Projects
    Status: Done
    Development

    No branches or pull requests

    6 participants