classification
Title: [subinterpreters] Per-interpreter singletons (None, True, False, etc.)
Type: Stage: patch review
Components: Subinterpreters Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Mark.Shannon, corona10, eric.snow, jeremy.kloth, jkloth, larry, maciej.szulik, nanjekyejoannah, ncoghlan, phsilva, rhettinger, shihai1991, steve.dower, vstinner
Priority: normal Keywords: patch

Created on 2020-01-31 15:33 by vstinner, last changed 2020-05-15 12:12 by Mark.Shannon.

Pull Requests
URL Status Linked Edit
PR 18296 merged vstinner, 2020-01-31 15:37
PR 18298 merged vstinner, 2020-01-31 16:05
PR 18301 open vstinner, 2020-02-01 01:32
Messages (26)
msg361121 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-01-31 15:33
The long-term goal of the PEP 554 is to run two Python interpreters in parallel. To achieve this goal, no object must be shared between two interpreters. See for example my article "Pass the Python thread state explicitly" which gives a longer rationale:
https://vstinner.github.io/cpython-pass-tstate.html

In bpo-38858, I modified Objects/longobject.c to have per-interpreter small integer singletons: commit 630c8df5cf126594f8c1c4579c1888ca80a29d59.

This issue is about other singletons like None or Py_True which are currently shared between two interpreters.

I propose to add new functions. Example for None:

* Py_GetNone(): return a *borrowed* reference to the None singleton (similar to existing Py_None macro)
* Py_GetNoneRef(): return a *strong* reference to the None singleton (similar to "Py_INCREF(Py_None); return Py_None;" and Py_RETURN_NONE macro)

And add PyInterpreterState.none field: strong reference to the per-interpreter None object.


We should do that for each singletons:

* None (Py_None)
* True (Py_True)
* False (Py_False)
* Ellipsis (Py_Ellipsis)


GIL issue
=========

Py_GetNone() would look like:

PyObject* Py_GetNone(void)
{ return _PyThreadState_GET()->interp->none; }

Problem: _PyThreadState_GET() returns NULL if the caller function doesn't hold the GIL.

Using the Python C API when the GIL is not held is a violation of the API: it is not supported. But it worked previously.

One solution is to fail with an assertion error (abort the process) in debug mode, and let Python crash in release mode.

Another option is to only fail with an assertion error in debug mode in Python 3.9. In Python 3.9, Py_GetNone() would use PyGILState_GetThisThreadState() function which works even when the GIL is released. In Python 3.10, we would switch to _PyThreadState_GET() and so crash in release mode.

One concrete example of such issue can be found in the multiprocessing C code, in semlock_acquire():

            Py_BEGIN_ALLOW_THREADS
            if (timeout_obj == Py_None) {
                res = sem_wait(self->handle);
            }
            else {
                res = sem_timedwait(self->handle, &deadline);
            }
            Py_END_ALLOW_THREADS

Py_None is accessed when the GIL is released.
msg361122 - (view) Author: Jeremy Kloth (jkloth) * Date: 2020-01-31 17:11
Would it not suffice to just make the singletons "immortal"?

Without affecting the hotpaths that are Py_INCREF and Py_DECREF, changing _Py_Dealloc to test for objects with a "special" destructor could be used:

    destructor dealloc = Py_TYPE(op)->tp_dealloc;
    if (dealloc == _Py_SingletonSentinel) {
        /* reset refcnt so as to not return here too often */
        op->ob_refcnt = PY_SSIZE_T_MAX;
    }
    else {
        (*dealloc)(op);
    }

Even in the presence of multiple mutating threads, the object cannot be destroyed.  Worst case, they all call _Py_Dealloc.
msg361137 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-02-01 00:26
New changeset 7dc140126e918cc7c6e65aea321b7255f0020798 by Victor Stinner in branch 'master':
bpo-39511: Fix multiprocessing semlock_acquire() (GH-18298)
https://github.com/python/cpython/commit/7dc140126e918cc7c6e65aea321b7255f0020798
msg361138 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-02-01 00:30
> Would it not suffice to just make the singletons "immortal"?

The problem is to make Py_INCREF/Py_DECREF efficient. Last time someone tried to use an atomic variable for ob_refcnt, it was 20% slower if I recall correctly. If many threads start to update such atomic variable, the CPU cacheline of common singletons like None, True and False can quickly become a performance bottleneck.

On the other side, if each interpreter has its own objects, there is no need to protect ob_refcnt, the interpreter lock protects it.
msg361145 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-02-01 01:30
New changeset 4d96b4635aeff1b8ad41d41422ce808ce0b971c8 by Victor Stinner in branch 'master':
bpo-39511: PyThreadState_Clear() calls on_delete (GH-18296)
https://github.com/python/cpython/commit/4d96b4635aeff1b8ad41d41422ce808ce0b971c8
msg361146 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-02-01 01:34
PR 18301 is a WIP showing my intent. I'm not sure if it would be possible to land such change right now in Python. It has different drawbacks described in my previous messages. I don't know the impact on performance neither.
msg361161 - (view) Author: Jeremy Kloth (jeremy.kloth) Date: 2020-02-01 11:23
> The problem is to make Py_INCREF/Py_DECREF efficient.

That is exactly why I didn't propose a change to them.  The singletons
still are refcounted as usual, just that their ob_refcnt is ignored.
If they somehow reach 0, they just "resurrect" themselves and ignore
the regular collection behavior.  In the presence of multiple
DECREF'ers, the ob_refcnt field is garbage, but that is OK as it is
effectively ignored.  Practicality vs Purity and all that.

> Last time someone tried to use an atomic variable for ob_refcnt, it was 20% slower if I recall correctly. If many threads start to update such atomic variable, the CPU cacheline of common singletons like None, True and False can quickly become a performance bottleneck.

Exactly so, hence why I chose the simple solution of effectively
ignoring ob_refcnt.

> On the other side, if each interpreter has its own objects, there is no need to protect ob_refcnt, the interpreter lock protects it.

My solution also does not need any protection around ob_refcnt.
msg361163 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-02-01 12:10
I vaguely recall discussions about immortal Python objects.

(*) Instagram gc.freeze()

* https://docs.python.org/dev/library/gc.html#gc.freeze
* https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172

(*) Python immortal strings

* PyUnicode_InternImmortal()
* SSTATE_INTERNED_IMMORTAL
* They are only destroyed if Python is compiled with Valgrind or Purify support: unicode_release_interned() function

(*) COUNT_ALLOCS

* When Python is built with COUNT_ALLOCS macro defined, types are immortal
* Many tests have to be skipped if COUNT_ALLOCS is used
* I plan to remove COUNT_ALLOCS feature in bpo-39489 :-)

(*) Static types

* Types which are not allocated on the heap but "static" are immortal
* These types cause various issues and should go away
* For example, they are incompatible with multi-phase initialization module (PEP 489) and stable ABI (PEP 384)
msg361164 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-02-01 12:14
Recently, Petr Viktorin proposed immortal singletons in my latest "Pass the Python thread state to internal C functions" thread on python-dev list:
https://mail.python.org/archives/list/python-dev@python.org/message/RAVSH7HYHTROXSTUR3677WGTCTEO6FYF/

In 2004,  Jewett, Jim J proposed:

"What if a few common (constant, singleton) objects (such as None, -1, 0, 1) were declared immortal at compile-time?"

https://mail.python.org/archives/list/python-dev@python.org/thread/XWGRUATMRAVXXZKQ7QA2YX22KI55T7P5/#OQO7DWRVH7IIOE4RUJ2ZR7S5UT6WOCPS
msg361256 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-02-02 21:53
Is the sub-interpreter PEP approved?   If not, I had thought the plan was to only implement PRs that made clean-ups that would have been necessary anyway.
msg361257 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-02-02 22:32
Random idea (not carefully thought-out):  Would it be simpler to have these objects just ignore their refcount by having dealloc() be a null operation or having it set the refcount back to a positive number).  That would let sub-interpreters share the objects without worrying about race-conditions on incref/decref operations.  To make this work, the objects can register themselves as permanent, shared, objects; then, during shutdown, we could explicitly call a hard dealloc on those objects.
msg361258 - (view) Author: Jeremy Kloth (jkloth) * Date: 2020-02-02 22:57
+1, obviously, as I came to the same conclusion above (see msg361122)
msg361355 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2020-02-04 15:55
On Sun, Feb 2, 2020 at 2:53 PM Raymond Hettinger <report@bugs.python.org> wrote:
> Is the sub-interpreter PEP approved?

PEP 554 is not approved yet (and certainly is not guaranteed, though
I'm hopeful).  However, that PEP is exclusively about exposing
subinterpreters in the stdlib.

There is no PEP for the effort to isolate subinterpreters and stop
sharing the GIL between them.  We are not planning on having such a
PEP since the actual proposal is so basic ("make the GIL
per-interpreter") and there has been no controversy.

I need to do a better job about communicating the difference, as folks
keep conflating PEP 554 with the effort to stop sharing the GIL
between subinterpreters.  Note that both are part of my "multi-core
Python" project. [1]

> If not, I had thought the plan was to only
> implement PRs that made clean-ups that would have been necessary anyway.

Right.  In this case the "cleanup" is how singletons are finalized
when the runtime shuts down.  This has a material impact on projects
that embed the CPython runtime.  Currently runtime finalization is a
mess.

That said, making the singletons per-interpreter isn't a prerequisite
of that cleanup.  For specific case of the per-interpreter GIL world,
we just need an approach that addresses their life cycle in a
thread-safe way.

[1] https://github.com/ericsnowcurrently/multi-core-python
msg361356 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2020-02-04 16:24
On Sun, Feb 2, 2020 at 3:32 PM Raymond Hettinger <report@bugs.python.org> wrote:
> Random idea (not carefully thought-out):  Would it be simpler to have these objects just
> ignore their refcount by having dealloc() be a null operation or having it set the refcount
> back to a positive number).  That would let sub-interpreters share the objects without
> worrying about race-conditions on incref/decref operations.

This is pretty much one of the two approaches I have been considering. :)

Just to be clear, singletons normally won't end up with a refcount of
0, by design.  However, if there's a incref race between two
interpreters (that don't share a GIL) then you *can* end up triggering
dealloc (via Py_DECREF) and even get negative refcounts (which cause
Py_FatalError() on debug builds).  Here are some mitigations:

* (as noted above) make dealloc() for singletons a noop
* (as noted above) in dealloc() set the refcount back to some positive value
* make the initial refcount sufficiently large such that it is
unlikely to reach 0 even with races
* incref each singleton once during each interpreter initiialization
(and decref once during interp. finalization)

We would have to special-case refleak checks for singletons, to avoid
false positives.

Also note that currently extension authors (and CPython contributors)
can rely on the runtime to identify when then accidentally
incref/decref a singleton too much (or forget to do so).  So it may
make sense to do something in the "singleton dealloc()" in debug
builds to provide similar checks.

>  To make this work, the objects can register themselves as permanent, shared, objects;
> then, during shutdown, we could explicitly call a hard dealloc on those objects.

great point
msg361357 - (view) Author: Eric Snow (eric.snow) * (Python committer) Date: 2020-02-04 16:27
> This is pretty much one of the two approaches I have been considering.

The other approach is to leave the current static singletons alone and
only use them for the main interpreter.  Each subinterpreter would get
its own copy, created when that interpreter is initialized.
msg361380 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-02-04 23:59
> The other approach is to leave the current static singletons alone and
> only use them for the main interpreter.  Each subinterpreter would get
> its own copy, created when that interpreter is initialized.

Which API should be used in C extensions to be "subinterpreter-safe"? Currently, Py_None is a singleton shared by multiple interpreters. Should suddenly all C extensions use a new Py_GetNone() function which returns the per-interpreter singleton? If yes, that's basically what my PR 18301 does:

   #define Py_None Py_GetNone()
msg361382 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-02-05 00:08
> (as noted above) make dealloc() for singletons a noop

I expect issues with negative reference count value. As you wrote, it triggers a fatal error when Python is built in release mode.


> make the initial refcount sufficiently large such that it is
unlikely to reach 0 even with races

Py_None is heavily used. If the reference count is updated by multiple threads with no lock to protect it, there is a significant risk that value zero will be reached soon or later.

--

In the Linux kernel, they started to special type for reference counters, to reduce the risk of vulnerability on reference counter underflow or overflow:

* "reference-count protection" for kernel hardening: refcount_t type
* https://lwn.net/Articles/728675/
* https://lwn.net/Articles/706498/

The kernel already used atomic_t type. But the issue here is about bugs, since no program is perfect, even the Linux kernel.
msg361543 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-02-07 08:46
Ah, I also found the idea of immortal None in an old discussion on tagged pointer:
https://mail.python.org/archives/list/capi-sig@python.org/thread/JPUNPN3AILGXOA3C2TTSLMOFNSWJE3QX/

Stefan Behnel proposed the idea: "All negative refcounts would have special meanings, such as: this is the immortal None, (...)".
msg364384 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2020-03-17 02:35
> We should do that for each singletons:
> 
> * None (Py_None)
> * True (Py_True)
> * False (Py_False)
> * Ellipsis (Py_Ellipsis)

Aren't there a couple more lurking in the interpreter?  E.g. empty tuple, empty frozenset.


> That is exactly why I didn't propose a change to them.
> The singletons still are refcounted as usual,
> just that their ob_refcnt is ignored.
> If they somehow reach 0, they just "resurrect" themselves
> and ignore the regular collection behavior.

That seems like a very good idea!  They don't even need to "resurrect" themselves--we just ensure tp_dealloc is a no-op for those special values.  If we do that, different threads and different interpreters can change ob_refcnt willy-nilly, there can be unsafe races between threads, the value could no longer make any sense--but as long as we never free the object, it's all totally fine.

(Actually: tp_dealloc shouldn't be a no-op--it should add a million to the reference count for these special objects, to forestall future irrelevant calls to tp_dealloc.)

This might have minor deleterious effects, e.g. sys.getrefcount() would return misleading results for such objects.  I think that's acceptable.
msg364429 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2020-03-17 14:10
Consider the case where a thread that doesn't hold the GIL attempts to get a reference on `None`.

The problem with having a single immortal `None`, is that it will cause data cache thrashing as two different CPUs modify the refcount on the shared `None` object.
Each subinterpreter needs its own distinct `None`.

`None` could be made immortal, it just can't be shared between sub-interpreters.
msg364446 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-17 16:54
Mark:
> The problem with having a single immortal `None`, is that it will cause data cache thrashing as two different CPUs modify the refcount on the shared `None` object.

Yeah, I concur with Mark: having one singleton per interpreter should provide better usage of the CPU caches, especially CPU data cache level 1.


Mark:
> Consider the case where a thread that doesn't hold the GIL attempts to get a reference on `None`.

The main drawback of PR 18301 is that accessing "Py_None" means accessing tstate->interp->none. Except that the commonly used _PyThreadState_GET() returns NULL if the thread doesn't hold the GIL.

One alternative would be to use PyGILState_GetThisThreadState() but this API doesn't support subinterpreters.

Maybe we are moving towards a major backward incompatible changes required to make the subinterpreters implementation more efficient.

Maybe CPython should have a backward compatible behavior by default (Py_None can be read without holding the GIL), but running subinterpreters in parallel would change Py_None behavior (cannot be read without holding the GIL).

I don't know.
msg364469 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2020-03-17 18:17
> The problem with having a single immortal `None`, is that it will
> cause data cache thrashing as two different CPUs modify the
> refcount on the shared `None` object.

That's a very reasonable theory. Personally, I find modern CPU architecture bewildering and unpredictable.  So I'd prefer it if somebody tests such performance claims, rather than simply asserting them and having that be the final design.
msg364471 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2020-03-17 18:35
Having two CPUs write to the same cache line is a well known performance problem. There's nothing special about CPython here.

The proper name for it seems to be "cache line ping-pong", but a search for "false sharing" might be more informative.
msg364485 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-17 22:09
I expect that for objects which are not commonly modified by two interpreters "at the same time", it should be fine.

But None, True, small integer singletons, latin-1 str single character singletons, etc. objects are likely to be frequently accessed and so can become a bottleneck.

Moreover, Python has an infamous feature: write (modify ob_refcnt) on "read-only" access :-D See "Copy-on-Read" feature popularized by Instagram ;-)

https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf
msg366342 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-04-13 22:46
bpo-40255 proposes a concrete implementation of immortal objects: it modifies Py_INCREF and Py_DECREF which makes Python up to 1.17x slower.
msg368937 - (view) Author: Mark Shannon (Mark.Shannon) * (Python committer) Date: 2020-05-15 12:12
Those numbers are for code without immortal objects.
They don't apply in this case, as the branch misprediction rate would rise.
History
Date User Action Args
2020-05-15 12:12:02Mark.Shannonsetmessages: + msg368937
2020-05-15 00:42:39vstinnersetcomponents: + Subinterpreters, - Interpreter Core
2020-04-14 13:55:23steve.dowersetnosy: + steve.dower
2020-04-13 22:46:33vstinnersetmessages: + msg366342
2020-04-05 16:35:08corona10setnosy: + corona10
2020-03-17 22:09:59vstinnersetmessages: + msg364485
2020-03-17 18:35:07Mark.Shannonsetmessages: + msg364471
2020-03-17 18:17:23larrysetmessages: + msg364469
2020-03-17 16:54:19vstinnersetmessages: + msg364446
2020-03-17 14:10:55Mark.Shannonsetnosy: + Mark.Shannon
messages: + msg364429
2020-03-17 02:35:49larrysetnosy: + larry
messages: + msg364384
2020-02-25 07:25:04phsilvasetnosy: + phsilva
2020-02-14 01:17:40shihai1991setnosy: + shihai1991
2020-02-07 15:16:01maciej.szuliksetnosy: + maciej.szulik
2020-02-07 08:46:19vstinnersetmessages: + msg361543
2020-02-05 00:08:01vstinnersetmessages: + msg361382
2020-02-04 23:59:24vstinnersetmessages: + msg361380
2020-02-04 16:27:22eric.snowsetmessages: + msg361357
2020-02-04 16:24:50eric.snowsetmessages: + msg361356
2020-02-04 15:55:41eric.snowsetmessages: + msg361355
2020-02-02 22:57:45jklothsetmessages: + msg361258
2020-02-02 22:32:13rhettingersetmessages: + msg361257
2020-02-02 21:53:01rhettingersetnosy: + rhettinger
messages: + msg361256
2020-02-01 12:14:15vstinnersetmessages: + msg361164
2020-02-01 12:10:29vstinnersetmessages: + msg361163
2020-02-01 11:23:50jeremy.klothsetnosy: + jeremy.kloth
messages: + msg361161
2020-02-01 01:34:20vstinnersetmessages: + msg361146
2020-02-01 01:32:29vstinnersetpull_requests: + pull_request17678
2020-02-01 01:30:31vstinnersetmessages: + msg361145
2020-02-01 00:30:57vstinnersetmessages: + msg361138
2020-02-01 00:26:17vstinnersetmessages: + msg361137
2020-01-31 17:11:18jklothsetnosy: + jkloth
messages: + msg361122
2020-01-31 16:19:11vstinnersetnosy: + ncoghlan, eric.snow, nanjekyejoannah
2020-01-31 16:05:04vstinnersetpull_requests: + pull_request17673
2020-01-31 15:37:51vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request17671
2020-01-31 15:33:35vstinnercreate