[subinterpreters] Design a subinterpreter friendly alternative to _Py_IDENTIFIER #83646

ncoghlan · 2020-01-27T14:28:31Z

BPO	39465
Nosy	@ncoghlan, @vstinner, @encukou, @ericsnowcurrently, @shihai1991
PRs	bpo-39465: Don't access directly _Py_Identifier members #20043 [WIP] bpo-39465: _PyUnicode_FromId() now uses an hash table #20048 bpo-39465: Fix _PyUnicode_FromId() for subinterpreters #20058 bpo-39465: Remove _PyUnicode_ClearStaticStrings() from C API #20078 [WIP] bpo-39465: Mark _Py_Identifier.object as atomic #20390 bpo-39465: Cleanup _PyUnicode_FromId() code #20595 bpo-39465: Add pycore_atomic_funcs.h internal header #20766 [WIP] bpo-39465: Optimize _PyInterpreterState_GET() #20767 bpo-39465: Use _PyInterpreterState_GET() #20788
Files	bench.py bench.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2020-12-25.23:49:32.652>
created_at = <Date 2020-01-27.14:28:30.691>
labels = ['expert-subinterpreters', 'type-feature', '3.10']
title = '[subinterpreters] Design a subinterpreter friendly alternative to _Py_IDENTIFIER'
updated_at = <Date 2021-12-07.17:28:23.318>
user = 'https://github.com/ncoghlan'

bugs.python.org fields:

activity = <Date 2021-12-07.17:28:23.318>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2020-12-25.23:49:32.652>
closer = 'vstinner'
components = ['Subinterpreters']
creation = <Date 2020-01-27.14:28:30.691>
creator = 'ncoghlan'
dependencies = []
files = ['49147', '49148']
hgrepos = []
issue_num = 39465
keywords = ['patch']
message_count = 17.0
messages = ['360766', '361041', '361042', '361044', '361045', '361533', '361627', '368682', '368688', '368691', '368805', '370607', '371229', '383629', '383781', '383783', '407954']
nosy_count = 5.0
nosy_names = ['ncoghlan', 'vstinner', 'petr.viktorin', 'eric.snow', 'shihai1991']
pr_nums = ['20043', '20048', '20058', '20078', '20390', '20595', '20766', '20767', '20788']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue39465'
versions = ['Python 3.10']

ncoghlan · 2020-01-27T14:28:30Z

Both #18066 (collections module) and #18032 (asyncio module) ran into the problem where porting them to multi-phase initialisation involves replacing their usage of the _Py_IDENTIFIER macro with some other mechanism.

When _posixsubprocess was ported, the replacement was a relatively ad hoc combination of string interning and the interpreter-managed module-specific state: 5a7d2e1

I'm wondering if we may able to devise a comparable struct-field based system that replaces the _Py_IDENTIFIER local static variable declaration macro and the Py_Id_<name> lookup convention with a combination like (using the posix subprocess module conversion as an example):

// Identifier usage declaration (replaces _Py_IDENTIFIER)
_Py_USE_CACHED_IDENTIFIER(_posixsubprocessstate(m), disable);

// Identifier usage remains unchanged, but uses a regular local variable
// rather than the static variable declared by _Py_IDENTIFIER
result = _PyObject_CallMethodIdNoArgs(gc_module, &PyId_disable);

And then the following additional state management macros would be needed to handle the string interning and reference counting:

    // Module state struct declaration
    typedef struct {
        // This would declare an initialised array of _Py_Identifier structs
        // under a name like __cached_identifiers__. The end of the array
        // would be indicated by a strict with "value" set to NULL.
        _Py_START_CACHED_IDENTIFIERS;
        _Py_CACHED_IDENTIFIER(disable);
        _Py_CACHED_IDENTIFIER(enable);
        _Py_CACHED_IDENTIFIER(isenabled);
        _Py_END_CACHED_IDENTIFIERS;
        );
    } _posixsubprocessstate;

// Module tp_traverse implementation
_Py_VISIT_CACHED_IDENTIFIERS(_posixsubprocessstate(m));

// Module tp_clear implementation (also called by tp_free)
_Py_CLEAR_CACHED_IDENTIFIERS(_posixsubprocessstate(m));

With the requirement to declare usage of the cached identifiers, they could be lazily initialized the same way the existing static variables are (even re-using the same struct declaration).

Note: this is just a draft of one possible design, the intent of this issue is to highlight the fact that this issue has now come up multiple times, and it would be good to have a standard answer available.

vstinner · 2020-01-30T11:35:23Z

Once I discussed with Eric Snow during a core developer sprint: _Py_IDENTIFIER() should use an "interpreter local storage" for identifiers values. _Py_IDENTIFIER() would only be a "key" and _PyUnicode_FromId() would store the value somewhere in a hash table stored in PyInterpreterState. Something similar to the TSS API:

PyThread_create_key()
PyThread_delete_key_value()
PyThread_set_key_value()
PyThread_get_key_value()

But per interpreter, rather than being per thread.

The key can be simply the variable address in memory. It only has to be unique in the process.

vstinner · 2020-01-30T11:41:22Z

Both #18066 (collections module) and #18032 (asyncio module) ran into the problem where porting them to multi-phase initialisation involves replacing their usage of the _Py_IDENTIFIER macro with some other mechanism.

What is the problem between _Py_IDENTIFIER and multi-phase initialisation modules?

If both are incompatible, we may need a different but similar API: values would be stored in a hash table per module object. The hash table can be stored in the module object directly, or it can be store in a second hash table (module => hash table).

If we want a unified API, maybe we can use module=NULL (or any other marker) for "global" identifiers (not specific to a module).

encukou · 2020-01-30T11:53:21Z

What is the problem between _Py_IDENTIFIER and multi-phase initialisation modules?

AFAIK there is no problem now, except possibly a race condition when initializing the identifiers.
It seems it's too easy to conflate porting to multi-phase initialization and getting rid of static state.

The problem will come with per-interpreter reference counting, or when the str class is no longer shared across all interpreters. For that, we'll need either per-interpreter identifiers, or solve the issue in another way.

vstinner · 2020-01-30T11:54:57Z

AFAIK there is no problem now, except possibly a race condition when initializing the identifiers.

The GIL avoids any risk of race condition, no?

shihai1991 · 2020-02-07T02:59:23Z

The GIL avoids any risk of race condition, no?

Looks like the GIL would affect performance more or less?

_Py_IDENTIFIER() would only be a "key" and _PyUnicode_FromId() would >store the value somewhere in a hash table stored in PyInterpreterState.

+1.

IMHO, for those two cases, the simplest idea is move IDENTIFIER to moduleState which would increase more memory usage than InterpreterState.

ncoghlan · 2020-02-08T12:52:39Z

As Petr notes, as long as all subinterpreters share the GIL, and share str instances, then the existing _Py_IDENTIFIER mechanism will work fine for both single phase and multi-phase initialisation.

However, that constraint also goes the other way: as long as we have modules that use the existing _Py_IDENTIFIER mechanism, then subinterpreters *must* share str instances, and hence *must* share the GIL.

Hence the "enhancement" classification: there's nothing broken right now, but if we're ever going to achieve the design goal of using subinterpreters to exploit multiple CPU cores without the overhead of running multiple full interpreter processes, we're going to need to design a different way of handling this.

Something to keep in mind with _Py_IDENTIFIER and any replacement API: the baseline for performance comparisons is https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_InternFromString

The reason multi-phase initialisation makes this more complicated is that it means we can't use the memory addresses of C process globals as unique identifiers any more, since more than one module object may be created from the same C shared library.

However, if we assume we've moved to per-module state storage (to get unique memory addresses back), then we can largely re-use the existing _Py_IDENTIFIER machinery to make the lookup as fast as possible, while still avoiding conflicts between subinterpreters.

vstinner · 2020-05-11T23:43:58Z

New changeset 4804b5b by Victor Stinner in branch 'master':
bpo-39465: Don't access directly _Py_Identifier members (GH-20043)
4804b5b

vstinner · 2020-05-12T00:27:05Z

I created bpo-40602: "Move Modules/hashtable.h to Include/internal/pycore_hashtable.h".

vstinner · 2020-05-12T01:12:29Z

Attached bench.py: Micro-benchmark on _PyUnicode_FromId(). It requires attached bench.patch being applied.

vstinner · 2020-05-13T23:11:58Z

New changeset d6fb53f by Victor Stinner in branch 'master':
bpo-39465: Remove _PyUnicode_ClearStaticStrings() from C API (GH-20078)
d6fb53f

vstinner · 2020-06-02T12:40:00Z

New changeset 297257f by Victor Stinner in branch 'master':
bpo-39465: Cleanup _PyUnicode_FromId() code (GH-20595)
297257f

vstinner · 2020-06-10T18:08:30Z

New changeset 1bcc32f by Victor Stinner in branch 'master':
bpo-39465: Use _PyInterpreterState_GET() (GH-20788)
1bcc32f

vstinner · 2020-12-23T02:41:11Z

New changeset 52a327c by Victor Stinner in branch 'master':
bpo-39465: Add pycore_atomic_funcs.h header (GH-20766)
52a327c

vstinner · 2020-12-25T23:41:57Z

New changeset ba3d67c by Victor Stinner in branch 'master':
bpo-39465: Fix _PyUnicode_FromId() for subinterpreters (GH-20058)
ba3d67c

vstinner · 2020-12-25T23:49:25Z

Ok, it should now be fixed. I close the issue.

See PR 20085 "Per-interpreter interned strings" of bpo-40521 for the follow-up.

vstinner · 2021-12-07T17:28:23Z

This change introduced a subtle regression: bpo-46006 "[subinterpreter] _PyUnicode_EqualToASCIIId() issue with subinterpreters".

ncoghlan added type-feature A feature request or enhancement labels Jan 27, 2020

vstinner added the topic-subinterpreters label May 15, 2020

vstinner changed the title ~~Design a subinterpreter friendly alternative to _Py_IDENTIFIER~~ [subinterpreters] Design a subinterpreter friendly alternative to _Py_IDENTIFIER May 15, 2020

vstinner added the topic-subinterpreters label May 15, 2020

vstinner changed the title ~~Design a subinterpreter friendly alternative to _Py_IDENTIFIER~~ [subinterpreters] Design a subinterpreter friendly alternative to _Py_IDENTIFIER May 15, 2020

vstinner added the 3.10 only security fixes label Dec 25, 2020

vstinner closed this as completed Dec 25, 2020

vstinner added the 3.10 only security fixes label Dec 25, 2020

vstinner closed this as completed Dec 25, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[subinterpreters] Design a subinterpreter friendly alternative to _Py_IDENTIFIER #83646

[subinterpreters] Design a subinterpreter friendly alternative to _Py_IDENTIFIER #83646

ncoghlan commented Jan 27, 2020

ncoghlan commented Jan 27, 2020

vstinner commented Jan 30, 2020

vstinner commented Jan 30, 2020

encukou commented Jan 30, 2020

vstinner commented Jan 30, 2020

shihai1991 commented Feb 7, 2020

ncoghlan commented Feb 8, 2020

vstinner commented May 11, 2020

vstinner commented May 12, 2020

vstinner commented May 12, 2020

vstinner commented May 13, 2020

vstinner commented Jun 2, 2020

vstinner commented Jun 10, 2020

vstinner commented Dec 23, 2020

vstinner commented Dec 25, 2020

vstinner commented Dec 25, 2020

vstinner commented Dec 7, 2021

[subinterpreters] Design a subinterpreter friendly alternative to _Py_IDENTIFIER #83646

[subinterpreters] Design a subinterpreter friendly alternative to _Py_IDENTIFIER #83646

Comments

ncoghlan commented Jan 27, 2020

ncoghlan commented Jan 27, 2020

vstinner commented Jan 30, 2020

vstinner commented Jan 30, 2020

encukou commented Jan 30, 2020

vstinner commented Jan 30, 2020

shihai1991 commented Feb 7, 2020

ncoghlan commented Feb 8, 2020

vstinner commented May 11, 2020

vstinner commented May 12, 2020

vstinner commented May 12, 2020

vstinner commented May 13, 2020

vstinner commented Jun 2, 2020

vstinner commented Jun 10, 2020

vstinner commented Dec 23, 2020

vstinner commented Dec 25, 2020

vstinner commented Dec 25, 2020

vstinner commented Dec 7, 2021