This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: intern strings in deepfrozen modules
Type: behavior Stage: resolved
Components: Versions: Python 3.11
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, corona10, gvanrossum, kumaraditya, miss-islington, sobolevn, vstinner
Priority: normal Keywords: patch

Created on 2022-01-19 11:06 by kumaraditya, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 30683 merged kumaraditya, 2022-01-19 11:07
PR 31248 merged sobolevn, 2022-02-10 08:51
PR 31549 merged kumaraditya, 2022-02-24 07:29
PR 31556 closed corona10, 2022-02-24 17:02
PR 31596 merged kumaraditya, 2022-02-26 07:23
Messages (26)
msg410936 - (view) Author: Kumar Aditya (kumaraditya) * (Python triager) Date: 2022-01-19 11:16
Interns strings in deep-frozen modules.

See https://github.com/faster-cpython/ideas/issues/218
msg411003 - (view) Author: Kumar Aditya (kumaraditya) * (Python triager) Date: 2022-01-20 05:47
This speeds up comparison of strings by just comparing their pointers so it is much faster.
msg412920 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2022-02-09 16:52
New changeset c0a5ebeb1239020f2ecc199053bb1a70d78841a1 by Kumar Aditya in branch 'main':
bpo-46430: Intern strings in deep-frozen modules  (GH-30683)
https://github.com/python/cpython/commit/c0a5ebeb1239020f2ecc199053bb1a70d78841a1
msg412922 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2022-02-09 17:05
I noticed that the new interning code lacks proper error reporting. There are only asserts. _PyCode_New() checks the return value of intern_strings.
msg412937 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2022-02-09 18:17
We discussed that and found that a lot of errors are ignored during interning anyway.

But it's not too late to change if you want to (sending a PR would probably be quicker than arguing :-).
msg412983 - (view) Author: miss-islington (miss-islington) Date: 2022-02-10 09:08
New changeset dee020a6f5bf29f95bec6294da9bcd577114f592 by Nikita Sobolev in branch 'main':
Fix sphinx-lint after #31097 and b878b3a (GH-31248)
https://github.com/python/cpython/commit/dee020a6f5bf29f95bec6294da9bcd577114f592
msg413002 - (view) Author: Kumar Aditya (kumaraditya) * (Python triager) Date: 2022-02-10 14:11
I consider this done so closing it as improving the error handling of interning it out of scope of this issue.
msg413004 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2022-02-10 14:18
Please leave the ticket open until we have an agreement how to handle the missing error checks.
msg413822 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-23 17:07
> New changeset c0a5ebeb1239020f2ecc199053bb1a70d78841a1 by Kumar Aditya in branch 'main':
> bpo-46430: Intern strings in deep-frozen modules  (GH-30683)

This change introduced a memory leak at Python exit.

Before:

$ ./python -X showrefcount -c pass
[0 refs, 0 blocks]

After:

$ ./python -X showrefcount -c pass
[0 refs, 344 blocks]
msg413823 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-23 17:09
> This change introduced a memory leak at Python exit.

It's regression related to bpo-1635741 which I fixed recently:
https://mail.python.org/archives/list/python-dev@python.org/thread/E4C6TDNVDPDNNP73HTGHN5W42LGAE22F/
msg413824 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-23 17:10
_PyStaticCode_InternStrings() error handling is based on assert(): that's really bad. It can crash Python (exit with abort()) at the first memory allocation failure. Why not returning -1 on error?
msg413832 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2022-02-23 17:45
Okay, let's change the error handling. @Kumar, can you handle that?

@Victor, the refleak is unrelated to the error handling right? Presumably the leak is imaginary -- the deep-frozen interned strings should be accounted for somehow. @Kumar do you need help investigating this?
msg413837 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-23 18:02
> Presumably the leak is imaginary

Well, you can check with your favorite memory debugger like Valgrind if you don't trust Python internal memory debugger :-)

Not leaking memory at exit matters when Python is embedded in an application.

I don't think that it's related to the error handling which is stripped in release mode.
msg413839 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2022-02-23 18:07
> Not leaking memory at exit matters when Python is embedded
> in an application.

Sure, but "leaks" caused by deep-freezing cannot be solved by freeing up the deep-frozen memory -- the solution must be to update the accounting somewhere.

Where is the existence of Py_None accounted for (since it's statically allocated, or at least used to be)? That's likely where we'd have to do something about the deep-frozen objects.
msg413847 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-23 18:58
> Sure, but "leaks" caused by deep-freezing cannot be solved by freeing up the deep-frozen memory -- the solution must be to update the accounting somewhere.

Python allocates memory (ex: with PyObject_Malloc()) which is not released at exit. Two examples from Valgrind:

==196803== 50 bytes in 1 blocks are still reachable in loss record 1 of 87
==196803==    at 0x484486F: malloc (vg_replace_malloc.c:381)
==196803==    by 0x544E29: _PyMem_RawMalloc (obmalloc.c:101)
==196803==    by 0x545E0E: PyObject_Malloc (obmalloc.c:700)
==196803==    by 0x587159: PyUnicode_New (unicodeobject.c:1448)
==196803==    by 0x58C4CF: get_latin1_char (unicodeobject.c:2148)
==196803==    by 0x58D7F7: _PyUnicode_FromUCS1 (unicodeobject.c:2450)
==196803==    by 0x58E374: PyUnicode_FromKindAndData (unicodeobject.c:2525)
==196803==    by 0x69402B: r_object (marshal.c:1150)
==196803==    by 0x694295: r_object (marshal.c:1212)
==196803==    by 0x694AE0: r_object (marshal.c:1393)
==196803==    by 0x694295: r_object (marshal.c:1212)
==196803==    by 0x694AA4: r_object (marshal.c:1387)
==196803==    by 0x6950F3: read_object (marshal.c:1524)
==196803==    by 0x6953E7: PyMarshal_ReadObjectFromString (marshal.c:1641)
==196803==    by 0x763223: _Py_Get_Getpath_CodeObject (getpath.c:792)
==196803==    by 0x76337F: _PyConfig_InitPathConfig (getpath.c:847)
==196803==    by 0x68CD04: config_init_import (initconfig.c:1967)
==196803==    by 0x68CE4E: _PyConfig_InitImportConfig (initconfig.c:2000)
==196803==    by 0x69E594: init_interp_main (pylifecycle.c:1103)
==196803==    by 0x69EBF8: pyinit_main (pylifecycle.c:1216)
==196803==    by 0x69EDCE: Py_InitializeFromConfig (pylifecycle.c:1247)
==196803==    by 0x6D5346: pymain_init (main.c:67)
==196803==    by 0x6D64F8: pymain_main (main.c:692)
==196803==    by 0x6D65A1: Py_BytesMain (main.c:725)
==196803==    by 0x41D7B5: main (python.c:15)

and

==196803== 3,336 bytes in 60 blocks are still reachable in loss record 87 of 87
==196803==    at 0x484486F: malloc (vg_replace_malloc.c:381)
==196803==    by 0x544E29: _PyMem_RawMalloc (obmalloc.c:101)
==196803==    by 0x545E0E: PyObject_Malloc (obmalloc.c:700)
==196803==    by 0x587159: PyUnicode_New (unicodeobject.c:1448)
==196803==    by 0x59AEB9: unicode_decode_utf8 (unicodeobject.c:5162)
==196803==    by 0x59B5B0: PyUnicode_DecodeUTF8Stateful (unicodeobject.c:5292)
==196803==    by 0x58D296: PyUnicode_FromString (unicodeobject.c:2322)
==196803==    by 0x5CFFAE: PyUnicode_InternFromString (unicodeobject.c:15650)
==196803==    by 0x4E8B61: descr_new (descrobject.c:885)
==196803==    by 0x4E8C9B: PyDescr_NewMethod (descrobject.c:934)
==196803==    by 0x56694C: type_add_method (typeobject.c:5643)
==196803==    by 0x566A92: type_add_methods (typeobject.c:5689)
==196803==    by 0x569166: type_ready_fill_dict (typeobject.c:6165)
==196803==    by 0x569A02: type_ready (typeobject.c:6421)
==196803==    by 0x569B59: PyType_Ready (typeobject.c:6457)
==196803==    by 0x544363: _PyTypes_InitTypes (object.c:1952)
==196803==    by 0x69D12D: pycore_init_types (pylifecycle.c:704)
==196803==    by 0x69D8FC: pycore_interp_init (pylifecycle.c:831)
==196803==    by 0x69DC31: pyinit_config (pylifecycle.c:887)
==196803==    by 0x69E355: pyinit_core (pylifecycle.c:1050)
==196803==    by 0x69ED32: Py_InitializeFromConfig (pylifecycle.c:1240)
==196803==    by 0x6D5346: pymain_init (main.c:67)
==196803==    by 0x6D64F8: pymain_main (main.c:692)
==196803==    by 0x6D65A1: Py_BytesMain (main.c:725)
==196803==    by 0x41D7B5: main (python.c:15)

Before the commit c0a5ebeb1239020f2ecc199053bb1a70d78841a1, Python didn't leak this memory.
msg413851 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2022-02-23 19:28
That's pretty mysterious. The deep-freeze code isn't on the stack for either of those, but allocation of new unicode string objects is. I'm guessing these are somehow leaked by the interning, but I don't follow yet how.
msg413852 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-23 19:44
PyUnicode_InternInPlace() in intern_strings() can convert an immortal string to a regular Python strong reference, whereas _PyStaticCode_Dealloc() doesn't bother with clearing co_names, co_consts and co_localsplusnames.
msg413855 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-23 20:33
I tried to hack _PyStaticCode_Dealloc() to free strings, but since immortal objects are half-baken today, calling Py_DECREF() does crash.

Py_SETREF() should increase _Py_RefTotal if the old object is immortal and he new object is not.

Maybe this change should be reverted until Eric's PEP 683 "Immortal Objects, Using a Fixed Refcount" is implemented.
msg413893 - (view) Author: Kumar Aditya (kumaraditya) * (Python triager) Date: 2022-02-24 07:53
I have created a PR to fix the memory leak, See https://github.com/python/cpython/pull/31549
msg413895 - (view) Author: Kumar Aditya (kumaraditya) * (Python triager) Date: 2022-02-24 08:49
> Okay, let's change the error handling. @Kumar, can you handle that?

How it should be handled? Currently PyUnicode_InternInPlace ignores any errors and does not return it. It would be backwards-incompatible to change that, moreover as I explained in https://github.com/python/cpython/pull/30683#discussion_r800648477 intern_strings only check if all the items are strings which will be always true in case of deep-freeze so I don't think anything needs to be changed here. I would be interested to know if I am missing something though.
msg413928 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-24 16:54
New changeset 4dc746310bd37ad6b381f9176acd167d445f4385 by Kumar Aditya in branch 'main':
bpo-46430: Fix memory leak in interned strings of deep-frozen modules (GH-31549)
https://github.com/python/cpython/commit/4dc746310bd37ad6b381f9176acd167d445f4385
msg413930 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-24 17:02
I wrote https://github.com/python/cpython/pull/31555 to make sure that Python doesn't leak at Python exit.
msg413931 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2022-02-24 17:04
> How it should be handled? Currently PyUnicode_InternInPlace ignores any errors and does not return it. It would be backwards-incompatible to change that, moreover as I explained in https://github.com/python/cpython/pull/30683#discussion_r800648477 intern_strings only check if all the items are strings which will be always true in case of deep-freeze so I don't think anything needs to be changed here. I would be interested to know if I am missing something though.

The other functions you are calling *do* return errors. You should not ignore those. If any errors are reported the caller can decide what to do (e.g. call Py_FatalError().
msg414119 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-26 23:18
commit 0d9b565e62a5fc8c3e9b8c64cce764fe084ccb2b
Author: Kumar Aditya <59607654+kumaraditya303@users.noreply.github.com>
Date:   Sat Feb 26 22:05:03 2022 +0530

    Propagate errors (however unlikely) from _Py_Deepfreeze_Init() (GH-31596)
msg414120 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2022-02-26 23:21
> The other functions you are calling *do* return errors. You should not ignore those. If any errors are reported the caller can decide what to do (e.g. call Py_FatalError().

PEP 587 introduced PyStatus to Python startup code which let the Py_Initialize() caller to decide how to handle errors ;-) For example, you can open a graphical popup rather than killing the process with SIGABRT (Py_FatalError() behavior) which might be more user friendly :-D Or just log the error.
msg414127 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2022-02-26 23:45
> PEP 587 introduced PyStatus to Python startup code which let the Py_Initialize() caller to decide how to handle errors ;-) For example, you can open a graphical popup rather than killing the process with SIGABRT (Py_FatalError() behavior) which might be more user friendly :-D Or just log the error.

That's up to pycore_interp_init() in pylifecycle.c now. Kumar called _PyStatus_ERR() there, so I think nothing more needs to be done.
History
Date User Action Args
2022-04-11 14:59:54adminsetgithub: 90588
2022-03-04 15:06:51corona10setpull_requests: - pull_request29803
2022-03-04 14:46:49corona10setpull_requests: + pull_request29803
2022-02-26 23:45:31gvanrossumsetmessages: + msg414127
2022-02-26 23:21:02vstinnersetmessages: + msg414120
2022-02-26 23:18:50vstinnersetmessages: + msg414119
2022-02-26 16:37:30gvanrossumsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2022-02-26 07:23:43kumaradityasetpull_requests: + pull_request29719
2022-02-24 17:04:41gvanrossumsetmessages: + msg413931
2022-02-24 17:02:57vstinnersetmessages: + msg413930
2022-02-24 17:02:41corona10setpull_requests: + pull_request29678
2022-02-24 16:54:14vstinnersetmessages: + msg413928
2022-02-24 08:49:47kumaradityasetmessages: + msg413895
2022-02-24 07:53:27kumaradityasetmessages: + msg413893
2022-02-24 07:29:56kumaradityasetstage: patch review
pull_requests: + pull_request29670
2022-02-24 01:57:52corona10setnosy: + corona10
2022-02-23 20:33:02vstinnersetmessages: + msg413855
2022-02-23 19:44:47vstinnersetmessages: + msg413852
2022-02-23 19:28:19gvanrossumsetmessages: + msg413851
2022-02-23 18:58:10vstinnersetmessages: + msg413847
2022-02-23 18:07:16gvanrossumsetmessages: + msg413839
2022-02-23 18:02:44vstinnersetmessages: + msg413837
2022-02-23 17:45:34gvanrossumsetmessages: + msg413832
2022-02-23 17:10:54vstinnersetmessages: + msg413824
2022-02-23 17:09:20vstinnersetmessages: + msg413823
2022-02-23 17:07:40vstinnersetnosy: + vstinner
messages: + msg413822
2022-02-10 14:18:11christian.heimessetstatus: closed -> open
type: behavior
messages: + msg413004

resolution: fixed -> (no value)
stage: resolved -> (no value)
2022-02-10 14:11:07kumaradityasetstatus: open -> closed
resolution: fixed
messages: + msg413002

stage: patch review -> resolved
2022-02-10 09:08:59miss-islingtonsetnosy: + miss-islington
messages: + msg412983
2022-02-10 08:51:44sobolevnsetnosy: + sobolevn
pull_requests: + pull_request29417
2022-02-09 18:17:34gvanrossumsetmessages: + msg412937
2022-02-09 17:05:13christian.heimessetnosy: + christian.heimes
messages: + msg412922
2022-02-09 16:52:54gvanrossumsetmessages: + msg412920
2022-01-20 05:47:51kumaradityasetmessages: + msg411003
2022-01-19 11:16:23kumaradityasetmessages: + msg410936
2022-01-19 11:07:06kumaradityasetkeywords: + patch
stage: patch review
pull_requests: + pull_request28882
2022-01-19 11:06:14kumaradityacreate