This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Initialization of globals in unicodeobject.c
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 3.2, Python 3.3, Python 3.4, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, Gregory.Andersen, amaury.forgeotdarc, franck, georg.brandl, lemburg, ncoghlan, pitrou, python-dev, serhiy.storchaka, skrah, stutzbach, vstinner
Priority: critical Keywords: patch

Created on 2010-10-20 17:27 by skrah, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unicode_init_globals.patch skrah, 2010-10-22 14:39 review
unicode_init_globals2.patch skrah, 2010-10-24 10:21 review
unicode-leak.patch skrah, 2011-04-11 07:31 patch by Daniel Stutzbach
unicode_globals-2.7.patch serhiy.storchaka, 2013-01-07 11:28 review
unicode_globals-3.2.patch serhiy.storchaka, 2013-01-07 11:28 review
unicode_globals-3.3.patch serhiy.storchaka, 2013-01-07 11:28 review
unicode_globals-3.4.patch serhiy.storchaka, 2013-01-07 11:28 review
unicode_globals-2.7_2.patch serhiy.storchaka, 2013-01-24 20:21 review
unicode_globals-3.2_2.patch serhiy.storchaka, 2013-01-24 20:21 review
unicode_globals-3.3_2.patch serhiy.storchaka, 2013-01-24 20:21
unicode_globals-3.4_2.patch serhiy.storchaka, 2013-01-24 20:21
unicode_globals-2.7_3.patch serhiy.storchaka, 2013-01-25 19:51 review
unicode_globals-3.2_3.patch serhiy.storchaka, 2013-01-25 19:51 review
unicode_globals-3.3_3.patch serhiy.storchaka, 2013-01-25 19:51 review
unicode_globals-3.4_3.patch serhiy.storchaka, 2013-01-25 19:51 review
Messages (29)
msg119226 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-10-20 17:27
This is one of two remaining "definitely lost" leaks in py3k. It started
to appear in r70459. How to reproduce:

make distclean && ./configure OPT="-O0 -g" --without-pymalloc && make

valgrind --leak-check=full --suppressions=Misc/valgrind-python.supp ./python > VGOUT 2>&1

Then search for 'definitely'. This leak is not present in release-2.7.


==2058== 56 bytes in 1 blocks are definitely lost in loss record 918 of 2,136
==2058==    at 0x4C2412C: malloc (vg_replace_malloc.c:195)
==2058==    by 0x4167DE: _PyObject_New (object.c:243)
==2058==    by 0x42C278: _PyUnicode_New (unicodeobject.c:341)
==2058==    by 0x4306BD: PyUnicodeUCS2_DecodeUTF8Stateful (unicodeobject.c:2100)
==2058==    by 0x430671: PyUnicodeUCS2_DecodeUTF8 (unicodeobject.c:2065)
==2058==    by 0x42C8F7: PyUnicodeUCS2_FromStringAndSize (unicodeobject.c:541)
==2058==    by 0x42C973: PyUnicodeUCS2_FromString (unicodeobject.c:559)
==2058==    by 0x50B432: PyDict_SetItemString (dictobject.c:2088)
==2058==    by 0x4258DF: PyType_Ready (typeobject.c:3844)
==2058==    by 0x517B64: PyStructSequence_InitType (structseq.c:522)
==2058==    by 0x4F3B4F: _PyFloat_Init (floatobject.c:1905)
==2058==    by 0x4813CE: Py_InitializeEx (pythonrun.c:217)
msg119237 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-10-20 22:10
The stack corresponds to the allocation of type(sys.float_info).__doc__.

Why would only this object appear as a memory leak?  It is certainly not deallocated, but all other types are in the same situation.
For example, sys.int_info is very similar, and happens to be defined in r70459. Why doesn't it appear in the report?
msg119238 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-10-20 22:24
To add to the mystery, the leak disappears if the key value is not
interned in PyDict_SetItemString:


Index: Objects/dictobject.c
===================================================================
--- Objects/dictobject.c        (revision 70459)
+++ Objects/dictobject.c        (working copy)
@@ -2088,7 +2088,6 @@
        kv = PyUnicode_FromString(key);
        if (kv == NULL)
                return -1;
-       PyUnicode_InternInPlace(&kv); /* XXX Should we really? */
        err = PyDict_SetItem(v, kv, item);
        Py_DECREF(kv);
        return err;
msg119239 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-10-20 22:29
Stefan Krah wrote:
> 
> Stefan Krah <stefan-usenet@bytereef.org> added the comment:
> 
> To add to the mystery, the leak disappears if the key value is not
> interned in PyDict_SetItemString:

I'm not sure how you determine what is a leak and what not.
Interned Unicode objects stay alive until the interpreter
is finalized.

Are you suggesting that the finalization does not free the
interned Unicode strings or not all of them ?

> Index: Objects/dictobject.c
> ===================================================================
> --- Objects/dictobject.c        (revision 70459)
> +++ Objects/dictobject.c        (working copy)
> @@ -2088,7 +2088,6 @@
>         kv = PyUnicode_FromString(key);
>         if (kv == NULL)
>                 return -1;
> -       PyUnicode_InternInPlace(&kv); /* XXX Should we really? */
>         err = PyDict_SetItem(v, kv, item);
>         Py_DECREF(kv);
>         return err;
msg119241 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-10-20 22:47
Marc-Andre Lemburg <report@bugs.python.org> wrote:
> I'm not sure how you determine what is a leak and what not.
> Interned Unicode objects stay alive until the interpreter
> is finalized.
> 
> Are you suggesting that the finalization does not free the
> interned Unicode strings or not all of them ?

No, Valgrind's "definitely lost" category means that all pointers
to an allocated region have been lost, so it would not be possible
to free the area. [1]

There are hundreds of "possibly lost" warnings as well, but I did
not report those.

My experience is that Valgrind is usually correct with "definitely
lost", see e.g. #10153. That said, of course it _could_ be a false
alarm.

[1] Last category from:

   http://mail.python.org/pipermail/python-dev/2002-October/029758.html
msg119356 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-10-22 00:03
Re disabling interning in PyDict_SetItemString:

A comment in unicodeobject.c says that globals should not be used
before calling _PyUnicode_Init. But in Py_InitializeEx (pythonrun.c)
_PyUnicode_Init is called after _Py_ReadyTypes, _PyFrame_Init,
_PyLong_Init, PyByteArray_Init and _PyFloat_Init.

In fact, when I move _PyUnicode_Init up, the error concerning
_PyFloat_Init disappears.


Problem is, PyType_Ready also uses PyDict_SetItemString, but I
presume that _Py_ReadyTypes has to be called before anything else.
In that case it would be unavoidable that PyDict_SetItemString is
used before _PyUnicode_Init, and it might be a good idea to disable
interning.
msg119387 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-10-22 14:39
I've verified the leak manually. The cause is that global variables in
unicodeobject.c, e.g. free_list, are used before _PyUnicode_Init() is
called. Later on _PyUnicode_Init() sets these variables to NULL, losing
the allocated memory.

Here is an example of the earliest use of free_list during
_Py_ReadyTypes (),
well before _PyUnicode_Init():


Breakpoint 1, unicode_dealloc (unicode=0x1b044c0) at Objects/unicodeobject.c:392
392         switch (PyUnicode_CHECK_INTERNED(unicode)) {
(gdb) bt
#0  unicode_dealloc (unicode=0x1b044c0) at Objects/unicodeobject.c:392
#1  0x000000000044fc69 in PyUnicode_InternInPlace (p=0x7fff303852b8) at Objects/unicodeobject.c:9991
#2  0x000000000044fed3 in PyUnicode_InternFromString (cp=0x568861 "__len__") at Objects/unicodeobject.c:10025
#3  0x00000000004344d0 in init_slotdefs () at Objects/typeobject.c:5751
#4  0x0000000000434840 in add_operators (type=0x7be260) at Objects/typeobject.c:5905
#5  0x000000000042eec8 in PyType_Ready (type=0x7be260) at Objects/typeobject.c:3810
#6  0x000000000042edfc in PyType_Ready (type=0x7bde60) at Objects/typeobject.c:3774
#7  0x000000000041aa5f in _Py_ReadyTypes () at Objects/object.c:1514
#8  0x00000000004992ff in Py_InitializeEx (install_sigs=1) at Python/pythonrun.c:232
#9  0x000000000049957f in Py_Initialize () at Python/pythonrun.c:321
#10 0x00000000004b289f in Py_Main (argc=1, argv=0x1afa010) at Modules/main.c:590
#11 0x0000000000417dcc in main (argc=1, argv=0x7fff30385758) at ./Modules/python.c:59
(gdb) n
411         if (PyUnicode_CheckExact(unicode) &&
(gdb) 
414             if (unicode->length >= KEEPALIVE_SIZE_LIMIT) {
(gdb) 
419             if (unicode->defenc) {
(gdb) 
423             *(PyUnicodeObject **)unicode = free_list;
(gdb) n
424             free_list = unicode;
(gdb) n
425             numfree++;
(gdb) n
411         if (PyUnicode_CheckExact(unicode) &&


A possible fix could be to initialize the globals right at the start
in main.c. Note that there are still several Unicode API functions in
main.c before PyType_Ready has been called on the Unicode type.


With the patch, Valgrind does not show the leak any longer.
msg119396 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-10-22 18:04
About the patch: why should _PyUnicode_Init() try to call _PyUnicode_InitGlobals() again?
msg119503 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2010-10-24 10:21
> why should _PyUnicode_Init() try to call _PyUnicode_InitGlobals() again?


For the embedding scenario (when only Py_Initialize() is called) I wanted
to preserve the old behavior of _PyUnicode_Init().

But this is not really enough. I wrote a new patch that also calls 
_PyUnicode_InitGlobals() at the beginning of Py_Initialize().


I don't like the fact that even more clutter is added to Py_Main(). Perhaps
Py_Initialize() could be moved up or the Unicode functions could be moved
down.
msg133502 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-04-11 07:31
[Merging with issue 11402]

Daniel's patch is much simpler, but I think that unicode_empty and
unicode_latin1 would need to be protected before _PyUnicode_Init
is called.

Is the module initialization procedure documented somewhere? I get
the impression that unicodeobject.c depends on dict.c and dict.c
depends on unicodeobject.c.
msg133504 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-04-11 07:36
Stefan Krah <report@bugs.python.org> wrote:
> Is the module initialization procedure documented somewhere? I get
> the impression that unicodeobject.c depends on dict.c and dict.c
> depends on unicodeobject.c.

s/dict.c/dictobject.c/
msg144749 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-10-01 20:59
The PEP-393 changes apparently fix this leak; at least I can't reproduce
it in default any longer (but still in 3.2).
msg172103 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-10-05 17:24
See also #16143.
msg179221 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-06 20:33
Daniel's patch looks good for me. 2.7 looks affected too.
msg179223 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2013-01-06 20:43
unicode-leak.patch doesn't fix #16143 though. unicode_empty and
unicode_latin1 need to be initialized, too.

Actually we could close this in favor of #16143.
msg179234 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-06 22:29
> unicode-leak.patch doesn't fix #16143 though. unicode_empty and
> unicode_latin1 need to be initialized, too.

Indeed. I'll upload patches tomorrow.
msg179256 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-07 11:28
Here are patches for all four Python versions. They fixes possible usage of the 
followed non-initialized global variables: free_list, numfree, interned, 
unicode_empty, static_strings, unicode_latin1, bloom_linebreak, 
unicode_default_encoding.
msg179494 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2013-01-09 22:54
Nick, I'm adding you to the nosy list since this issue seems related to PEP 432.

Quick summary: Globals are used in unicodeobject.c before they are initialized.
Also, Unicode objects are created before PyType_Ready(&PyUnicode_Type) has been
called.


This happens during startup:


_Py_InitializeEx_Private():

  _Py_ReadyTypes():

    PyType_Ready(&PyType_Type);

    [...]

    Many Unicode objects like "" or "__add__" are created. Uninitialized
    globals have led to a crash (#16143). This is fixed by Serhiy's patch,
    which always dynamically checks all globals for NULL before using them.
    However, Unicode objects are still created at this point.

    [...]

    PyType_Ready(&PyUnicode_Type); /* Called for the first time */

    [...]

  _PyUnicode_Init:

    for (i = 0; i < 256; i++)       /* Could leak if latin1 strings
        unicode_latin1[i] = NULL;      have already been created. */

    PyType_Ready(&PyUnicode_Type);  /* Called a second time! */


So, considering PEP 432:  Are these "pre-type-ready" Unicode objects
safe to use, or should something be done about it?
msg179498 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-01-09 23:10
There should still be a check in tp_new (IIRC) that calls PyType_Ready on
unready types.

While doing something systematic about this kind of problem is part of the
rationale of PEP 432, that won't help earlier versions.
msg179504 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2013-01-10 00:21
Nick Coghlan <report@bugs.python.org> wrote:
> There should still be a check in tp_new (IIRC) that calls PyType_Ready on
> unready types.

Indeed there is one in type_new(), but that isn't used here AFAICS. If
you apply this patch and start up python, there are many "str: not ready"
instances:

diff --git a/Objects/unicodeobject.c b/Objects/unicodeobject.c
--- a/Objects/unicodeobject.c
+++ b/Objects/unicodeobject.c
@@ -14282,6 +14282,10 @@
 PyUnicode_InternFromString(const char *cp)
 {
     PyObject *s = PyUnicode_FromString(cp);
+
+    fprintf(stderr, "%s: %s\n", PyUnicode_Type.tp_name,
+       (PyUnicode_Type.tp_flags & Py_TPFLAGS_READY) ? "ready" : "not ready");
+
     if (s == NULL)
         return NULL;
     PyUnicode_InternInPlace(&s);
msg180546 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-24 20:21
There is a set of updated patches.
msg180547 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-01-24 20:42
Serhiy's general approach here looks good to me (although there seem to be some unrelated changes to the re module in the current 3.2 patch).

For PEP 432, I want to try to rearrange things so that _PyUnicode_Init is one of the *first* calls made in Py_BeginInitialization (even before the general call to Py_ReadyTypes), but that still won't invalidate the work done here.
msg180573 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2013-01-25 12:29
Since Rietveld didn't mail me this time: I left some comments on the 2.7 patch.
msg180579 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2013-01-25 13:40
The 2.7 comments also apply to the 3.2 patch. Otherwise the 3.2 patch
(without the _sre changes :) looks good to me.
msg180617 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-25 19:51
> The 2.7 comments also apply to the 3.2 patch. Otherwise the 3.2 patch
> (without the _sre changes :) looks good to me.

Patches updated addressing new Stefan's comments. Unicode globals no longer 
reinitialized in _PyUnicode_Init(). Note that I have added a consistency check 
into the macro in 3.3+.

I hope Rietveld will accept this set.
msg180623 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2013-01-25 21:12
Nice. I think the latest patches are commit-ready.
msg180652 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-01-26 10:21
New changeset 7c8ad0d02664 by Serhiy Storchaka in branch '2.7':
Issue #10156: In the interpreter's initialization phase, unicode globals
http://hg.python.org/cpython/rev/7c8ad0d02664

New changeset f7eda8165e6f by Serhiy Storchaka in branch '3.2':
Issue #10156: In the interpreter's initialization phase, unicode globals
http://hg.python.org/cpython/rev/f7eda8165e6f

New changeset 01d4dd412581 by Serhiy Storchaka in branch '3.3':
Issue #10156: In the interpreter's initialization phase, unicode globals
http://hg.python.org/cpython/rev/01d4dd412581

New changeset cb12d642eed2 by Serhiy Storchaka in branch 'default':
Issue #10156: In the interpreter's initialization phase, unicode globals
http://hg.python.org/cpython/rev/cb12d642eed2
msg180654 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-26 10:33
Committed. Thank you for review, Stefan. Close this issue if the work is finished.
msg180721 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2013-01-26 23:21
Buildbots etc. look all good. Thanks for fixing this.
History
Date User Action Args
2022-04-11 14:57:07adminsetgithub: 54365
2013-01-26 23:21:43skrahsetresolution: fixed
stage: patch review -> resolved
2013-01-26 23:21:26skrahsetstatus: open -> closed
2013-01-26 23:21:19skrahsetmessages: + msg180721
2013-01-26 10:33:56serhiy.storchakasetmessages: + msg180654
2013-01-26 10:21:50python-devsetnosy: + python-dev
messages: + msg180652
2013-01-25 21:12:47skrahsetmessages: + msg180623
2013-01-25 19:51:33serhiy.storchakasetfiles: + unicode_globals-2.7_3.patch, unicode_globals-3.2_3.patch, unicode_globals-3.3_3.patch, unicode_globals-3.4_3.patch

messages: + msg180617
2013-01-25 13:40:39skrahsetmessages: + msg180579
2013-01-25 12:29:15skrahsetmessages: + msg180573
versions: + Python 3.3, Python 3.4
2013-01-24 20:42:08ncoghlansetmessages: + msg180547
2013-01-24 20:21:09serhiy.storchakasetfiles: + unicode_globals-2.7_2.patch, unicode_globals-3.2_2.patch, unicode_globals-3.3_2.patch, unicode_globals-3.4_2.patch

messages: + msg180546
2013-01-10 00:21:25skrahsetmessages: + msg179504
2013-01-09 23:10:52ncoghlansetmessages: + msg179498
2013-01-09 22:54:13skrahsetnosy: + ncoghlan
messages: + msg179494
2013-01-07 23:30:53skrahsetpriority: high -> critical
2013-01-07 23:30:38skrahsetnosy: + georg.brandl, pitrou, franck, Gregory.Andersen
2013-01-07 23:29:13skrahlinkissue16143 superseder
2013-01-07 11:34:18serhiy.storchakasetstage: commit review -> patch review
2013-01-07 11:28:43serhiy.storchakasetfiles: + unicode_globals-2.7.patch, unicode_globals-3.2.patch, unicode_globals-3.3.patch, unicode_globals-3.4.patch

messages: + msg179256
2013-01-06 22:29:46serhiy.storchakasetmessages: + msg179234
2013-01-06 20:43:48skrahsetmessages: + msg179223
2013-01-06 20:33:36serhiy.storchakasetversions: + Python 2.7, - Python 3.3, Python 3.4
nosy: + serhiy.storchaka

messages: + msg179221

stage: patch review -> commit review
2013-01-04 23:41:18Arfreversetnosy: + Arfrever
2012-10-05 17:24:52skrahsetmessages: + msg172103
versions: + Python 3.3, Python 3.4
2012-04-20 17:58:50mark.dickinsonsetnosy: - mark.dickinson
2011-10-01 20:59:19skrahsetmessages: + msg144749
2011-04-11 07:36:29skrahsetmessages: + msg133504
2011-04-11 07:32:27skrahlinkissue11402 superseder
2011-04-11 07:31:07skrahsetfiles: + unicode-leak.patch
nosy: + stutzbach
messages: + msg133502

2010-10-24 10:21:44skrahsetfiles: + unicode_init_globals2.patch

messages: + msg119503
2010-10-22 18:04:51amaury.forgeotdarcsetmessages: + msg119396
2010-10-22 14:39:03skrahsetfiles: + unicode_init_globals.patch
priority: normal -> high
title: Memory leak (r70459) -> Initialization of globals in unicodeobject.c
messages: + msg119387

keywords: + patch
stage: patch review
2010-10-22 00:03:30skrahsetmessages: + msg119356
2010-10-20 22:47:11skrahsetmessages: + msg119241
2010-10-20 22:29:25lemburgsetnosy: + lemburg
messages: + msg119239
2010-10-20 22:24:43skrahsetmessages: + msg119238
2010-10-20 22:10:02amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg119237
2010-10-20 17:58:55belopolskysetnosy: + vstinner
2010-10-20 17:27:05skrahcreate