Avoid temporary Unicode strings, use identifiers to only create the string once #63711

vstinner · 2013-11-06T17:05:05Z

BPO	19512
Nosy	@loewis, @birkenfeld, @vstinner, @serhiy-storchaka
Files	pysys_getobjectid.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2013-11-19.23:58:52.398>
created_at = <Date 2013-11-06.17:05:05.480>
labels = []
title = 'Avoid temporary Unicode strings, use identifiers to only create the string once'
updated_at = <Date 2013-11-19.23:58:52.397>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2013-11-19.23:58:52.397>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2013-11-19.23:58:52.398>
closer = 'vstinner'
components = []
creation = <Date 2013-11-06.17:05:05.480>
creator = 'vstinner'
dependencies = []
files = ['32517']
hgrepos = []
issue_num = 19512
keywords = ['patch']
message_count = 23.0
messages = ['202273', '202275', '202277', '202278', '202279', '202280', '202281', '202283', '202288', '202291', '202293', '202295', '202296', '202297', '202320', '202324', '202325', '202336', '202337', '202390', '202417', '202698', '203448']
nosy_count = 6.0
nosy_names = ['loewis', 'georg.brandl', 'vstinner', 'Arfrever', 'python-dev', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue19512'
versions = ['Python 3.4']

vstinner · 2013-11-06T17:05:05Z

In interactive mode, when I run python in gdb, I see that PyUnicode_DecodeUTF8Stateful() is called a lot of times. Calls come from PyDict_GetItemString() or PySys_GetObject() for example.

Allocating a temporary Unicode string and decode a byte string from UTF-8 is inefficient: the memory allocator is stressed and the byte string is decoded at each call.

I propose to reuse the _Py_IDENTIFIER API in most common places to limit calls to the memory allocator and to PyUnicode_DecodeUTF8Stateful().

vstinner · 2013-11-06T17:08:30Z

pysys_getobjectid.patch:

add _PySys_GetObjectId() and _PyDict_GetItemId() functions
add global identifiers for most common strings: "argv", "path", "stdin", "stdout", "stderr"
use these new functions and identifiers

serhiy-storchaka · 2013-11-06T17:29:43Z

PySys_GetObject() is called with followed literal strings: argv, displayhook, excepthook, modules, path, path_hooks, path_importer_cache, ps1, ps2, stderr, stdin, stdout, tracebacklimit.

PyDict_GetItemString() is called with followed literal strings: __abstractmethods__, __builtins__, __file__, __loader__, __module__, __name__, __warningregistry__, _abstract_, _argtypes_, _errcheck_, _fields_, _flags_, _iterdump, _needs_com_addref_, _restype_, _type_, builtins, decimal_point, default_int_handler, displayhook, excepthook, fillvalue, grouping, imp, metaclass, options, sys, thousands_sep.

Are any of these calls performance critical?

vstinner · 2013-11-06T17:42:37Z

Are any of these calls performance critical?

I'm trying to focus on the interactive interpreter. I didn't touch literal strings used once, for example at module initialization.

Well, if it doesn't make the code much uglier or much slower, it's maybe not a big deal to replace all string literals with identifiers.

python-dev · 2013-11-06T17:45:51Z

New changeset a2f42d57b91d by Victor Stinner in branch 'default':
Issue bpo-19512: sys_displayhook() now uses an identifier for "builtins"
http://hg.python.org/cpython/rev/a2f42d57b91d

New changeset 55517661a053 by Victor Stinner in branch 'default':
Issue bpo-19512: _print_total_refs() now uses an identifier to get "showrefcount"
http://hg.python.org/cpython/rev/55517661a053

New changeset af822a6c9faf by Victor Stinner in branch 'default':
Issue bpo-19512: Add PyRun_InteractiveOneObject() function
http://hg.python.org/cpython/rev/af822a6c9faf

vstinner · 2013-11-06T17:47:33Z

Oh, by the way, identifiers have a nice side effect: they are interned, and so dict lookup should be faster.

python-dev · 2013-11-06T18:06:22Z

New changeset 8a6a920d8eae by Victor Stinner in branch 'default':
Issue bpo-19512: Py_ReprEnter() and Py_ReprLeave() now use an identifier for the
http://hg.python.org/cpython/rev/8a6a920d8eae

New changeset 69071054b42f by Victor Stinner in branch 'default':
Issue bpo-19512: Add a new _PyDict_DelItemId() function, similar to
http://hg.python.org/cpython/rev/69071054b42f

New changeset 862a62e61553 by Victor Stinner in branch 'default':
Issue bpo-19512: type_abstractmethods() and type_set_abstractmethods() now use an
http://hg.python.org/cpython/rev/862a62e61553

New changeset e5476ecb8b57 by Victor Stinner in branch 'default':
Issue bpo-19512: eval() and exec() now use an identifier for "__builtins__" string
http://hg.python.org/cpython/rev/e5476ecb8b57

serhiy-storchaka · 2013-11-06T18:51:59Z

I don't think these changes are required. The interactive interpreter is not a bottleneck.

And definitely adding new public functions to API needs more discussion.

vstinner · 2013-11-06T21:10:24Z

I don't think these changes are required. The interactive interpreter is not a bottleneck.

What is the problem with these changes?

Identifiers have different advantages. Errors become more unlikely because objects are only initialized once, near startup. So it put also less pressure on code handling errors :) (it is usually the least tested part of the code)

And definitely adding new public functions to API needs more discussion.

You mean for PyRun_InteractiveOneObject()? Oh, it can be made private, but what is the problem of adding yet another PyRun_Interactive*() function? There are already a lot of them :-)

I also worked hard to support unencodable filenames: using char*, you cannot support arbitrary Unicode filename on Windows. That's why a added many various functions with "Object" suffix. Some examples: PyWarn_ExplicitObject(), PyParser_ParseStringObject(), PyImport_AddModuleObject(), etc.

Some users complained that they were not able to run Python scripts on Windows with unencodable filenames (like russian characters on an english setup). I can try to find the related issues.

python-dev · 2013-11-06T21:46:26Z

New changeset 5e402c16a74c by Victor Stinner in branch 'default':
Issue bpo-19512: Add _PySys_GetObjectId() and _PySys_SetObjectId() functions
http://hg.python.org/cpython/rev/5e402c16a74c

New changeset cca13dd603a9 by Victor Stinner in branch 'default':
Issue bpo-19512: PRINT_EXPR bytecode now uses an identifier to get sys.displayhook
http://hg.python.org/cpython/rev/cca13dd603a9

New changeset 6348764bacdd by Victor Stinner in branch 'default':
Issue bpo-19512: pickle now uses an identifier to only create the Unicode string
http://hg.python.org/cpython/rev/6348764bacdd

New changeset 954167ce92a3 by Victor Stinner in branch 'default':
Issue bpo-19512: add some common identifiers to only create common strings once,
http://hg.python.org/cpython/rev/954167ce92a3

vstinner · 2013-11-06T22:43:12Z

Another problem is that PyUnicode_FromString() failure is not handled correctly in some cases. PyUnicode_FromString() can fail because an decoder error, but also because of a MemoryError. For example, PyDict_GetItemString() returns NULL as if the entry does not exist if PyUnicode_FromString() failed :-(
---

PyObject *
PyDict_GetItemString(PyObject *v, const char *key)
{
    PyObject *kv, *rv;
    kv = PyUnicode_FromString(key);
    if (kv == NULL) {
        PyErr_Clear();
        return NULL;
    }
    rv = PyDict_GetItem(v, kv);
    Py_DECREF(kv);
    return rv;
}

While working on failmalloc issues (bpo-18048, bpo-19437), I found some places where MemoryError caused tricky bugs because of this. Example of such issue:
---
changeset: 84684:af18829a7754
user: Victor Stinner <victor.stinner@gmail.com>
date: Wed Jul 17 01:22:45 2013 +0200
files: Objects/structseq.c Python/pythonrun.c
description:
Close bpo-18469: Replace PyDict_GetItemString() with _PyDict_GetItemId() in structseq.c

_PyDict_GetItemId() is more efficient: it only builds the Unicode string once.
Identifiers (dictionary keys) are now created at Python initialization, and if
the creation failed, Python does exit with a fatal error.

Before, PyDict_GetItemString() failure was not handled: structseq_new() could
call PyObject_GC_NewVar() with a negative size, and structseq_dealloc() could
also crash.
---

So moving from PyDict_GetItemString() to _PyDict_GetItemId() is for perfomances, but the main motivation is to handle better errors. I hope that the identifier will be initialized quickly at startup, and if its initialization failed, the failure is handled better...

There is also a _PyDict_GetItemIdWithError() function. But it is not used currently (it was in changeset 2dd046be2c88).

python-dev · 2013-11-06T23:02:59Z

New changeset 40c73ccaee95 by Victor Stinner in branch 'default':
Issue bpo-19512: __build_class() builtin now uses an identifier for the "metaclass" string
http://hg.python.org/cpython/rev/40c73ccaee95

New changeset 7177363d8c5c by Victor Stinner in branch 'default':
Issue bpo-19512: fileio_init() reuses PyId_name identifier instead of "name"
http://hg.python.org/cpython/rev/7177363d8c5c

New changeset dbee50619259 by Victor Stinner in branch 'default':
Issue bpo-19512: _count_elements() of _collections reuses PyId_get identifier
http://hg.python.org/cpython/rev/dbee50619259

New changeset 6a1ce1fd1fc0 by Victor Stinner in branch 'default':
Issue bpo-19512: builtin print() function uses an identifier instead of literal
http://hg.python.org/cpython/rev/6a1ce1fd1fc0

vstinner · 2013-11-06T23:03:49Z

I changed the issue title to make it closer to the real changesets related to the issue.

python-dev · 2013-11-07T00:12:37Z

New changeset 77bebcf5c4cf by Victor Stinner in branch 'default':
Issue bpo-19512: add _PyUnicode_CompareWithId() function
http://hg.python.org/cpython/rev/77bebcf5c4cf

New changeset 3f9f2cfae53b by Victor Stinner in branch 'default':
Issue bpo-19512: Use the new _PyId_builtins identifier
http://hg.python.org/cpython/rev/3f9f2cfae53b

serhiy-storchaka · 2013-11-07T09:19:06Z

What is the problem with these changes?

Usually CPython team avoids code churn without serious reasons. Performance reasons for the change PySys_GetObject("stdout") to _PySys_GetObjectId(&_PyId_stdout) are ridiculous. You changed hundreds lines of code for speed up interactive mode by perhaps several microseconds.

Errors become more unlikely because objects are only initialized once, near startup. So it put also less pressure on code handling errors :) (it is usually the least tested part of the code)

If there are bugs in code handling errors, they should be fixed in maintenance releases too.

You mean for PyRun_InteractiveOneObject()? Oh, it can be made private, but what is the problem of adding yet another PyRun_Interactive*() function? There are already a lot of them :-)

And this is a problem. Newly added function is not even documented.

I also worked hard to support unencodable filenames: using char*, you cannot support arbitrary Unicode filename on Windows. That's why a added many various functions with "Object" suffix. Some examples: PyWarn_ExplicitObject(), PyParser_ParseStringObject(), PyImport_AddModuleObject(), etc.

"One bug per bug report" as Martin says.

Another problem is that PyUnicode_FromString() failure is not handled correctly in some cases. PyUnicode_FromString() can fail because an decoder error, but also because of a MemoryError.

It can't fail on "stdout" because an decoder error.

vstinner · 2013-11-07T10:35:40Z

> Another problem is that PyUnicode_FromString() failure is not handled correctly in some cases. PyUnicode_FromString() can fail because an decoder error, but also because of a MemoryError.

It can't fail on "stdout" because an decoder error.

It can fail on "stdout" because of a memory allocation failure.

birkenfeld · 2013-11-07T11:39:47Z

> You mean for PyRun_InteractiveOneObject()? Oh, it can be made private, but what is the problem of adding yet another PyRun_Interactive*() function? There are already a lot of them :-)

And this is a problem. Newly added function is not even documented.

Serhiy is right. You have to be responsible with the Py* namespace, and keep new functions private unless they are useful enough to the outside and you document them.

In general, you changed lots of code without a single review. Can you slow down a bit?

vstinner · 2013-11-07T13:03:51Z

Serhiy is right. You have to be responsible with the Py* namespace, and keep new functions private unless they are useful enough to the outside and you document them.

I created the issue bpo-19518 to discuss this part (but also to propose other enhancements related to Unicode).

vstinner · 2013-11-07T13:10:30Z

> Errors become more unlikely because objects are only initialized once, near startup. So it put also less pressure on code handling errors :) (it is usually the least tested part of the code)

If there are bugs in code handling errors, they should be fixed in maintenance releases too.

Well, using identifiers doesn't solve directly all issues. For example, _PyDict_GetItemId() should be replaced with _PyDict_GetItemIdWithError() to be complelty safe. It just reduces the probability of bugs.

Using identifiers might add regressions for a minor gain (handling MemoryError better). As I did for issues bpo-18048 and bpo-19437 (related to issues found by failmalloc), I prefer to not backport such minor bugfixes to not risk a regression.

You changed hundreds lines of code for speed up interactive mode by perhaps several microseconds.

Again, performance is not the main motivation, please read again msg202293. Or maybe you disagree with this message?

Sorry, I didn't explain my changes in first messages of this issue. I created the issue to group my changesets to an issue, to explain why I did them. I didn't expect any discussion :-) But thank you for all your remarks.

python-dev · 2013-11-07T22:08:08Z

New changeset 01c4a0af73cf by Victor Stinner in branch 'default':
Issue bpo-19512, bpo-19515: remove shared identifiers, move identifiers where they
http://hg.python.org/cpython/rev/01c4a0af73cf

python-dev · 2013-11-08T13:07:49Z

New changeset bf9c77bac36d by Victor Stinner in branch 'default':
Issue bpo-19512, bpo-19526: Exclude the new _PyDict_DelItemId() function from the
http://hg.python.org/cpython/rev/bf9c77bac36d

vstinner · 2013-11-12T15:43:28Z

@georg, Serhiy, Martin: Sorry for having commits directly without more review. I didn't expect negative feedback on such changes, I thaught to moving from literal C byte string to Python identifiers was a well accepted practice since identifiers are used in a lot of places in Python code base.

So what should I do now? Should I revert all changesets related to this issue, or can we keep these new identifiers and close the issue?

vstinner · 2013-11-19T23:58:52Z

No reaction, I close the issue. Reopen it if you still have complains ;-)

vstinner changed the title ~~Avoid most calls to PyUnicode_DecodeUTF8Stateful() in Python interactive mode~~ Avoid temporary Unicode strings, use identifiers to only create the string once Nov 6, 2013

vstinner closed this as completed Nov 19, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid temporary Unicode strings, use identifiers to only create the string once #63711

Avoid temporary Unicode strings, use identifiers to only create the string once #63711

vstinner commented Nov 6, 2013

vstinner commented Nov 6, 2013

vstinner commented Nov 6, 2013

serhiy-storchaka commented Nov 6, 2013

vstinner commented Nov 6, 2013

python-dev mannequin commented Nov 6, 2013

vstinner commented Nov 6, 2013

python-dev mannequin commented Nov 6, 2013

serhiy-storchaka commented Nov 6, 2013

vstinner commented Nov 6, 2013

python-dev mannequin commented Nov 6, 2013

vstinner commented Nov 6, 2013

python-dev mannequin commented Nov 6, 2013

vstinner commented Nov 6, 2013

python-dev mannequin commented Nov 7, 2013

serhiy-storchaka commented Nov 7, 2013

vstinner commented Nov 7, 2013

birkenfeld commented Nov 7, 2013

vstinner commented Nov 7, 2013

vstinner commented Nov 7, 2013

python-dev mannequin commented Nov 7, 2013

python-dev mannequin commented Nov 8, 2013

vstinner commented Nov 12, 2013

vstinner commented Nov 19, 2013

Avoid temporary Unicode strings, use identifiers to only create the string once #63711

Avoid temporary Unicode strings, use identifiers to only create the string once #63711

Comments

vstinner commented Nov 6, 2013

vstinner commented Nov 6, 2013

vstinner commented Nov 6, 2013

serhiy-storchaka commented Nov 6, 2013

vstinner commented Nov 6, 2013

python-dev mannequin commented Nov 6, 2013

vstinner commented Nov 6, 2013

python-dev mannequin commented Nov 6, 2013

serhiy-storchaka commented Nov 6, 2013

vstinner commented Nov 6, 2013

python-dev mannequin commented Nov 6, 2013

vstinner commented Nov 6, 2013

python-dev mannequin commented Nov 6, 2013

vstinner commented Nov 6, 2013

python-dev mannequin commented Nov 7, 2013

serhiy-storchaka commented Nov 7, 2013

vstinner commented Nov 7, 2013

birkenfeld commented Nov 7, 2013

vstinner commented Nov 7, 2013

vstinner commented Nov 7, 2013

python-dev mannequin commented Nov 7, 2013

python-dev mannequin commented Nov 8, 2013

vstinner commented Nov 12, 2013

vstinner commented Nov 19, 2013