Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with string concatenation and utf-8 cache. #69895

Closed
rpdKsa mannequin opened this issue Nov 23, 2015 · 39 comments
Closed

Problem with string concatenation and utf-8 cache. #69895

rpdKsa mannequin opened this issue Nov 23, 2015 · 39 comments
Assignees
Labels
release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@rpdKsa
Copy link
Mannequin

rpdKsa mannequin commented Nov 23, 2015

BPO 25709
Nosy @malemburg, @birkenfeld, @terryjreedy, @pitrou, @vstinner, @larryhastings, @benjaminp, @ezio-melotti, @stevendaprano, @serhiy-storchaka, @eryksun
Files
  • greekbug.py: small script generating greek alphabet
  • issue25709.patch
  • issue25709_2.patch
  • issue25709_3.patch
  • issue25709_4.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/birkenfeld'
    closed_at = <Date 2016-02-11.17:23:56.312>
    created_at = <Date 2015-11-23.14:57:19.309>
    labels = ['type-bug', 'library', 'release-blocker']
    title = 'Problem with string concatenation and utf-8 cache.'
    updated_at = <Date 2016-02-13.03:10:46.821>
    user = 'https://bugs.python.org/rpdKsa'

    bugs.python.org fields:

    activity = <Date 2016-02-13.03:10:46.821>
    actor = 'ned.deily'
    assignee = 'georg.brandl'
    closed = True
    closed_date = <Date 2016-02-11.17:23:56.312>
    closer = 'georg.brandl'
    components = ['Library (Lib)']
    creation = <Date 2015-11-23.14:57:19.309>
    creator = '\xc3\x81rp\xc3\xa1d K\xc3\xb3sa'
    dependencies = []
    files = ['41139', '41140', '41141', '41142', '41146']
    hgrepos = []
    issue_num = 25709
    keywords = ['patch']
    message_count = 39.0
    messages = ['255172', '255175', '255177', '255194', '255203', '255206', '255207', '255212', '255214', '255216', '255218', '255226', '255227', '255230', '255233', '255234', '255236', '255240', '255241', '255244', '255255', '255256', '255257', '255259', '255271', '255705', '255708', '255710', '255793', '255794', '255996', '256050', '260115', '260116', '260119', '260120', '260121', '260172', '260174']
    nosy_count = 14.0
    nosy_names = ['lemburg', 'georg.brandl', 'terry.reedy', 'pitrou', 'vstinner', 'larry', 'benjamin.peterson', 'ezio.melotti', 'steven.daprano', 'python-dev', 'serhiy.storchaka', 'eryksun', 'random832', '\xc3\x81rp\xc3\xa1d K\xc3\xb3sa']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue25709'
    versions = ['Python 3.3']

    @rpdKsa
    Copy link
    Mannequin Author

    rpdKsa mannequin commented Nov 23, 2015

    One of my students found this bug. For ascii characters it works as you expect, but for greek alphabet it works unexpectedly.
    The program works correctly for Python 3.2.x but for 3.4.x and 3.5 it gives erroneous result.

    @rpdKsa rpdKsa mannequin added the type-bug An unexpected behavior, bug, or error label Nov 23, 2015
    @stevendaprano
    Copy link
    Member

    I'm afraid I'm unable to replicate this bug report in Python 3.4.

    If you are able to replicate it, can you tell us the exact version number of Python you are using? Also, which operating system are you using?

    @serhiy-storchaka
    Copy link
    Member

    Confirmed on IDLE.

    >>> s = ''
    >>> for i in range(5):
    	s += '\xe0'
    	print(s)
    
    	
    à
    àà
    àà
    àà
    àà
    >>> s = ''
    >>> for i in range(5):
    	s += chr(0xe0)
    	print(s)
    
    	
    à
    àà
    àà
    àà
    àà
    >>> s = ''
    >>> for i in range(5):
    	s += '\u0107'
    	print(s)
    
    	
    ć
    ćć
    ćć
    ćć
    ćć
    >>> s = ''
    >>> for i in range(5):
    	s += chr(0x107)
    	print(s)

    ć
    ć
    ć
    ć
    ć

    This issue can be related to details of IDLE's standard streams and RPC.

    @serhiy-storchaka
    Copy link
    Member

    Here is reproducer without IDLE. Looks as pickle is a culprit.

    >>> import pickle
    >>> s = ''
    >>> for i in range(5):
    ...         s += chr(0xe0)
    ...         print(len(s), s, s.encode(), repr(s))
    ...         print('   ', pickle.dumps(s))
    ... 
    1 à b'\xc3\xa0' 'à'
        b'\x80\x03X\x02\x00\x00\x00\xc3\xa0q\x00.'
    2 àà b'\xc3\xa0\xc3\xa0' 'àà'
        b'\x80\x03X\x04\x00\x00\x00\xc3\xa0\xc3\xa0q\x00.'
    3 àà b'\xc3\xa0\xc3\xa0' 'ààà'
        b'\x80\x03X\x04\x00\x00\x00\xc3\xa0\xc3\xa0q\x00.'
    4 àà b'\xc3\xa0\xc3\xa0' 'àààà'
        b'\x80\x03X\x04\x00\x00\x00\xc3\xa0\xc3\xa0q\x00.'
    5 àà b'\xc3\xa0\xc3\xa0' 'ààààà'
        b'\x80\x03X\x04\x00\x00\x00\xc3\xa0\xc3\xa0q\x00.'

    @random832
    Copy link
    Mannequin

    random832 mannequin commented Nov 23, 2015

    I've looked at the raw bytes [through a ctypes pointer to id(s)] of a string affected by the issue, and decoded enough to be able to tell that the bad string has an incorrect UTF-8 length and data, which pickle presumably relies on.

    HEAD............length..hash....flgs....wstr....u8len...u8ptr...wstrl...data....
    010000003094ac6403000000ffffffffa40000000000000004000000d814480000000000e0e0e000
    010000003094ac6403000000e5d0030ca400006500000000060000008814480000000000e0e0e000

    I've omitted the UTF-8 data, but both were null-terminated after four and six bytes respectively.

    @eryksun
    Copy link
    Contributor

    eryksun commented Nov 23, 2015

    unicode_modifiable in Objects/unicodeobject.c should return 0 if there's cached PyUnicode_UTF8 data. In this case PyUnicode_Append won't operate in place but instead concatenate a new string.

    @random832
    Copy link
    Mannequin

    random832 mannequin commented Nov 23, 2015

    I can't reproduce without pickle. I did some further digging, though, and it *looks like*...

    1. Pickle causes the built-in UTF-8 representation of a string to be populated, whereas encode('utf-8') does not. Can anyone think of any other operations that do this?
    2. After the UTF-8 representation of the 2-character string is populated, concatenating a new character to it does not update or clear it.
    3. However, it worked just fine with the 1-character string - concatenating it caused the UTF-8 representation to be cleared.

    The actual operation that creates an inconsistent string is the concatenate operation, but it only happens with a string that has been "primed" by having its UTF-8 representation materialized.

    @serhiy-storchaka
    Copy link
    Member

    Yes, I have came to the same as random832. String objects have "fast path" for concatenating, and in this path cached UTF8 representation is not cleaned. Pickle is one of simplest ways to reproduce this issue. May be it can be reproduced with compile() or type(), but these ways looks too heavyweighted to me.

    Here is a patch that fixes strings concatenation.

    @terryjreedy
    Copy link
    Member

    It would be good to get this in 3.4.4.

    @terryjreedy terryjreedy added stdlib Python modules in the Lib dir and removed topic-IDLE labels Nov 23, 2015
    @terryjreedy terryjreedy changed the title greek alphabet bug it is very disturbing... Problem with string concatenation and utf-8 cache. Nov 23, 2015
    @serhiy-storchaka
    Copy link
    Member

    Added test without using pickle.

    @eryksun
    Copy link
    Contributor

    eryksun commented Nov 23, 2015

    Serhiy, when does sharing UTF-8 data occur in a compact object? It has to be ASCII since non-ASCII UTF-8 isn't sharable, but PyASCIIObject doesn't have the utf8 field. So it has to be a PyCompactUnicodeObject. But isn't ASCII always allocated as a PyASCIIObject? I need a bit of help getting from point A to point B. Thanks.

    @vstinner
    Copy link
    Member

    I reviewed issue25709_2.patch.

    It would be good to get this in 3.4.4.

    Since it's a major bug in the Unicode implementation, it may be worth to fix it in Python 3.3. The bug was introduced in Python 3.3 by the PEP-393.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 23, 2015

    3.3 is presumably in security mode. Anyone using it would have had to live with the bug for a long time already.

    @random832
    Copy link
    Mannequin

    random832 mannequin commented Nov 23, 2015

    unicode_modifiable in Objects/unicodeobject.c should return 0 if there's cached PyUnicode_UTF8 data. In this case PyUnicode_Append won't operate in place but instead concatenate a new string.

    Shouldn't it still operate in place but clear it? Operating in place is only an option if the old string has no references and will be discarded, right?

    @larryhastings
    Copy link
    Contributor

    I read some comments here and on the patches. Serhiy's patch adds some code and Victor says you can't call that macro on this object and wow this is badly broken. Can someone explain in simpler terms what's so broken, exactly?

    @vstinner
    Copy link
    Member

    " and wow this is badly broken "

    I mean the currently code is badly broken.

    The bug is that sometimes, when a string is resized (which doesn't make sense, strings are immutable, right? :-D), the cached UTF-8 string can become corrupted (old pointer not updated).

    It occurs if

    • the string is resized (ex: "s += s2")
    • the string has a cached UTF-8 byte string (ex: int(s) was called before the resize)
    • the resize moves the memory block to a new address

    Ok, it's probably unlikely to get these 3 conditions, but from my point of view, it's a major bug because it's in a Python fundamental type (str), it's not a bug in user code and it cannot be worked around (easily).

    @serhiy-storchaka
    Copy link
    Member

    In updated patch fixed a bug found by Victor and addressed other his comments. Many thanks Victor!

    @stevendaprano
    Copy link
    Member

    On Mon, Nov 23, 2015 at 09:48:46PM +0000, STINNER Victor wrote:

    • the string has a cached UTF-8 byte string (ex: int(s) was called before the resize)

    Why do strings cache their UTF-8 encoding?

    I presume that some of Python's internals rely on the UTF-8 encoding
    rather than the internal Latin-1/UCS-2/UTF-32 representation (PEP-393).
    E.g. I infer from the above that int(s) parses the UTF-8 representation
    of s rather than the internal representation. Is that right?

    Nevertheless, I wonder why the UTF-8 representation is cached. Is it
    that expensive to generate that it can't be done on the fly, as needed?
    As it stands now, non-ASCII strings may be up to twice as big as they
    need be, once you include the UTF-8 cache. And, as this bug painfully
    shows, the problem with caches is that you run the risk of the cache
    being out of date.

    @vstinner
    Copy link
    Member

    Steven D'Aprano added the comment:

    the problem with caches is that you run the risk of the cache being out
    of date.

    Since strings are immutable, it's not a big deal. We control where strings
    are modified (unicodeobject.c).

    @eryksun
    Copy link
    Contributor

    eryksun commented Nov 24, 2015

    Why do strings cache their UTF-8 encoding?

    Strings also cache the wide-string representation. For example:

        from ctypes import *
        s = '\241\242\243'
        pythonapi.PyUnicode_AsUnicodeAndSize(py_object(s), None)
        pythonapi.PyUnicode_AsUTF8AndSize(py_object(s), None)
        >>> hex(id(s))
        '0x7ffff69f8e98'
    (gdb) p *(PyCompactUnicodeObject *)0x7ffff69f8e98
    $1 = {_base = {ob_base = {_ob_next = 0x7ffff697f890,
                              _ob_prev = 0x7ffff6a04d40,
                              ob_refcnt = 1, 
                              ob_type = 0x89d860 <PyUnicode_Type>},
                   length = 3,
                   hash = -5238559198920514942,
                   state = {interned = 0,
                            kind = 1,
                            compact = 1,
                            ascii = 0,
                            ready = 1},
                   wstr = 0x7ffff69690a0 L"¡¢£"},
          utf8_length = 6,
          utf8 = 0x7ffff696b7e8 "¡¢£",
          wstr_length = 3}
    
    (gdb) p (char *)((PyCompactUnicodeObject *)0x7ffff69f8e98 + 1)
    $2 = 0x7ffff69f8ef0 "\241\242\243"
    

    This object uses 4 bytes for the null-terminated Latin-1 string, which directly follows the PyCompactUnicodeObject struct. It uses 7 bytes for the UTF-8 string. It uses 16 bytes for the wchar_t string (4 bytes per wchar_t).

    @serhiy-storchaka
    Copy link
    Member

    Fixed yet one bug (thanks Victor again). Test is improved, now it doesn't rely on implementation detail of particular builtin.

    @malemburg
    Copy link
    Member

    On 24.11.2015 02:30, Steven D'Aprano wrote:

    Steven D'Aprano added the comment:

    On Mon, Nov 23, 2015 at 09:48:46PM +0000, STINNER Victor wrote:

    > * the string has a cached UTF-8 byte string (ex: int(s) was called before the resize)

    Why do strings cache their UTF-8 encoding?

    I presume that some of Python's internals rely on the UTF-8 encoding
    rather than the internal Latin-1/UCS-2/UTF-32 representation (PEP-393).
    E.g. I infer from the above that int(s) parses the UTF-8 representation
    of s rather than the internal representation. Is that right?

    Nevertheless, I wonder why the UTF-8 representation is cached. Is it
    that expensive to generate that it can't be done on the fly, as needed?
    As it stands now, non-ASCII strings may be up to twice as big as they
    need be, once you include the UTF-8 cache. And, as this bug painfully
    shows, the problem with caches is that you run the risk of the cache
    being out of date.

    The cache is needed because it's the only way to get a direct
    C char* to the object's UTF-8 representation without having to
    worry about memory management on the caller's side. Not having
    access to this would break a lot of code using the Python
    C API, since the cache is there per design. The speedup aspect
    is secondary.

    Unicode objects are normally immutable, but there are a few
    corner cases during the initialization of the objects where
    they are in fact mutable for a short while, e.g. when
    creating an empty object which is then filled with data and
    resized to the final length before passing it back to
    Python.

    @serhiy-storchaka
    Copy link
    Member

    Why do strings cache their UTF-8 encoding?

    Mainly for compatibility with existing C API. Common way to parse function arguments in implemented in C function is to use special argument parsing API: PyArg_ParseTuple, PyArg_ParseTupleAndKeywords, or PyArg_Parse. Most format codes for Unicode strings returned a C pointer to char array. For that encoded Unicode strings should be kept somewhere at least for the time of executing C function. As well as PyArg_Parse* functions doesn't allow user to specify a storage for encoded string, it should be saved in Unicode object. That is not new to PEP-393 or Python 3, in Python 2 the Unicode objects also keep cached encoded version.

    @vstinner
    Copy link
    Member

    issue25709_4.patch now looks good to me, but I added some minor comments on the review.

    @serhiy-storchaka
    Copy link
    Member

    Georg, I ask for applying this fix to 3.3.

    @larryhastings
    Copy link
    Contributor

    Is this going in soon? I want to cherry-pick this for 3.5.1, which I tag in about 80 hours.

    @serhiy-storchaka
    Copy link
    Member

    I wait only Greg's approving for 3.3. If I'll not get it in a day, I'll commit the patch for 3.4+.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 2, 2015

    Please commit right now to 3.4+. Backport to 3.3 can be done later.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 2, 2015

    New changeset 67718032badb by Serhiy Storchaka in branch '3.4':
    Issue bpo-25709: Fixed problem with in-place string concatenation and utf-8 cache.
    https://hg.python.org/cpython/rev/67718032badb

    New changeset a0e2376768dc by Serhiy Storchaka in branch '3.5':
    Issue bpo-25709: Fixed problem with in-place string concatenation and utf-8 cache.
    https://hg.python.org/cpython/rev/a0e2376768dc

    New changeset 9e800b2aeeac by Serhiy Storchaka in branch 'default':
    Issue bpo-25709: Fixed problem with in-place string concatenation and utf-8 cache.
    https://hg.python.org/cpython/rev/9e800b2aeeac

    @vstinner
    Copy link
    Member

    vstinner commented Dec 2, 2015

    New changeset 67718032badb by Serhiy Storchaka in branch '3.4':

    Thanks.

    @larryhastings
    Copy link
    Contributor

    I cherry-picked this for 3.5.1.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 7, 2015

    New changeset 376b100107ba by Serhiy Storchaka in branch '3.5':
    Issue bpo-25709: Fixed problem with in-place string concatenation and utf-8 cache.
    https://hg.python.org/cpython/rev/376b100107ba

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 11, 2016

    New changeset b9c8f1c80f47 by Serhiy Storchaka in branch '3.3':
    Issue bpo-25709: Fixed problem with in-place string concatenation and utf-8 cache.
    https://hg.python.org/cpython/rev/b9c8f1c80f47

    @birkenfeld
    Copy link
    Member

    Backpicked to 3.3. Sorry for the wait.

    @vstinner
    Copy link
    Member

    2016-02-11 18:23 GMT+01:00 Georg Brandl <report@bugs.python.org>:

    Georg Brandl added the comment:

    Backpicked to 3.3. Sorry for the wait.

    Good, this bugfix is useful :-)

    @serhiy-storchaka
    Copy link
    Member

    I wait only Greg's approving for 3.3. If I'll not get it in a day, I'll commit the patch for 3.4+.

    Maybe it was my fault. I made a mistake in Georg's name.

    @birkenfeld
    Copy link
    Member

    Actually I prefer Greg to Gerg, so it's only half bad. :D

    @serhiy-storchaka
    Copy link
    Member

    b9c8f1c80f47 added a new head. Should we merge 3.3 -> 3.4 -> 3.5 -> default?

    @birkenfeld
    Copy link
    Member

    Don't bother. I can do that once 3.3.7 is released.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    9 participants