New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with string concatenation and utf-8 cache. #69895
Comments
One of my students found this bug. For ascii characters it works as you expect, but for greek alphabet it works unexpectedly. |
I'm afraid I'm unable to replicate this bug report in Python 3.4. If you are able to replicate it, can you tell us the exact version number of Python you are using? Also, which operating system are you using? |
Confirmed on IDLE. >>> s = ''
>>> for i in range(5):
s += '\xe0'
print(s)
à
àà
àà
àà
àà
>>> s = ''
>>> for i in range(5):
s += chr(0xe0)
print(s)
à
àà
àà
àà
àà
>>> s = ''
>>> for i in range(5):
s += '\u0107'
print(s)
ć
ćć
ćć
ćć
ćć
>>> s = ''
>>> for i in range(5):
s += chr(0x107)
print(s) ć This issue can be related to details of IDLE's standard streams and RPC. |
Here is reproducer without IDLE. Looks as pickle is a culprit. >>> import pickle
>>> s = ''
>>> for i in range(5):
... s += chr(0xe0)
... print(len(s), s, s.encode(), repr(s))
... print(' ', pickle.dumps(s))
...
1 à b'\xc3\xa0' 'à'
b'\x80\x03X\x02\x00\x00\x00\xc3\xa0q\x00.'
2 àà b'\xc3\xa0\xc3\xa0' 'àà'
b'\x80\x03X\x04\x00\x00\x00\xc3\xa0\xc3\xa0q\x00.'
3 àà b'\xc3\xa0\xc3\xa0' 'ààà'
b'\x80\x03X\x04\x00\x00\x00\xc3\xa0\xc3\xa0q\x00.'
4 àà b'\xc3\xa0\xc3\xa0' 'àààà'
b'\x80\x03X\x04\x00\x00\x00\xc3\xa0\xc3\xa0q\x00.'
5 àà b'\xc3\xa0\xc3\xa0' 'ààààà'
b'\x80\x03X\x04\x00\x00\x00\xc3\xa0\xc3\xa0q\x00.' |
I've looked at the raw bytes [through a ctypes pointer to id(s)] of a string affected by the issue, and decoded enough to be able to tell that the bad string has an incorrect UTF-8 length and data, which pickle presumably relies on. HEAD............length..hash....flgs....wstr....u8len...u8ptr...wstrl...data.... I've omitted the UTF-8 data, but both were null-terminated after four and six bytes respectively. |
unicode_modifiable in Objects/unicodeobject.c should return 0 if there's cached PyUnicode_UTF8 data. In this case PyUnicode_Append won't operate in place but instead concatenate a new string. |
I can't reproduce without pickle. I did some further digging, though, and it *looks like*...
The actual operation that creates an inconsistent string is the concatenate operation, but it only happens with a string that has been "primed" by having its UTF-8 representation materialized. |
Yes, I have came to the same as random832. String objects have "fast path" for concatenating, and in this path cached UTF8 representation is not cleaned. Pickle is one of simplest ways to reproduce this issue. May be it can be reproduced with compile() or type(), but these ways looks too heavyweighted to me. Here is a patch that fixes strings concatenation. |
It would be good to get this in 3.4.4. |
Added test without using pickle. |
Serhiy, when does sharing UTF-8 data occur in a compact object? It has to be ASCII since non-ASCII UTF-8 isn't sharable, but PyASCIIObject doesn't have the utf8 field. So it has to be a PyCompactUnicodeObject. But isn't ASCII always allocated as a PyASCIIObject? I need a bit of help getting from point A to point B. Thanks. |
I reviewed issue25709_2.patch.
Since it's a major bug in the Unicode implementation, it may be worth to fix it in Python 3.3. The bug was introduced in Python 3.3 by the PEP-393. |
3.3 is presumably in security mode. Anyone using it would have had to live with the bug for a long time already. |
Shouldn't it still operate in place but clear it? Operating in place is only an option if the old string has no references and will be discarded, right? |
I read some comments here and on the patches. Serhiy's patch adds some code and Victor says you can't call that macro on this object and wow this is badly broken. Can someone explain in simpler terms what's so broken, exactly? |
" and wow this is badly broken " I mean the currently code is badly broken. The bug is that sometimes, when a string is resized (which doesn't make sense, strings are immutable, right? :-D), the cached UTF-8 string can become corrupted (old pointer not updated). It occurs if
Ok, it's probably unlikely to get these 3 conditions, but from my point of view, it's a major bug because it's in a Python fundamental type (str), it's not a bug in user code and it cannot be worked around (easily). |
In updated patch fixed a bug found by Victor and addressed other his comments. Many thanks Victor! |
On Mon, Nov 23, 2015 at 09:48:46PM +0000, STINNER Victor wrote:
Why do strings cache their UTF-8 encoding? I presume that some of Python's internals rely on the UTF-8 encoding Nevertheless, I wonder why the UTF-8 representation is cached. Is it |
Steven D'Aprano added the comment:
Since strings are immutable, it's not a big deal. We control where strings |
Strings also cache the wide-string representation. For example: from ctypes import *
s = '\241\242\243'
pythonapi.PyUnicode_AsUnicodeAndSize(py_object(s), None)
pythonapi.PyUnicode_AsUTF8AndSize(py_object(s), None) >>> hex(id(s))
'0x7ffff69f8e98'
This object uses 4 bytes for the null-terminated Latin-1 string, which directly follows the PyCompactUnicodeObject struct. It uses 7 bytes for the UTF-8 string. It uses 16 bytes for the wchar_t string (4 bytes per wchar_t). |
Fixed yet one bug (thanks Victor again). Test is improved, now it doesn't rely on implementation detail of particular builtin. |
On 24.11.2015 02:30, Steven D'Aprano wrote:
The cache is needed because it's the only way to get a direct Unicode objects are normally immutable, but there are a few |
Mainly for compatibility with existing C API. Common way to parse function arguments in implemented in C function is to use special argument parsing API: PyArg_ParseTuple, PyArg_ParseTupleAndKeywords, or PyArg_Parse. Most format codes for Unicode strings returned a C pointer to char array. For that encoded Unicode strings should be kept somewhere at least for the time of executing C function. As well as PyArg_Parse* functions doesn't allow user to specify a storage for encoded string, it should be saved in Unicode object. That is not new to PEP-393 or Python 3, in Python 2 the Unicode objects also keep cached encoded version. |
issue25709_4.patch now looks good to me, but I added some minor comments on the review. |
Georg, I ask for applying this fix to 3.3. |
Is this going in soon? I want to cherry-pick this for 3.5.1, which I tag in about 80 hours. |
I wait only Greg's approving for 3.3. If I'll not get it in a day, I'll commit the patch for 3.4+. |
Please commit right now to 3.4+. Backport to 3.3 can be done later. |
New changeset 67718032badb by Serhiy Storchaka in branch '3.4': New changeset a0e2376768dc by Serhiy Storchaka in branch '3.5': New changeset 9e800b2aeeac by Serhiy Storchaka in branch 'default': |
Thanks. |
I cherry-picked this for 3.5.1. |
New changeset 376b100107ba by Serhiy Storchaka in branch '3.5': |
New changeset b9c8f1c80f47 by Serhiy Storchaka in branch '3.3': |
Backpicked to 3.3. Sorry for the wait. |
2016-02-11 18:23 GMT+01:00 Georg Brandl <report@bugs.python.org>:
Good, this bugfix is useful :-) |
Maybe it was my fault. I made a mistake in Georg's name. |
Actually I prefer Greg to Gerg, so it's only half bad. :D |
b9c8f1c80f47 added a new head. Should we merge 3.3 -> 3.4 -> 3.5 -> default? |
Don't bother. I can do that once 3.3.7 is released. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: