classification
Title: Problem with string concatenation and utf-8 cache.
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: georg.brandl Nosy List: benjamin.peterson, eryksun, ezio.melotti, georg.brandl, larry, lemburg, pitrou, python-dev, random832, serhiy.storchaka, steven.daprano, terry.reedy, vstinner, Árpád Kósa
Priority: release blocker Keywords: patch

Created on 2015-11-23 14:57 by Árpád Kósa, last changed 2016-02-13 03:10 by ned.deily. This issue is now closed.

Files
File name Uploaded Description Edit
greekbug.py Árpád Kósa, 2015-11-23 14:57 small script generating greek alphabet
issue25709.patch serhiy.storchaka, 2015-11-23 19:20 review
issue25709_2.patch serhiy.storchaka, 2015-11-23 20:14 review
issue25709_3.patch serhiy.storchaka, 2015-11-23 22:47 review
issue25709_4.patch serhiy.storchaka, 2015-11-24 08:51 review
Messages (39)
msg255172 - (view) Author: Árpád Kósa (Árpád Kósa) Date: 2015-11-23 14:57
One of my students found this bug. For ascii characters it works as you expect, but for greek alphabet it works unexpectedly.
The program works correctly for Python 3.2.x but for 3.4.x and 3.5 it gives erroneous result.
msg255175 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2015-11-23 15:14
I'm afraid I'm unable to replicate this bug report in Python 3.4.

If you are able to replicate it, can you tell us the exact version number of Python you are using? Also, which operating system are you using?
msg255177 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-23 15:30
Confirmed on IDLE.

>>> s = ''
>>> for i in range(5):
	s += '\xe0'
	print(s)

	
à
àà
àà
àà
àà
>>> s = ''
>>> for i in range(5):
	s += chr(0xe0)
	print(s)

	
à
àà
àà
àà
àà
>>> s = ''
>>> for i in range(5):
	s += '\u0107'
	print(s)

	
ć
ćć
ćć
ćć
ćć
>>> s = ''
>>> for i in range(5):
	s += chr(0x107)
	print(s)

	
ć
ć
ć
ć
ć

This issue can be related to details of IDLE's standard streams and RPC.
msg255194 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-23 16:57
Here is reproducer without IDLE. Looks as pickle is a culprit.

>>> import pickle
>>> s = ''
>>> for i in range(5):
...         s += chr(0xe0)
...         print(len(s), s, s.encode(), repr(s))
...         print('   ', pickle.dumps(s))
... 
1 à b'\xc3\xa0' 'à'
    b'\x80\x03X\x02\x00\x00\x00\xc3\xa0q\x00.'
2 àà b'\xc3\xa0\xc3\xa0' 'àà'
    b'\x80\x03X\x04\x00\x00\x00\xc3\xa0\xc3\xa0q\x00.'
3 àà b'\xc3\xa0\xc3\xa0' 'ààà'
    b'\x80\x03X\x04\x00\x00\x00\xc3\xa0\xc3\xa0q\x00.'
4 àà b'\xc3\xa0\xc3\xa0' 'àààà'
    b'\x80\x03X\x04\x00\x00\x00\xc3\xa0\xc3\xa0q\x00.'
5 àà b'\xc3\xa0\xc3\xa0' 'ààààà'
    b'\x80\x03X\x04\x00\x00\x00\xc3\xa0\xc3\xa0q\x00.'
msg255203 - (view) Author: (random832) Date: 2015-11-23 17:40
I've looked at the raw bytes [through a ctypes pointer to id(s)] of a string affected by the issue, and decoded enough to be able to tell that the bad string has an incorrect UTF-8 length and data, which pickle presumably relies on.

HEAD............length..hash....flgs....wstr....u8len...u8ptr...wstrl...data....
010000003094ac6403000000ffffffffa40000000000000004000000d814480000000000e0e0e000
010000003094ac6403000000e5d0030ca400006500000000060000008814480000000000e0e0e000

I've omitted the UTF-8 data, but both were null-terminated after four and six bytes respectively.
msg255206 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2015-11-23 17:58
unicode_modifiable in Objects/unicodeobject.c should return 0 if there's cached PyUnicode_UTF8 data. In this case PyUnicode_Append won't operate in place but instead concatenate a new string.
msg255207 - (view) Author: (random832) Date: 2015-11-23 18:05
I can't reproduce without pickle. I did some further digging, though, and it *looks like*...

1. Pickle causes the built-in UTF-8 representation of a string to be populated, whereas encode('utf-8') does not. Can anyone think of any other operations that do this?
2. After the UTF-8 representation of the 2-character string is populated, concatenating a new character to it does not update or clear it.
3. However, it worked just fine with the 1-character string - concatenating it caused the UTF-8 representation to be cleared.

The actual operation that creates an inconsistent string is the concatenate operation, but it only happens with a string that has been "primed" by having its UTF-8 representation materialized.
msg255212 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-23 19:20
Yes, I have came to the same as random832. String objects have "fast path" for concatenating, and in this path cached UTF8 representation is not cleaned. Pickle is one of simplest ways to reproduce this issue. May be it can be reproduced with compile() or type(), but these ways looks too heavyweighted to me.

Here is a patch that fixes strings concatenation.
msg255214 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2015-11-23 19:41
It would be good to get this in 3.4.4.
msg255216 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-23 20:14
Added test without using pickle.
msg255218 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2015-11-23 20:19
Serhiy, when does sharing UTF-8 data occur in a compact object? It has to be ASCII since non-ASCII UTF-8 isn't sharable, but PyASCIIObject doesn't have the utf8 field. So it has to be a PyCompactUnicodeObject. But isn't ASCII always allocated as a PyASCIIObject? I need a bit of help getting from point A to point B. Thanks.
msg255226 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-11-23 21:06
I reviewed issue25709_2.patch.

> It would be good to get this in 3.4.4.

Since it's a major bug in the Unicode implementation, it may be worth to fix it in Python 3.3. The bug was introduced in Python 3.3 by the PEP 393.
msg255227 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-11-23 21:09
3.3 is presumably in security mode. Anyone using it would have had to live with the bug for a long time already.
msg255230 - (view) Author: (random832) Date: 2015-11-23 21:20
> unicode_modifiable in Objects/unicodeobject.c should return 0 if there's cached PyUnicode_UTF8 data. In this case PyUnicode_Append won't operate in place but instead concatenate a new string.

Shouldn't it still operate in place but clear it? Operating in place is only an option if the old string has no references and will be discarded, right?
msg255233 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2015-11-23 21:42
I read some comments here and on the patches.  Serhiy's patch adds some code and Victor says you can't call that macro on this object and wow this is badly broken.  Can someone explain in simpler terms what's so broken, exactly?
msg255234 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-11-23 21:48
" and wow this is badly broken "

I mean the currently code is badly broken.

The bug is that sometimes, when a string is resized (which doesn't make sense, strings are immutable, right? :-D), the cached UTF-8 string can become corrupted (old pointer not updated).

It occurs if

* the string is resized (ex: "s += s2")
* the string has a cached UTF-8 byte string (ex: int(s) was called before the resize)
* the resize moves the memory block to a new address

Ok, it's probably unlikely to get these 3 conditions, but from my point of view, it's a major bug because it's in a Python fundamental type (str), it's not a bug in user code and it cannot be worked around (easily).
msg255236 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-23 22:47
In updated patch fixed a bug found by Victor and addressed other his comments. Many thanks Victor!
msg255240 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2015-11-24 01:30
On Mon, Nov 23, 2015 at 09:48:46PM +0000, STINNER Victor wrote:

> * the string has a cached UTF-8 byte string (ex: int(s) was called before the resize)

Why do strings cache their UTF-8 encoding?

I presume that some of Python's internals rely on the UTF-8 encoding 
rather than the internal Latin-1/UCS-2/UTF-32 representation (PEP 393). 
E.g. I infer from the above that int(s) parses the UTF-8 representation 
of s rather than the internal representation. Is that right?

Nevertheless, I wonder why the UTF-8 representation is cached. Is it 
that expensive to generate that it can't be done on the fly, as needed? 
As it stands now, non-ASCII strings may be up to twice as big as they 
need be, once you include the UTF-8 cache. And, as this bug painfully 
shows, the problem with caches is that you run the risk of the cache 
being out of date.
msg255241 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-11-24 01:47
Steven D'Aprano added the comment:
> the problem with caches is that you run the risk of the cache being out
of date.

Since strings are immutable, it's not a big deal. We control where strings
are modified (unicodeobject.c).
msg255244 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2015-11-24 02:29
> Why do strings cache their UTF-8 encoding?

Strings also cache the wide-string representation. For example:

    from ctypes import *
    s = '\241\242\243'
    pythonapi.PyUnicode_AsUnicodeAndSize(py_object(s), None)
    pythonapi.PyUnicode_AsUTF8AndSize(py_object(s), None)

    >>> hex(id(s))
    '0x7ffff69f8e98'

    (gdb) p *(PyCompactUnicodeObject *)0x7ffff69f8e98
    $1 = {_base = {ob_base = {_ob_next = 0x7ffff697f890,
                              _ob_prev = 0x7ffff6a04d40,
                              ob_refcnt = 1, 
                              ob_type = 0x89d860 <PyUnicode_Type>},
                   length = 3,
                   hash = -5238559198920514942,
                   state = {interned = 0,
                            kind = 1,
                            compact = 1,
                            ascii = 0,
                            ready = 1},
                   wstr = 0x7ffff69690a0 L"¡¢£"},
          utf8_length = 6,
          utf8 = 0x7ffff696b7e8 "¡¢£",
          wstr_length = 3}

    (gdb) p (char *)((PyCompactUnicodeObject *)0x7ffff69f8e98 + 1)
    $2 = 0x7ffff69f8ef0 "\241\242\243"

This object uses 4 bytes for the null-terminated Latin-1 string, which directly follows the PyCompactUnicodeObject struct. It uses 7 bytes for the UTF-8 string. It uses 16 bytes for the wchar_t string (4 bytes per wchar_t).
msg255255 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-24 08:51
Fixed yet one bug (thanks Victor again). Test is improved, now it doesn't rely on implementation detail of particular builtin.
msg255256 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2015-11-24 08:58
On 24.11.2015 02:30, Steven D'Aprano wrote:
> 
> Steven D'Aprano added the comment:
> 
> On Mon, Nov 23, 2015 at 09:48:46PM +0000, STINNER Victor wrote:
> 
>> * the string has a cached UTF-8 byte string (ex: int(s) was called before the resize)
> 
> Why do strings cache their UTF-8 encoding?
> 
> I presume that some of Python's internals rely on the UTF-8 encoding 
> rather than the internal Latin-1/UCS-2/UTF-32 representation (PEP 393). 
> E.g. I infer from the above that int(s) parses the UTF-8 representation 
> of s rather than the internal representation. Is that right?
> 
> Nevertheless, I wonder why the UTF-8 representation is cached. Is it 
> that expensive to generate that it can't be done on the fly, as needed? 
> As it stands now, non-ASCII strings may be up to twice as big as they 
> need be, once you include the UTF-8 cache. And, as this bug painfully 
> shows, the problem with caches is that you run the risk of the cache 
> being out of date.

The cache is needed because it's the only way to get a direct
C char* to the object's UTF-8 representation without having to
worry about memory management on the caller's side. Not having
access to this would break a lot of code using the Python
C API, since the cache is there per design. The speedup aspect
is secondary.

Unicode objects are normally immutable, but there are a few
corner cases during the initialization of the objects where
they are in fact mutable for a short while, e.g. when
creating an empty object which is then filled with data and
resized to the final length before passing it back to
Python.
msg255257 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-24 09:07
> Why do strings cache their UTF-8 encoding?

Mainly for compatibility with existing C API. Common way to parse function arguments in implemented in C function is to use special argument parsing API: PyArg_ParseTuple, PyArg_ParseTupleAndKeywords, or PyArg_Parse. Most format codes for Unicode strings returned a C pointer to char array. For that encoded Unicode strings should be kept somewhere at least for the time of executing C function. As well as PyArg_Parse* functions doesn't allow user to specify a storage for encoded string, it should be saved in Unicode object. That is not new to PEP 393 or Python 3, in Python 2 the Unicode objects also keep cached encoded version.
msg255259 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-11-24 10:25
issue25709_4.patch now looks good to me, but I added some minor comments on the review.
msg255271 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-11-24 15:32
Georg, I ask for applying this fix to 3.3.
msg255705 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2015-12-02 10:57
Is this going in soon?  I want to cherry-pick this for 3.5.1, which I tag in about 80 hours.
msg255708 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-12-02 11:24
I wait only Greg's approving for 3.3. If I'll not get it in a day, I'll commit the patch for 3.4+.
msg255710 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-12-02 11:58
Please commit right now to 3.4+. Backport to 3.3 can be done later.
msg255793 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-12-02 23:06
New changeset 67718032badb by Serhiy Storchaka in branch '3.4':
Issue #25709: Fixed problem with in-place string concatenation and utf-8 cache.
https://hg.python.org/cpython/rev/67718032badb

New changeset a0e2376768dc by Serhiy Storchaka in branch '3.5':
Issue #25709: Fixed problem with in-place string concatenation and utf-8 cache.
https://hg.python.org/cpython/rev/a0e2376768dc

New changeset 9e800b2aeeac by Serhiy Storchaka in branch 'default':
Issue #25709: Fixed problem with in-place string concatenation and utf-8 cache.
https://hg.python.org/cpython/rev/9e800b2aeeac
msg255794 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-12-02 23:18
> New changeset 67718032badb by Serhiy Storchaka in branch '3.4':

Thanks.
msg255996 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2015-12-06 00:52
I cherry-picked this for 3.5.1.
msg256050 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-12-07 06:16
New changeset 376b100107ba by Serhiy Storchaka in branch '3.5':
Issue #25709: Fixed problem with in-place string concatenation and utf-8 cache.
https://hg.python.org/cpython/rev/376b100107ba
msg260115 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-02-11 17:23
New changeset b9c8f1c80f47 by Serhiy Storchaka in branch '3.3':
Issue #25709: Fixed problem with in-place string concatenation and utf-8 cache.
https://hg.python.org/cpython/rev/b9c8f1c80f47
msg260116 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2016-02-11 17:23
Backpicked to 3.3. Sorry for the wait.
msg260119 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-11 18:00
2016-02-11 18:23 GMT+01:00 Georg Brandl <report@bugs.python.org>:
>
> Georg Brandl added the comment:
>
> Backpicked to 3.3. Sorry for the wait.

Good, this bugfix is useful :-)
msg260120 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-11 18:21
> I wait only Greg's approving for 3.3. If I'll not get it in a day, I'll commit the patch for 3.4+.

Maybe it was my fault. I made a mistake in Georg's name.
msg260121 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2016-02-11 18:24
Actually I prefer Greg to Gerg, so it's only half bad. :D
msg260172 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-02-12 10:41
b9c8f1c80f47 added a new head. Should we merge 3.3 -> 3.4 -> 3.5 -> default?
msg260174 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2016-02-12 11:51
Don't bother. I can do that once 3.3.7 is released.
History
Date User Action Args
2016-02-13 03:10:46ned.deilysetstage: patch review -> resolved
2016-02-12 11:51:47georg.brandlsetmessages: + msg260174
2016-02-12 10:41:19serhiy.storchakasetmessages: + msg260172
2016-02-11 18:24:32georg.brandlsetmessages: + msg260121
2016-02-11 18:21:32serhiy.storchakasetmessages: + msg260120
2016-02-11 18:00:32vstinnersetmessages: + msg260119
2016-02-11 17:23:56georg.brandlsetstatus: open -> closed
resolution: fixed
messages: + msg260116
2016-02-11 17:23:36python-devsetmessages: + msg260115
2016-02-11 17:04:14serhiy.storchakasetassignee: serhiy.storchaka -> georg.brandl
2015-12-07 06:16:02python-devsetmessages: + msg256050
2015-12-06 00:52:45larrysetmessages: + msg255996
2015-12-03 17:22:39serhiy.storchakasetversions: - Python 3.4, Python 3.5, Python 3.6
2015-12-02 23:18:13vstinnersetmessages: + msg255794
2015-12-02 23:06:28python-devsetnosy: + python-dev
messages: + msg255793
2015-12-02 11:58:04vstinnersetmessages: + msg255710
2015-12-02 11:24:58serhiy.storchakasetmessages: + msg255708
2015-12-02 10:57:49larrysetmessages: + msg255705
2015-11-24 15:32:04serhiy.storchakasetnosy: + georg.brandl

messages: + msg255271
versions: + Python 3.3
2015-11-24 10:25:25vstinnersetmessages: + msg255259
2015-11-24 09:07:54serhiy.storchakasetmessages: + msg255257
2015-11-24 08:58:50lemburgsetmessages: + msg255256
2015-11-24 08:51:07serhiy.storchakasetfiles: + issue25709_4.patch

messages: + msg255255
2015-11-24 02:29:05eryksunsetmessages: + msg255244
2015-11-24 01:47:36vstinnersetmessages: + msg255241
2015-11-24 01:30:48steven.dapranosetmessages: + msg255240
2015-11-23 22:47:23serhiy.storchakasetfiles: + issue25709_3.patch

messages: + msg255236
2015-11-23 21:48:46vstinnersetmessages: + msg255234
2015-11-23 21:42:29larrysetmessages: + msg255233
2015-11-23 21:20:28random832setmessages: + msg255230
2015-11-23 21:09:43pitrousetmessages: + msg255227
2015-11-23 21:06:42vstinnersetmessages: + msg255226
2015-11-23 20:19:25eryksunsetmessages: + msg255218
2015-11-23 20:14:42serhiy.storchakasetfiles: + issue25709_2.patch
priority: high -> release blocker

nosy: + larry
messages: + msg255216
2015-11-23 19:41:15terry.reedysettitle: greek alphabet bug it is very disturbing... -> Problem with string concatenation and utf-8 cache.
nosy: + lemburg, pitrou, vstinner, benjamin.peterson, ezio.melotti, - kbk, roger.serwy

messages: + msg255214

components: + Library (Lib), - IDLE
2015-11-23 19:20:56serhiy.storchakasetfiles: + issue25709.patch
keywords: + patch
messages: + msg255212

stage: patch review
2015-11-23 18:05:52random832setmessages: + msg255207
2015-11-23 17:58:50eryksunsetnosy: + eryksun
messages: + msg255206
2015-11-23 17:40:09random832setnosy: + random832
messages: + msg255203
2015-11-23 16:57:45serhiy.storchakasetpriority: normal -> high

messages: + msg255194
2015-11-23 15:30:04serhiy.storchakasetversions: + Python 3.5, Python 3.6
nosy: + terry.reedy, serhiy.storchaka, roger.serwy, kbk

messages: + msg255177

assignee: serhiy.storchaka
components: + IDLE
2015-11-23 15:14:16steven.dapranosetnosy: + steven.daprano
messages: + msg255175
2015-11-23 14:57:19Árpád Kósacreate