This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author steven.daprano
Recipients benjamin.peterson, eryksun, ezio.melotti, larry, lemburg, pitrou, random832, serhiy.storchaka, steven.daprano, terry.reedy, vstinner, Árpád Kósa
Date 2015-11-24.01:30:47
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
In-reply-to <>
On Mon, Nov 23, 2015 at 09:48:46PM +0000, STINNER Victor wrote:

> * the string has a cached UTF-8 byte string (ex: int(s) was called before the resize)

Why do strings cache their UTF-8 encoding?

I presume that some of Python's internals rely on the UTF-8 encoding 
rather than the internal Latin-1/UCS-2/UTF-32 representation (PEP 393). 
E.g. I infer from the above that int(s) parses the UTF-8 representation 
of s rather than the internal representation. Is that right?

Nevertheless, I wonder why the UTF-8 representation is cached. Is it 
that expensive to generate that it can't be done on the fly, as needed? 
As it stands now, non-ASCII strings may be up to twice as big as they 
need be, once you include the UTF-8 cache. And, as this bug painfully 
shows, the problem with caches is that you run the risk of the cache 
being out of date.
Date User Action Args
2015-11-24 01:30:49steven.dapranosetrecipients: + steven.daprano, lemburg, terry.reedy, pitrou, vstinner, larry, benjamin.peterson, ezio.melotti, serhiy.storchaka, eryksun, random832, Árpád Kósa
2015-11-24 01:30:48steven.dapranolinkissue25709 messages
2015-11-24 01:30:47steven.dapranocreate