This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients benjamin.peterson, eryksun, ezio.melotti, larry, lemburg, pitrou, random832, serhiy.storchaka, steven.daprano, terry.reedy, vstinner, Árpád Kósa
Date 2015-11-24.08:58:50
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <565426C4.1080100@egenix.com>
In-reply-to <20151124013020.GN3821@ando.pearwood.info>
Content
On 24.11.2015 02:30, Steven D'Aprano wrote:
> 
> Steven D'Aprano added the comment:
> 
> On Mon, Nov 23, 2015 at 09:48:46PM +0000, STINNER Victor wrote:
> 
>> * the string has a cached UTF-8 byte string (ex: int(s) was called before the resize)
> 
> Why do strings cache their UTF-8 encoding?
> 
> I presume that some of Python's internals rely on the UTF-8 encoding 
> rather than the internal Latin-1/UCS-2/UTF-32 representation (PEP 393). 
> E.g. I infer from the above that int(s) parses the UTF-8 representation 
> of s rather than the internal representation. Is that right?
> 
> Nevertheless, I wonder why the UTF-8 representation is cached. Is it 
> that expensive to generate that it can't be done on the fly, as needed? 
> As it stands now, non-ASCII strings may be up to twice as big as they 
> need be, once you include the UTF-8 cache. And, as this bug painfully 
> shows, the problem with caches is that you run the risk of the cache 
> being out of date.

The cache is needed because it's the only way to get a direct
C char* to the object's UTF-8 representation without having to
worry about memory management on the caller's side. Not having
access to this would break a lot of code using the Python
C API, since the cache is there per design. The speedup aspect
is secondary.

Unicode objects are normally immutable, but there are a few
corner cases during the initialization of the objects where
they are in fact mutable for a short while, e.g. when
creating an empty object which is then filled with data and
resized to the final length before passing it back to
Python.
History
Date User Action Args
2015-11-24 08:58:50lemburgsetrecipients: + lemburg, terry.reedy, pitrou, vstinner, larry, benjamin.peterson, ezio.melotti, steven.daprano, serhiy.storchaka, eryksun, random832, Árpád Kósa
2015-11-24 08:58:50lemburglinkissue25709 messages
2015-11-24 08:58:50lemburgcreate