Message 255256 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	benjamin.peterson, eryksun, ezio.melotti, larry, lemburg, pitrou, random832, serhiy.storchaka, steven.daprano, terry.reedy, vstinner, Árpád Kósa
Date	2015-11-24.08:58:50
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<565426C4.1080100@egenix.com>
In-reply-to	<20151124013020.GN3821@ando.pearwood.info>

Content
On 24.11.2015 02:30, Steven D'Aprano wrote: > > Steven D'Aprano added the comment: > > On Mon, Nov 23, 2015 at 09:48:46PM +0000, STINNER Victor wrote: > >> * the string has a cached UTF-8 byte string (ex: int(s) was called before the resize) > > Why do strings cache their UTF-8 encoding? > > I presume that some of Python's internals rely on the UTF-8 encoding > rather than the internal Latin-1/UCS-2/UTF-32 representation (PEP 393). > E.g. I infer from the above that int(s) parses the UTF-8 representation > of s rather than the internal representation. Is that right? > > Nevertheless, I wonder why the UTF-8 representation is cached. Is it > that expensive to generate that it can't be done on the fly, as needed? > As it stands now, non-ASCII strings may be up to twice as big as they > need be, once you include the UTF-8 cache. And, as this bug painfully > shows, the problem with caches is that you run the risk of the cache > being out of date. The cache is needed because it's the only way to get a direct C char* to the object's UTF-8 representation without having to worry about memory management on the caller's side. Not having access to this would break a lot of code using the Python C API, since the cache is there per design. The speedup aspect is secondary. Unicode objects are normally immutable, but there are a few corner cases during the initialization of the objects where they are in fact mutable for a short while, e.g. when creating an empty object which is then filled with data and resized to the final length before passing it back to Python.

On 24.11.2015 02:30, Steven D'Aprano wrote:
> 
> Steven D'Aprano added the comment:
> 
> On Mon, Nov 23, 2015 at 09:48:46PM +0000, STINNER Victor wrote:
> 
>> * the string has a cached UTF-8 byte string (ex: int(s) was called before the resize)
> 
> Why do strings cache their UTF-8 encoding?
> 
> I presume that some of Python's internals rely on the UTF-8 encoding 
> rather than the internal Latin-1/UCS-2/UTF-32 representation (PEP 393). 
> E.g. I infer from the above that int(s) parses the UTF-8 representation 
> of s rather than the internal representation. Is that right?
> 
> Nevertheless, I wonder why the UTF-8 representation is cached. Is it 
> that expensive to generate that it can't be done on the fly, as needed? 
> As it stands now, non-ASCII strings may be up to twice as big as they 
> need be, once you include the UTF-8 cache. And, as this bug painfully 
> shows, the problem with caches is that you run the risk of the cache 
> being out of date.

The cache is needed because it's the only way to get a direct
C char* to the object's UTF-8 representation without having to
worry about memory management on the caller's side. Not having
access to this would break a lot of code using the Python
C API, since the cache is there per design. The speedup aspect
is secondary.

Unicode objects are normally immutable, but there are a few
corner cases during the initialization of the objects where
they are in fact mutable for a short while, e.g. when
creating an empty object which is then filled with data and
resized to the final length before passing it back to
Python.

History
Date	User	Action	Args
2015-11-24 08:58:50	lemburg	set	recipients: + lemburg, terry.reedy, pitrou, vstinner, larry, benjamin.peterson, ezio.melotti, steven.daprano, serhiy.storchaka, eryksun, random832, Árpád Kósa
2015-11-24 08:58:50	lemburg	link	issue25709 messages
2015-11-24 08:58:50	lemburg	create