Message255256
On 24.11.2015 02:30, Steven D'Aprano wrote:
>
> Steven D'Aprano added the comment:
>
> On Mon, Nov 23, 2015 at 09:48:46PM +0000, STINNER Victor wrote:
>
>> * the string has a cached UTF-8 byte string (ex: int(s) was called before the resize)
>
> Why do strings cache their UTF-8 encoding?
>
> I presume that some of Python's internals rely on the UTF-8 encoding
> rather than the internal Latin-1/UCS-2/UTF-32 representation (PEP 393).
> E.g. I infer from the above that int(s) parses the UTF-8 representation
> of s rather than the internal representation. Is that right?
>
> Nevertheless, I wonder why the UTF-8 representation is cached. Is it
> that expensive to generate that it can't be done on the fly, as needed?
> As it stands now, non-ASCII strings may be up to twice as big as they
> need be, once you include the UTF-8 cache. And, as this bug painfully
> shows, the problem with caches is that you run the risk of the cache
> being out of date.
The cache is needed because it's the only way to get a direct
C char* to the object's UTF-8 representation without having to
worry about memory management on the caller's side. Not having
access to this would break a lot of code using the Python
C API, since the cache is there per design. The speedup aspect
is secondary.
Unicode objects are normally immutable, but there are a few
corner cases during the initialization of the objects where
they are in fact mutable for a short while, e.g. when
creating an empty object which is then filled with data and
resized to the final length before passing it back to
Python. |
|
Date |
User |
Action |
Args |
2015-11-24 08:58:50 | lemburg | set | recipients:
+ lemburg, terry.reedy, pitrou, vstinner, larry, benjamin.peterson, ezio.melotti, steven.daprano, serhiy.storchaka, eryksun, random832, Árpád Kósa |
2015-11-24 08:58:50 | lemburg | link | issue25709 messages |
2015-11-24 08:58:50 | lemburg | create | |
|