Message255244
> Why do strings cache their UTF-8 encoding?
Strings also cache the wide-string representation. For example:
from ctypes import *
s = '\241\242\243'
pythonapi.PyUnicode_AsUnicodeAndSize(py_object(s), None)
pythonapi.PyUnicode_AsUTF8AndSize(py_object(s), None)
>>> hex(id(s))
'0x7ffff69f8e98'
(gdb) p *(PyCompactUnicodeObject *)0x7ffff69f8e98
$1 = {_base = {ob_base = {_ob_next = 0x7ffff697f890,
_ob_prev = 0x7ffff6a04d40,
ob_refcnt = 1,
ob_type = 0x89d860 <PyUnicode_Type>},
length = 3,
hash = -5238559198920514942,
state = {interned = 0,
kind = 1,
compact = 1,
ascii = 0,
ready = 1},
wstr = 0x7ffff69690a0 L"¡¢£"},
utf8_length = 6,
utf8 = 0x7ffff696b7e8 "¡¢£",
wstr_length = 3}
(gdb) p (char *)((PyCompactUnicodeObject *)0x7ffff69f8e98 + 1)
$2 = 0x7ffff69f8ef0 "\241\242\243"
This object uses 4 bytes for the null-terminated Latin-1 string, which directly follows the PyCompactUnicodeObject struct. It uses 7 bytes for the UTF-8 string. It uses 16 bytes for the wchar_t string (4 bytes per wchar_t). |
|
Date |
User |
Action |
Args |
2015-11-24 02:29:05 | eryksun | set | recipients:
+ eryksun, lemburg, terry.reedy, pitrou, vstinner, larry, benjamin.peterson, ezio.melotti, steven.daprano, serhiy.storchaka, random832, Árpád Kósa |
2015-11-24 02:29:05 | eryksun | set | messageid: <1448332145.42.0.311582355377.issue25709@psf.upfronthosting.co.za> |
2015-11-24 02:29:05 | eryksun | link | issue25709 messages |
2015-11-24 02:29:04 | eryksun | create | |
|