Message 255244 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	benjamin.peterson, eryksun, ezio.melotti, larry, lemburg, pitrou, random832, serhiy.storchaka, steven.daprano, terry.reedy, vstinner, Árpád Kósa
Date	2015-11-24.02:29:04
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1448332145.42.0.311582355377.issue25709@psf.upfronthosting.co.za>
In-reply-to

Content
> Why do strings cache their UTF-8 encoding? Strings also cache the wide-string representation. For example: from ctypes import * s = '\241\242\243' pythonapi.PyUnicode_AsUnicodeAndSize(py_object(s), None) pythonapi.PyUnicode_AsUTF8AndSize(py_object(s), None) >>> hex(id(s)) '0x7ffff69f8e98' (gdb) p (PyCompactUnicodeObject )0x7ffff69f8e98 $1 = {_base = {ob_base = {_ob_next = 0x7ffff697f890, _ob_prev = 0x7ffff6a04d40, ob_refcnt = 1, ob_type = 0x89d860 <PyUnicode_Type>}, length = 3, hash = -5238559198920514942, state = {interned = 0, kind = 1, compact = 1, ascii = 0, ready = 1}, wstr = 0x7ffff69690a0 L"¡¢£"}, utf8_length = 6, utf8 = 0x7ffff696b7e8 "¡¢£", wstr_length = 3} (gdb) p (char )((PyCompactUnicodeObject )0x7ffff69f8e98 + 1) $2 = 0x7ffff69f8ef0 "\241\242\243" This object uses 4 bytes for the null-terminated Latin-1 string, which directly follows the PyCompactUnicodeObject struct. It uses 7 bytes for the UTF-8 string. It uses 16 bytes for the wchar_t string (4 bytes per wchar_t).

> Why do strings cache their UTF-8 encoding?

Strings also cache the wide-string representation. For example:

    from ctypes import *
    s = '\241\242\243'
    pythonapi.PyUnicode_AsUnicodeAndSize(py_object(s), None)
    pythonapi.PyUnicode_AsUTF8AndSize(py_object(s), None)

    >>> hex(id(s))
    '0x7ffff69f8e98'

    (gdb) p *(PyCompactUnicodeObject *)0x7ffff69f8e98
    $1 = {_base = {ob_base = {_ob_next = 0x7ffff697f890,
                              _ob_prev = 0x7ffff6a04d40,
                              ob_refcnt = 1, 
                              ob_type = 0x89d860 <PyUnicode_Type>},
                   length = 3,
                   hash = -5238559198920514942,
                   state = {interned = 0,
                            kind = 1,
                            compact = 1,
                            ascii = 0,
                            ready = 1},
                   wstr = 0x7ffff69690a0 L"¡¢£"},
          utf8_length = 6,
          utf8 = 0x7ffff696b7e8 "¡¢£",
          wstr_length = 3}

    (gdb) p (char *)((PyCompactUnicodeObject *)0x7ffff69f8e98 + 1)
    $2 = 0x7ffff69f8ef0 "\241\242\243"

This object uses 4 bytes for the null-terminated Latin-1 string, which directly follows the PyCompactUnicodeObject struct. It uses 7 bytes for the UTF-8 string. It uses 16 bytes for the wchar_t string (4 bytes per wchar_t).

History
Date	User	Action	Args
2015-11-24 02:29:05	eryksun	set	recipients: + eryksun, lemburg, terry.reedy, pitrou, vstinner, larry, benjamin.peterson, ezio.melotti, steven.daprano, serhiy.storchaka, random832, Árpád Kósa
2015-11-24 02:29:05	eryksun	set	messageid: <1448332145.42.0.311582355377.issue25709@psf.upfronthosting.co.za>
2015-11-24 02:29:05	eryksun	link	issue25709 messages
2015-11-24 02:29:04	eryksun	create