This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients georg.brandl, indygreg, methane, petr.viktorin, serhiy.storchaka, vstinner
Date 2021-08-30.15:37:53
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1630337873.69.0.767162203537.issue45025@roundup.psfhosted.org>
In-reply-to
Content
> PyUnicode_KIND does *not* expose the implementation details to the programmer.

PyUnicode_KIND() is very specific to the exact PEP 393 implementation. Documentation of this field:
---
/* Character size:

   - PyUnicode_WCHAR_KIND (0):

     * character type = wchar_t (16 or 32 bits, depending on the
       platform)

   - PyUnicode_1BYTE_KIND (1):

     * character type = Py_UCS1 (8 bits, unsigned)
     * all characters are in the range U+0000-U+00FF (latin1)
     * if ascii is set, all characters are in the range U+0000-U+007F
       (ASCII), otherwise at least one character is in the range
       U+0080-U+00FF

   - PyUnicode_2BYTE_KIND (2):

     * character type = Py_UCS2 (16 bits, unsigned)
     * all characters are in the range U+0000-U+FFFF (BMP)
     * at least one character is in the range U+0100-U+FFFF

   - PyUnicode_4BYTE_KIND (4):

     * character type = Py_UCS4 (32 bits, unsigned)
     * all characters are in the range U+0000-U+10FFFF
     * at least one character is in the range U+10000-U+10FFFF
 */
unsigned int kind:3;
---

I don't think that PyUnicode_KIND() makes sense if CPython uses UTF-8 tomorrow.


> If the internal representation os strings is switched to use masks and shifts instead of bitfields, PyUnicode_KIND (and others) can be adapted to the new details without breaking API compatibility.

PyUnicode_KIND() was exposed in the *public* C API because unicodeobject.h provides functions as macros for best performances, and these macros use PyUnicode_KIND() internally.

Macros like PyUnicode_READ(kind, data, index) are also designed for best performances with the exact PEP 393 implementation.

The public C API should only contain PyUnicode_READ_CHAR(unicode, index): this macro doesn't use "kind" or "data" which are (again) specific to the PEP 393.

In the CPython implementation, we should use the most efficient code, it's fine to use macros accessing directly structures.

But for the public C API, I would recommend to only provide abstractions, even if there are a little bit slower.
History
Date User Action Args
2021-08-30 15:37:53vstinnersetrecipients: + vstinner, georg.brandl, petr.viktorin, methane, serhiy.storchaka, indygreg
2021-08-30 15:37:53vstinnersetmessageid: <1630337873.69.0.767162203537.issue45025@roundup.psfhosted.org>
2021-08-30 15:37:53vstinnerlinkissue45025 messages
2021-08-30 15:37:53vstinnercreate