Message 400624 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	georg.brandl, indygreg, methane, petr.viktorin, serhiy.storchaka, vstinner
Date	2021-08-30.15:37:53
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1630337873.69.0.767162203537.issue45025@roundup.psfhosted.org>
In-reply-to

Content
> PyUnicode_KIND does not expose the implementation details to the programmer. PyUnicode_KIND() is very specific to the exact PEP 393 implementation. Documentation of this field: --- /* Character size: - PyUnicode_WCHAR_KIND (0): * character type = wchar_t (16 or 32 bits, depending on the platform) - PyUnicode_1BYTE_KIND (1): * character type = Py_UCS1 (8 bits, unsigned) * all characters are in the range U+0000-U+00FF (latin1) * if ascii is set, all characters are in the range U+0000-U+007F (ASCII), otherwise at least one character is in the range U+0080-U+00FF - PyUnicode_2BYTE_KIND (2): * character type = Py_UCS2 (16 bits, unsigned) * all characters are in the range U+0000-U+FFFF (BMP) * at least one character is in the range U+0100-U+FFFF - PyUnicode_4BYTE_KIND (4): * character type = Py_UCS4 (32 bits, unsigned) * all characters are in the range U+0000-U+10FFFF * at least one character is in the range U+10000-U+10FFFF / unsigned int kind:3; --- I don't think that PyUnicode_KIND() makes sense if CPython uses UTF-8 tomorrow. > If the internal representation os strings is switched to use masks and shifts instead of bitfields, PyUnicode_KIND (and others) can be adapted to the new details without breaking API compatibility. PyUnicode_KIND() was exposed in the public* C API because unicodeobject.h provides functions as macros for best performances, and these macros use PyUnicode_KIND() internally. Macros like PyUnicode_READ(kind, data, index) are also designed for best performances with the exact PEP 393 implementation. The public C API should only contain PyUnicode_READ_CHAR(unicode, index): this macro doesn't use "kind" or "data" which are (again) specific to the PEP 393. In the CPython implementation, we should use the most efficient code, it's fine to use macros accessing directly structures. But for the public C API, I would recommend to only provide abstractions, even if there are a little bit slower.

> PyUnicode_KIND does *not* expose the implementation details to the programmer.

PyUnicode_KIND() is very specific to the exact PEP 393 implementation. Documentation of this field:
---
/* Character size:

   - PyUnicode_WCHAR_KIND (0):

     * character type = wchar_t (16 or 32 bits, depending on the
       platform)

   - PyUnicode_1BYTE_KIND (1):

     * character type = Py_UCS1 (8 bits, unsigned)
     * all characters are in the range U+0000-U+00FF (latin1)
     * if ascii is set, all characters are in the range U+0000-U+007F
       (ASCII), otherwise at least one character is in the range
       U+0080-U+00FF

   - PyUnicode_2BYTE_KIND (2):

     * character type = Py_UCS2 (16 bits, unsigned)
     * all characters are in the range U+0000-U+FFFF (BMP)
     * at least one character is in the range U+0100-U+FFFF

   - PyUnicode_4BYTE_KIND (4):

     * character type = Py_UCS4 (32 bits, unsigned)
     * all characters are in the range U+0000-U+10FFFF
     * at least one character is in the range U+10000-U+10FFFF
 */
unsigned int kind:3;
---

I don't think that PyUnicode_KIND() makes sense if CPython uses UTF-8 tomorrow.


> If the internal representation os strings is switched to use masks and shifts instead of bitfields, PyUnicode_KIND (and others) can be adapted to the new details without breaking API compatibility.

PyUnicode_KIND() was exposed in the *public* C API because unicodeobject.h provides functions as macros for best performances, and these macros use PyUnicode_KIND() internally.

Macros like PyUnicode_READ(kind, data, index) are also designed for best performances with the exact PEP 393 implementation.

The public C API should only contain PyUnicode_READ_CHAR(unicode, index): this macro doesn't use "kind" or "data" which are (again) specific to the PEP 393.

In the CPython implementation, we should use the most efficient code, it's fine to use macros accessing directly structures.

But for the public C API, I would recommend to only provide abstractions, even if there are a little bit slower.

History
Date	User	Action	Args
2021-08-30 15:37:53	vstinner	set	recipients: + vstinner, georg.brandl, petr.viktorin, methane, serhiy.storchaka, indygreg
2021-08-30 15:37:53	vstinner	set	messageid: <1630337873.69.0.767162203537.issue45025@roundup.psfhosted.org>
2021-08-30 15:37:53	vstinner	link	issue45025 messages
2021-08-30 15:37:53	vstinner	create