--- pep-0393_orig.txt 2011-12-14 14:01:53.000000000 -0500 +++ pep-0393.txt 2011-12-15 19:30:26.000000000 -0500 @@ -17,13 +17,14 @@ representations, depending on the character with the largest Unicode ordinal (1, 2, or 4 bytes). This will allow a space-efficient representation in common cases, but give access to full UCS-4 on all systems. For compatibility with existing APIs, several representations may exist in parallel; over time, this compatibility should be phased out. The distinction between narrow and wide Unicode builds is -dropped. An implementation of this PEP is available at [1]_. +dropped. An early implementation of this PEP is available at [1]_, and +the current implementation has been integrated into the CPython source. Rationale ========= There are two classes of complaints about the current implementation of the unicode type: on systems only supporting UTF-16, users complain @@ -31,22 +32,24 @@ UCS-4 internally (and also sometimes on systems using UCS-2), there is a complaint that Unicode strings take up too much memory - especially compared to Python 2.x, where the same code would often use ASCII strings (i.e. ASCII-encoded byte strings). With the proposed approach, ASCII-only Unicode strings will again use only one byte per character; while still allowing efficient indexing of strings containing non-BMP -characters (as strings containing them will use 4 bytes per -character). +characters (as strings actually containing these characters will use +4 bytes per character). One problem with the approach is support for existing applications -(e.g. extension modules). For compatibility, redundant representations -may be computed. Applications are encouraged to phase out reliance on -a specific internal representation if possible. As interaction with -other libraries will often require some sort of internal -representation, the specification chooses UTF-8 as the recommended way -of exposing strings to C code. +(e.g. extension modules). These applications often assume that they +will have direct access to a specific representation, such as +wchar_t or UTF-8. For compatibility, redundant representations may be +computed, and once computed may be cached. Applications are encouraged +to phase out reliance on any specific internal representation. When +interaction with other libraries requires exposing an internal +representation in a single format for all strings, the specification +chooses UTF-8 as the recommended physical encoding. For many strings (e.g. ASCII), multiple representations may actually share memory (e.g. the shortest form may be shared with the UTF-8 form if all characters are ASCII). With such sharing, the overhead of compatibility representations is reduced. If representations do share data, it is also possible to omit structure fields, reducing the base @@ -61,13 +64,13 @@ typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; struct { unsigned int interned:2; - unsigned int kind:2; + unsigned int kind:3; unsigned int compact:1; unsigned int ascii:1; unsigned int ready:1; } state; wchar_t *wstr; } PyASCIIObject; @@ -94,51 +97,69 @@ immediately follow the base structure. If the maximum character is less than 128, they use the PyASCIIObject structure, and the UTF-8 data, the UTF-8 length and the wstr length are the same as the length of the ASCII data. For non-ASCII strings, the PyCompactObject structure is used. Resizing compact objects is not supported. -Objects for which the maximum character is not given at creation time -are called "legacy" objects, created through -PyUnicode_FromStringAndSize(NULL, length). They use the -PyUnicodeObject structure. Initially, their data is only in the wstr -pointer; when PyUnicode_READY is called, the data pointer (union) is -allocated. Resizing is possible as long PyUnicode_READY has not been -called. +When either the number of characters or the maximum character is not +given at creation time, as with calls to +PyUnicode_FromStringAndSize(NULL, length), +a "legacy" string object is created. Legacy string objects use the +full PyUnicodeObject structure, and the character data is stored +separately from the object header. Legacy strings may delay creating +the canonical representation until PyUnicode_READY is (perhaps +implicitly) called. -The fields have the following interpretations: +The structure fields have the following interpretations: -- length: number of code points in the string (result of sq_length) -- interned: interned-state (SSTATE_*) as in 3.2 -- kind: form of string - + 00 => str is not initialized (data are in wstr) - + 01 => 1 byte (Latin-1) - + 10 => 2 byte (UCS-2) - + 11 => 4 byte (UCS-4); -- compact: the object uses one of the compact representations +- length: number of characters in the string (result of sq_length). + As the canonical string representations are fixed-width without + surrogates, this is also the number of code points and the number + of code units. + +- state.interned: interned-state (SSTATE_*) as in Python 3.2 +- state.kind: physical representation form of the string + + 0 => canonical form has not been created; wstr points to the + wchar_t representation, which may contain surrogates. + + 1 => 1 byte (Latin-1) + + 2 => 2 byte (UCS-2) + + 4 => 4 byte (UCS-4) + + Other values (3, 5, 6, 7) are reserved at this time. + +- state.compact: The object uses one of the compact representations. (implies ready) -- ascii: the object uses the PyASCIIObject representation - (implies compact and ready) -- ready: the canonical representation is ready to be accessed through - PyUnicode_DATA and PyUnicode_GET_LENGTH. This is set either if the - object is compact, or the data pointer and length have been - initialized. +- state.ascii: The object uses the PyASCIIObject representation. + (implies compact and ready) + Note that even strings containing only ASCII characters will not + have this bit set if the actual data does not immediately follow + the PyASCIIObject header. +- state.ready: The canonical representation is ready to be accessed + through PyUnicode_DATA and PyUnicode_GET_LENGTH. This is set if + either the object is compact, or the data pointer and length have + been initialized. + Note that this bit does not imply that the actual data is there + ready for reading; it simply indicates that a properly sized buffer + is reserved. + - wstr_length, wstr: representation in platform's wchar_t (null-terminated). If wchar_t is 16-bit, this form may use surrogate - pairs (in which cast wstr_length differs form length). - wstr_length differs from length only if there are surrogate pairs - in the representation. + pairs. wstr_length differs from length if and only if there are + surrogate pairs in the wstr representation. - utf8_length, utf8: UTF-8 representation (null-terminated). -- data: shortest-form representation of the unicode string. - The string is null-terminated (in its respective representation). +- data: Canonical (shortest-form fixed-width) representation of the + unicode string. The string is null-terminated (in its respective + representation). + Note that this pointer is only valid when state.ready is set. + Note that a compact string does not contain an explicit data pointer, + as the data will begin immediately after the string's header. -All three representations are optional, although the data form is -considered the canonical representation which can be absent only -while the string is being created. If the representation is absent, -the pointer is NULL, and the corresponding length field may contain -arbitrary data. +All three representations are optional, but the data form (either at +*data, or immediately following the header) is considered the canonical +representation which can be absent only while the string is being +created. If a representation is absent, the pointer is NULL, and the +corresponding length field may contain arbitrary data. The Py_UNICODE type is still supported but deprecated. It is always defined as a typedef for wchar_t, so the wstr representation can double as Py_UNICODE representation. The data and utf8 pointers point to the same memory if the string uses @@ -155,111 +176,111 @@ PyUnicode_New:: PyObject* PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar); Both parameters must denote the eventual size/range of the strings. In particular, codecs using this API must compute both the number of -characters and the maximum character in advance. An string is +characters and the maximum character in advance. It is acceptable to +round the maximum character up to 127 (ASCII), 255 (Latin-1), +65535 (no surrogates needed) or 1114111 (surrogates would be needed); +this may be particularly helpful if the original data is known to be +in a particular fixed-width legacy character encoding. A string is allocated according to the specified size and character range and is null-terminated; the actual characters in it may be uninitialized. PyUnicode_FromString and PyUnicode_FromStringAndSize remain supported -for processing UTF-8 input; the input is decoded, and the UTF-8 -representation is not yet set for the string. +for processing UTF-8 input; the input is decoded, the canonical +representation is created, and the original UTF-8 representation may +be discarded. PyUnicode_FromUnicode remains supported but is deprecated. If the -Py_UNICODE pointer is non-null, the data representation is set. If the -pointer is NULL, a properly-sized wstr representation is allocated, -which can be modified until PyUnicode_READY() is called (explicitly -or implicitly). Resizing a Unicode string remains possible until it -is finalized. +Py_UNICODE pointer is non-null, the (canonical form) data +representation is set. If the pointer is NULL, a properly-sized wstr +representation is allocated, which can be modified until +PyUnicode_READY() is called (explicitly or implicitly). PyUnicode_READY() converts a string containing only a wstr -representation into the canonical representation. Unless wstr and data -can share the memory, the wstr representation is discarded after the -conversion. The macro returns 0 on success and -1 on failure, which -happens in particular if the memory allocation fails. +representation into the canonical representation. The original wstr +representation may be discarded, but is typically kept if the memory +can be shared with the canonical format. The macro returns 0 on success +and -1 on failure. (Failure could be caused by a memory allocation +failure or by certain types of invalid input.) String Access ------------- The canonical representation can be accessed using two macros PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the values PyUnicode_WCHAR_KIND (0), PyUnicode_1BYTE_KIND (1), -PyUnicode_2BYTE_KIND (2), or PyUnicode_4BYTE_KIND (3). PyUnicode_DATA +PyUnicode_2BYTE_KIND (2), or PyUnicode_4BYTE_KIND (4). PyUnicode_DATA gives the void pointer to the data. Access to individual characters should use PyUnicode_{READ|WRITE}[_CHAR]: - PyUnicode_READ(kind, data, index) - PyUnicode_WRITE(kind, data, index, value) - PyUnicode_READ_CHAR(unicode, index) All these macros assume that the string is in canonical form; -callers need to ensure this by calling PyUnicode_READY. +callers need to ensure this by first calling PyUnicode_READY. A new function PyUnicode_AsUTF8 is provided to access the UTF-8 representation. It is thus identical to the existing _PyUnicode_AsString, which is removed. The function will compute the -utf8 representation when first called. Since this representation will -consume memory until the string object is released, applications -should use the existing PyUnicode_AsUTF8String where possible -(which generates a new string object every time). APIs that implicitly -converts a string to a char* (such as the ParseTuple functions) will -use PyUnicode_AsUTF8 to compute a conversion. +utf8 representation when first called, and cache this representation. +If storing this additional representation is undesirable, applications +should use the existing PyUnicode_AsUTF8String (which generates a new +bytes object every time). APIs that implicitly convert a string to a +char* (such as the ParseTuple functions) will use PyUnicode_AsUTF8 to +compute a conversion. New API ------- This section summarizes the API additions. Macros to access the internal representation of a Unicode object (read-only): -- PyUnicode_IS_COMPACT_ASCII(o), PyUnicode_IS_COMPACT(o), - PyUnicode_IS_READY(o) +- PyUnicode_IS_ASCII(o), PyUnicode_IS_COMPACT(o), + PyUnicode_IS_COMPACT_ASCII(o) +- PyUnicode_IS_READY(o) - PyUnicode_GET_LENGTH(o) -- PyUnicode_KIND(o), PyUnicode_CHARACTER_SIZE(o), - PyUnicode_MAX_CHAR_VALUE(o) -- PyUnicode_DATA(o), PyUnicode_1BYTE_DATA(o), PyUnicode_2BYTE_DATA(o), - PyUnicode_4BYTE_DATA(o) +- PyUnicode_KIND(o), PyUnicode_MAX_CHAR_VALUE(o) +- PyUnicode_DATA(o), + PyUnicode_1BYTE_DATA(o), PyUnicode_2BYTE_DATA(o), PyUnicode_4BYTE_DATA(o) Character access macros: - PyUnicode_READ(kind, data, index), PyUnicode_READ_CHAR(o, index) - PyUnicode_WRITE(kind, data, index, value) -Other macros: +Macro to ensure a valid canonical form: - PyUnicode_READY(o) -- PyUnicode_CONVERT_BYTES(from_type, to_type, begin, end, to) String creation functions: - PyUnicode_New(size, maxchar) - PyUnicode_FromKindAndData(kind, data, size) - PyUnicode_Substring(o, start, end) Character access utility functions: -- PyUnicode_GetLength(o), PyUnicode_ReadChar(o, index), +- PyUnicode_GetLength(o) +- PyUnicode_ReadChar(o, index), PyUnicode_WriteChar(o, index, character) - PyUnicode_CopyCharacters(to, to_start, from, from_start, how_many) - PyUnicode_FindChar(str, ch, start, end, direction) Representation conversion: - PyUnicode_AsUCS4(o, buffer, buflen) - PyUnicode_AsUCS4Copy(o) - PyUnicode_AsUnicodeAndSize(o, size_out) - PyUnicode_AsUTF8(o) - PyUnicode_AsUTF8AndSize(o, size_out) - -UCS4 utility functions: - -- Py_UCS4_{strlen, strcpy, strcat, strncpy, strcmp, strncpy, strcmp, - strncmp, strchr, strrchr} Stable ABI ---------- The following functions are added to the stable ABI (PEP 384), as they are independent of the actual representation of Unicode objects: @@ -284,24 +305,25 @@ been ported to the new API. A reasonable motivation for using the deprecated API even in new code is for code that shall work both on Python 2 and Python 3. The following macros and functions are deprecated: +- PyUnicode_AS_DATA - PyUnicode_FromUnicode - PyUnicode_GET_SIZE, PyUnicode_GetSize, PyUnicode_GET_DATA_SIZE, -- PyUnicode_AS_UNICODE, PyUnicode_AsUnicode, PyUnicode_AsUnicodeAndSize +- PyUnicode_AS_UNICODE, PyUnicode_AsUnicode - PyUnicode_COPY, PyUnicode_FILL, PyUnicode_MATCH - PyUnicode_Encode, PyUnicode_EncodeUTF7, PyUnicode_EncodeUTF8, PyUnicode_EncodeUTF16, PyUnicode_EncodeUTF32, PyUnicode_EncodeUnicodeEscape, PyUnicode_EncodeRawUnicodeEscape, PyUnicode_EncodeLatin1, PyUnicode_EncodeASCII, PyUnicode_EncodeCharmap, PyUnicode_TranslateCharmap, PyUnicode_EncodeMBCS, PyUnicode_EncodeDecimal, PyUnicode_TransformDecimalToASCII -- Py_UNICODE_{strlen, strcat, strcpy, strcmp, strchr, strrchr} +- Py_UNICODE_str* - PyUnicode_AsUnicodeCopy - PyUnicode_GetMax _PyUnicode_AsDefaultEncodedString is removed. It previously returned a borrowed reference to an UTF-8-encoded bytes object. Since the unicode object cannot anymore cache such a reference, implementing it without @@ -395,18 +417,18 @@ indexing works well for PyUnicode_READ(_CHAR) and PyUnicode_WRITE. Use void* as the buffer type for characters to let the compiler detect invalid dereferencing operations. If you do want to use pointer arithmetics (e.g. when converting existing code), use (unsigned) char* as the buffer type, and keep the element size (1, 2, or 4) in a variable. Notice that (1<<(kind-1)) will produce the element size -given a buffer kind. +given a (canonical form) buffer kind. -When creating new strings, it was common in Python to start of with a -heuristical buffer size, and then grow or shrink if the heuristics -failed. With this PEP, this is now less practical, as you need not -only a heuristics for the length of the string, but also for the +When creating new strings, it was common in Python to start off with +an heuristical buffer size, and then grow or shrink if the heuristics +failed. With this PEP, this is now less practical, as you need +heuristics not only for the length of the string, but also for the maximum character. In order to avoid heuristics, you need to make two passes over the input: once to determine the output length, and the maximum character; then allocate the target string with PyUnicode_New and iterate over the input a second time to produce the final output. While this may