--- C:\pep-0393_orig.txt 2011-12-14 14:01:53.000000000 -0500 +++ C:\pep-0393.txt 2011-12-14 16:37:28.000000000 -0500 @@ -61,13 +61,13 @@ typedef struct { PyObject_HEAD Py_ssize_t length; Py_hash_t hash; struct { unsigned int interned:2; - unsigned int kind:2; + unsigned int kind:3; unsigned int compact:1; unsigned int ascii:1; unsigned int ready:1; } state; wchar_t *wstr; } PyASCIIObject; @@ -107,34 +107,37 @@ The fields have the following interpretations: - length: number of code points in the string (result of sq_length) - interned: interned-state (SSTATE_*) as in 3.2 - kind: form of string - + 00 => str is not initialized (data are in wstr) - + 01 => 1 byte (Latin-1) - + 10 => 2 byte (UCS-2) - + 11 => 4 byte (UCS-4); + + 000 => str is not initialized (data are in wstr) + + 001 => 1 byte (Latin-1) + + 010 => 2 byte (UCS-2) + + 100 => 4 byte (UCS-4) + + Other values are reserved at this time. + - compact: the object uses one of the compact representations (implies ready) - ascii: the object uses the PyASCIIObject representation (implies compact and ready) - ready: the canonical representation is ready to be accessed through PyUnicode_DATA and PyUnicode_GET_LENGTH. This is set either if the object is compact, or the data pointer and length have been initialized. - wstr_length, wstr: representation in platform's wchar_t (null-terminated). If wchar_t is 16-bit, this form may use surrogate - pairs (in which cast wstr_length differs form length). - wstr_length differs from length only if there are surrogate pairs - in the representation. + pairs. wstr_length differs from length if and only if there are + surrogate pairs in the wstr representation. - utf8_length, utf8: UTF-8 representation (null-terminated). - data: shortest-form representation of the unicode string. The string is null-terminated (in its respective representation). + Note that a compact string does not contain an explicit data pointer, + as the data will begin immediately after the string's header. -All three representations are optional, although the data form is +All three representations are optional, but the data form is considered the canonical representation which can be absent only while the string is being created. If the representation is absent, the pointer is NULL, and the corresponding length field may contain arbitrary data. The Py_UNICODE type is still supported but deprecated. It is always @@ -168,13 +171,13 @@ PyUnicode_FromUnicode remains supported but is deprecated. If the Py_UNICODE pointer is non-null, the data representation is set. If the pointer is NULL, a properly-sized wstr representation is allocated, which can be modified until PyUnicode_READY() is called (explicitly or implicitly). Resizing a Unicode string remains possible until it -is finalized. +is finalized, generally by calling PyUnicode_READY. PyUnicode_READY() converts a string containing only a wstr representation into the canonical representation. Unless wstr and data can share the memory, the wstr representation is discarded after the conversion. The macro returns 0 on success and -1 on failure, which happens in particular if the memory allocation fails. @@ -182,13 +185,13 @@ String Access ------------- The canonical representation can be accessed using two macros PyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of the values PyUnicode_WCHAR_KIND (0), PyUnicode_1BYTE_KIND (1), -PyUnicode_2BYTE_KIND (2), or PyUnicode_4BYTE_KIND (3). PyUnicode_DATA +PyUnicode_2BYTE_KIND (2), or PyUnicode_4BYTE_KIND (4). PyUnicode_DATA gives the void pointer to the data. Access to individual characters should use PyUnicode_{READ|WRITE}[_CHAR]: - PyUnicode_READ(kind, data, index) - PyUnicode_WRITE(kind, data, index, value) - PyUnicode_READ_CHAR(unicode, index) @@ -224,16 +227,15 @@ Character access macros: - PyUnicode_READ(kind, data, index), PyUnicode_READ_CHAR(o, index) - PyUnicode_WRITE(kind, data, index, value) -Other macros: +Finalization macro: - PyUnicode_READY(o) -- PyUnicode_CONVERT_BYTES(from_type, to_type, begin, end, to) String creation functions: - PyUnicode_New(size, maxchar) - PyUnicode_FromKindAndData(kind, data, size) - PyUnicode_Substring(o, start, end)