Message 149558 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	Jim.Jewett, docs@python, loewis, vstinner
Date	2011-12-15.14:03:41
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1323957823.18.0.0570232959902.issue13604@psf.upfronthosting.co.za>
In-reply-to

Content
Various comments of the PEP 393 and your patch. "For compatibility with existing APIs, several representations may exist in parallel; over time, this compatibility should be phased out." and "For compatibility, redundant representations may be computed." I never understood this statement: in most cases, PyUnicode_READY() replaces the Py_UNICODE* (wchar_t) representation by a compact representation. PyUnicode_AsUnicode(), PyUnicode_AS_UNICODE(), PyUnicode_GET_SIZE(), ... do reallocate a Py_UNICODE string for a ready string, but I don't think that it is a usual use case. PyUnicode_AS_UNICODE() & friends are usually only used to build strings. So this issue should be documented in a section different than the Abstract, maybe in a Limitations section. So even if a third party module uses the legagy Unicode API, the PEP 393 will still optimize the memory usage thanks to implicit calls to PyUnicode_READY() (done everywhere in Python source code). In the current code, the most common case where a string has two representations is the conversion to wchar_t* on Windows. PyUnicode_AsUnicode() is used to encode arguments for the Windows Unicode API, and PyUnicode_AsUnicode() keeps the result in the wstr attribute. Note: there is also the utf8 attribute which may contain a third representation if PyUnicode_AsUTF8() or PyUnicode_AsUTF8AndSize() (or the old _PyUnicode_AsString()) is called. "Objects for which the maximum character is not given at creation time are called "legacy" objects, created through PyUnicode_FromStringAndSize(NULL, length)." They can also be created by PyUnicode_FromUnicode(). "Resizing a Unicode string remains possible until it is finalized, generally by calling PyUnicode_READY." I changed PyUnicode_Resize(): it is now always possible to resize a string. The change was required because some decoders overallocate the string, and then resize after decoding the input. The sentence can be simply removed. + + 000 => str is not initialized (data are in wstr) + + 001 => 1 byte (Latin-1) + + 010 => 2 byte (UCS-2) + + 100 => 4 byte (UCS-4) + + Other values are reserved at this time. I don't like binary numbers, I would prefer decimal numbers here. Binary was maybe useful when we used bit masks, but we are now using the C "unsigned int field:bit_size;" trick for a nicer API. With the new values, it is even easier to remember them: 1 byte <=> kind=1 2 bytes <=> kind=2 4 bytes <=> kind=4 "[PyUnicode_AsUTF8] is thus identical to the existing _PyUnicode_AsString, which is removed" _PyUnicode_AsString() does still exist and is still heavily used (66 calls). It is not documented as deprecated in What's New in Python 3.3 (but it is a private function, so nobody uses it, right?. "This section summarizes the API additions." PyUnicode_IS_ASCII() is missing. PyUnicode_CHARACTER_SIZE() has been removed (use kind directly). UCS4 utility functions: Py_UCS4_{strlen, strcpy, strcat, strncpy, strcmp, strncpy, strcmp, strncmp, strchr, strrchr} have been removed. "The following functions are added to the stable ABI (PEP 384), as they are independent of the actual representation of Unicode objects: ... ... PyUnicode_WriteChar ...." PyUnicode_WriteChar() allows to modify an immutable object, which is something specific to CPython. Well, the function does now raise an error if the string is no more modifiable (e.g. more than 1 reference to the string, the hash has already been computed, etc.), but I don't know if it should be added to the stable ABI. "PyUnicode_AsUnicodeAndSize" This function was added to Python 3.3 and is directly deprecated. Why adding a function to deprecate it? PyUnicode_AsUnicode() and PyUnicode_GET_SIZE() were not enough? "Deprecations, Removals, and Incompatibilities" Missing: PyUnicode_AS_DATA(), Py_UNICODE_strncpy, Py_UNICODE_strncmp -- A very important point is not well explained: it is very important that a ("final") string is in its canonical representation. It means that a UCS2 string must contain at least a character bigger than U+00FF for example. Not only some optimizations rely on the canonical representation, but also some core methods of the Unicode type. I tried to list all properties of Unicode objects in the definition of the PyASCIIbject structure. And I implemented checks in _PyUnicode_CheckConsistency(). This method is only available in debug mode.

Various comments of the PEP 393 and your patch.

"For compatibility with existing APIs, several representations
may exist in parallel; over time, this compatibility should be phased
out."
and
"For compatibility, redundant representations may be computed."

I never understood this statement: in most cases, PyUnicode_READY() replaces the Py_UNICODE* (wchar_t*) representation by a compact representation.

PyUnicode_AsUnicode(), PyUnicode_AS_UNICODE(), PyUnicode_GET_SIZE(), ... do reallocate a Py_UNICODE* string for a ready string, but I don't think that it is a usual use case. 
PyUnicode_AS_UNICODE() & friends are usually only used to build strings. So this issue should be documented in a section different than the Abstract, maybe in a Limitations section.

So even if a third party module uses the legagy Unicode API, the PEP 393 will still optimize the memory usage thanks to implicit calls to PyUnicode_READY() (done everywhere in Python source code).

In the current code, the most common case where a string has two representations is the conversion to wchar_t* on Windows. PyUnicode_AsUnicode() is used to encode arguments for the Windows Unicode API, and PyUnicode_AsUnicode() keeps the result in the wstr attribute.

Note: there is also the utf8 attribute which may contain a third representation if PyUnicode_AsUTF8() or PyUnicode_AsUTF8AndSize() (or the old _PyUnicode_AsString()) is called.

"Objects for which the maximum character is not given at creation time are called "legacy" objects, created through PyUnicode_FromStringAndSize(NULL, length)."

They can also be created by PyUnicode_FromUnicode().

"Resizing a Unicode string remains possible until it is finalized, generally by calling PyUnicode_READY."

I changed PyUnicode_Resize(): it is now *always* possible to resize a string. The change was required because some decoders overallocate the string, and then resize after decoding the input.

The sentence can be simply removed.

+    + 000 => str is not initialized (data are in wstr)
+    + 001 => 1 byte (Latin-1)
+    + 010 => 2 byte (UCS-2)
+    + 100 => 4 byte (UCS-4)
+    + Other values are reserved at this time.

I don't like binary numbers, I would prefer decimal numbers here. Binary was maybe useful when we used bit masks, but we are now using the C "unsigned int field:bit_size;" trick for a nicer API. With the new values, it is even easier to remember them:

 1 byte <=> kind=1
 2 bytes <=> kind=2
 4 bytes <=> kind=4

"[PyUnicode_AsUTF8] is thus identical to the existing _PyUnicode_AsString, which is removed"

_PyUnicode_AsString() does still exist and is still heavily used (66 calls). It is not documented as deprecated in What's New in Python 3.3 (but it is a private function, so nobody uses it, right?.

"This section summarizes the API additions."

PyUnicode_IS_ASCII() is missing.

PyUnicode_CHARACTER_SIZE() has been removed (use kind directly).

UCS4 utility functions:

Py_UCS4_{strlen, strcpy, strcat, strncpy, strcmp, strncpy, strcmp, strncmp, strchr, strrchr} have been removed.


"The following functions are added to the stable ABI (PEP 384), as they
are independent of the actual representation of Unicode objects: ...
... PyUnicode_WriteChar ...."

PyUnicode_WriteChar() allows to modify an immutable object, which is something specific to CPython. Well, the function does now raise an error if the string is no more modifiable (e.g. more than 1 reference to the string, the hash has already been computed, etc.), but I don't know if it should be added to the stable ABI.

"PyUnicode_AsUnicodeAndSize"

This function was added to Python 3.3 and is directly deprecated. Why adding a function to deprecate it? PyUnicode_AsUnicode() and PyUnicode_GET_SIZE() were not enough?

"Deprecations, Removals, and Incompatibilities"

Missing: PyUnicode_AS_DATA(), Py_UNICODE_strncpy, Py_UNICODE_strncmp


--

A very important point is not well explained: it is very important that a ("final") string is in its canonical representation. It means that a UCS2 string must contain at least a character bigger than U+00FF for example. Not only some optimizations rely on the canonical representation, but also some core methods of the Unicode type.

I tried to list all properties of Unicode objects in the definition of the PyASCIIbject structure. And I implemented checks in _PyUnicode_CheckConsistency(). This method is only available in debug mode.

History
Date	User	Action	Args
2011-12-15 14:03:43	vstinner	set	recipients: + vstinner, loewis, docs@python, Jim.Jewett
2011-12-15 14:03:43	vstinner	set	messageid: <1323957823.18.0.0570232959902.issue13604@psf.upfronthosting.co.za>
2011-12-15 14:03:42	vstinner	link	issue13604 messages
2011-12-15 14:03:41	vstinner	create