Issue 13604: update PEP 393 (match implementation)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/57813

classification

Title:	update PEP 393 (match implementation)
Type:		Stage:	patch review
Components:	Documentation	Versions:	Python 3.3

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	Jim.Jewett, docs@python, ezio.melotti, jcea, loewis, vstinner
Priority:	normal	Keywords:	patch

Created on 2011-12-15 04:25 by Jim.Jewett, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
pep-0393.txt.patch	Jim.Jewett, 2011-12-15 04:25	updated PEP 393, patch format
pep-0393.txt	Jim.Jewett, 2011-12-15 04:27	updated PEP 393, updated version only
pep-0393.txt	Jim.Jewett, 2011-12-15 21:15	updated to reflect feedback
pep-0393.txt	Jim.Jewett, 2011-12-16 00:34	replacement text
pep-0393v20111215.patch	Jim.Jewett, 2011-12-16 00:38	diff of latest against current hg
pep-0393.txt	Jim.Jewett, 2011-12-16 13:50	updated to reflect Martin's answers
pep-0393_20111216.txt.patch	Jim.Jewett, 2011-12-16 13:52	diff of latest against current hg

Messages (9)
msg149497 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2011-12-15 04:25
The implementation has a larger state.kind Clarified wording on wstr_length and surrogate pairs. Clarified that the canonical "data" format doesn't always have a data pointer. Mentioned that calling PyUnicode_READY would finalize a string, so that it couldn't be resized. Changed section head "Other macros" to "Finalization macro" and removed the non-existent PyUnicode_CONVERT_BYTES (there is a similarly named private macro).
msg149558 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-12-15 14:03
Various comments of the PEP 393 and your patch. "For compatibility with existing APIs, several representations may exist in parallel; over time, this compatibility should be phased out." and "For compatibility, redundant representations may be computed." I never understood this statement: in most cases, PyUnicode_READY() replaces the Py_UNICODE* (wchar_t) representation by a compact representation. PyUnicode_AsUnicode(), PyUnicode_AS_UNICODE(), PyUnicode_GET_SIZE(), ... do reallocate a Py_UNICODE string for a ready string, but I don't think that it is a usual use case. PyUnicode_AS_UNICODE() & friends are usually only used to build strings. So this issue should be documented in a section different than the Abstract, maybe in a Limitations section. So even if a third party module uses the legagy Unicode API, the PEP 393 will still optimize the memory usage thanks to implicit calls to PyUnicode_READY() (done everywhere in Python source code). In the current code, the most common case where a string has two representations is the conversion to wchar_t* on Windows. PyUnicode_AsUnicode() is used to encode arguments for the Windows Unicode API, and PyUnicode_AsUnicode() keeps the result in the wstr attribute. Note: there is also the utf8 attribute which may contain a third representation if PyUnicode_AsUTF8() or PyUnicode_AsUTF8AndSize() (or the old _PyUnicode_AsString()) is called. "Objects for which the maximum character is not given at creation time are called "legacy" objects, created through PyUnicode_FromStringAndSize(NULL, length)." They can also be created by PyUnicode_FromUnicode(). "Resizing a Unicode string remains possible until it is finalized, generally by calling PyUnicode_READY." I changed PyUnicode_Resize(): it is now always possible to resize a string. The change was required because some decoders overallocate the string, and then resize after decoding the input. The sentence can be simply removed. + + 000 => str is not initialized (data are in wstr) + + 001 => 1 byte (Latin-1) + + 010 => 2 byte (UCS-2) + + 100 => 4 byte (UCS-4) + + Other values are reserved at this time. I don't like binary numbers, I would prefer decimal numbers here. Binary was maybe useful when we used bit masks, but we are now using the C "unsigned int field:bit_size;" trick for a nicer API. With the new values, it is even easier to remember them: 1 byte <=> kind=1 2 bytes <=> kind=2 4 bytes <=> kind=4 "[PyUnicode_AsUTF8] is thus identical to the existing _PyUnicode_AsString, which is removed" _PyUnicode_AsString() does still exist and is still heavily used (66 calls). It is not documented as deprecated in What's New in Python 3.3 (but it is a private function, so nobody uses it, right?. "This section summarizes the API additions." PyUnicode_IS_ASCII() is missing. PyUnicode_CHARACTER_SIZE() has been removed (use kind directly). UCS4 utility functions: Py_UCS4_{strlen, strcpy, strcat, strncpy, strcmp, strncpy, strcmp, strncmp, strchr, strrchr} have been removed. "The following functions are added to the stable ABI (PEP 384), as they are independent of the actual representation of Unicode objects: ... ... PyUnicode_WriteChar ...." PyUnicode_WriteChar() allows to modify an immutable object, which is something specific to CPython. Well, the function does now raise an error if the string is no more modifiable (e.g. more than 1 reference to the string, the hash has already been computed, etc.), but I don't know if it should be added to the stable ABI. "PyUnicode_AsUnicodeAndSize" This function was added to Python 3.3 and is directly deprecated. Why adding a function to deprecate it? PyUnicode_AsUnicode() and PyUnicode_GET_SIZE() were not enough? "Deprecations, Removals, and Incompatibilities" Missing: PyUnicode_AS_DATA(), Py_UNICODE_strncpy, Py_UNICODE_strncmp -- A very important point is not well explained: it is very important that a ("final") string is in its canonical representation. It means that a UCS2 string must contain at least a character bigger than U+00FF for example. Not only some optimizations rely on the canonical representation, but also some core methods of the Unicode type. I tried to list all properties of Unicode objects in the definition of the PyASCIIbject structure. And I implemented checks in _PyUnicode_CheckConsistency(). This method is only available in debug mode.
msg149577 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2011-12-15 21:20
Updated to resolve most of Victor's concerns, but this meant enough changes that I'm not sure it quite counts as editorial only. A few questions that I couldn't answer: (1) Upon string creation, do we want to promise to discard the UTF-8 and wstr, so that the caller can memory manage? (2) PyUnicode_AS_DATA(), Py_UNICODE_strncpy, Py_UNICODE_strncmp seemed to be there in the code I was looking at. (3) I can't justify the born-deprecated function "PyUnicode_AsUnicodeAndSize". Perhaps rename it with a leading underscore? Though I'm not sure it is really needed at all. (4) I tried to reword the "for compatibility" ... "redundant" part ... but I'm not sure I resolved it.
msg149579 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-12-15 22:45
> PyUnicode_AsUnicode(), PyUnicode_AS_UNICODE(), PyUnicode_GET_SIZE(), > ... do reallocate a Py_UNICODE* string for a ready string, but I > don't think that it is a usual use case. Define "usual". There were certainly plenty of occurrences of that in the Python code base, and I believe that extension modules also use it, provided they care about the content of string objects at all. > PyUnicode_AS_UNICODE() & > friends are usually only used to build strings. No. They are also used to inspect them. > So even if a third party module uses the legagy Unicode API, the PEP > 393 will still optimize the memory usage thanks to implicit calls to > PyUnicode_READY() (done everywhere in Python source code). ... unless they inspect a given Unicode string, in which case it will use twice the memory (or 1.5x). > "Resizing a Unicode string remains possible until it is finalized, > generally by calling PyUnicode_READY." > > I changed PyUnicode_Resize(): it is now always possible to resize a > string. The change was required because some decoders overallocate > the string, and then resize after decoding the input. > > The sentence can be simply removed. Well, I meant the resizing of strings that doesn't move the object in memory (i.e. unicode_resize). You (apparently) changed its signature to take PyUnicode_Object** (instead of PyUnicode_Object*). It's probably irrelevant since that's a unicodeobject.c-internal function, anyway. > "PyUnicode_AsUnicodeAndSize" > > This function was added to Python 3.3 and is directly deprecated. Why > adding a function to deprecate it? PyUnicode_AsUnicode() and > PyUnicode_GET_SIZE() were not enough? If it was not in 3.2, we should certainly remove it right away.
msg149580 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-12-15 22:50
> (1) Upon string creation, do we want to promise to discard the UTF-8 and wstr, so that the caller can memory manage? I don't understand the question. Assuming "discards" means "releases" here, then there is no API which releases memory during creation of the string object - let alone that there is any promise to do so. I'm also not aware of any candidate buffer that you might want to release. > (2) PyUnicode_AS_DATA(), Py_UNICODE_strncpy, Py_UNICODE_strncmp seemed to be there in the code I was looking at. That's very well possible. What's the question? > (3) I can't justify the born-deprecated function "PyUnicode_AsUnicodeAndSize". Perhaps rename it with a leading underscore? Though I'm not sure it is really needed at all. Nobody noticed that it is born-deprecated. If it really is, it should be removed before the release.
msg149584 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2011-12-16 00:34
>> So even if a third party module uses the legagy Unicode API, the PEP >> 393 will still optimize the memory usage thanks to implicit calls to >> PyUnicode_READY() (done everywhere in Python source code). > ... unless they inspect a given Unicode string, in which case it > will use twice the memory (or 1.5x). Why is the utf-8 representation not cached when it is generated for ParseTuple et alia? It seems like these parameters are likely to either be re-used as parameters (in which case caching makes sense) or not re-used at all (in which case, the whole string can go away). > Well, I meant the resizing of strings that doesn't move the object > in memory (i.e. unicode_resize). This may easily fail because the new size can't be found at that location; wouldn't it be better to just encourage proper sizing in the first place? >> (1) Upon string creation, do we want to promise to discard >> the UTF-8 and wstr, so that the caller can memory manage? > I don't understand the question. Assuming "discards" means > "releases" here, then there is no API which releases memory > during creation of the string object - let alone that there is > any promise to do so. I'm also not aware of any candidate buffer > that you might want to release. When a string is created from a wchar_t array, who is responsible for releasing the original wchar_t array? As I read it now, Python doesn't release the buffer, and the caller can't because maybe Python just pointed to it as memory shared with the canonical representation. >> (2) PyUnicode_AS_DATA(), Py_UNICODE_strncpy, Py_UNICODE_strncmp >> seemed to be there in the code I was looking at. > That's very well possible. What's the question? Victor listed them as missing. I now suspect he meant "missing from the PEP list of deprecated functions and macros", and I just misunderstood.
msg149594 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2011-12-16 05:41
> Why is the utf-8 representation not cached when it is generated for > ParseTuple et alia? It is. > When a string is created from a wchar_t array, who is responsible for > releasing the original wchar_t array? The caller. > As I read it now, Python > doesn't release the buffer, and the caller can't because maybe Python > just pointed to it as memory shared with the canonical > representation. But Python won't; it will always make a copy for itself.
msg149623 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2011-12-16 13:50
>> Why is the utf-8 representation not cached when it is generated for >> ParseTuple et alia? My error -- I read something backwards. >> When a string is created from a wchar_t array, who is responsible for >> releasing the original wchar_t array? > The caller. OK, I'll document that. >> As I read it now, Python >> doesn't release the buffer, and the caller can't because maybe Python >> just pointed to it as memory shared with the canonical >> representation. > But Python won't; it will always make a copy for itself. I thought I found an example each way, but it is possible that the shared version was something python had already copied. If not, I'll raise that as a separate issue to get the code changed. (Note that I may not be able to look at this again until after Christmas, so I'm likely to go silent for a while.)
msg184148 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-03-14 08:19
What's the status of this?

History
Date	User	Action	Args
2022-04-11 14:57:24	admin	set	github: 57813
2013-03-14 08:19:59	ezio.melotti	set	nosy: + ezio.melotti messages: + msg184148
2011-12-16 16:02:33	jcea	set	nosy: + jcea
2011-12-16 13:52:38	Jim.Jewett	set	files: + pep-0393_20111216.txt.patch
2011-12-16 13:50:19	Jim.Jewett	set	files: + pep-0393.txt messages: + msg149623
2011-12-16 05:41:31	loewis	set	messages: + msg149594
2011-12-16 00:38:51	Jim.Jewett	set	files: + pep-0393v20111215.patch
2011-12-16 00:34:30	Jim.Jewett	set	files: + pep-0393.txt messages: + msg149584
2011-12-15 22:50:05	loewis	set	messages: + msg149580
2011-12-15 22:45:26	loewis	set	messages: + msg149579
2011-12-15 21:20:48	Jim.Jewett	set	messages: + msg149577
2011-12-15 21:15:33	Jim.Jewett	set	files: + pep-0393.txt
2011-12-15 14:03:42	vstinner	set	messages: + msg149558
2011-12-15 09:58:41	pitrou	set	nosy: + loewis, vstinner stage: patch review
2011-12-15 04:27:24	Jim.Jewett	set	files: + pep-0393.txt versions: + Python 3.3 nosy: + docs@python assignee: docs@python components: + Documentation
2011-12-15 04:25:46	Jim.Jewett	create