This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: update PEP 393 (match implementation)
Type: Stage: patch review
Components: Documentation Versions: Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Jim.Jewett, docs@python, ezio.melotti, jcea, loewis, vstinner
Priority: normal Keywords: patch

Created on 2011-12-15 04:25 by Jim.Jewett, last changed 2022-04-11 14:57 by admin.

Files
File name Uploaded Description Edit
pep-0393.txt.patch Jim.Jewett, 2011-12-15 04:25 updated PEP 393, patch format
pep-0393.txt Jim.Jewett, 2011-12-15 04:27 updated PEP 393, updated version only
pep-0393.txt Jim.Jewett, 2011-12-15 21:15 updated to reflect feedback
pep-0393.txt Jim.Jewett, 2011-12-16 00:34 replacement text
pep-0393v20111215.patch Jim.Jewett, 2011-12-16 00:38 diff of latest against current hg
pep-0393.txt Jim.Jewett, 2011-12-16 13:50 updated to reflect Martin's answers
pep-0393_20111216.txt.patch Jim.Jewett, 2011-12-16 13:52 diff of latest against current hg
Messages (9)
msg149497 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2011-12-15 04:25
The implementation has a larger state.kind
Clarified wording on wstr_length and surrogate pairs.
Clarified that the canonical "data" format doesn't always have a data pointer.
Mentioned that calling PyUnicode_READY would finalize a string, so that it couldn't be resized.
Changed section head "Other macros" to "Finalization macro" and removed the non-existent PyUnicode_CONVERT_BYTES (there is a similarly named private macro).
msg149558 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-15 14:03
Various comments of the PEP 393 and your patch.

"For compatibility with existing APIs, several representations
may exist in parallel; over time, this compatibility should be phased
out."
and
"For compatibility, redundant representations may be computed."

I never understood this statement: in most cases, PyUnicode_READY() replaces the Py_UNICODE* (wchar_t*) representation by a compact representation.

PyUnicode_AsUnicode(), PyUnicode_AS_UNICODE(), PyUnicode_GET_SIZE(), ... do reallocate a Py_UNICODE* string for a ready string, but I don't think that it is a usual use case. 
PyUnicode_AS_UNICODE() & friends are usually only used to build strings. So this issue should be documented in a section different than the Abstract, maybe in a Limitations section.

So even if a third party module uses the legagy Unicode API, the PEP 393 will still optimize the memory usage thanks to implicit calls to PyUnicode_READY() (done everywhere in Python source code).

In the current code, the most common case where a string has two representations is the conversion to wchar_t* on Windows. PyUnicode_AsUnicode() is used to encode arguments for the Windows Unicode API, and PyUnicode_AsUnicode() keeps the result in the wstr attribute.

Note: there is also the utf8 attribute which may contain a third representation if PyUnicode_AsUTF8() or PyUnicode_AsUTF8AndSize() (or the old _PyUnicode_AsString()) is called.

"Objects for which the maximum character is not given at creation time are called "legacy" objects, created through PyUnicode_FromStringAndSize(NULL, length)."

They can also be created by PyUnicode_FromUnicode().

"Resizing a Unicode string remains possible until it is finalized, generally by calling PyUnicode_READY."

I changed PyUnicode_Resize(): it is now *always* possible to resize a string. The change was required because some decoders overallocate the string, and then resize after decoding the input.

The sentence can be simply removed.

+    + 000 => str is not initialized (data are in wstr)
+    + 001 => 1 byte (Latin-1)
+    + 010 => 2 byte (UCS-2)
+    + 100 => 4 byte (UCS-4)
+    + Other values are reserved at this time.

I don't like binary numbers, I would prefer decimal numbers here. Binary was maybe useful when we used bit masks, but we are now using the C "unsigned int field:bit_size;" trick for a nicer API. With the new values, it is even easier to remember them:

 1 byte <=> kind=1
 2 bytes <=> kind=2
 4 bytes <=> kind=4

"[PyUnicode_AsUTF8] is thus identical to the existing _PyUnicode_AsString, which is removed"

_PyUnicode_AsString() does still exist and is still heavily used (66 calls). It is not documented as deprecated in What's New in Python 3.3 (but it is a private function, so nobody uses it, right?.

"This section summarizes the API additions."

PyUnicode_IS_ASCII() is missing.

PyUnicode_CHARACTER_SIZE() has been removed (use kind directly).

UCS4 utility functions:

Py_UCS4_{strlen, strcpy, strcat, strncpy, strcmp, strncpy, strcmp, strncmp, strchr, strrchr} have been removed.


"The following functions are added to the stable ABI (PEP 384), as they
are independent of the actual representation of Unicode objects: ...
... PyUnicode_WriteChar ...."

PyUnicode_WriteChar() allows to modify an immutable object, which is something specific to CPython. Well, the function does now raise an error if the string is no more modifiable (e.g. more than 1 reference to the string, the hash has already been computed, etc.), but I don't know if it should be added to the stable ABI.

"PyUnicode_AsUnicodeAndSize"

This function was added to Python 3.3 and is directly deprecated. Why adding a function to deprecate it? PyUnicode_AsUnicode() and PyUnicode_GET_SIZE() were not enough?

"Deprecations, Removals, and Incompatibilities"

Missing: PyUnicode_AS_DATA(), Py_UNICODE_strncpy, Py_UNICODE_strncmp


--

A very important point is not well explained: it is very important that a ("final") string is in its canonical representation. It means that a UCS2 string must contain at least a character bigger than U+00FF for example. Not only some optimizations rely on the canonical representation, but also some core methods of the Unicode type.

I tried to list all properties of Unicode objects in the definition of the PyASCIIbject structure. And I implemented checks in _PyUnicode_CheckConsistency(). This method is only available in debug mode.
msg149577 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2011-12-15 21:20
Updated to resolve most of Victor's concerns, but this meant enough changes that I'm not sure it quite counts as editorial only.

A few questions that I couldn't answer:

(1)  Upon string creation, do we want to *promise* to discard the UTF-8 and wstr, so that the caller can memory manage?

(2)  PyUnicode_AS_DATA(), Py_UNICODE_strncpy, Py_UNICODE_strncmp seemed to be there in the code I was looking at.

(3)  I can't justify the born-deprecated function "PyUnicode_AsUnicodeAndSize".  Perhaps rename it with a leading underscore?  Though I'm not sure it is really needed at all.

(4)  I tried to reword the "for compatibility" ... "redundant" part ... but I'm not sure I resolved it.
msg149579 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-12-15 22:45
> PyUnicode_AsUnicode(), PyUnicode_AS_UNICODE(), PyUnicode_GET_SIZE(),
> ... do reallocate a Py_UNICODE* string for a ready string, but I
> don't think that it is a usual use case.

Define "usual". There were certainly plenty of occurrences of that
in the Python code base, and I believe that extension modules also
use it, provided they care about the content of string objects at all.

> PyUnicode_AS_UNICODE() &
> friends are usually only used to build strings.

No. They are also used to inspect them.

> So even if a third party module uses the legagy Unicode API, the PEP
> 393 will still optimize the memory usage thanks to implicit calls to
> PyUnicode_READY() (done everywhere in Python source code).

... unless they inspect a given Unicode string, in which case it
will use twice the memory (or 1.5x).

> "Resizing a Unicode string remains possible until it is finalized,
> generally by calling PyUnicode_READY."
> 
> I changed PyUnicode_Resize(): it is now *always* possible to resize a
> string. The change was required because some decoders overallocate
> the string, and then resize after decoding the input.
> 
> The sentence can be simply removed.

Well, I meant the resizing of strings that doesn't move the object
in memory (i.e. unicode_resize). You (apparently) changed its signature
to take PyUnicode_Object** (instead of PyUnicode_Object*). It's probably
irrelevant since that's a unicodeobject.c-internal function, anyway.

> "PyUnicode_AsUnicodeAndSize"
> 
> This function was added to Python 3.3 and is directly deprecated. Why
> adding a function to deprecate it? PyUnicode_AsUnicode() and
> PyUnicode_GET_SIZE() were not enough?

If it was not in 3.2, we should certainly remove it right away.
msg149580 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-12-15 22:50
> (1)  Upon string creation, do we want to *promise* to discard the UTF-8 and wstr, so that the caller can memory manage?

I don't understand the question. Assuming "discards" means "releases"
here, then there is no API which releases memory during creation of
the string object - let alone that there is any promise to do so. I'm
also not aware of any candidate buffer that you might want to release.

> (2)  PyUnicode_AS_DATA(), Py_UNICODE_strncpy, Py_UNICODE_strncmp seemed to be there in the code I was looking at.

That's very well possible. What's the question?

> (3)  I can't justify the born-deprecated function "PyUnicode_AsUnicodeAndSize".  Perhaps rename it with a leading underscore?  Though I'm not sure it is really needed at all.

Nobody noticed that it is born-deprecated. If it really is, it should be
removed before the release.
msg149584 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2011-12-16 00:34
>> So even if a third party module uses the legagy Unicode API, the PEP
>> 393 will still optimize the memory usage thanks to implicit calls to
>> PyUnicode_READY() (done everywhere in Python source code).

> ... unless they inspect a given Unicode string, in which case it
> will use twice the memory (or 1.5x).

Why is the utf-8 representation not cached when it is generated for ParseTuple et alia?

It seems like these parameters are likely to either be re-used as parameters (in which case caching makes sense) or not re-used at all (in which case, the whole string can go away).

> Well, I meant the resizing of strings that doesn't move the object
> in memory (i.e. unicode_resize).

This may easily fail because the new size can't be found at that location; wouldn't it be better to just encourage proper sizing in the first place?

>> (1)  Upon string creation, do we want to *promise* to discard
>> the UTF-8 and wstr, so that the caller can memory manage?

> I don't understand the question. Assuming "discards" means
> "releases" here, then there is no API which releases memory
> during creation of the string object - let alone that there is
> any promise to do so. I'm also not aware of any candidate buffer
> that you might want to release.

When a string is created from a wchar_t array, who is responsible for releasing the original wchar_t array?  As I read it now, Python doesn't release the buffer, and the caller can't because maybe Python just pointed to it as memory shared with the canonical representation.  

>> (2)  PyUnicode_AS_DATA(), Py_UNICODE_strncpy, Py_UNICODE_strncmp 
>> seemed to be there in the code I was looking at.

> That's very well possible. What's the question?

Victor listed them as missing.  I now suspect he meant "missing from the PEP list of deprecated functions and macros", and I just misunderstood.
msg149594 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-12-16 05:41
> Why is the utf-8 representation not cached when it is generated for
> ParseTuple et alia?

It is.

> When a string is created from a wchar_t array, who is responsible for
> releasing the original wchar_t array?

The caller.

> As I read it now, Python
> doesn't release the buffer, and the caller can't because maybe Python
> just pointed to it as memory shared with the canonical
> representation.

But Python won't; it will always make a copy for itself.
msg149623 - (view) Author: Jim Jewett (Jim.Jewett) * (Python triager) Date: 2011-12-16 13:50
>> Why is the utf-8 representation not cached when it is generated for
>> ParseTuple et alia?

My error -- I read something backwards.

>> When a string is created from a wchar_t array, who is responsible for
>> releasing the original wchar_t array?

> The caller.

OK, I'll document that.

>> As I read it now, Python
>> doesn't release the buffer, and the caller can't because maybe Python
>> just pointed to it as memory shared with the canonical
>> representation.

> But Python won't; it will always make a copy for itself.

I thought I found an example each way, but it is possible that the shared version was something python had already copied.  If not, I'll raise that as a separate issue to get the code changed.

(Note that I may not be able to look at this again until after Christmas, so I'm likely to go silent for a while.)
msg184148 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-03-14 08:19
What's the status of this?
History
Date User Action Args
2022-04-11 14:57:24adminsetgithub: 57813
2013-03-14 08:19:59ezio.melottisetnosy: + ezio.melotti
messages: + msg184148
2011-12-16 16:02:33jceasetnosy: + jcea
2011-12-16 13:52:38Jim.Jewettsetfiles: + pep-0393_20111216.txt.patch
2011-12-16 13:50:19Jim.Jewettsetfiles: + pep-0393.txt

messages: + msg149623
2011-12-16 05:41:31loewissetmessages: + msg149594
2011-12-16 00:38:51Jim.Jewettsetfiles: + pep-0393v20111215.patch
2011-12-16 00:34:30Jim.Jewettsetfiles: + pep-0393.txt

messages: + msg149584
2011-12-15 22:50:05loewissetmessages: + msg149580
2011-12-15 22:45:26loewissetmessages: + msg149579
2011-12-15 21:20:48Jim.Jewettsetmessages: + msg149577
2011-12-15 21:15:33Jim.Jewettsetfiles: + pep-0393.txt
2011-12-15 14:03:42vstinnersetmessages: + msg149558
2011-12-15 09:58:41pitrousetnosy: + loewis, vstinner

stage: patch review
2011-12-15 04:27:24Jim.Jewettsetfiles: + pep-0393.txt
versions: + Python 3.3
nosy: + docs@python

assignee: docs@python
components: + Documentation
2011-12-15 04:25:46Jim.Jewettcreate