Message 142089 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	Arfrever, ezio.melotti, jkloth, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy
Date	2011-08-15.00:26:52
SpamBayes Score	2.3584579e-11
Marked as misclassified	No
Message-id	<1313368013.46.0.107285515249.issue12729@psf.upfronthosting.co.za>
In-reply-to

Content
Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16. They support non-BMP chars but only partially, because, BY DESIGN, indexing and len are by code units, not codepoints. They are documented as being UCS-2 because that is what M-A Lemburg, the original designer and writer of Python's unicode type and the unicode-capable re module, wants them to be called. The link to msg142037, which is one of 50+ in the thread (and many or most other disagree), pretty well explains his viewpoint. The positive side is that we deliver more than we promise. The negative side is that by not promising what perhaps we should allows is not to deliver what perhaps we should. While I think this design decision may have been OK a decade ago for a first implementation of an optional text type, I do not think it so for the future for revised implementations of what is now the text type. I think narrow builds can and should be revised and upgraded to index, slice, and measure by codepoints. Here is my current idea: If the code unit stream contains any non-BMP characters (ie, surrogate pair of 16-bit code units), construct a sequence of indexes of such characters (pairs). The fixed length of the string in codepoints is n-k, where n is the number of code units (the current length) and k is the length of the auxiliary sequence and the number of pairs. For indexing, look up the character index in the list of indexes by binary search and increment the codepoint index by the index of the index found to get the corresponding code unit index. (I have omitted the details needed avoid off-by-1 errors.) This would make indexing O(log(k)) when there are surrogates. If that is really a problem because k is a substantial fraction of a 'large' n, then one should use a wide build. By using a separate internal class, there would be no time or space penalty for all-BMP text. I will work on a prototype in Python. PS: The OSCON link in msg142036 currently gives me 404 not found

Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16. They support non-BMP chars but only partially, because, BY DESIGN*, indexing and len are by code units, not codepoints. They are documented as being UCS-2 because that is what M-A Lemburg, the original designer and writer of Python's unicode type and the unicode-capable re module, wants them to be called. The link to msg142037, which is one of 50+ in the thread (and many or most other disagree), pretty well explains his viewpoint. The positive side is that we deliver more than we promise. The negative side is that by not promising what perhaps we should allows is not to deliver what perhaps we should.

*While I think this design decision may have been OK a decade ago for a first implementation of an *optional* text type, I do not think it so for the future for revised implementations of what is now *the* text type. I think narrow builds can and should be revised and upgraded to index, slice, and measure by codepoints. Here is my current idea:

If the code unit stream contains any non-BMP characters (ie, surrogate pair of 16-bit code units), construct a sequence of *indexes* of such characters (pairs). The fixed length of the string in codepoints is n-k, where n is the number of code units (the current length) and k is the length of the auxiliary sequence and the number of pairs. For indexing, look up the character index in the list of indexes by binary search and increment the codepoint index by the index of the index found to get the corresponding code unit index. (I have omitted the details needed avoid off-by-1 errors.)

This would make indexing O(log(k)) when there are surrogates. If that is really a problem because k is a substantial fraction of a 'large' n, then one should use a wide build. By using a separate internal class, there would be no time or space penalty for all-BMP text. I will work on a prototype in Python.

PS: The OSCON link in msg142036 currently gives me 404 not found

History
Date	User	Action	Args
2011-08-15 00:26:53	terry.reedy	set	recipients: + terry.reedy, pitrou, jkloth, ezio.melotti, mrabarnett, Arfrever, r.david.murray, tchrist
2011-08-15 00:26:53	terry.reedy	set	messageid: <1313368013.46.0.107285515249.issue12729@psf.upfronthosting.co.za>
2011-08-15 00:26:52	terry.reedy	link	issue12729 messages
2011-08-15 00:26:52	terry.reedy	create