Message 122464 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	belopolsky
Recipients	Rhamphoryncus, amaury.forgeotdarc, belopolsky, eric.smith, ezio.melotti, lemburg, pitrou
Date	2010-11-26.16:16:04
SpamBayes Score	2.214895e-14
Marked as misclassified	No
Message-id	<1290788167.05.0.282092739881.issue10542@psf.upfronthosting.co.za>
In-reply-to

Content
As discussed in issue 10521 and the sprawling "len(chr(i)) = 2?" thread [1] on python-dev, many functions in python library behave differently on narrow and wide builds. While there are unavoidable differences such as the length of strings with non-BMP characters, many functions can work around these differences. For example, the ord() function already produces integers over 0xFFFF when given a surrogate pair as a string of length two on a narrow build. Other functions such as str.isalpha(), are not yet aware of surrogates. See also issue9200. A consensus is developing that non-BMP characters support on narrow builds is here to stay and that naive functions should be fixed. Unfortunately, working with surrogates in python code is tricky because unicode C-API does not provide much support and existing examples of surrogate processing look like this: - while (u != uend && w != wend) { - if (0xD800 <= u[0] && u[0] <= 0xDBFF - && 0xDC00 <= u[1] && u[1] <= 0xDFFF) - { - w = (((u[0] & 0x3FF) << 10) \| (u[1] & 0x3FF)) + 0x10000; - u += 2; - } - else { - w = u; - u++; - } - w++; - } The attached patch introduces a Py_UNICODE_NEXT() macro that allows replacing the code above with two lines: + while (u != uend && w != wend) + w++ = Py_UNICODE_NEXT(u, uend); The patch also introduces a set of macros for manipulating the surrogates, but I have not started replacing more instances of verbose surrogate processing because I would like to first look for higher level abstractions such as Py_UNICODE_NEXT(). For example, there are many instances that can benefit from Py_UNICODE_PUT_NEXT(ptr, ch) macro that would put a UCS4 character ch into Py_UNICODE buffer pointed by ptr and advance ptr by 1 or 2 units as necessary. [1] http://mail.python.org/pipermail/python-dev/2010-November/105908.html

As discussed in issue 10521 and the sprawling "len(chr(i)) = 2?" thread [1] on python-dev, many functions in python library behave differently on narrow and wide builds.  While there are unavoidable differences such as the length of strings with non-BMP characters, many functions can work around these differences.  For example, the ord() function already produces integers over 0xFFFF when given a surrogate pair as a string of length two on a narrow build.  Other functions such as str.isalpha(), are not yet aware of surrogates.  See also issue9200.

A consensus is developing that non-BMP characters support on narrow builds is here to stay and that naive functions should be fixed.  Unfortunately, working with surrogates in python code is tricky because unicode C-API does not provide much support and existing examples of surrogate processing look like this:

-        while (u != uend && w != wend) {
-            if (0xD800 <= u[0] && u[0] <= 0xDBFF
-                && 0xDC00 <= u[1] && u[1] <= 0xDFFF)
-            {
-                *w = (((u[0] & 0x3FF) << 10) | (u[1] & 0x3FF)) + 0x10000;
-                u += 2;
-            }
-            else {
-                *w = *u;
-                u++;
-            }
-            w++;
-        }

The attached patch introduces a Py_UNICODE_NEXT() macro that allows replacing the code above with two lines:

+        while (u != uend && w != wend)
+            *w++ = Py_UNICODE_NEXT(u, uend);

The patch also introduces a set of macros for manipulating the surrogates, but I have not started replacing more instances of verbose surrogate processing because I would like to first look for higher level abstractions such as Py_UNICODE_NEXT().  For example, there are many instances that can benefit from Py_UNICODE_PUT_NEXT(ptr, ch) macro that would put a UCS4 character ch into Py_UNICODE buffer pointed by ptr and advance ptr by 1 or 2 units as necessary.


[1] http://mail.python.org/pipermail/python-dev/2010-November/105908.html

History
Date	User	Action	Args
2010-11-26 16:16:07	belopolsky	set	recipients: + belopolsky, lemburg, amaury.forgeotdarc, Rhamphoryncus, pitrou, eric.smith, ezio.melotti
2010-11-26 16:16:07	belopolsky	set	messageid: <1290788167.05.0.282092739881.issue10542@psf.upfronthosting.co.za>
2010-11-26 16:16:05	belopolsky	link	issue10542 messages
2010-11-26 16:16:05	belopolsky	create