Message 112460 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mgiuca
Recipients	ezio.melotti, georg.brandl, lemburg, mgiuca, pitrou
Date	2010-08-02.13:17:04
SpamBayes Score	3.9246487e-07
Marked as misclassified	No
Message-id	<1280755028.1.0.229222108387.issue8821@psf.upfronthosting.co.za>
In-reply-to

Content
OK, I finally had time to review this issue again. Firstly, granted the original fix broke a test case, shouldn't we figure out why it broke and fix it, rather than just revert the change and continue relying on this tenuous assumption? Surely it's best to have as little code relying on it as possible. Secondly, please have a look at my patch again. It wasn't committed properly -- no offense to Georg, it's an honest mistake! My patch was: --- Objects/unicodeobject.c (revision 81539) +++ Objects/unicodeobject.c (working copy) @@ -3065,7 +3065,7 @@ } #else /* Map UTF-16 surrogate pairs to '\U00xxxxxx' / - else if (ch >= 0xD800 && ch < 0xDC00) { + else if (ch >= 0xD800 && ch < 0xDC00 && size > 0) { Py_UNICODE ch2; Py_UCS4 ucs; The commit made in r83418 by Georg Brandl (and similarly r83395 in py3k): http://svn.python.org/view/python/branches/release27-maint/Objects/unicodeobject.c?r1=82980&r2=83418 --- Objects/unicodeobject.c (revision 83417) +++ Objects/unicodeobject.c (revision 83418) @@ -3067,7 +3067,7 @@ ch2 = s++; size--; - if (ch2 >= 0xDC00 && ch2 <= 0xDFFF) { + if (ch2 >= 0xDC00 && ch2 <= 0xDFFF && size) { ucs = (((ch & 0x03FF) << 10) \| (ch2 & 0x03FF)) + 0x00010000; p++ = '\\'; p++ = 'U'; @@ -3316,7 +3316,7 @@ ch2 = s++; size--; - if (ch2 >= 0xDC00 && ch2 <= 0xDFFF) { + if (ch2 >= 0xDC00 && ch2 <= 0xDFFF && size) { ucs = (((ch & 0x03FF) << 10) \| (ch2 & 0x03FF)) + 0x00010000; p++ = '\\'; *p++ = 'U'; I put the size check on the first character of the surrogate pair; in the committed version the size check was on the second character (after the "size" variable is decremented), causing it to break out of that branch too early in some cases. Moving the size check to the outer if block fixes the test breakage. PS. Good find on the second copy of that code in the PyUnicode_EncodeRawUnicodeEscape function. I've attached a new patch which fixes both functions instead of just the unicodeescape_string function. Passes all test cases on UCS2 build of the 2.7 branch.

OK, I finally had time to review this issue again.

Firstly, granted the original fix broke a test case, shouldn't we figure out why it broke and fix it, rather than just revert the change and continue relying on this tenuous assumption? Surely it's best to have as little code relying on it as possible.

Secondly, please have a look at my patch again. It wasn't committed properly -- no offense to Georg, it's an honest mistake! My patch was:

--- Objects/unicodeobject.c	(revision 81539)
+++ Objects/unicodeobject.c	(working copy)
@@ -3065,7 +3065,7 @@
         }
 #else
         /* Map UTF-16 surrogate pairs to '\U00xxxxxx' */
-        else if (ch >= 0xD800 && ch < 0xDC00) {
+        else if (ch >= 0xD800 && ch < 0xDC00 && size > 0) {
             Py_UNICODE ch2;
             Py_UCS4 ucs;

The commit made in r83418 by Georg Brandl (and similarly r83395 in py3k):
http://svn.python.org/view/python/branches/release27-maint/Objects/unicodeobject.c?r1=82980&r2=83418

--- Objects/unicodeobject.c	(revision 83417)
+++ Objects/unicodeobject.c	(revision 83418)
@@ -3067,7 +3067,7 @@
 
             ch2 = *s++;
             size--;
-            if (ch2 >= 0xDC00 && ch2 <= 0xDFFF) {
+            if (ch2 >= 0xDC00 && ch2 <= 0xDFFF && size) {
                 ucs = (((ch & 0x03FF) << 10) | (ch2 & 0x03FF)) + 0x00010000;
                 *p++ = '\\';
                 *p++ = 'U';
@@ -3316,7 +3316,7 @@
 
                 ch2 = *s++;
                 size--;
-                if (ch2 >= 0xDC00 && ch2 <= 0xDFFF) {
+                if (ch2 >= 0xDC00 && ch2 <= 0xDFFF && size) {
                     ucs = (((ch & 0x03FF) << 10) | (ch2 & 0x03FF)) + 0x00010000;
                     *p++ = '\\';
                     *p++ = 'U';

I put the size check on the first character of the surrogate pair; in the committed version the size check was on the second character (after the "size" variable is decremented), causing it to break out of that branch too early in some cases.

Moving the size check to the outer if block fixes the test breakage.

PS. Good find on the second copy of that code in the PyUnicode_EncodeRawUnicodeEscape function. I've attached a new patch which fixes both functions instead of just the unicodeescape_string function.

Passes all test cases on UCS2 build of the 2.7 branch.

History
Date	User	Action	Args
2010-08-02 13:17:08	mgiuca	set	recipients: + mgiuca, lemburg, georg.brandl, pitrou, ezio.melotti
2010-08-02 13:17:08	mgiuca	set	messageid: <1280755028.1.0.229222108387.issue8821@psf.upfronthosting.co.za>
2010-08-02 13:17:06	mgiuca	link	issue8821 messages
2010-08-02 13:17:05	mgiuca	create