Message 106506 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mgiuca
Recipients	mgiuca
Date	2010-05-26.04:11:50
SpamBayes Score	2.028326e-07
Marked as misclassified	No
Message-id	<1274847113.25.0.171149609237.issue8821@psf.upfronthosting.co.za>
In-reply-to

Content
In unicodeobject.c's unicodeescape_string, in UCS2 builds, if the last character of the string is the start of a UTF-16 surrogate pair (between '\ud800' and '\udfff'), there is a slight overrun problem. For example: >>> repr(u'abcd\ud800') Upon reading ch = 0xd800, the test (ch >= 0xD800 && ch < 0xDC00) succeeds, and it then reads ch2 = s++. Note that preceding this line, s points at one character past the end of the string, so the value read will be garbage. I imagine that unless it falls on a segment boundary, the worst that could happen is the character '\ud800' is interpreted as some other wide character. Nevertheless, this is bad. Note that technically* this is never bad, because _PyUnicode_New allocates an extra character and sets it to '\u0000', and thus the above example will always set ch2 to 0, and it will behave correctly. But this is a tenuous thing to rely on, especially given the comment above _PyUnicode_New: /* We allocate one more byte to make sure the string is Ux0000 terminated -- XXX is this needed ? / I thought about removing that XXX, but I'd rather fix the problem. Therefore, I have attached a patch which does a range check before reading ch2: --- Objects/unicodeobject.c (revision 81539) +++ Objects/unicodeobject.c (working copy) @@ -3065,7 +3065,7 @@ } #else / Map UTF-16 surrogate pairs to '\U00xxxxxx' */ - else if (ch >= 0xD800 && ch < 0xDC00) { + else if (ch >= 0xD800 && ch < 0xDC00 && size > 0) { Py_UNICODE ch2; Py_UCS4 ucs; Also affects Python 3.

In unicodeobject.c's unicodeescape_string, in UCS2 builds, if the last character of the string is the start of a UTF-16 surrogate pair (between '\ud800' and '\udfff'), there is a slight overrun problem. For example:

>>> repr(u'abcd\ud800')

Upon reading ch = 0xd800, the test (ch >= 0xD800 && ch < 0xDC00) succeeds, and it then reads ch2 = *s++. Note that preceding this line, s points at one character past the end of the string, so the value read will be garbage. I imagine that unless it falls on a segment boundary, the worst that could happen is the character '\ud800' is interpreted as some other wide character. Nevertheless, this is bad.

Note that *technically* this is never bad, because _PyUnicode_New allocates an extra character and sets it to '\u0000', and thus the above example will always set ch2 to 0, and it will behave correctly. But this is a tenuous thing to rely on, especially given the comment above _PyUnicode_New:

/* We allocate one more byte to make sure the string is
   Ux0000 terminated -- XXX is this needed ?
*/

I thought about removing that XXX, but I'd rather fix the problem. Therefore, I have attached a patch which does a range check before reading ch2:

--- Objects/unicodeobject.c	(revision 81539)
+++ Objects/unicodeobject.c	(working copy)
@@ -3065,7 +3065,7 @@
         }
 #else
         /* Map UTF-16 surrogate pairs to '\U00xxxxxx' */
-        else if (ch >= 0xD800 && ch < 0xDC00) {
+        else if (ch >= 0xD800 && ch < 0xDC00 && size > 0) {
             Py_UNICODE ch2;
             Py_UCS4 ucs;

Also affects Python 3.

History
Date	User	Action	Args
2010-05-26 04:11:53	mgiuca	set	recipients: + mgiuca
2010-05-26 04:11:53	mgiuca	set	messageid: <1274847113.25.0.171149609237.issue8821@psf.upfronthosting.co.za>
2010-05-26 04:11:51	mgiuca	link	issue8821 messages
2010-05-26 04:11:50	mgiuca	create