classification
Title: Range check on unicode repr
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.2, Python 2.7
process
Status: closed Resolution: works for me
Dependencies: Superseder:
Assigned To: Nosy List: belopolsky, ezio.melotti, georg.brandl, lemburg, mgiuca, pitrou, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2010-05-26 04:11 by mgiuca, last changed 2013-09-02 15:45 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
unicode-range-check.patch mgiuca, 2010-05-26 04:11 Patch for unicodeobject:unicodeescape_string
unicode-range-check2.patch mgiuca, 2010-08-02 13:17 Patch for unicodeobject:unicodeescape_string and PyUnicode_EncodeRawUnicodeEscape
Messages (15)
msg106506 - (view) Author: Matt Giuca (mgiuca) Date: 2010-05-26 04:11
In unicodeobject.c's unicodeescape_string, in UCS2 builds, if the last character of the string is the start of a UTF-16 surrogate pair (between '\ud800' and '\udfff'), there is a slight overrun problem. For example:

>>> repr(u'abcd\ud800')

Upon reading ch = 0xd800, the test (ch >= 0xD800 && ch < 0xDC00) succeeds, and it then reads ch2 = *s++. Note that preceding this line, s points at one character past the end of the string, so the value read will be garbage. I imagine that unless it falls on a segment boundary, the worst that could happen is the character '\ud800' is interpreted as some other wide character. Nevertheless, this is bad.

Note that *technically* this is never bad, because _PyUnicode_New allocates an extra character and sets it to '\u0000', and thus the above example will always set ch2 to 0, and it will behave correctly. But this is a tenuous thing to rely on, especially given the comment above _PyUnicode_New:

/* We allocate one more byte to make sure the string is
   Ux0000 terminated -- XXX is this needed ?
*/

I thought about removing that XXX, but I'd rather fix the problem. Therefore, I have attached a patch which does a range check before reading ch2:

--- Objects/unicodeobject.c	(revision 81539)
+++ Objects/unicodeobject.c	(working copy)
@@ -3065,7 +3065,7 @@
         }
 #else
         /* Map UTF-16 surrogate pairs to '\U00xxxxxx' */
-        else if (ch >= 0xD800 && ch < 0xDC00) {
+        else if (ch >= 0xD800 && ch < 0xDC00 && size > 0) {
             Py_UNICODE ch2;
             Py_UCS4 ucs;

Also affects Python 3.
msg112291 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-08-01 08:49
Applied in r83395. Thanks!
msg112379 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-08-01 20:24
Well, the patch was technically useless since, as mentioned, unicode strings are terminated by a NUL character by design.

Anyway, I now get the following error on the 2.7 branch. Perhaps it's related:

======================================================================
FAIL: test_ucs4 (test.test_unicode.UnicodeTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/antoine/cpython/27/Lib/test/test_unicode.py", line 941, in test_ucs4
    self.assertEqual(x, y)
AssertionError: '\\udbc0\\udc00' != '\\U00100000'
msg112382 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-08-01 20:51
You're right. Reverted in r83444 and merging back, and I'll also remove the "XXX is this needed" from 2.7.
msg112434 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-08-02 09:07
Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
> Well, the patch was technically useless since, as mentioned, unicode strings are terminated by a NUL character by design.

There are two things to keep in mind:

 * Unicode objects are NUL-terminated, but only very external APIs
   rely on this (e.g. code using the Windows Unicode API). Please
   don't make the code in unicodeobject.c itself rely on this
   subtle detail.

 * The codecs work on Py_UNICODE* buffers which are *never* guaranteed
   to be NUL-terminated, so the problem in question is real.

> Anyway, I now get the following error on the 2.7 branch. Perhaps it's related:
> 
> ======================================================================
> FAIL: test_ucs4 (test.test_unicode.UnicodeTest)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/home/antoine/cpython/27/Lib/test/test_unicode.py", line 941, in test_ucs4
>     self.assertEqual(x, y)
> AssertionError: '\\udbc0\\udc00' != '\\U00100000'
> 
> ----------
> nosy: +pitrou
> status: closed -> open
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue8821>
> _______________________________________
> _______________________________________________
> Python-bugs-list mailing list
> Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/mal%40egenix.com
msg112441 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-08-02 09:47
>  * Unicode objects are NUL-terminated, but only very external APIs
>    rely on this (e.g. code using the Windows Unicode API). Please
>    don't make the code in unicodeobject.c itself rely on this
>    subtle detail.

That's wishful thinking, don't you think? *I* am not making code in
unicodeobject.c rely on this. It has been so for years, long before I
was here. You should check who made that design decision in the first
place instead of putting the blame on me.

Besides, the fact that external APIs rely on it make it much more
unchangeable than if it were an implementation detail.
msg112447 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-08-02 12:03
Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>>  * Unicode objects are NUL-terminated, but only very external APIs
>>    rely on this (e.g. code using the Windows Unicode API). Please
>>    don't make the code in unicodeobject.c itself rely on this
>>    subtle detail.
> 
> That's wishful thinking, don't you think? *I* am not making code in
> unicodeobject.c rely on this. It has been so for years, long before I
> was here. You should check who made that design decision in the first
> place instead of putting the blame on me.

I'm not blaming you for this. However, I don't want more code
to rely on this behavior.

The NUL-termination has never been documented and my decision
to use NUL-termination on the PyUnicodeObject buffers was merely
a safety measure.

> Besides, the fact that external APIs rely on it make it much more
> unchangeable than if it were an implementation detail.

It's an undocumented implementation detail. We can certainly
deprecate it's use using the standard approach we have for this.

But all that is off-topic for this ticket, since codecs
operate on Py_UNICODE* buffers together with a size parameter
and relying on those buffers being NUL-terminated is bound to
cause problems.
msg112454 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-08-02 12:33
> But all that is off-topic for this ticket, since codecs
> operate on Py_UNICODE* buffers together with a size parameter
> and relying on those buffers being NUL-terminated is bound to
> cause problems.

That's right. Then perhaps a fixed patch can be uploaded, if someone
cares enough.
msg112460 - (view) Author: Matt Giuca (mgiuca) Date: 2010-08-02 13:17
OK, I finally had time to review this issue again.

Firstly, granted the original fix broke a test case, shouldn't we figure out why it broke and fix it, rather than just revert the change and continue relying on this tenuous assumption? Surely it's best to have as little code relying on it as possible.

Secondly, please have a look at my patch again. It wasn't committed properly -- no offense to Georg, it's an honest mistake! My patch was:

--- Objects/unicodeobject.c	(revision 81539)
+++ Objects/unicodeobject.c	(working copy)
@@ -3065,7 +3065,7 @@
         }
 #else
         /* Map UTF-16 surrogate pairs to '\U00xxxxxx' */
-        else if (ch >= 0xD800 && ch < 0xDC00) {
+        else if (ch >= 0xD800 && ch < 0xDC00 && size > 0) {
             Py_UNICODE ch2;
             Py_UCS4 ucs;

The commit made in r83418 by Georg Brandl (and similarly r83395 in py3k):
http://svn.python.org/view/python/branches/release27-maint/Objects/unicodeobject.c?r1=82980&r2=83418

--- Objects/unicodeobject.c	(revision 83417)
+++ Objects/unicodeobject.c	(revision 83418)
@@ -3067,7 +3067,7 @@
 
             ch2 = *s++;
             size--;
-            if (ch2 >= 0xDC00 && ch2 <= 0xDFFF) {
+            if (ch2 >= 0xDC00 && ch2 <= 0xDFFF && size) {
                 ucs = (((ch & 0x03FF) << 10) | (ch2 & 0x03FF)) + 0x00010000;
                 *p++ = '\\';
                 *p++ = 'U';
@@ -3316,7 +3316,7 @@
 
                 ch2 = *s++;
                 size--;
-                if (ch2 >= 0xDC00 && ch2 <= 0xDFFF) {
+                if (ch2 >= 0xDC00 && ch2 <= 0xDFFF && size) {
                     ucs = (((ch & 0x03FF) << 10) | (ch2 & 0x03FF)) + 0x00010000;
                     *p++ = '\\';
                     *p++ = 'U';

I put the size check on the first character of the surrogate pair; in the committed version the size check was on the second character (after the "size" variable is decremented), causing it to break out of that branch too early in some cases.

Moving the size check to the outer if block fixes the test breakage.

PS. Good find on the second copy of that code in the PyUnicode_EncodeRawUnicodeEscape function. I've attached a new patch which fixes both functions instead of just the unicodeescape_string function.

Passes all test cases on UCS2 build of the 2.7 branch.
msg124890 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-29 23:59
[MAL]
>  * Unicode objects are NUL-terminated, but only very external APIs
>    rely on this (e.g. code using the Windows Unicode API). Please
>    don't make the code in unicodeobject.c itself rely on this
>    subtle detail.

I would like to note that several APIs have been introduced that require NUL-terminated unicode strings: Py_UNICODE_strlen(), Py_UNICODE_strcpy(), etc.  I see them used quite extensively in unicodeobject.c and elsewhere in Python codebase. Perhaps this train has left the station already.
msg124894 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-12-30 00:45
> Unicode objects are NUL-terminated, but only very external APIs
> rely on this (e.g. code using the Windows Unicode API).

All Py_UNICODE_str*() functions rely on the NUL character. They are useful when patching a function from bytes (char*) to unicode (PyUnicodeObject): the API is very close. It is possible to avoid them with new functions using the strings length.

All functions using PyUNICODE* as wchar_t* to the Windows wide character API (*W functions) also rely on the NUL character. Python core uses a lot of these functions. Don't write a NUL character require to create a temporary new string ending with a NUL character. It is not efficient, especially on long strings.

And there is the problem of all third party modules (written in C) relying on the NUL character.

I think that we have good reasons to not remove the NUL character. So I think that we can continue to accept that unicode[length] character can be read. Eg. implement text.startswith("ab") as "p=PyUnicode_AS_UNICODE(text); if (p[0] == 'a' && p[1] == 'b')" without checking the length of text.

Using the NUL character or the length as a terminator condition doesn't really matter. I just see one advantage for the NUL character: it is faster in some cases.
msg124901 - (view) Author: Matt Giuca (mgiuca) Date: 2010-12-30 02:37
> I think that we have good reasons to not remove the NUL character.

Please note: Nobody is suggesting that we remove the NUL character. I was merely suggesting that we don't rely on it where it is unnecessary.

Returning to my original patch: If the code was using the NUL character as a terminator, then it wouldn't be a bug.

What the repr code does is it uses the length, and does not explicitly search for a NUL character. However, there is a *bug* where it reads one too many characters in certain cases. As I said in the first place, it just happens to *not* be catastrophic due to the presence of the NUL character. But that does not mean this isn't a bug -- at the very least, the code is very confusing to read because it does not do what it is trying to do.

Anyway the important issue is what Marc-Andre raised about buffers. Since we are in agreement that there is a potential problem here, and I have a patch which seems correct and doesn't break any test cases (note my above post responding to test case breakages), can it be applied?
msg176810 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-02 21:26
This patch is not needed for 3.3+. And for 2.7 and 3.2 it actually doesn't fix any bug in the current code.
msg178912 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-01-03 04:56
Should this be closed then?
msg178929 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-01-03 10:03
You can accept the patch. You can reject the patch. It doesn't matter.
History
Date User Action Args
2013-09-02 15:45:31serhiy.storchakasetstatus: pending -> closed
resolution: works for me
stage: resolved
2013-01-26 18:53:56serhiy.storchakasetstatus: open -> pending
2013-01-03 10:03:31serhiy.storchakasetmessages: + msg178929
2013-01-03 04:56:01ezio.melottisetmessages: + msg178912
2012-12-02 21:26:07serhiy.storchakasetnosy: + serhiy.storchaka

messages: + msg176810
versions: - Python 3.1
2010-12-30 02:37:20mgiucasetnosy: lemburg, georg.brandl, belopolsky, pitrou, vstinner, ezio.melotti, mgiuca
messages: + msg124901
2010-12-30 00:45:27vstinnersetnosy: lemburg, georg.brandl, belopolsky, pitrou, vstinner, ezio.melotti, mgiuca
messages: + msg124894
2010-12-29 23:59:20belopolskysetnosy: + belopolsky, vstinner
messages: + msg124890
2010-08-10 20:24:08terry.reedysetversions: - Python 2.6, Python 3.3
2010-08-02 13:17:06mgiucasetfiles: + unicode-range-check2.patch

messages: + msg112460
2010-08-02 12:33:17pitrousetmessages: + msg112454
2010-08-02 12:03:57lemburgsetmessages: + msg112447
2010-08-02 09:49:13ezio.melottisetnosy: + ezio.melotti
2010-08-02 09:47:45pitrousetmessages: + msg112441
2010-08-02 09:08:18lemburgsetstatus: closed -> open
resolution: fixed -> (no value)
2010-08-02 09:07:51lemburgsetnosy: + lemburg
messages: + msg112434
2010-08-01 20:51:27georg.brandlsetstatus: open -> closed

messages: + msg112382
2010-08-01 20:24:00pitrousetstatus: closed -> open
nosy: + pitrou
messages: + msg112379

2010-08-01 08:49:32georg.brandlsetstatus: open -> closed

nosy: + georg.brandl
messages: + msg112291

resolution: fixed
2010-05-26 04:11:51mgiucacreate