Issue 8821: Range check on unicode repr

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53067

classification

Title:	Range check on unicode repr
Type:	behavior	Stage:	resolved
Components:	Unicode	Versions:	Python 3.2, Python 2.7

process

Status:	closed	Resolution:	works for me
Dependencies:		Superseder:
Assigned To:		Nosy List:	belopolsky, ezio.melotti, georg.brandl, lemburg, mgiuca, pitrou, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2010-05-26 04:11 by mgiuca, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unicode-range-check.patch	mgiuca, 2010-05-26 04:11	Patch for unicodeobject:unicodeescape_string
unicode-range-check2.patch	mgiuca, 2010-08-02 13:17	Patch for unicodeobject:unicodeescape_string and PyUnicode_EncodeRawUnicodeEscape

Messages (15)
msg106506 - (view)	Author: Matt Giuca (mgiuca)	Date: 2010-05-26 04:11
In unicodeobject.c's unicodeescape_string, in UCS2 builds, if the last character of the string is the start of a UTF-16 surrogate pair (between '\ud800' and '\udfff'), there is a slight overrun problem. For example: >>> repr(u'abcd\ud800') Upon reading ch = 0xd800, the test (ch >= 0xD800 && ch < 0xDC00) succeeds, and it then reads ch2 = s++. Note that preceding this line, s points at one character past the end of the string, so the value read will be garbage. I imagine that unless it falls on a segment boundary, the worst that could happen is the character '\ud800' is interpreted as some other wide character. Nevertheless, this is bad. Note that technically* this is never bad, because _PyUnicode_New allocates an extra character and sets it to '\u0000', and thus the above example will always set ch2 to 0, and it will behave correctly. But this is a tenuous thing to rely on, especially given the comment above _PyUnicode_New: /* We allocate one more byte to make sure the string is Ux0000 terminated -- XXX is this needed ? / I thought about removing that XXX, but I'd rather fix the problem. Therefore, I have attached a patch which does a range check before reading ch2: --- Objects/unicodeobject.c (revision 81539) +++ Objects/unicodeobject.c (working copy) @@ -3065,7 +3065,7 @@ } #else / Map UTF-16 surrogate pairs to '\U00xxxxxx' */ - else if (ch >= 0xD800 && ch < 0xDC00) { + else if (ch >= 0xD800 && ch < 0xDC00 && size > 0) { Py_UNICODE ch2; Py_UCS4 ucs; Also affects Python 3.
msg112291 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2010-08-01 08:49
Applied in r83395. Thanks!
msg112379 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-08-01 20:24
Well, the patch was technically useless since, as mentioned, unicode strings are terminated by a NUL character by design. Anyway, I now get the following error on the 2.7 branch. Perhaps it's related: ====================================================================== FAIL: test_ucs4 (test.test_unicode.UnicodeTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/antoine/cpython/27/Lib/test/test_unicode.py", line 941, in test_ucs4 self.assertEqual(x, y) AssertionError: '\\udbc0\\udc00' != '\\U00100000'
msg112382 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2010-08-01 20:51
You're right. Reverted in r83444 and merging back, and I'll also remove the "XXX is this needed" from 2.7.
msg112434 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-08-02 09:07
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > Well, the patch was technically useless since, as mentioned, unicode strings are terminated by a NUL character by design. There are two things to keep in mind: * Unicode objects are NUL-terminated, but only very external APIs rely on this (e.g. code using the Windows Unicode API). Please don't make the code in unicodeobject.c itself rely on this subtle detail. * The codecs work on Py_UNICODE* buffers which are never guaranteed to be NUL-terminated, so the problem in question is real. > Anyway, I now get the following error on the 2.7 branch. Perhaps it's related: > > ====================================================================== > FAIL: test_ucs4 (test.test_unicode.UnicodeTest) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "/home/antoine/cpython/27/Lib/test/test_unicode.py", line 941, in test_ucs4 > self.assertEqual(x, y) > AssertionError: '\\udbc0\\udc00' != '\\U00100000' > > ---------- > nosy: +pitrou > status: closed -> open > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue8821> > _______________________________________ > _______________________________________________ > Python-bugs-list mailing list > Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/mal%40egenix.com
msg112441 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-08-02 09:47
> * Unicode objects are NUL-terminated, but only very external APIs > rely on this (e.g. code using the Windows Unicode API). Please > don't make the code in unicodeobject.c itself rely on this > subtle detail. That's wishful thinking, don't you think? I am not making code in unicodeobject.c rely on this. It has been so for years, long before I was here. You should check who made that design decision in the first place instead of putting the blame on me. Besides, the fact that external APIs rely on it make it much more unchangeable than if it were an implementation detail.
msg112447 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-08-02 12:03
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> * Unicode objects are NUL-terminated, but only very external APIs >> rely on this (e.g. code using the Windows Unicode API). Please >> don't make the code in unicodeobject.c itself rely on this >> subtle detail. > > That's wishful thinking, don't you think? I am not making code in > unicodeobject.c rely on this. It has been so for years, long before I > was here. You should check who made that design decision in the first > place instead of putting the blame on me. I'm not blaming you for this. However, I don't want more code to rely on this behavior. The NUL-termination has never been documented and my decision to use NUL-termination on the PyUnicodeObject buffers was merely a safety measure. > Besides, the fact that external APIs rely on it make it much more > unchangeable than if it were an implementation detail. It's an undocumented implementation detail. We can certainly deprecate it's use using the standard approach we have for this. But all that is off-topic for this ticket, since codecs operate on Py_UNICODE* buffers together with a size parameter and relying on those buffers being NUL-terminated is bound to cause problems.
msg112454 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-08-02 12:33
> But all that is off-topic for this ticket, since codecs > operate on Py_UNICODE* buffers together with a size parameter > and relying on those buffers being NUL-terminated is bound to > cause problems. That's right. Then perhaps a fixed patch can be uploaded, if someone cares enough.
msg112460 - (view)	Author: Matt Giuca (mgiuca)	Date: 2010-08-02 13:17
OK, I finally had time to review this issue again. Firstly, granted the original fix broke a test case, shouldn't we figure out why it broke and fix it, rather than just revert the change and continue relying on this tenuous assumption? Surely it's best to have as little code relying on it as possible. Secondly, please have a look at my patch again. It wasn't committed properly -- no offense to Georg, it's an honest mistake! My patch was: --- Objects/unicodeobject.c (revision 81539) +++ Objects/unicodeobject.c (working copy) @@ -3065,7 +3065,7 @@ } #else /* Map UTF-16 surrogate pairs to '\U00xxxxxx' / - else if (ch >= 0xD800 && ch < 0xDC00) { + else if (ch >= 0xD800 && ch < 0xDC00 && size > 0) { Py_UNICODE ch2; Py_UCS4 ucs; The commit made in r83418 by Georg Brandl (and similarly r83395 in py3k): http://svn.python.org/view/python/branches/release27-maint/Objects/unicodeobject.c?r1=82980&r2=83418 --- Objects/unicodeobject.c (revision 83417) +++ Objects/unicodeobject.c (revision 83418) @@ -3067,7 +3067,7 @@ ch2 = s++; size--; - if (ch2 >= 0xDC00 && ch2 <= 0xDFFF) { + if (ch2 >= 0xDC00 && ch2 <= 0xDFFF && size) { ucs = (((ch & 0x03FF) << 10) \| (ch2 & 0x03FF)) + 0x00010000; p++ = '\\'; p++ = 'U'; @@ -3316,7 +3316,7 @@ ch2 = s++; size--; - if (ch2 >= 0xDC00 && ch2 <= 0xDFFF) { + if (ch2 >= 0xDC00 && ch2 <= 0xDFFF && size) { ucs = (((ch & 0x03FF) << 10) \| (ch2 & 0x03FF)) + 0x00010000; p++ = '\\'; *p++ = 'U'; I put the size check on the first character of the surrogate pair; in the committed version the size check was on the second character (after the "size" variable is decremented), causing it to break out of that branch too early in some cases. Moving the size check to the outer if block fixes the test breakage. PS. Good find on the second copy of that code in the PyUnicode_EncodeRawUnicodeEscape function. I've attached a new patch which fixes both functions instead of just the unicodeescape_string function. Passes all test cases on UCS2 build of the 2.7 branch.
msg124890 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-12-29 23:59
[MAL] > * Unicode objects are NUL-terminated, but only very external APIs > rely on this (e.g. code using the Windows Unicode API). Please > don't make the code in unicodeobject.c itself rely on this > subtle detail. I would like to note that several APIs have been introduced that require NUL-terminated unicode strings: Py_UNICODE_strlen(), Py_UNICODE_strcpy(), etc. I see them used quite extensively in unicodeobject.c and elsewhere in Python codebase. Perhaps this train has left the station already.
msg124894 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-12-30 00:45
> Unicode objects are NUL-terminated, but only very external APIs > rely on this (e.g. code using the Windows Unicode API). All Py_UNICODE_str() functions rely on the NUL character. They are useful when patching a function from bytes (char) to unicode (PyUnicodeObject): the API is very close. It is possible to avoid them with new functions using the strings length. All functions using PyUNICODE* as wchar_t* to the Windows wide character API (*W functions) also rely on the NUL character. Python core uses a lot of these functions. Don't write a NUL character require to create a temporary new string ending with a NUL character. It is not efficient, especially on long strings. And there is the problem of all third party modules (written in C) relying on the NUL character. I think that we have good reasons to not remove the NUL character. So I think that we can continue to accept that unicode[length] character can be read. Eg. implement text.startswith("ab") as "p=PyUnicode_AS_UNICODE(text); if (p[0] == 'a' && p[1] == 'b')" without checking the length of text. Using the NUL character or the length as a terminator condition doesn't really matter. I just see one advantage for the NUL character: it is faster in some cases.
msg124901 - (view)	Author: Matt Giuca (mgiuca)	Date: 2010-12-30 02:37
> I think that we have good reasons to not remove the NUL character. Please note: Nobody is suggesting that we remove the NUL character. I was merely suggesting that we don't rely on it where it is unnecessary. Returning to my original patch: If the code was using the NUL character as a terminator, then it wouldn't be a bug. What the repr code does is it uses the length, and does not explicitly search for a NUL character. However, there is a bug where it reads one too many characters in certain cases. As I said in the first place, it just happens to not be catastrophic due to the presence of the NUL character. But that does not mean this isn't a bug -- at the very least, the code is very confusing to read because it does not do what it is trying to do. Anyway the important issue is what Marc-Andre raised about buffers. Since we are in agreement that there is a potential problem here, and I have a patch which seems correct and doesn't break any test cases (note my above post responding to test case breakages), can it be applied?
msg176810 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-12-02 21:26
This patch is not needed for 3.3+. And for 2.7 and 3.2 it actually doesn't fix any bug in the current code.
msg178912 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-01-03 04:56
Should this be closed then?
msg178929 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-01-03 10:03
You can accept the patch. You can reject the patch. It doesn't matter.

History
Date	User	Action	Args
2022-04-11 14:57:01	admin	set	github: 53067
2013-09-02 15:45:31	serhiy.storchaka	set	status: pending -> closed resolution: works for me stage: resolved
2013-01-26 18:53:56	serhiy.storchaka	set	status: open -> pending
2013-01-03 10:03:31	serhiy.storchaka	set	messages: + msg178929
2013-01-03 04:56:01	ezio.melotti	set	messages: + msg178912
2012-12-02 21:26:07	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg176810 versions: - Python 3.1
2010-12-30 02:37:20	mgiuca	set	nosy: lemburg, georg.brandl, belopolsky, pitrou, vstinner, ezio.melotti, mgiuca messages: + msg124901
2010-12-30 00:45:27	vstinner	set	nosy: lemburg, georg.brandl, belopolsky, pitrou, vstinner, ezio.melotti, mgiuca messages: + msg124894
2010-12-29 23:59:20	belopolsky	set	nosy: + belopolsky, vstinner messages: + msg124890
2010-08-10 20:24:08	terry.reedy	set	versions: - Python 2.6, Python 3.3
2010-08-02 13:17:06	mgiuca	set	files: + unicode-range-check2.patch messages: + msg112460
2010-08-02 12:33:17	pitrou	set	messages: + msg112454
2010-08-02 12:03:57	lemburg	set	messages: + msg112447
2010-08-02 09:49:13	ezio.melotti	set	nosy: + ezio.melotti
2010-08-02 09:47:45	pitrou	set	messages: + msg112441
2010-08-02 09:08:18	lemburg	set	status: closed -> open resolution: fixed -> (no value)
2010-08-02 09:07:51	lemburg	set	nosy: + lemburg messages: + msg112434
2010-08-01 20:51:27	georg.brandl	set	status: open -> closed messages: + msg112382
2010-08-01 20:24:00	pitrou	set	status: closed -> open nosy: + pitrou messages: + msg112379
2010-08-01 08:49:32	georg.brandl	set	status: open -> closed nosy: + georg.brandl messages: + msg112291 resolution: fixed
2010-05-26 04:11:51	mgiuca	create