Issue 1950: Potential overflows due to incorrect usage of PyUnicode_AsString.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/46242

classification

Title:	Potential overflows due to incorrect usage of PyUnicode_AsString.
Type:	security	Stage:
Components:	Interpreter Core	Versions:	Python 3.0

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	alexandre.vassalotti	Nosy List:	alexandre.vassalotti, belopolsky, gvanrossum, lemburg
Priority:	normal	Keywords:	patch

Created on 2008-01-27 22:44 by alexandre.vassalotti, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unicode_string_overflow.patch	alexandre.vassalotti, 2008-01-27 22:44
unicode_string_overflow-2.patch	alexandre.vassalotti, 2008-04-12 21:56

Messages (15)
msg61753 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) *	Date: 2008-01-27 22:44
I have found a few instances of the following pattern in Py3k: char buf[MAX]; len = PyUnicode_GET_SIZE(str); if (len >= MAX) /* return error */ strcpy(buf, PyUnicode_AsString(str)); which could overflow if str contains non-ASCII characters. These were probably introduced during the PyString -> PyUnicode transition. Anyway, I got a patch that fixes (hopefully) most of these bugs.
msg63360 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2008-03-07 18:43
Your description of the patch is a bit misleading. As far as I can tell only the first chunk (Python/import.c changes) addresses a potential buffer overflow. For example the last chunk (Modules/posixmodule.c changes) simply eliminates an unused variable. While a worthwhile change, it should not be bundled with what is potentially a security patch. I have a few suggestions: 1. It will really help if you produce a test case that crashes the interpretor. I am sure that will get noticed. 2. If any of buffer overflows apply to the current production versions (2.4 or 2.5) or even the alpha release (2.6a1), it would make sense to backport it to the trunk. Once again, security issues in the trunk will get noticed much faster than in py3k branch.
msg63363 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2008-03-07 20:19
I tried to produce a buffer overflow in get_parent (import.c), but an attempt to import a module with non-ascii characters is aborted in getargs.c before get_parent is reached: >>> __import__("\u0080xyz") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: __import__() argument 1 must be string without null bytes, not str This looks like a bug. At the very least the error message is misleading because there are no null bytes in "\u0080xyz" string. The offending code is if ((Py_ssize_t)strlen(*p) != PyUnicode_GetSize(arg)) return converterr("string without null bytes", arg, msgbuf, bufsize); at getargs.c:826 However, given the preceding "XXX WAAAAH!" comment, this is probably a sign of not yet implemented feature rather than a bug.
msg63369 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2008-03-07 22:00
Here are my comments on the other parts of the patch: Python/structmember.c The existing code is safe, but would silently produce wrong result if T_CHAR attribute is assigned a non-ascii character. With the patch this situation will be detected and an exception raised. I am not sure that would be a desired behavior of py3k. I could not find any examples of using T_CHAR member in the stdlib, but an alternative solution would be to change T_CHAR code to mean PY_UNICODE_TYPE instead of char member. Objects/typeobject.c "%s" -> ".400s" is an obviously good change. The existing __doc__ processing code is correct. Proposed code may be marginally faster, but will allow docstrings with embedded null characters, which may or may not be desirable (and may break other code that uses tp_doc). Finally PyUnicode_AsStringAndSize always returns null-terminated strings, so memcpy logic does not need to be altered. Objects/structseq.c Change from macros to enums is purely stylistic and python C style seem to favor macros. I don't think a repr of a python object can contain embedded null characters, but even if that were the case, the patched code would not support it because the resulting buffer is returned with PyUnicode_FromString(buf). Modules/datetimemodule.c Existing code compensates for an error in initial estimate of totalnew when it checks for overflow, but the proposed change will make code more efficient. Modules/zipimport.c Since 's' format unit in PyArg_ParseTuple does not properly support unicode yet, it is hard to tell if the current code is wrong, but unicode paths cannot have embedded null characters, so use of 's#' is not necessary. Modules/timemodule.c Supporting format strings with null characters is probably a good idea, but that would be an RFE rather than a bug fix. Modules/parsermodule.c Looks like there is a bug there.
msg63433 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) *	Date: 2008-03-10 02:18
Thanks for the review! > Your description of the patch is a bit misleading. As far as I can > tell only the first chunk (Python/import.c changes) addresses a > potential buffer overflow. Yes, you are right. It seems only the bug in import.c could easily be exploited. > 1. It will really help if you produce a test case that crashes the > interpretor. I am sure that will get noticed. % cat pkg/__init__.py __package__ = "\U000c9c9c9" * 900 from . import f % ./python Python 3.0a3+ (py3k:61164, Mar 1 2008, 19:55:42) >>> import pkg * stack smashing detected *: ./python terminated [1] 9503 abort (core dumped) ./python > 2. If any of buffer overflows apply to the current production > versions (2.4 or 2.5) or even the alpha release (2.6a1), it would > make sense to backport it to the trunk. I don't think the trunk is affected in any way by the issues mentioned here. > The existing __doc__ processing code is correct. Proposed code may be > marginally faster, but will allow docstrings with embedded null > characters, which may or may not be desirable (and may break other code > that uses tp_doc) Good call! I will check out if null-characters may pose a problem for tp_doc and update the patch consequently. > I don't think a repr of a python object can contain embedded null > characters, but even if that were the case, the patched code would not > support it because the resulting buffer is returned with > PyUnicode_FromString(buf). Oh, that is true. I will remove that part from the patch, then. > Modules/datetimemodule.c > > Existing code compensates for an error in initial estimate of totalnew > when it checks for overflow, but the proposed change will make code more > efficient. Right again. > Modules/zipimport.c [...] > Modules/timemodule.c [...] > Modules/parsermodule.c [...] I need to check again the code for these three modules, before commenting. I will clean up the patch with your recommendation and post it again. Thanks for taking the time to review my patch. It's greatly appreciated.
msg63638 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-03-17 13:35
Any progress?
msg65423 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) *	Date: 2008-04-12 21:56
I revised the patch with respect to Alexander's comments. In summary, here is what I changed from the previous patch: - Removed the unnecessary "fixes" to Objects/structseq.c and Modules/timemodule.c - Updated Objects/typeobject.c to forbid null-bytes in __doc__, since they cannot be handled in the `tp_doc` member. - Removed an erroneous pointer dereference in Modules/zipimport.c - Changed `len+1` to `len` in the memcpy() call in the Modules/parsermodule.c fixes
msg65861 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) *	Date: 2008-04-27 01:18
So, any comment on the latest patch? If everything is all right, I would like to commit the patch to py3k.
msg65872 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2008-04-27 04:26
The patch looks good. Just a question: I thought the strings returned by PyUnicode_AsStringAndSize are 0-terminated, while your patch at several places attempts to explicitly 0-terminate a copy of such string. Are you sure this is necessary?
msg65960 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2008-04-29 11:07
@@ -2195,7 +2200,7 @@ } return Py_None; } - len = lastdot - start; + len = (size_t)(lastdot - start); if (len >= MAXPATHLEN) { PyErr_SetString(PyExc_ValueError, "Module name too long"); The above cast needs to be (Py_ssize_t). size_t is an unsigned length type.
msg65962 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2008-04-29 11:31
BTW: The API PyUnicode_AsString() is pretty useless by itself - there's no way to access the size information of the returned string without again going to the Unicode object. I'd suggest to remove the API altogether and not only deprecating it. Furthermore, the API PyUnicode_AsStringAndSize() does not follow the API signature of PyString_AsStringAndSize() in that it passes back the pointer to the string as output parameter. That should be changed as well. Note that PyString_AsStringAndSize() already does this for both 8-bit strings and Unicode, so the special Unicode API is not really needed at all or you may want to rename PyString_AsStringAndSize() to PyUnicode_AsStringAndSize(). Finally, since there are many cases where the string buffer contents are copied to a new buffer, it's probably worthwhile to add a new API which does the copying straight away and also deals with the overflow cases in a central place. I'd suggest PyUnicode_AsChar() (with an API like PyUnicode_AsWideChar()).
msg66160 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) *	Date: 2008-05-03 17:49
Alexander Belopolsky wrote: > The patch looks good. Just a question: I thought the strings returned > by PyUnicode_AsStringAndSize are 0-terminated, while your patch at > several places attempts to explicitly 0-terminate a copy of such string. > Are you sure this is necessary? I wasn't sure if the strings returned by PyUnicode_AsStringAndSize were 0-terminated, so I didn't take any chance and explicitly terminated them. But I just verified myself and they are indeed 0-terminated. So, I modified the patch in consequence.
msg66162 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) *	Date: 2008-05-03 17:58
Marc-Andre Lemburg wrote: [SNIP] > The above cast needs to be (Py_ssize_t). size_t is an unsigned length type. Actually, the cast is right (even though it is not strictly necessary). It just the patch that is confusing. Here is the relevant code: /* Normal module, so work out the package name if any / char start = PyUnicode_AsString(modname); char lastdot = strrchr(start, '.'); size_t len; int error; / snip */ len = (size_t)(lastdot - start); if (len >= MAXPATHLEN) { PyErr_SetString(PyExc_ValueError, "Module name too long"); return NULL; } I removed the cast from the patch (I don't know why I added it, anyway) to avoid further confusion.
msg66163 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) *	Date: 2008-05-03 18:25
Committed to r62667. Thank you all for your comments!
msg66167 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2008-05-03 19:53
On 2008-05-03 20:25, Alexandre Vassalotti wrote: > Alexandre Vassalotti <alexandre@peadrop.com> added the comment: > > Committed to r62667. > > Thank you all for your comments! > > ---------- > resolution: -> fixed > status: open -> closed What about my comments regarding the PyUnicode_AsString() API in http://bugs.python.org/msg65962 Should I open a separate tracker item for this ? I don't know who added those APIs, but they are neither in line with the rest of the Unicode API, nor are they really all that helpful. I guess they were just added out of a misunderstanding of the already existing code. I'd suggest to remove PyUnicode_AsString() altogether (which your patch has already done in a couple of places).

History
Date	User	Action	Args
2022-04-11 14:56:30	admin	set	github: 46242
2013-11-11 11:56:09	ncoghlan	unlink	issue19347 dependencies
2013-11-11 11:55:38	ncoghlan	link	issue19347 dependencies
2008-05-03 19:53:27	lemburg	set	messages: + msg66167 title: Potential overflows due to incorrect usage of PyUnicode_AsString. -> Potential overflows due to incorrect usage of PyUnicode_AsString.
2008-05-03 18:25:30	alexandre.vassalotti	set	status: open -> closed resolution: fixed messages: + msg66163
2008-05-03 17:58:32	alexandre.vassalotti	set	messages: + msg66162
2008-05-03 17:49:58	alexandre.vassalotti	set	messages: + msg66160
2008-04-29 11:31:49	lemburg	set	messages: + msg65962
2008-04-29 11:07:49	lemburg	set	nosy: + lemburg messages: + msg65960
2008-04-27 04:27:55	belopolsky	set	messages: + msg65872
2008-04-27 01:18:19	alexandre.vassalotti	set	messages: + msg65861
2008-04-12 21:56:25	alexandre.vassalotti	set	files: + unicode_string_overflow-2.patch messages: + msg65423
2008-03-17 13:35:12	gvanrossum	set	nosy: + gvanrossum messages: + msg63638
2008-03-10 02:18:58	alexandre.vassalotti	set	messages: + msg63433
2008-03-07 22:00:29	belopolsky	set	messages: + msg63369
2008-03-07 20:19:23	belopolsky	set	messages: + msg63363
2008-03-07 18:43:14	belopolsky	set	nosy: + belopolsky messages: + msg63360
2008-01-27 22:45:12	alexandre.vassalotti	set	title: Potential Overflow due to incorrect usage of PyUnicode_AsString. -> Potential overflows due to incorrect usage of PyUnicode_AsString.
2008-01-27 22:44:54	alexandre.vassalotti	create