This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Potential overflows due to incorrect usage of PyUnicode_AsString.
Type: security Stage:
Components: Interpreter Core Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: alexandre.vassalotti Nosy List: alexandre.vassalotti, belopolsky, gvanrossum, lemburg
Priority: normal Keywords: patch

Created on 2008-01-27 22:44 by alexandre.vassalotti, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unicode_string_overflow.patch alexandre.vassalotti, 2008-01-27 22:44
unicode_string_overflow-2.patch alexandre.vassalotti, 2008-04-12 21:56
Messages (15)
msg61753 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2008-01-27 22:44
I have found a few instances of the following pattern in Py3k: 

   char buf[MAX];
   len = PyUnicode_GET_SIZE(str);
   if (len >= MAX)
       /* return error */
   strcpy(buf, PyUnicode_AsString(str));

which could overflow if str contains non-ASCII characters. These were
probably introduced during the PyString -> PyUnicode transition. Anyway,
I got a patch that fixes (hopefully) most of these bugs.
msg63360 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2008-03-07 18:43
Your description of the patch is a bit misleading.  As far as I can tell
only the first chunk (Python/import.c changes) addresses a potential
buffer overflow.  For example the last chunk (Modules/posixmodule.c
changes) simply eliminates an unused variable.  While a worthwhile
change, it should not be bundled with what is potentially a security patch.

I have a few suggestions:

1. It will really help if you produce a test case that crashes the
interpretor.  I am sure that will get noticed.

2. If any of buffer overflows apply to the current production versions
(2.4 or 2.5) or even the alpha release (2.6a1), it would make sense to
backport it to the trunk.  Once again, security issues in the trunk will
get noticed much faster than in py3k branch.
msg63363 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2008-03-07 20:19
I tried to produce a buffer overflow in get_parent (import.c), but an
attempt to import a module with non-ascii characters is aborted in
getargs.c before get_parent is reached:

>>> __import__("\u0080xyz")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __import__() argument 1 must be string without null bytes,
not str

This looks like a bug.  At the very least the error message is
misleading because there are no null bytes in "\u0080xyz" string.


The offending code is 

                        if ((Py_ssize_t)strlen(*p) !=
PyUnicode_GetSize(arg)) 
                                return converterr("string without null
bytes", 
                                                  arg, msgbuf, bufsize);

at getargs.c:826


However, given the preceding "XXX WAAAAH!" comment, this is probably a
sign of not yet implemented feature rather than a bug.
msg63369 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2008-03-07 22:00
Here are my comments on the other parts of the patch:

Python/structmember.c

The existing code is safe, but would silently produce wrong result if
T_CHAR attribute is assigned a non-ascii character.

With the patch this situation will be detected and an exception raised.

I am not sure that would be a desired behavior of py3k.  I could not
find any examples of using T_CHAR member in the stdlib, but an
alternative solution would be to change T_CHAR code to mean
PY_UNICODE_TYPE instead of char member.

Objects/typeobject.c

"%s" -> ".400s" is an obviously good change. 

The existing __doc__ processing code is correct.  Proposed code may be
marginally faster, but will allow docstrings with embedded null
characters, which may or may not be desirable (and may break other code
that uses tp_doc). Finally PyUnicode_AsStringAndSize always returns
null-terminated strings, so memcpy logic does not need to be altered.


Objects/structseq.c

Change from macros to enums is purely stylistic and python C style seem
to favor macros.

I don't think a repr of a python object can contain embedded null
characters, but even if that were the case, the patched code would not
support it because the resulting buffer is returned with
PyUnicode_FromString(buf).


Modules/datetimemodule.c

Existing code compensates for an error in initial estimate of totalnew
when it checks for overflow, but the proposed change will make code more
efficient.

Modules/zipimport.c

Since 's' format unit in PyArg_ParseTuple does not properly support
unicode yet, it is hard to tell if the current code is wrong, but
unicode paths cannot have embedded null characters, so use of 's#' is
not necessary.

Modules/timemodule.c

Supporting format strings with null characters is probably a good idea,
but that would be an RFE rather than a bug fix.

Modules/parsermodule.c

Looks like there is a bug there.
msg63433 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2008-03-10 02:18
Thanks for the review!

> Your description of the patch is a bit misleading.  As far as I can
> tell only the first chunk (Python/import.c changes) addresses a
> potential buffer overflow.

Yes, you are right. It seems only the bug in import.c could easily be
exploited.

> 1. It will really help if you produce a test case that crashes the
> interpretor.  I am sure that will get noticed.

   % cat pkg/__init__.py
   __package__ = "\U000c9c9c9" * 900
   from . import f
   % ./python
   Python 3.0a3+ (py3k:61164, Mar  1 2008, 19:55:42)
   >>> import pkg
   *** stack smashing detected ***: ./python terminated
   [1]    9503 abort (core dumped)  ./python

> 2. If any of buffer overflows apply to the current production
> versions (2.4 or 2.5) or even the alpha release (2.6a1), it would
> make sense to backport it to the trunk.

I don't think the trunk is affected in any way by the issues
mentioned here.

> The existing __doc__ processing code is correct.  Proposed code may be
> marginally faster, but will allow docstrings with embedded null
> characters, which may or may not be desirable (and may break other code
> that uses tp_doc)

Good call! I will check out if null-characters may pose a problem for
tp_doc and update the patch consequently.

> I don't think a repr of a python object can contain embedded null
> characters, but even if that were the case, the patched code would not
> support it because the resulting buffer is returned with
> PyUnicode_FromString(buf).

Oh, that is true. I will remove that part from the patch, then.

> Modules/datetimemodule.c
>
> Existing code compensates for an error in initial estimate of totalnew
> when it checks for overflow, but the proposed change will make code more
> efficient.

Right again.

> Modules/zipimport.c
[...]
> Modules/timemodule.c
[...]
> Modules/parsermodule.c
[...]

I need to check again the code for these three modules, before
commenting.

I will clean up the patch with your recommendation and post it
again. Thanks for taking the time to review my patch. It's greatly
appreciated.
msg63638 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-03-17 13:35
Any progress?
msg65423 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2008-04-12 21:56
I revised the patch with respect to Alexander's comments. In summary,
here is what I changed from the previous patch:

  - Removed the unnecessary "fixes" to Objects/structseq.c
    and Modules/timemodule.c
  - Updated Objects/typeobject.c to forbid null-bytes in __doc__,
    since they cannot be handled in the `tp_doc` member.
  - Removed an erroneous pointer dereference in Modules/zipimport.c
  - Changed `len+1` to `len` in the memcpy() call in the
    Modules/parsermodule.c fixes
msg65861 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2008-04-27 01:18
So, any comment on the latest patch?

If everything is all right, I would like to commit the patch to py3k.
msg65872 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2008-04-27 04:26
The patch looks good.  Just a question: I thought the strings returned by PyUnicode_AsStringAndSize are 0-terminated, while your patch at several 
places attempts to explicitly 0-terminate a copy of such string.  Are you 
sure this is necessary?
msg65960 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-04-29 11:07
@@ -2195,7 +2200,7 @@
 				}
 				return Py_None;
 			}
-			len = lastdot - start;
+			len = (size_t)(lastdot - start);
 			if (len >= MAXPATHLEN) {
 				PyErr_SetString(PyExc_ValueError,
 						"Module name too long");

The above cast needs to be (Py_ssize_t). size_t is an unsigned length type.
msg65962 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-04-29 11:31
BTW: The API PyUnicode_AsString() is pretty useless by itself - there's
no way to access the size information of the returned string without
again going to the Unicode object.

I'd suggest to remove the API altogether and not only deprecating it.

Furthermore, the API PyUnicode_AsStringAndSize() does not follow the API
signature of PyString_AsStringAndSize() in that it passes back the
pointer to the string as output parameter. That should be changed as
well. Note that PyString_AsStringAndSize() already does this for both
8-bit strings and Unicode, so the special Unicode API is not really
needed at all or you may want to rename PyString_AsStringAndSize() to
PyUnicode_AsStringAndSize().

Finally, since there are many cases where the string buffer contents are
copied to a new buffer, it's probably worthwhile to add a new API which
does the copying straight away and also deals with the overflow cases in
a central place. I'd suggest PyUnicode_AsChar() (with an API like
PyUnicode_AsWideChar()).
msg66160 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2008-05-03 17:49
Alexander Belopolsky wrote:
> The patch looks good.  Just a question: I thought the strings returned
> by PyUnicode_AsStringAndSize are 0-terminated, while your patch at
> several places attempts to explicitly 0-terminate a copy of such string.  
> Are you sure this is necessary?

I wasn't sure if the strings returned by PyUnicode_AsStringAndSize were
0-terminated, so I didn't take any chance and explicitly terminated
them. But I just verified myself and they are indeed 0-terminated. So, I
modified the patch in consequence.
msg66162 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2008-05-03 17:58
Marc-Andre Lemburg wrote:
[SNIP]
> The above cast needs to be (Py_ssize_t). size_t is an unsigned length
type.

Actually, the cast is right (even though it is not strictly necessary).
  It just the patch that is confusing. Here is the relevant code:  

	/* Normal module, so work out the package name if any */
	char *start = PyUnicode_AsString(modname);
	char *lastdot = strrchr(start, '.');
	size_t len;
	int error;
/* snip */
	len = (size_t)(lastdot - start);
	if (len >= MAXPATHLEN) {
		PyErr_SetString(PyExc_ValueError,
				"Module name too long");
		return NULL;
	}

I removed the cast from the patch (I don't know why I added it, anyway)
to avoid further confusion.
msg66163 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2008-05-03 18:25
Committed to r62667.

Thank you all for your comments!
msg66167 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-05-03 19:53
On 2008-05-03 20:25, Alexandre Vassalotti wrote:
> Alexandre Vassalotti <alexandre@peadrop.com> added the comment:
> 
> Committed to r62667.
> 
> Thank you all for your comments!
> 
> ----------
> resolution:  -> fixed
> status: open -> closed

What about my comments regarding the PyUnicode_AsString() API in
http://bugs.python.org/msg65962

Should I open a separate tracker item for this ?

I don't know who added those APIs, but they are neither in line with
the rest of the Unicode API, nor are they really all that helpful.
I guess they were just added out of a misunderstanding of the
already existing code.

I'd suggest to remove PyUnicode_AsString() altogether (which your
patch has already done in a couple of places).
History
Date User Action Args
2022-04-11 14:56:30adminsetgithub: 46242
2013-11-11 11:56:09ncoghlanunlinkissue19347 dependencies
2013-11-11 11:55:38ncoghlanlinkissue19347 dependencies
2008-05-03 19:53:27lemburgsetmessages: + msg66167
title: Potential overflows due to incorrect usage of PyUnicode_AsString. -> Potential overflows due to incorrect usage of PyUnicode_AsString.
2008-05-03 18:25:30alexandre.vassalottisetstatus: open -> closed
resolution: fixed
messages: + msg66163
2008-05-03 17:58:32alexandre.vassalottisetmessages: + msg66162
2008-05-03 17:49:58alexandre.vassalottisetmessages: + msg66160
2008-04-29 11:31:49lemburgsetmessages: + msg65962
2008-04-29 11:07:49lemburgsetnosy: + lemburg
messages: + msg65960
2008-04-27 04:27:55belopolskysetmessages: + msg65872
2008-04-27 01:18:19alexandre.vassalottisetmessages: + msg65861
2008-04-12 21:56:25alexandre.vassalottisetfiles: + unicode_string_overflow-2.patch
messages: + msg65423
2008-03-17 13:35:12gvanrossumsetnosy: + gvanrossum
messages: + msg63638
2008-03-10 02:18:58alexandre.vassalottisetmessages: + msg63433
2008-03-07 22:00:29belopolskysetmessages: + msg63369
2008-03-07 20:19:23belopolskysetmessages: + msg63363
2008-03-07 18:43:14belopolskysetnosy: + belopolsky
messages: + msg63360
2008-01-27 22:45:12alexandre.vassalottisettitle: Potential Overflow due to incorrect usage of PyUnicode_AsString. -> Potential overflows due to incorrect usage of PyUnicode_AsString.
2008-01-27 22:44:54alexandre.vassalotticreate