Issue 9058: PyUnicodeDecodeError_Create asserts that various arguments are less than INT_MAX

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53304

classification

Title:	PyUnicodeDecodeError_Create asserts that various arguments are less than INT_MAX
Type:	crash	Stage:	patch review
Components:	Unicode	Versions:	Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	benjamin.peterson, dmalcolm
Priority:	normal	Keywords:	patch

Created on 2010-06-22 18:08 by dmalcolm, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
remove-PyUnicodeDecodeError_Create-assertions-issue9058.patch	dmalcolm, 2010-06-22 18:14	Patch against trunk to remove the use of "s#" and remove the assertions
remove-PyUnicodeDecodeError_Create-assertions-issue9058-v2.patch	dmalcolm, 2010-06-22 19:06	Updated version of patch

Messages (5)
msg108404 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2010-06-22 18:08
Running this code (seen via http://bugs.python.org/file10013/python-2.5.2-unicode_resize-utf16.py for issue 2620): >>> msg = 'A'2147483647 ; msg.decode('utf16') leads to the python process exiting with an assertion failure: python: Objects/exceptions.c:1787: PyUnicodeDecodeError_Create: Assertion `length < 2147483647' failed. The reaason is that PyUnicodeDecodeError_Create contains this code: PyObject PyUnicodeDecodeError_Create( const char encoding, const char object, Py_ssize_t length, Py_ssize_t start, Py_ssize_t end, const char reason) { assert(length < INT_MAX); assert(start < INT_MAX); assert(end < INT_MAX); return PyObject_CallFunction(PyExc_UnicodeDecodeError, "ss#nns", encoding, object, length, start, end, reason); } In the example above, we're creating a buffer containing a very large but odd-numbered of bytes, and trying to UTF-16 decode it, which requires an even number of bytes. This leads to a UnicodeDecodeError with a very large value for each of length, start, and end, and although they fit in Py_ssize_t on 64-bit, they don't fit in an int. It seems that this will affect any other decoding errors for buffers that are >= INT_MAX in size: unicode decode errors will cause the python process to bail out with an assert failure. It appears that throughout the UnicodeDecodeError representation that length, start and size are to be of size Py_ssize_t, rather than int: they are stored in fields of type Py_ssize_t, they are printed using format "%zd", which is indeed a Py_ssize_t (see http://docs.python.org/c-api/string.html#PyString_FromFormat ). In PyUnicodeDecodeError_Create, "ss#nns" is: - "s" (string) [char ]: "encoding" - "s#" (string) [char , int]:"object", "length" - "n" (int) [Py_ssize_t]: "start" - "n" (int) [Py_ssize_t]: "end" - "s" (string) [char ]: "reason" See Python/modsupport.c: do_mkvalue: "s#" uses this logic: if (flags & FLAG_SIZE_T) n = va_arg(p_va, Py_ssize_t); else n = va_arg(p_va, int); where FLAG_SIZE_T is set by _Py_BuildValue_SizeT, but not by Py_BuildValue. The latter is what's called by PyObject_CallFunction. Hence, as written, "length" must fit within an "int", but "start" and "end" don't have to. "s#" calls PyString_FromStringAndSize(str, n) upon the data, which takes a Py_ssize_t. It's going to be big, but may well fit in RAM (but might not). The invoked function leads to a call to: UnicodeError_init, which calls: if (!PyArg_ParseTuple(args, "O!O!nnO!", &PyString_Type, &self->encoding, objecttype, &self->object, &self->start, &self->end, &PyString_Type, &self->reason)) { "O!": (object) [typeobject, PyObject ]: &PyString_Type, &self->encoding, "O!": (object) [typeobject, PyObject ]: objecttype, &self->object (objecttype is passed as &PyString_Type) "n": (integer) [Py_ssize_t]: &self->start, "n": (integer) [Py_ssize_t]: &self->end, "O!": (object) [typeobject, PyObject *]: &PyString_Type, &self->reason, So it looks like the only place in construction where we actually restrict to "int" is in that "s#" in PyUnicodeDecodeError_Create, which looks fixable. After construction: various calls to PyString_FromFormat for start and end using format "%zd", which is indeed a Py_ssize_t (see http://docs.python.org/c-api/string.html#PyString_FromFormat ) So it looks like the only issue here is the restriction to "int" in PyUnicodeDecodeError_Create due to the use of "s#" in the call to PyObject_CallFunction; apart from that, it looks like the assertions can be removed. I'll attach a patch which removes this restriction. (for my reference, I'm tracking this as https://bugzilla.redhat.com/show_bug.cgi?id=540518 )
msg108406 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2010-06-22 18:14
Patch to remove the restriction that the fields be < INT_MAX Tested on x86_64 on a machine with > 12GB of RAM, leads to this exception, rather than the python process bailing out: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "Lib/encodings/utf_16.py", line 16, in decode return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode byte 0x41 in position 2147483646: truncated data I'm not sure about automated testing for this; the reproducer seems to need >=12GB of RAM - in borderline cases it drives my machine deep into swap, which can take a long time to recover from, and might be unacceptable in the test suite.
msg108407 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2010-06-22 18:25
Actually, you should be able to just remove the asserts.
msg108409 - (view)	Author: Dave Malcolm (dmalcolm)	Date: 2010-06-22 19:06
>>> Actually, you should be able to just remove the asserts. Aha, thanks! Yes: #define PY_SSIZE_T_CLEAN is defined at the top of Objects/exceptions.c, so yes, Py_VaBuildValue is redirected to _Py_VaBuildValue_SizeT, so that PyEval_CallFunction calls: Py_VaBuildValue(format, vargs) which is preprocessed to: _Py_VaBuildValue_SizeT which calls: return va_build_value(format, va, FLAG_SIZE_T); which means that "s#" is processed with flags==FLAG_SIZE_T, and thus as "Py_ssize_t" within do_mkvalue: if (flags & FLAG_SIZE_T) n = va_arg(p_va, Py_ssize_t); else n = va_arg(p_va, int); So, yes, it does look like the three assert lines can simply be removed.
msg108410 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2010-06-22 19:17
Removed them in r82157.

History
Date	User	Action	Args
2022-04-11 14:57:02	admin	set	github: 53304
2010-06-22 19:17:19	benjamin.peterson	set	status: open -> closed resolution: fixed messages: + msg108410
2010-06-22 19:06:09	dmalcolm	set	files: + remove-PyUnicodeDecodeError_Create-assertions-issue9058-v2.patch messages: + msg108409
2010-06-22 18:25:42	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg108407
2010-06-22 18:14:29	dmalcolm	set	files: + remove-PyUnicodeDecodeError_Create-assertions-issue9058.patch keywords: + patch messages: + msg108406 stage: patch review
2010-06-22 18:08:43	dmalcolm	create