Issue 5108: Invalid UTF-8 ("%s") length in PyUnicode_FromFormatV()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49358

classification

Title:	Invalid UTF-8 ("%s") length in PyUnicode_FromFormatV()
Type:	crash	Stage:	patch review
Components:	Unicode	Versions:	Python 3.0, Python 3.1

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	doerwalter, vstinner
Priority:	critical	Keywords:	patch

Created on 2009-01-30 11:26 by vstinner, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unicode_format.patch	vstinner, 2009-01-30 11:27
diff.txt	doerwalter, 2009-04-09 21:26	Second version of the patch

Messages (4)
msg80814 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-01-30 11:26
PyUnicode_FromFormatV() doesn't count correctly the unicode length of an UTF-8 string. Commit r57837 "Change %s argument for PyUnicode_FromFormat to be UTF-8. Fixes #1070." introduced the bug. To compute the length, it uses a a complex code to compute the length of the UTF-8 string, whereas PyUnicode_DecodeUTF8(p, strlen(p), "replace") + Py_UNICODE_COPY() is used to copy the string. The problem may comes from the error handling ("replace"). Valgrind show that the error occurs at Py_UNICODE_COPY(): Invalid write of size 1. Since it's only one byte, Python does not always crash. To reproduce the crash, use PyUnicode_FromFormatV() or function using it: PyUnicode_FromFormat(), PyErr_Format(), ... Example 1: import grp x=["\uDBE7\u8C99", "\u9C31\uF8DC\u3EC5\u1804\u629D\uE748\u68C8\uCF74\u9E63\uF647\uBF7A\uED63"] x=str(x) grp.getgrnam(x) Example 2: import unicodedata x = "\\udbe7\u8c99', '\u9c31\\uf8dc\u3ec5\u1804\u629d\\ue748\u68c8\ucf74\u9e63\\uf647\ubf7a\\ued63" unicodedata.lookup(x) I wrote a patch reusing PyUnicode_DecodeUTF8(p, strlen(p), "replace") + PyUnicode_GET_SIZE() to get the real length of the converted UTF-8 string. A better patch should reuse code used to convert UTF-8 to Unicode with the "replace" error handling.
msg85827 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2009-04-09 21:26
The problem with your patch is that it calls PyUnicode_DecodeUTF8() twice. It would be better if step 1 in the code would include the %s format specifiers and step 3 would then call PyUnicode_DecodeUTF8() and put the result into the callresults buffer. BTW, I can't make your examples segfault. Attached is a new version of the patch that implements this. Could you test this new version to see if there's still a Valgrind report?
msg87081 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2009-05-03 22:58
Checked in: r72260 (trunk) r72262 (release26-maint) r72265 (py3k) r72266 (release30-maint)
msg87082 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-05-03 23:38
> The problem with your patch is that it calls PyUnicode_DecodeUTF8() twice I read your patch: storing the decoded string is the right way to do that, but i didn't noticed the "callresult stack". > Could you test this new version to see if there's still a Valgrind report? I will try to reproduce the crash using my Python fuzzer (Fusil). But it looks better (and valid). Thanks.

History
Date	User	Action	Args
2022-04-11 14:56:45	admin	set	github: 49358
2009-05-03 23:38:21	vstinner	set	messages: + msg87082
2009-05-03 22:58:03	doerwalter	set	status: open -> closed resolution: fixed messages: + msg87081
2009-04-15 10:13:28	pitrou	set	priority: critical type: crash stage: patch review
2009-04-09 21:26:06	doerwalter	set	files: + diff.txt nosy: + doerwalter messages: + msg85827
2009-01-30 11:27:05	vstinner	set	files: + unicode_format.patch keywords: + patch
2009-01-30 11:26:56	vstinner	create