Message 80814 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	vstinner
Date	2009-01-30.11:26:53
SpamBayes Score	1.5659696e-13
Marked as misclassified	No
Message-id	<1233314818.25.0.50797871273.issue5108@psf.upfronthosting.co.za>
In-reply-to

Content
PyUnicode_FromFormatV() doesn't count correctly the unicode length of an UTF-8 string. Commit r57837 "Change %s argument for PyUnicode_FromFormat to be UTF-8. Fixes #1070." introduced the bug. To compute the length, it uses a a complex code to compute the length of the UTF-8 string, whereas PyUnicode_DecodeUTF8(p, strlen(p), "replace") + Py_UNICODE_COPY() is used to copy the string. The problem may comes from the error handling ("replace"). Valgrind show that the error occurs at Py_UNICODE_COPY(): Invalid write of size 1. Since it's only one byte, Python does not always crash. To reproduce the crash, use PyUnicode_FromFormatV() or function using it: PyUnicode_FromFormat(), PyErr_Format(), ... Example 1: import grp x=["\uDBE7\u8C99", "\u9C31\uF8DC\u3EC5\u1804\u629D\uE748\u68C8\uCF74\u9E63\uF647\uBF7A\uED63"] x=str(x) grp.getgrnam(x) Example 2: import unicodedata x = "\\udbe7\u8c99', '\u9c31\\uf8dc\u3ec5\u1804\u629d\\ue748\u68c8\ucf74\u9e63\\uf647\ubf7a\\ued63" unicodedata.lookup(x) I wrote a patch reusing PyUnicode_DecodeUTF8(p, strlen(p), "replace") + PyUnicode_GET_SIZE() to get the real length of the converted UTF-8 string. A better patch should reuse code used to convert UTF-8 to Unicode with the "replace" error handling.

PyUnicode_FromFormatV() doesn't count correctly the unicode length of 
an UTF-8 string. Commit r57837 "Change %s argument for 
PyUnicode_FromFormat to be UTF-8. Fixes #1070." introduced the bug. To 
compute the length, it uses a a complex code to compute the length of 
the UTF-8 string, whereas PyUnicode_DecodeUTF8(p, 
strlen(p), "replace") + Py_UNICODE_COPY() is used to copy the string. 
The problem may comes from the error handling ("replace").

Valgrind show that the error occurs at Py_UNICODE_COPY(): Invalid 
write of size 1. Since it's only one byte, Python does not always 
crash.

To reproduce the crash, use PyUnicode_FromFormatV() or function using 
it: PyUnicode_FromFormat(), PyErr_Format(), ...

Example 1:

    import grp
    
x=["\uDBE7\u8C99", "\u9C31\uF8DC\u3EC5\u1804\u629D\uE748\u68C8\uCF74\u9E63\uF647\uBF7A\uED63"]
    x=str(x)
    grp.getgrnam(x)

Example 2:

    import unicodedata
    x 
= "\\udbe7\u8c99', '\u9c31\\uf8dc\u3ec5\u1804\u629d\\ue748\u68c8\ucf74\u9e63\\uf647\ubf7a\\ued63"
    unicodedata.lookup(x)

I wrote a patch reusing PyUnicode_DecodeUTF8(p, strlen(p), "replace") 
+ PyUnicode_GET_SIZE() to get the real length of the converted UTF-8 
string.

A better patch should reuse code used to convert UTF-8 to Unicode with 
the "replace" error handling.

History
Date	User	Action	Args
2009-01-30 11:26:58	vstinner	set	recipients: + vstinner
2009-01-30 11:26:58	vstinner	set	messageid: <1233314818.25.0.50797871273.issue5108@psf.upfronthosting.co.za>
2009-01-30 11:26:55	vstinner	link	issue5108 messages
2009-01-30 11:26:53	vstinner	create