This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author amaury.forgeotdarc
Recipients amaury.forgeotdarc, dongying
Date 2011-12-16.11:26:44
SpamBayes Score 2.4757013e-10
Marked as misclassified No
Message-id <1324034805.33.0.644386454984.issue13612@psf.upfronthosting.co.za>
In-reply-to
Content
Actually, this fails on 2.6 and 2.7 on wide unicode builds, and passes with narrow unicode builds (on my 64bit Linux box).

In pyexpat.c, PyUnknownEncodingHandler accesses 256 characters of a unicode buffer, without checking its length... which happens to be 192 chars long.
So buffers overflow, etc.  The function has a comment "supports only 8bit encodings"; indeed.
Versions 3.2 and 3.3 happen to pass the test, probably by pure luck.

Supporting multibytes codecs won't be easy: pyexpat requires to fill an array which specifies the number of bytes needed by each start byte (for example, in utf-8, 0xc3 starts a 2-bytes sequence, 0xef starts a 3-bytes sequence).  Our codecs framwork does not provide this information, and some codecs (gb18030 for example) need the second char to determine whether it will need 4 bytes.
History
Date User Action Args
2011-12-16 11:26:45amaury.forgeotdarcsetrecipients: + amaury.forgeotdarc, dongying
2011-12-16 11:26:45amaury.forgeotdarcsetmessageid: <1324034805.33.0.644386454984.issue13612@psf.upfronthosting.co.za>
2011-12-16 11:26:44amaury.forgeotdarclinkissue13612 messages
2011-12-16 11:26:44amaury.forgeotdarccreate