This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients cdqzzy, ezio.melotti, lemburg, terry.reedy, vstinner
Date 2011-05-07.09:21:43
SpamBayes Score 4.0302137e-05
Marked as misclassified No
Message-id <1304760104.18.0.60308675149.issue12016@psf.upfronthosting.co.za>
In-reply-to
Content
_codecs_cn implements different multibyte encodings: gb2312, gbkext, gbcommon, gb18030ext, gbk, gb18030.

And there are other Asian multibyte encodings: big5 family, ISO 2202 family, JIS family, korean encodings (KSX1001, EUC_KR, CP949, ...), Big5, CP950, ...

All of them ignore the all bytes if one byte of a multibyte sequence is invalid (lile 0xFF 0x0A: replaced by ? instead of ?\n using replace error handler).

I don't think that you can/should patch only one encoding: we should use the same rule for all encodings.

By the way, do you have any document explaining which result is the good one (? or ?\n)? For UTF-8, we have well defined standards explaining exactly what to do with invalid byte sequences => see issue #8271. It is easy to fix the decoders, but I would like to be sure that your proposed change is the right way to decode these encodings.

Change the multibyte encodings can also concern the security. Read for example the following section "Check byte strings before decoding them to character strings" of my book:
http://www.haypocalc.com/tmp/unicode-2011-03-25/html/issues.html#check-byte-strings-before-decoding-them-to-character-strings
(https://github.com/haypo/unicode_book/wiki)
History
Date User Action Args
2011-05-07 09:21:44vstinnersetrecipients: + vstinner, lemburg, terry.reedy, ezio.melotti, cdqzzy
2011-05-07 09:21:44vstinnersetmessageid: <1304760104.18.0.60308675149.issue12016@psf.upfronthosting.co.za>
2011-05-07 09:21:43vstinnerlinkissue12016 messages
2011-05-07 09:21:43vstinnercreate