Author vstinner
Recipients cdqzzy, ezio.melotti, lemburg, terry.reedy, vstinner
Date 2011-05-11.09:51:59
SpamBayes Score 8.60928e-12
Marked as misclassified No
Message-id <>
I asked if the change is correct on iconv mail list. Here is a copy of an answer.

De: 	Bruno Haible
À: 	[iconv mailing list]
Cc: 	Victor Stinner
Sujet: 	Re: [bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings
Date: 	Tue, 10 May 2011 14:52:09 +0200


> Someone opened an issue in Python bug tracker asking to change how
> invalid multibyte sequences are handled.

For UTF-8 the recommended way of handling malformed input is written down
in <>. But the
principle applies to any encoding with a variable number of bytes per
  When an invalid or malformed byte sequence is found, the smallest
  such byte sequence is transformed to U+FFFD (replacement character).

In particular, normally, if the first byte that is considered "wrong"
or "invalid" is a valid starter byte, the malformed byte sequence should
be considered to end before that byte. If it is not a valid starter
byte, then use your judgement.

For an example implementation, see
Here the return value is the number of bytes consumed. Look carefully
when it is 1, 2, 3, or 4.

> b'\xffabc'.decode('gb2312', 'replace') gives "�bc". The 'a' character is
> seen as part of a multibyte character of 2 bytes. Because {0xFF, 0x61}
> is invalid in GB2312, the two bytes are replaced by U+FFFD.
> Is it the "right" way to to do?

It is better to replace only the 0xFF byte with U+FFFD, because 0x61 is a
valid first byte (even a complete character).

> UTF-8 decoder changed recently to ignore a single byte and restart the
> decoder, so '\xF1\x80\x41\x42\x43' is now decoded "�ABC" instead "�C".
> Should we do the same for all encodings?

Generally, yes.

> Or at least for asian encodings 
> (gb2312, gbk, gb18030, big5 family, ISO 2202 family, JIS family, EUC_KR,
> CP949, Big5, CP950, ...)?

For stateful encodings of the ISO 2202 family, you may want to ignore/replace
a complete escape sequence, where the syntax of escape sequences is defined
through general rules.

In memoriam Siegfried Rädel <ädel>
Date User Action Args
2011-05-11 09:52:01vstinnersetrecipients: + vstinner, lemburg, terry.reedy, ezio.melotti, cdqzzy
2011-05-11 09:52:01vstinnersetmessageid: <>
2011-05-11 09:52:00vstinnerlinkissue12016 messages
2011-05-11 09:51:59vstinnercreate