Message 135767 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	cdqzzy, ezio.melotti, lemburg, terry.reedy, vstinner
Date	2011-05-11.09:51:59
SpamBayes Score	8.60928e-12
Marked as misclassified	No
Message-id	<1305107521.32.0.0777227431537.issue12016@psf.upfronthosting.co.za>
In-reply-to

Content
I asked if the change is correct on iconv mail list. Here is a copy of an answer. De: Bruno Haible À: [iconv mailing list] Cc: Victor Stinner Sujet: Re: [bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings Date: Tue, 10 May 2011 14:52:09 +0200 Hi, > Someone opened an issue in Python bug tracker asking to change how > invalid multibyte sequences are handled. > http://bugs.python.org/issue12016 For UTF-8 the recommended way of handling malformed input is written down in <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>. But the principle applies to any encoding with a variable number of bytes per character: When an invalid or malformed byte sequence is found, the smallest such byte sequence is transformed to U+FFFD (replacement character). In particular, normally, if the first byte that is considered "wrong" or "invalid" is a valid starter byte, the malformed byte sequence should be considered to end before that byte. If it is not a valid starter byte, then use your judgement. For an example implementation, see <http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/unistr/u8-mbtouc.c;hb=HEAD> Here the return value is the number of bytes consumed. Look carefully when it is 1, 2, 3, or 4. > b'\xffabc'.decode('gb2312', 'replace') gives "�bc". The 'a' character is > seen as part of a multibyte character of 2 bytes. Because {0xFF, 0x61} > is invalid in GB2312, the two bytes are replaced by U+FFFD. > > Is it the "right" way to to do? It is better to replace only the 0xFF byte with U+FFFD, because 0x61 is a valid first byte (even a complete character). > UTF-8 decoder changed recently to ignore a single byte and restart the > decoder, so '\xF1\x80\x41\x42\x43' is now decoded "�ABC" instead "�C". > Should we do the same for all encodings? Generally, yes. > Or at least for asian encodings > (gb2312, gbk, gb18030, big5 family, ISO 2202 family, JIS family, EUC_KR, > CP949, Big5, CP950, ...)? For stateful encodings of the ISO 2202 family, you may want to ignore/replace a complete escape sequence, where the syntax of escape sequences is defined through general rules. Bruno -- In memoriam Siegfried Rädel <http://en.wikipedia.org/wiki/Siegfried_Rädel>

I asked if the change is correct on iconv mail list. Here is a copy of an answer.

De: 	Bruno Haible
À: 	[iconv mailing list]
Cc: 	Victor Stinner
Sujet: 	Re: [bug-gnu-libiconv] Invalid byte sequences and multiybyte encodings
Date: 	Tue, 10 May 2011 14:52:09 +0200

Hi,

> Someone opened an issue in Python bug tracker asking to change how
> invalid multibyte sequences are handled.
> http://bugs.python.org/issue12016

For UTF-8 the recommended way of handling malformed input is written down
in <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>. But the
principle applies to any encoding with a variable number of bytes per
character:
  When an invalid or malformed byte sequence is found, the smallest
  such byte sequence is transformed to U+FFFD (replacement character).

In particular, normally, if the first byte that is considered "wrong"
or "invalid" is a valid starter byte, the malformed byte sequence should
be considered to end before that byte. If it is not a valid starter
byte, then use your judgement.

For an example implementation, see
<http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/unistr/u8-mbtouc.c;hb=HEAD>
Here the return value is the number of bytes consumed. Look carefully
when it is 1, 2, 3, or 4.

> b'\xffabc'.decode('gb2312', 'replace') gives "�bc". The 'a' character is
> seen as part of a multibyte character of 2 bytes. Because {0xFF, 0x61}
> is invalid in GB2312, the two bytes are replaced by U+FFFD.
> 
> Is it the "right" way to to do?

It is better to replace only the 0xFF byte with U+FFFD, because 0x61 is a
valid first byte (even a complete character).

> UTF-8 decoder changed recently to ignore a single byte and restart the
> decoder, so '\xF1\x80\x41\x42\x43' is now decoded "�ABC" instead "�C".
> Should we do the same for all encodings?

Generally, yes.

> Or at least for asian encodings 
> (gb2312, gbk, gb18030, big5 family, ISO 2202 family, JIS family, EUC_KR,
> CP949, Big5, CP950, ...)?

For stateful encodings of the ISO 2202 family, you may want to ignore/replace
a complete escape sequence, where the syntax of escape sequences is defined
through general rules.

Bruno
-- 
In memoriam Siegfried Rädel <http://en.wikipedia.org/wiki/Siegfried_Rädel>

History
Date	User	Action	Args
2011-05-11 09:52:01	vstinner	set	recipients: + vstinner, lemburg, terry.reedy, ezio.melotti, cdqzzy
2011-05-11 09:52:01	vstinner	set	messageid: <1305107521.32.0.0777227431537.issue12016@psf.upfronthosting.co.za>
2011-05-11 09:52:00	vstinner	link	issue12016 messages
2011-05-11 09:51:59	vstinner	create