Issue 9133: Invalid UTF8 Byte sequence not raising exception/being substituted

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53379

classification

Title:	Invalid UTF8 Byte sequence not raising exception/being substituted
Type:	behavior	Stage:	resolved
Components:	Unicode	Versions:	Python 2.7, Python 2.6

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	Mike.Lewis, ezio.melotti, jmehnle, lemburg, vstinner
Priority:	normal	Keywords:

Created on 2010-06-30 20:02 by Mike.Lewis, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (4)
msg109010 - (view)	Author: Mike Lewis (Mike.Lewis)	Date: 2010-06-30 20:02
When I do codecs.encode(codecs.decode('\xed\xbc\xad', 'utf8'), 'utf8') its not throwing an exception. '\xed\xbc\xad' is an invalid UTF8 byte sequence. It maps to the value U+DF2D which is a "surrogate pair" it seems. http://tools.ietf.org/html/rfc3629#section-4 explains: However, pairs of UCS-2 values between D800 and DFFF (surrogate pairs in Unicode parlance), being actually UCS-4 characters transformed through UTF-16, need special treatment: the UTF-16 transformation must be undone, yielding a UCS-4 character that is then transformed as above. which would suggest that it is invalid. However, I think wikipedia's explanation is a bit clearer: UTF-8 may only legally be used to encode valid Unicode scalar values. According to the Unicode standard the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above. Thanks, Mike
msg109011 - (view)	Author: Mike Lewis (Mike.Lewis)	Date: 2010-06-30 20:07
Sorry, meant to add this part to the quote from the rfc: This leads to different results for character numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT valid UTF-8
msg109012 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-06-30 20:11
This is already fixed in Python 3. However I think that for backward compatibility reasons it can't be fixed in Python 2, where it is possible to encode and decode every codepoint to/from UTF-8. See also http://bugs.python.org/issue8271#msg102209 I think this can be closed as wontfix.
msg109017 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-06-30 20:38
Ezio Melotti wrote: > > I think this can be closed as wontfix. Agreed. I've already closed the ticket.

History
Date	User	Action	Args
2022-04-11 14:57:03	admin	set	github: 53379
2014-03-12 21:14:09	jmehnle	set	nosy: + jmehnle
2010-06-30 22:05:21	ezio.melotti	set	stage: resolved
2010-06-30 20:38:28	lemburg	set	messages: + msg109017
2010-06-30 20:25:37	lemburg	set	status: pending -> closed resolution: wont fix
2010-06-30 20:11:50	ezio.melotti	set	status: open -> pending versions: + Python 2.7 nosy: + lemburg, vstinner, ezio.melotti messages: + msg109012 type: behavior
2010-06-30 20:07:17	Mike.Lewis	set	messages: + msg109011
2010-06-30 20:02:53	Mike.Lewis	create