classification
Title: Invalid UTF8 Byte sequence not raising exception/being substituted
Type: behavior Stage: resolved
Components: Unicode Versions: Python 2.7, Python 2.6
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Mike.Lewis, ezio.melotti, jmehnle, lemburg, vstinner
Priority: normal Keywords:

Created on 2010-06-30 20:02 by Mike.Lewis, last changed 2014-03-12 21:14 by jmehnle. This issue is now closed.

Messages (4)
msg109010 - (view) Author: Mike Lewis (Mike.Lewis) Date: 2010-06-30 20:02
When I do
codecs.encode(codecs.decode('\xed\xbc\xad', 'utf8'), 'utf8')

its not throwing an exception.  '\xed\xbc\xad' is an invalid UTF8 byte sequence.

It maps to the value U+DF2D which is a "surrogate pair" it seems.

http://tools.ietf.org/html/rfc3629#section-4

explains:

      However, pairs of
      UCS-2 values between D800 and DFFF (surrogate pairs in Unicode
      parlance), being actually UCS-4 characters transformed through
      UTF-16, need special treatment: the UTF-16 transformation must be
      undone, yielding a UCS-4 character that is then transformed as
      above.

which would suggest that it is invalid.

However, I think wikipedia's explanation is a bit clearer:

UTF-8 may only legally be used to encode valid Unicode scalar values. According to the Unicode standard the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above.


Thanks,
Mike
msg109011 - (view) Author: Mike Lewis (Mike.Lewis) Date: 2010-06-30 20:07
Sorry, meant to add this part to the quote from the rfc:

This leads to different results for character
   numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT
   valid UTF-8
msg109012 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-06-30 20:11
This is already fixed in Python 3.
However I think that for backward compatibility reasons it can't be fixed in Python 2, where it is possible to encode and decode every codepoint to/from UTF-8.

See also http://bugs.python.org/issue8271#msg102209

I think this can be closed as wontfix.
msg109017 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-06-30 20:38
Ezio Melotti wrote:
> 
> I think this can be closed as wontfix.

Agreed. I've already closed the ticket.
History
Date User Action Args
2014-03-12 21:14:09jmehnlesetnosy: + jmehnle
2010-06-30 22:05:21ezio.melottisetstage: resolved
2010-06-30 20:38:28lemburgsetmessages: + msg109017
2010-06-30 20:25:37lemburgsetstatus: pending -> closed
resolution: wont fix
2010-06-30 20:11:50ezio.melottisetstatus: open -> pending
versions: + Python 2.7
nosy: + lemburg, vstinner, ezio.melotti

messages: + msg109012

type: behavior
2010-06-30 20:07:17Mike.Lewissetmessages: + msg109011
2010-06-30 20:02:53Mike.Lewiscreate