This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Mike.Lewis
Recipients Mike.Lewis
Date 2010-06-30.20:02:51
SpamBayes Score 0.015523787
Marked as misclassified No
Message-id <1277928174.53.0.503418777966.issue9133@psf.upfronthosting.co.za>
In-reply-to
Content
When I do
codecs.encode(codecs.decode('\xed\xbc\xad', 'utf8'), 'utf8')

its not throwing an exception.  '\xed\xbc\xad' is an invalid UTF8 byte sequence.

It maps to the value U+DF2D which is a "surrogate pair" it seems.

http://tools.ietf.org/html/rfc3629#section-4

explains:

      However, pairs of
      UCS-2 values between D800 and DFFF (surrogate pairs in Unicode
      parlance), being actually UCS-4 characters transformed through
      UTF-16, need special treatment: the UTF-16 transformation must be
      undone, yielding a UCS-4 character that is then transformed as
      above.

which would suggest that it is invalid.

However, I think wikipedia's explanation is a bit clearer:

UTF-8 may only legally be used to encode valid Unicode scalar values. According to the Unicode standard the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above.


Thanks,
Mike
History
Date User Action Args
2010-06-30 20:02:54Mike.Lewissetrecipients: + Mike.Lewis
2010-06-30 20:02:54Mike.Lewissetmessageid: <1277928174.53.0.503418777966.issue9133@psf.upfronthosting.co.za>
2010-06-30 20:02:53Mike.Lewislinkissue9133 messages
2010-06-30 20:02:51Mike.Lewiscreate