Message 169283 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	serhiy.storchaka
Recipients	Brian.Merrell, belopolsky, ezio.melotti, merrellb, petri.lehtinen, pitrou, rhettinger, serhiy.storchaka, tchrist, vstinner
Date	2012-08-28.13:58:28
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1346162309.53.0.965792304651.issue11489@psf.upfronthosting.co.za>
In-reply-to

Content
> It's Unicode that considers unpaired surrogates invalid, not UTF-8 by itself. It's UTF-8 too. See RFC 3629: The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. When encoding in UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to obtain character numbers, which are then encoded in UTF-8 as described above.

> It's Unicode that considers unpaired surrogates invalid, not UTF-8 by itself.

It's UTF-8 too. See RFC 3629:

   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
   characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
   to first decode the UTF-16 data to obtain character numbers, which
   are then encoded in UTF-8 as described above.

History
Date	User	Action	Args
2012-08-28 13:58:29	serhiy.storchaka	set	recipients: + serhiy.storchaka, rhettinger, belopolsky, pitrou, vstinner, ezio.melotti, merrellb, Brian.Merrell, petri.lehtinen, tchrist
2012-08-28 13:58:29	serhiy.storchaka	set	messageid: <1346162309.53.0.965792304651.issue11489@psf.upfronthosting.co.za>
2012-08-28 13:58:29	serhiy.storchaka	link	issue11489 messages
2012-08-28 13:58:28	serhiy.storchaka	create