Author ezio.melotti
Recipients Arfrever, ezio.melotti, gvanrossum, jkloth, lemburg, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy, v+python, vstinner
Date 2011-09-03.00:28:02
SpamBayes Score 4.19681e-10
Marked as misclassified No
Message-id <1315009683.69.0.880749172262.issue12729@psf.upfronthosting.co.za>
In-reply-to
Content
Or they are still called UTF-8 but used in combination with different error handlers, like surrogateescape and surrogatepass.  The "plain" UTF-* codecs should produce data that can be used for "open interchange", rejecting all the invalid data, both during encoding and decoding.

Chapter 03, D79 also says:
"""
To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points, must be mapped to unique code unit sequences. Note that this requirement does not extend to high-surrogate and low-surrogate code points, which are excluded by definition from the set of Unicode scalar values.
"""

and this seems to imply that the only unencodable codepoint are the non-scalar values, i.e. surrogates and codepoints >U+10FFFF.  Noncharacters shouldn't thus receive any special treatment (at least during encoding).

Tom, do you agree with this?  What does Perl do with them?
History
Date User Action Args
2011-09-03 00:28:03ezio.melottisetrecipients: + ezio.melotti, lemburg, gvanrossum, terry.reedy, pitrou, vstinner, jkloth, mrabarnett, Arfrever, v+python, r.david.murray, tchrist
2011-09-03 00:28:03ezio.melottisetmessageid: <1315009683.69.0.880749172262.issue12729@psf.upfronthosting.co.za>
2011-09-03 00:28:03ezio.melottilinkissue12729 messages
2011-09-03 00:28:02ezio.melotticreate