Message 143446 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	Arfrever, ezio.melotti, gvanrossum, jkloth, lemburg, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy, v+python, vstinner
Date	2011-09-03.00:28:02
SpamBayes Score	4.196808e-10
Marked as misclassified	No
Message-id	<1315009683.69.0.880749172262.issue12729@psf.upfronthosting.co.za>
In-reply-to

Content
Or they are still called UTF-8 but used in combination with different error handlers, like surrogateescape and surrogatepass. The "plain" UTF-* codecs should produce data that can be used for "open interchange", rejecting all the invalid data, both during encoding and decoding. Chapter 03, D79 also says: """ To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points, must be mapped to unique code unit sequences. Note that this requirement does not extend to high-surrogate and low-surrogate code points, which are excluded by definition from the set of Unicode scalar values. """ and this seems to imply that the only unencodable codepoint are the non-scalar values, i.e. surrogates and codepoints >U+10FFFF. Noncharacters shouldn't thus receive any special treatment (at least during encoding). Tom, do you agree with this? What does Perl do with them?

Or they are still called UTF-8 but used in combination with different error handlers, like surrogateescape and surrogatepass.  The "plain" UTF-* codecs should produce data that can be used for "open interchange", rejecting all the invalid data, both during encoding and decoding.

Chapter 03, D79 also says:
"""
To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points, must be mapped to unique code unit sequences. Note that this requirement does not extend to high-surrogate and low-surrogate code points, which are excluded by definition from the set of Unicode scalar values.
"""

and this seems to imply that the only unencodable codepoint are the non-scalar values, i.e. surrogates and codepoints >U+10FFFF.  Noncharacters shouldn't thus receive any special treatment (at least during encoding).

Tom, do you agree with this?  What does Perl do with them?

History
Date	User	Action	Args
2011-09-03 00:28:03	ezio.melotti	set	recipients: + ezio.melotti, lemburg, gvanrossum, terry.reedy, pitrou, vstinner, jkloth, mrabarnett, Arfrever, v+python, r.david.murray, tchrist
2011-09-03 00:28:03	ezio.melotti	set	messageid: <1315009683.69.0.880749172262.issue12729@psf.upfronthosting.co.za>
2011-09-03 00:28:03	ezio.melotti	link	issue12729 messages
2011-09-03 00:28:02	ezio.melotti	create