Message 144289 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Arfrever, abacabadabacaba, ezio.melotti, gvanrossum, jkloth, lemburg, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy, v+python, vstinner
Date	2011-09-19.15:28:34
SpamBayes Score	9.436896e-16
Marked as misclassified	No
Message-id	<4E775F94.7010901@egenix.com>
In-reply-to	<12256.1316385915@chthon>

Content
Tom Christiansen wrote: > > I'm pretty sure that anything that claims to be UTF-{8,16,32} needs > to reject both surrogates and noncharacters. Here's something from the > published Unicode Standard's p.24 about noncharacter code points: > > • Noncharacter code points are reserved for internal use, such as for > sentinel values. They should never be interchanged. They do, however, > have well-formed representations in Unicode encoding forms and survive > conversions between encoding forms. This allows sentinel values to be > preserved internally across Unicode encoding forms, even though they are > not designed to be used in open interchange. > > And here from the Unicode Standard's chapter on Conformance, section 3.2, p. 59: > > C2 A process shall not interpret a noncharacter code point as an > abstract character. > > • The noncharacter code points may be used internally, such as for > sentinel values or delimiters, but should not be exchanged publicly. You have to remember that Python is used to build applications. It's up to the applications to conform to Unicode or not and the application also defines what "exchange" means in the above context. Python itself needs to be able to deal with assigned non-character code points as well as unassigned code points or code points that are part of special ranges such as the surrogate ranges. I'm +1 on not allowing e.g. lone surrogates in UTF-8 data, because we have a way to optionally allow these via an error handler, but -1 on making changes that cause full range round-trip safety of the UTF encodings to be lost without a way to turn the functionality back on.

Tom Christiansen wrote:
> 
> I'm pretty sure that anything that claims to be UTF-{8,16,32} needs  
> to reject both surrogates *and* noncharacters. Here's something from the
> published Unicode Standard's p.24 about noncharacter code points:
> 
>     • Noncharacter code points are reserved for internal use, such as for 
>       sentinel values. They should never be interchanged. They do, however,
>       have well-formed representations in Unicode encoding forms and survive
>       conversions between encoding forms. This allows sentinel values to be
>       preserved internally across Unicode encoding forms, even though they are
>       not designed to be used in open interchange.
> 
> And here from the Unicode Standard's chapter on Conformance, section 3.2, p. 59:
> 
>     C2 A process shall not interpret a noncharacter code point as an 
>        abstract character.
> 
>         • The noncharacter code points may be used internally, such as for 
>           sentinel values or delimiters, but should not be exchanged publicly.

You have to remember that Python is used to build applications. It's
up to the applications to conform to Unicode or not and the
application also defines what "exchange" means in the above context.

Python itself needs to be able to deal with assigned non-character
code points as well as unassigned code points or code points that
are part of special ranges such as the surrogate ranges.

I'm +1 on not allowing e.g. lone surrogates in UTF-8 data, because
we have a way to optionally allow these via an error handler,
but -1 on making changes that cause full range round-trip safety
of the UTF encodings to be lost without a way to turn the functionality
back on.

History
Date	User	Action	Args
2011-09-19 15:28:35	lemburg	set	recipients: + lemburg, gvanrossum, terry.reedy, pitrou, vstinner, jkloth, ezio.melotti, mrabarnett, Arfrever, v+python, r.david.murray, abacabadabacaba, tchrist
2011-09-19 15:28:35	lemburg	link	issue12729 messages
2011-09-19 15:28:34	lemburg	create