This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients Arfrever, abacabadabacaba, ezio.melotti, gvanrossum, jkloth, lemburg, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy, v+python, vstinner
Date 2011-09-19.15:28:34
SpamBayes Score 9.436896e-16
Marked as misclassified No
Message-id <4E775F94.7010901@egenix.com>
In-reply-to <12256.1316385915@chthon>
Content
Tom Christiansen wrote:
> 
> I'm pretty sure that anything that claims to be UTF-{8,16,32} needs  
> to reject both surrogates *and* noncharacters. Here's something from the
> published Unicode Standard's p.24 about noncharacter code points:
> 
>     • Noncharacter code points are reserved for internal use, such as for 
>       sentinel values. They should never be interchanged. They do, however,
>       have well-formed representations in Unicode encoding forms and survive
>       conversions between encoding forms. This allows sentinel values to be
>       preserved internally across Unicode encoding forms, even though they are
>       not designed to be used in open interchange.
> 
> And here from the Unicode Standard's chapter on Conformance, section 3.2, p. 59:
> 
>     C2 A process shall not interpret a noncharacter code point as an 
>        abstract character.
> 
>         • The noncharacter code points may be used internally, such as for 
>           sentinel values or delimiters, but should not be exchanged publicly.

You have to remember that Python is used to build applications. It's
up to the applications to conform to Unicode or not and the
application also defines what "exchange" means in the above context.

Python itself needs to be able to deal with assigned non-character
code points as well as unassigned code points or code points that
are part of special ranges such as the surrogate ranges.

I'm +1 on not allowing e.g. lone surrogates in UTF-8 data, because
we have a way to optionally allow these via an error handler,
but -1 on making changes that cause full range round-trip safety
of the UTF encodings to be lost without a way to turn the functionality
back on.
History
Date User Action Args
2011-09-19 15:28:35lemburgsetrecipients: + lemburg, gvanrossum, terry.reedy, pitrou, vstinner, jkloth, ezio.melotti, mrabarnett, Arfrever, v+python, r.david.murray, abacabadabacaba, tchrist
2011-09-19 15:28:35lemburglinkissue12729 messages
2011-09-19 15:28:34lemburgcreate