Author ezio.melotti
Recipients Arfrever, ezio.melotti, gvanrossum, jkloth, lemburg, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy, v+python, vstinner
Date 2011-09-02.10:05:13
SpamBayes Score 3.88578e-16
Marked as misclassified No
Message-id <1314957915.25.0.476621110173.issue12729@psf.upfronthosting.co.za>
In-reply-to
Content
> To start with, no code point which when bitwise added with 0xFFFE 
> returns 0xFFFE can never appear in a valid UTF-* stream, but Python
> allow this without any error.

> That means that both 0xNN_FFFE and 0xNN_FFFF are illegal in all 
> planes, where NN is 00 through 10 in hex.  So that's 2 noncharacters
> times 17 planes = 34 code points illegal for interchange that Python 
> is passing through illegally.  

> The remaining 32 nonsurrogate code points illegal for open interchange
> are 0xFDD0 through 0xFDEF.  Those are not allowed either, but Python
> doesn't seem to care.

It's not entirely clear to me what the UTF-8 codec is supposed to do with this.

For example U+FFFE is <EF BF BE> in UTF-8, and this is valid according to table 3-7, Chapter 03[0]:
"""
Code points     1st byte  2nd byte  3rd byte
U+E000..U+FFFF  EE..EF    80..BF    80..BF
"""

Chapter 16, section 16.7 "Noncharacters" says[1]:
"""
Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are forbidden for use in open interchange of Unicode text data.
"""

and
"""
Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them.
"""
seem to suggest that encoding them is forbidden.


"""
If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD replacement character, to indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters.
"""
here decoding seems allowed, possibly with a replacement (that would depend on the error handler used though, so the default 'strict' would turn this in an error).


Chapter 03, after D14, says:
"""
In general, a conforming process may indicate the presence of a code point whose use has not been designated (for example, by showing a missing glyph in rendering or by signaling an appropriate error in a streaming protocol), even though it is forbidden by the standard from interpreting that code point as an abstract character.
"""

and in C7:
"""
If a noncharacter that does not have a specific internal use is unexpectedly encountered in processing, an implementation may signal an error or replace the noncharacter with U+FFFD replacement character. If the implementation chooses to replace, delete or ignore a noncharacter, such an action constitutes a modification in the interpretation of the text. In general, a noncharacter should be treated as an unassigned code point.
"""

This doesn't mention clearly what the codec is supposed to do.
On one hand, it suggests that an error can be raised, i.e. consider the noncharacter invalid like out-of-range codepoints (>U+10FFFF) or lone surrogates. 
On the other hand it says that they should be treated as an unassigned code point, i.e. encoded/decoded normally, and doesn't list them as invalid in table 3-7.


[0]: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
[1]: http://www.unicode.org/versions/Unicode6.0.0/ch16.pdf
History
Date User Action Args
2011-09-02 10:05:15ezio.melottisetrecipients: + ezio.melotti, lemburg, gvanrossum, terry.reedy, pitrou, vstinner, jkloth, mrabarnett, Arfrever, v+python, r.david.murray, tchrist
2011-09-02 10:05:15ezio.melottisetmessageid: <1314957915.25.0.476621110173.issue12729@psf.upfronthosting.co.za>
2011-09-02 10:05:14ezio.melottilinkissue12729 messages
2011-09-02 10:05:13ezio.melotticreate