This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author tchrist
Recipients Rhamphoryncus, amaury.forgeotdarc, belopolsky, doerwalter, eric.smith, ezio.melotti, georg.brandl, lemburg, loewis, pitrou, rhettinger, stutzbach, tchrist, vstinner
Date 2011-08-16.11:42:58
SpamBayes Score 5.551115e-17
Marked as misclassified No
Message-id <8895.1313494957@chthon>
In-reply-to <1313485930.8.0.601749695449.issue10542@psf.upfronthosting.co.za>
Content
I now see there are lots of good things in the BOM FAQ that have come up
lately regarding surrogates and other illegal characters, and about what
can go in data streams.  

I quote a few of these from http://unicode.org/faq/utf_bom.html below:

    Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? 

    A: A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. 
       By represented such an *unpaired* surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream
       would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires
       that encoding form conversion always results in valid data stream. Therefore a converter *must* treat this
       as an error.

    Q: How do I convert an unpaired UTF-16 surrogate to UTF-32? 

    A: If an unpaired surrogate is encountered when converting ill-formed UTF-16 data, any conformant converter must
       treat this as an error. By representing such an unpaired surrogate on its own, the resulting UTF-32 data stream
       would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that
       encoding form conversion always results in valid data stream.

    Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining
       UTF-8 bytes are in big-endian order?

    A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8
       always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise
       unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8
       is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format
       that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix
       shell scripts.

    Q: What should I do with U+FEFF in the middle of a file?

    A: In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF
       should normally not occur. For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE
       (ZWNBSP), and is then part of the content of the file or string. The use of U+2060 WORD JOINER is strongly
       preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM. When
       designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In
       that case, any U+FEFF occurring in the middle of a file can be treated as an unsupported character.

    Q: How do I tag data that does not interpret U+FEFF as a BOM?

    A: Use the tag UTF-16BE to indicate big-endian UTF-16 text, and UTF-16LE to indicate little-endian UTF-16 text. 
       If you do use a BOM, tag the text as simply UTF-16. 

    Q: Why wouldn’t I always use a protocol that requires a BOM?

    A: Where the data has an associated type, such as a field in a database, a BOM is unnecessary. In particular, 
       if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary *nor
       permitted*. Any U+FEFF would be interpreted as a ZWNBSP.  Do not tag every string in a database or set of fields
       with a BOM, since it wastes space and complicates string concatenation. Moreover, it also means two data fields
       may have precisely the same content, but not be binary-equal (where one is prefaced by a BOM).

Somewhat frustratingly, I am now almost more confused than ever by the last two sentences here:

    Q: What is a UTF?

    A: A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate
       code points) to a unique byte sequence. The ISO/IEC 10646 standard uses the term “UCS transformation format” for
       UTF; the two terms are merely synonyms for the same concept.

       Each UTF is reversible, thus every UTF supports *lossless round tripping*: mapping from any Unicode coded
       character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF
       mapping *must also* map all code points that are not valid Unicode characters to unique byte sequences. These
       invalid code points are the 66 *noncharacters* (including FFFE and FFFF), as well as unpaired surrogates.

My confusion is about the invalid code points. The first two FAQs I cite at the top are quite clear that it is illegal
to have unpaired surrogates in a UTF stream.  I don’t understand therefore what it saying about “must also” mapping all
code points that aren’t valid Unicode characters to “unique byte sequences” to ensure roundtripping.  At first reading,
I’d almost say those appear to contradict each other.  I must just be being boneheaded though.  It’s very early morning
yet, and maybe it will become clearer upon a fifth or sixth reading.  Maybe it has to with replacement characters?  No, 
that can’t be right.  Muddle muddle.  Sigh.

Important material is also found in http://www.unicode.org/faq/basic_q.html:

    Q: Are surrogate characters the same as supplementary characters?

    A: This question shows a common confusion. It is very important to distinguish surrogate code points (in the range
       U+D800..U+DFFF) from supplementary code points (in the completely different range, U+10000..U+10FFFF). Surrogate
       code points are reserved for use, in pairs, in representing supplementary code points in UTF-16.

       There are supplementary characters (i.e. encoded characters represented with a single supplementary code point), but
       there are not and will never be surrogate characters (i.e. encoded characters represented with a single surrogate
       code point).

    Q: What is the difference between UCS-2 and UTF-16?

    A: UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code
       points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided.

       UCS-2 does not define a distinct data format, because UTF-16 and UCS-2 are identical for purposes of data exchange.
       Both are 16-bit, and have exactly the same code unit representation.

       Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary
       characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not
       handle processing of character properties, code point boundaries, collation, etc. for supplementary characters.

And in reference to UTF-16 being slower by code point than by code unit:

    Q: How about using UTF-32 interfaces in my APIs?

    A: Except in some environments that store text as UTF-32 in memory, most Unicode APIs are using UTF-16. With UTF-16
       APIs  the low level indexing is at the storage or code unit level, with higher-level mechanisms for graphemes or
       words specifying their boundaries in terms of the code units. This provides efficiency at the low levels, and the
       required functionality at the high levels.

        If its [sic] ever necessary to locate the nᵗʰ character, indexing by character can be implemented as a high
        level operation. However, while converting from such a UTF-16 code unit index to a character index or vice versa
        is fairly straightforward, it does involve a scan through the 16-bit units up to the index point. In a test run,
        for example, accessing UTF-16 storage as characters, instead of code units resulted in a 10× degradation. While
        there are some interesting optimizations that can be performed, it will always be slower on average. Therefore
        locating other boundaries, such as grapheme, word, line or sentence boundaries proceeds directly from the code
        unit index, not indirectly via an intermediate character code index.

I am somewhat amused by this summary:

    Q: What does Unicode conformance require?

    A: Chapter 3, Conformance discusses this in detail. Here's a very informal version: 

        * Unicode characters don't fit in 8 bits; deal with it.
        * 2 [sic] Byte order is only an issue in I/O.
        * If you don't know, assume big-endian.
        * Loose surrogates have no meaning.
        * Neither do U+FFFE and U+FFFF.
        * Leave the unassigned codepoints alone.
        * It's OK to be ignorant about a character, but not plain wrong.
        * Subsets are strictly up to you.
        * Canonical equivalence matters.
        * Don't garble what you don't understand.
        * Process UTF-* by the book.
        * Ignore illegal encodings.
        * Right-to-left scripts have to go by bidi rules. 

And don’t know what I think about this, except that there sure a lot of 
screw‐ups out there if it is truly as easy as they would would have you believe:

    Given that any industrial-strength text and internationalization support API has to be able to handle sequences of
    characters, it makes little difference whether the string is internally represented by a sequence of [...] code
    units, or by a sequence of code-points [...]. Both UTF-16 and UTF-8 are designed to make working with substrings
    easy, by the fact that the sequence of code units for a given code point is unique.

Take this all with a grain of salt, since I found various typos in these FAQs
and occasionally also language that seems to reflect an older nomenclature than
is now seen in the current published Unicode Standard, meaning 6.0.0.  Probably
best then to take only general directives from their FAQs and leave language‐
lawyering to the formal printed Standard, insofar as that is possible — which
sometimes it is not, because they do make mistakes from time to time, and even
less frequently, correct these.  :)

--tom
History
Date User Action Args
2011-08-16 11:43:02tchristsetrecipients: + tchrist, lemburg, loewis, doerwalter, georg.brandl, rhettinger, amaury.forgeotdarc, belopolsky, Rhamphoryncus, pitrou, vstinner, eric.smith, stutzbach, ezio.melotti
2011-08-16 11:43:00tchristlinkissue10542 messages
2011-08-16 11:42:58tchristcreate