Message 142184 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	Rhamphoryncus, amaury.forgeotdarc, belopolsky, doerwalter, eric.smith, ezio.melotti, georg.brandl, lemburg, loewis, pitrou, rhettinger, stutzbach, tchrist, vstinner
Date	2011-08-16.11:42:58
SpamBayes Score	5.551115e-17
Marked as misclassified	No
Message-id	<8895.1313494957@chthon>
In-reply-to	<1313485930.8.0.601749695449.issue10542@psf.upfronthosting.co.za>

Content
I now see there are lots of good things in the BOM FAQ that have come up lately regarding surrogates and other illegal characters, and about what can go in data streams. I quote a few of these from http://unicode.org/faq/utf_bom.html below: Q: How do I convert an unpaired UTF-16 surrogate to UTF-8? A: A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data. By represented such an unpaired surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Therefore a converter must treat this as an error. Q: How do I convert an unpaired UTF-16 surrogate to UTF-32? A: If an unpaired surrogate is encountered when converting ill-formed UTF-16 data, any conformant converter must treat this as an error. By representing such an unpaired surrogate on its own, the resulting UTF-32 data stream would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that encoding form conversion always results in valid data stream. Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order? A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts. Q: What should I do with U+FEFF in the middle of a file? A: In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF should normally not occur. For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the file or string. The use of U+2060 WORD JOINER is strongly preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM. When designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In that case, any U+FEFF occurring in the middle of a file can be treated as an unsupported character. Q: How do I tag data that does not interpret U+FEFF as a BOM? A: Use the tag UTF-16BE to indicate big-endian UTF-16 text, and UTF-16LE to indicate little-endian UTF-16 text. If you do use a BOM, tag the text as simply UTF-16. Q: Why wouldn’t I always use a protocol that requires a BOM? A: Where the data has an associated type, such as a field in a database, a BOM is unnecessary. In particular, if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary nor permitted. Any U+FEFF would be interpreted as a ZWNBSP. Do not tag every string in a database or set of fields with a BOM, since it wastes space and complicates string concatenation. Moreover, it also means two data fields may have precisely the same content, but not be binary-equal (where one is prefaced by a BOM). Somewhat frustratingly, I am now almost more confused than ever by the last two sentences here: Q: What is a UTF? A: A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. The ISO/IEC 10646 standard uses the term “UCS transformation format” for UTF; the two terms are merely synonyms for the same concept. Each UTF is reversible, thus every UTF supports lossless round tripping: mapping from any Unicode coded character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF mapping must also map all code points that are not valid Unicode characters to unique byte sequences. These invalid code points are the 66 noncharacters (including FFFE and FFFF), as well as unpaired surrogates. My confusion is about the invalid code points. The first two FAQs I cite at the top are quite clear that it is illegal to have unpaired surrogates in a UTF stream. I don’t understand therefore what it saying about “must also” mapping all code points that aren’t valid Unicode characters to “unique byte sequences” to ensure roundtripping. At first reading, I’d almost say those appear to contradict each other. I must just be being boneheaded though. It’s very early morning yet, and maybe it will become clearer upon a fifth or sixth reading. Maybe it has to with replacement characters? No, that can’t be right. Muddle muddle. Sigh. Important material is also found in http://www.unicode.org/faq/basic_q.html: Q: Are surrogate characters the same as supplementary characters? A: This question shows a common confusion. It is very important to distinguish surrogate code points (in the range U+D800..U+DFFF) from supplementary code points (in the completely different range, U+10000..U+10FFFF). Surrogate code points are reserved for use, in pairs, in representing supplementary code points in UTF-16. There are supplementary characters (i.e. encoded characters represented with a single supplementary code point), but there are not and will never be surrogate characters (i.e. encoded characters represented with a single surrogate code point). Q: What is the difference between UCS-2 and UTF-16? A: UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided. UCS-2 does not define a distinct data format, because UTF-16 and UCS-2 are identical for purposes of data exchange. Both are 16-bit, and have exactly the same code unit representation. Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not handle processing of character properties, code point boundaries, collation, etc. for supplementary characters. And in reference to UTF-16 being slower by code point than by code unit: Q: How about using UTF-32 interfaces in my APIs? A: Except in some environments that store text as UTF-32 in memory, most Unicode APIs are using UTF-16. With UTF-16 APIs the low level indexing is at the storage or code unit level, with higher-level mechanisms for graphemes or words specifying their boundaries in terms of the code units. This provides efficiency at the low levels, and the required functionality at the high levels. If its [sic] ever necessary to locate the nᵗʰ character, indexing by character can be implemented as a high level operation. However, while converting from such a UTF-16 code unit index to a character index or vice versa is fairly straightforward, it does involve a scan through the 16-bit units up to the index point. In a test run, for example, accessing UTF-16 storage as characters, instead of code units resulted in a 10× degradation. While there are some interesting optimizations that can be performed, it will always be slower on average. Therefore locating other boundaries, such as grapheme, word, line or sentence boundaries proceeds directly from the code unit index, not indirectly via an intermediate character code index. I am somewhat amused by this summary: Q: What does Unicode conformance require? A: Chapter 3, Conformance discusses this in detail. Here's a very informal version: * Unicode characters don't fit in 8 bits; deal with it. * 2 [sic] Byte order is only an issue in I/O. * If you don't know, assume big-endian. * Loose surrogates have no meaning. * Neither do U+FFFE and U+FFFF. * Leave the unassigned codepoints alone. * It's OK to be ignorant about a character, but not plain wrong. * Subsets are strictly up to you. * Canonical equivalence matters. * Don't garble what you don't understand. * Process UTF-* by the book. * Ignore illegal encodings. * Right-to-left scripts have to go by bidi rules. And don’t know what I think about this, except that there sure a lot of screw‐ups out there if it is truly as easy as they would would have you believe: Given that any industrial-strength text and internationalization support API has to be able to handle sequences of characters, it makes little difference whether the string is internally represented by a sequence of [...] code units, or by a sequence of code-points [...]. Both UTF-16 and UTF-8 are designed to make working with substrings easy, by the fact that the sequence of code units for a given code point is unique. Take this all with a grain of salt, since I found various typos in these FAQs and occasionally also language that seems to reflect an older nomenclature than is now seen in the current published Unicode Standard, meaning 6.0.0. Probably best then to take only general directives from their FAQs and leave language‐ lawyering to the formal printed Standard, insofar as that is possible — which sometimes it is not, because they do make mistakes from time to time, and even less frequently, correct these. :) --tom

I now see there are lots of good things in the BOM FAQ that have come up
lately regarding surrogates and other illegal characters, and about what
can go in data streams.

I quote a few of these from http://unicode.org/faq/utf_bom.html below:

Q: How do I convert an unpaired UTF-16 surrogate to UTF-8?

A: A different issue arises if an unpaired surrogate is encountered when converting ill-formed UTF-16 data.
By represented such an *unpaired* surrogate on its own as a 3-byte sequence, the resulting UTF-8 data stream
would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires
that encoding form conversion always results in valid data stream. Therefore a converter *must* treat this
as an error.

Q: How do I convert an unpaired UTF-16 surrogate to UTF-32?

A: If an unpaired surrogate is encountered when converting ill-formed UTF-16 data, any conformant converter must
treat this as an error. By representing such an unpaired surrogate on its own, the resulting UTF-32 data stream
would become ill-formed. While it faithfully reflects the nature of the input, Unicode conformance requires that
encoding form conversion always results in valid data stream.

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining
UTF-8 bytes are in big-endian order?

A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8
always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise
unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8
is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format
that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix
shell scripts.

Q: What should I do with U+FEFF in the middle of a file?

A: In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF
should normally not occur. For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE
(ZWNBSP), and is then part of the content of the file or string. The use of U+2060 WORD JOINER is strongly
preferred over ZWNBSP for expressing word joining semantics since it cannot be confused with a BOM. When
designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In
that case, any U+FEFF occurring in the middle of a file can be treated as an unsupported character.

Q: How do I tag data that does not interpret U+FEFF as a BOM?

A: Use the tag UTF-16BE to indicate big-endian UTF-16 text, and UTF-16LE to indicate little-endian UTF-16 text.
If you do use a BOM, tag the text as simply UTF-16.

Q: Why wouldn’t I always use a protocol that requires a BOM?

A: Where the data has an associated type, such as a field in a database, a BOM is unnecessary. In particular,
if a text data stream is marked as UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary *nor
permitted*. Any U+FEFF would be interpreted as a ZWNBSP. Do not tag every string in a database or set of fields
with a BOM, since it wastes space and complicates string concatenation. Moreover, it also means two data fields
may have precisely the same content, but not be binary-equal (where one is prefaced by a BOM).

Somewhat frustratingly, I am now almost more confused than ever by the last two sentences here:

Q: What is a UTF?

A: A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate
code points) to a unique byte sequence. The ISO/IEC 10646 standard uses the term “UCS transformation format” for
UTF; the two terms are merely synonyms for the same concept.

Each UTF is reversible, thus every UTF supports *lossless round tripping*: mapping from any Unicode coded
character sequence S to a sequence of bytes and back will produce S again. To ensure round tripping, a UTF
mapping *must also* map all code points that are not valid Unicode characters to unique byte sequences. These
invalid code points are the 66 *noncharacters* (including FFFE and FFFF), as well as unpaired surrogates.

My confusion is about the invalid code points. The first two FAQs I cite at the top are quite clear that it is illegal
to have unpaired surrogates in a UTF stream. I don’t understand therefore what it saying about “must also” mapping all
code points that aren’t valid Unicode characters to “unique byte sequences” to ensure roundtripping. At first reading,
I’d almost say those appear to contradict each other. I must just be being boneheaded though. It’s very early morning
yet, and maybe it will become clearer upon a fifth or sixth reading. Maybe it has to with replacement characters? No,
that can’t be right. Muddle muddle. Sigh.

Important material is also found in http://www.unicode.org/faq/basic_q.html:

Q: Are surrogate characters the same as supplementary characters?

A: This question shows a common confusion. It is very important to distinguish surrogate code points (in the range
U+D800..U+DFFF) from supplementary code points (in the completely different range, U+10000..U+10FFFF). Surrogate
code points are reserved for use, in pairs, in representing supplementary code points in UTF-16.

There are supplementary characters (i.e. encoded characters represented with a single supplementary code point), but
there are not and will never be surrogate characters (i.e. encoded characters represented with a single surrogate
code point).

Q: What is the difference between UCS-2 and UTF-16?

A: UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code
points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided.

UCS-2 does not define a distinct data format, because UTF-16 and UCS-2 are identical for purposes of data exchange.
Both are 16-bit, and have exactly the same code unit representation.

Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary
characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not
handle processing of character properties, code point boundaries, collation, etc. for supplementary characters.

And in reference to UTF-16 being slower by code point than by code unit:

Q: How about using UTF-32 interfaces in my APIs?

A: Except in some environments that store text as UTF-32 in memory, most Unicode APIs are using UTF-16. With UTF-16
APIs the low level indexing is at the storage or code unit level, with higher-level mechanisms for graphemes or
words specifying their boundaries in terms of the code units. This provides efficiency at the low levels, and the
required functionality at the high levels.

If its [sic] ever necessary to locate the nᵗʰ character, indexing by character can be implemented as a high
level operation. However, while converting from such a UTF-16 code unit index to a character index or vice versa
is fairly straightforward, it does involve a scan through the 16-bit units up to the index point. In a test run,
for example, accessing UTF-16 storage as characters, instead of code units resulted in a 10× degradation. While
there are some interesting optimizations that can be performed, it will always be slower on average. Therefore
locating other boundaries, such as grapheme, word, line or sentence boundaries proceeds directly from the code
unit index, not indirectly via an intermediate character code index.

I am somewhat amused by this summary:

Q: What does Unicode conformance require?

A: Chapter 3, Conformance discusses this in detail. Here's a very informal version:

* Unicode characters don't fit in 8 bits; deal with it.
* 2 [sic] Byte order is only an issue in I/O.
* If you don't know, assume big-endian.
* Loose surrogates have no meaning.
* Neither do U+FFFE and U+FFFF.
* Leave the unassigned codepoints alone.
* It's OK to be ignorant about a character, but not plain wrong.
* Subsets are strictly up to you.
* Canonical equivalence matters.
* Don't garble what you don't understand.
* Process UTF-* by the book.
* Ignore illegal encodings.
* Right-to-left scripts have to go by bidi rules.

And don’t know what I think about this, except that there sure a lot of
screw‐ups out there if it is truly as easy as they would would have you believe:

Given that any industrial-strength text and internationalization support API has to be able to handle sequences of
characters, it makes little difference whether the string is internally represented by a sequence of [...] code
units, or by a sequence of code-points [...]. Both UTF-16 and UTF-8 are designed to make working with substrings
easy, by the fact that the sequence of code units for a given code point is unique.

Take this all with a grain of salt, since I found various typos in these FAQs
and occasionally also language that seems to reflect an older nomenclature than
is now seen in the current published Unicode Standard, meaning 6.0.0. Probably
best then to take only general directives from their FAQs and leave language‐
lawyering to the formal printed Standard, insofar as that is possible — which
sometimes it is not, because they do make mistakes from time to time, and even
less frequently, correct these. :)

--tom

History
Date	User	Action	Args
2011-08-16 11:43:02	tchrist	set	recipients: + tchrist, lemburg, loewis, doerwalter, georg.brandl, rhettinger, amaury.forgeotdarc, belopolsky, Rhamphoryncus, pitrou, vstinner, eric.smith, stutzbach, ezio.melotti
2011-08-16 11:43:00	tchrist	link	issue10542 messages
2011-08-16 11:42:58	tchrist	create