Message 143702 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	Arfrever, ezio.melotti, gvanrossum, jkloth, lemburg, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy, v+python, vstinner
Date	2011-09-07.19:26:43
SpamBayes Score	3.8857806e-16
Marked as misclassified	No
Message-id	<24315.1315423582@chthon>
In-reply-to	<1315009683.69.0.880749172262.issue12729@psf.upfronthosting.co.za>

Content
Ezio Melotti <report@bugs.python.org> wrote on Sat, 03 Sep 2011 00:28:03 -0000: > Ezio Melotti <ezio.melotti@gmail.com> added the comment: > Or they are still called UTF-8 but used in combination with different error > handlers, like surrogateescape and surrogatepass. The "plain" UTF-* codecs > should produce data that can be used for "open interchange", rejecting all the > invalid data, both during encoding and decoding. > Chapter 03, D79 also says: > To ensure that the mapping for a Unicode encoding form is one-to-one, > all Unicode scalar values, including those corresponding to > noncharacter code points and unassigned code points, must be mapped to > unique code unit sequences. Note that this requirement does not extend > to high-surrogate and low-surrogate code points, which are excluded by > definition from the set of Unicode scalar values. > and this seems to imply that the only unencodable codepoint are the non-scalar > values, i.e. surrogates and codepoints >U+10FFFF. Noncharacters shouldn't > thus receive any special treatment (at least during encoding). > Tom, do you agree with this? What does Perl do with them? I agree that one needs to be able to encode any scalar value and store it in memory in a designated character encoding form. This is different from streams, though. The 3 different Unicode "character encoding forms" -- UTF-8, UTF-16, and UTF-32 -- certainly need to support all possible scalar values. These are the forms used to store code points in memory. They do not have BOMs, because one knows one's memory layout. These are specifically allowed to contain the noncharacters: http://www.unicode.org/reports/tr17/#CharacterEncodingForm The third type is peculiar to the Unicode Standard: the noncharacter. This is a kind of internal-use user-defined character, not intended for public interchange. The problem is that one must make a clean distinction between character encoding forms and character encoding schemes. http://www.unicode.org/reports/tr17/#CharacterEncodingScheme It is important not to confuse a Character Encoding Form (CEF) and a CES. 1. The CEF maps code points to code units, while the CES transforms sequences of code units to byte sequences. 2. The CES must take into account the byte-order serialization of all code units wider than a byte that are used in the CEF. 3. Otherwise identical CESs may differ in other aspects, such as the number of user-defined characters allowed. Some of the Unicode encoding schemes have the same labels as the three Unicode encoding forms. [...] As encoding schemes, UTF-16 and UTF-32 refer to serialized bytes, for example the serialized bytes for streaming data or in files; they may have either byte orientation, and a single BOM may be present at the start of the data. When the usage of the abbreviated designators UTF-16 or UTF-32 might be misinterpreted, and where a distinction between their use as referring to Unicode encoding forms or to Unicode encoding schemes is important, the full terms should be used. For example, use UTF-16 encoding form or UTF-16 encoding scheme. They may also be abbreviated to UTF-16 CEF or UTF-16 CES, respectively. The Unicode Standard has seven character encoding schemes: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. * UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF32-LE are simple CESs. * UTF-16 and UTF-32 are compound CESs, consisting of an single, optional byte order mark at the start of the data followed by a simple CES. I believe that what this comes down to is that you can have noncharacters in memory as a CEF, but that you cannot have them in a CES meant for open interchange. And what you do privately is a different, third matter. What Perl does differs somewhat depending on whether you are just playing around with encodings in memory verus using streams that have particular encodings associated with them. I belive that you can think of this as the first being for CEF stuff and the second is for CES stuff. Streams are strict. Memory isn't. Perl will never ever produce nor accept one of the 66 noncharacers on any stream marked as one of the 7 character encoding schemes. However, we aren't always good about whether we generate an exception or whether we return replacement characters. Here the first process created a (for the nonce, nonfatal) warning, whereas the second process raised an exception: % perl -wle 'binmode(STDOUT, "encoding(UTF-16)")\|\| die; print chr(0xFDD0)' \| perl -wle 'binmode(STDIN, "encoding(UTF-16)")\|\|die; print ord <STDIN>' Unicode non-character U+FDD0 is illegal for open interchange at -e line 1. UTF-16:Unicode character fdd0 is illegal at -e line 1. Exit 255 Here the first again makes a warning, and the second returns a replacement string because: % perl -wle 'binmode(STDOUT, "encoding(UTF-8)")\|\| die; print chr(0xFDD0)' \| perl -wle 'binmode(STDIN, "encoding(UTF-8)")\|\|die; print ord <STDIN>' Unicode non-character U+FDD0 is illegal for open interchange at -e line 1. "\x{fdd0}" does not map to utf8. 92 If you call encode() manually, you have a lot clearer control over this, beause you can specify what to do with invalid characters (exceptions, replacements, etc). We have a flavor of non-strict utf8, spelled "utf8" instead of "UTF-8", that can produce and accept illegal characters, although by default it is still going to generate a warning: % perl -wle 'binmode(STDOUT, "encoding(utf8)")\|\| die; print chr(0xFDD0)' \| perl -wle 'binmode(STDIN, "encoding(utf8)")\|\|die; print ord <STDIN>' Unicode non-character U+FDD0 is illegal for open interchange at -e line 1. 64976 I could talk about ways to control whether it's a warning or an exception or a replacement string or nothing at all, but suffice to say such mechanisms do exist. I just don't know that I agree with the defaults. I think a big problem here is that the Python culture doesn't use stream encodings enough. People are always making their own repeated and tedious calls to encode and then sending stuff out a byte stream, by which time it is too late to check. This is a real problem, because now you cannot be permissive for the CES but conservative for the CEF. In Perl this doesn't in practice happen because in Perl people seldom send the result of encode() out a byte stream; they send things out character streams that have proper encodings affiliated with them. Yes, you can do it, but then you lose the checks. That's not a good idea. Anything that deals with streams should have an encoding argument. But often/many? things in Python don't. For example, subprocess.Popen doesn't even seem to take an encoding argument. This makes people do things by hand too often. In fact, subprocess.Popen won't even accept normal (Python 3 Unicode) strings, which is a real pain. I do think the culture of calling .encode("utf8") all over the place needs to be replaced with a more stream-based approach in Python. I had another place where this happens too much in Python besides subprocess.Popen but I can't remember where it is right now. Perl's internal name for the strict utf stuff is for example "utf-8-strict". I think you probably want to distingish these, and make the default strict the way we do with "UTF-8". We do not ever allow nonstrict UTF-16 or UTF-32, only sometimes nonstrict UTF-8 if you call it "utf8". I quote a bit of the perlunicode manpage below which talks about this a bit. Sorry it's taken me so long to get back to you on this. I'd be happy to answer any further questions you might have. --tom PERLUNICODE(1) Perl Programmers Reference Guide PERLUNICODE(1) Non-character code points 66 code points are set aside in Unicode as "non-character code points". These all have the Unassigned (Cn) General Category, and they never will be assigned. These are never supposed to be in legal Unicode input streams, so that code can use them as sentinels that can be mixed in with character data, and they always will be distinguishable from that data. To keep them out of Perl input streams, strict UTF-8 should be specified, such as by using the layer ":encoding('UTF-8')". The non-character code points are the 32 between U+FDD0 and U+FDEF, and the 34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF. Some people are under the mistaken impression that these are "illegal", but that is not true. An application or cooperating set of applications can legally use them at will internally; but these code points are "illegal for open interchange". Therefore, Perl will not accept these from input streams unless lax rules are being used, and will warn (using the warning category "nonchar", which is a sub-category of "utf8") if an attempt is made to output them. Beyond Unicode code points The maximum Unicode code point is U+10FFFF. But Perl accepts code points up to the maximum permissible unsigned number available on the platform. However, Perl will not accept these from input streams unless lax rules are being used, and will warn (using the warning category "non_unicode", which is a sub-category of "utf8") if an attempt is made to operate on or output them. For example, "uc(0x11_0000)" will generate this warning, returning the input parameter as its result, as the upper case of every non-Unicode code point is the code point itself. perl v5.14.0 2011-05-07 26

Ezio Melotti <report@bugs.python.org> wrote
   on Sat, 03 Sep 2011 00:28:03 -0000: 

> Ezio Melotti <ezio.melotti@gmail.com> added the comment:

> Or they are still called UTF-8 but used in combination with different error
> handlers, like surrogateescape and surrogatepass.  The "plain" UTF-* codecs
> should produce data that can be used for "open interchange", rejecting all the
> invalid data, both during encoding and decoding.

> Chapter 03, D79 also says:

>        To ensure that the mapping for a Unicode encoding form is one-to-one,
>        all Unicode scalar values, including those corresponding to
>        noncharacter code points and unassigned code points, must be mapped to
>        unique code unit sequences. Note that this requirement does not extend
>        to high-surrogate and low-surrogate code points, which are excluded by
>        definition from the set of Unicode scalar values.

> and this seems to imply that the only unencodable codepoint are the non-scalar
> values, i.e. surrogates and codepoints >U+10FFFF.  Noncharacters shouldn't
> thus receive any special treatment (at least during encoding).

> Tom, do you agree with this?  What does Perl do with them?

I agree that one needs to be able to encode any scalar value and
store it in memory in a designated character encoding form.

This is different from streams, though.

The 3 different Unicode "character encoding *forms*" -- UTF-8,
UTF-16, and UTF-32 -- certainly need to support all possible
scalar values.  These are the forms used to store code points in
memory.  They do not have BOMs, because one knows one's memory
layout.   These are specifically allowed to contain the
noncharacters:

    http://www.unicode.org/reports/tr17/#CharacterEncodingForm

    The third type is peculiar to the Unicode Standard: the noncharacter.
    This is a kind of internal-use user-defined character, not intended for
    public interchange.

The problem is that one must make a clean distinction between character
encoding *forms* and character encoding *schemes*.

    http://www.unicode.org/reports/tr17/#CharacterEncodingScheme

    It is important not to confuse a Character Encoding Form (CEF) and a CES.

    1. The CEF maps code points to code units, while the CES transforms
       sequences of code units to byte sequences.
    2. The CES must take into account the byte-order serialization of
       all code units wider than a byte that are used in the CEF.
    3. Otherwise identical CESs may differ in other aspects, such as the
       number of user-defined characters allowed.

    Some of the Unicode encoding schemes have the same labels as the three
    Unicode encoding forms. [...]

    As encoding schemes, UTF-16 and UTF-32 refer to serialized bytes, for
    example the serialized bytes for streaming data or in files; they may have
    either byte orientation, and a single BOM may be present at the start of the
    data. When the usage of the abbreviated designators UTF-16 or UTF-32 might
    be misinterpreted, and where a distinction between their use as referring to
    Unicode encoding forms or to Unicode encoding schemes is important, the full
    terms should be used. For example, use UTF-16 encoding form or UTF-16
    encoding scheme. They may also be abbreviated to UTF-16 CEF or UTF-16 CES,
    respectively.

    The Unicode Standard has seven character encoding schemes: UTF-8, UTF-16,
    UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.

	* UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF32-LE are simple CESs.

        * UTF-16 and UTF-32 are compound CESs, consisting of an single, optional
          byte order mark at the start of the data followed by a simple CES.

I believe that what this comes down to is that you can have noncharacters in memory
as a CEF, but that you cannot have them in a CES meant for open interchange.
And what you do privately is a different, third matter.

What Perl does differs somewhat depending on whether you are just playing
around with encodings in memory verus using streams that have particular
encodings associated with them.  I belive that you can think of this as the
first being for CEF stuff and the second is for CES stuff.

Streams are strict.  Memory isn't.

Perl will never ever produce nor accept one of the 66 noncharacers on any
stream marked as one of the 7 character encoding schemes.  However, we
aren't always good about whether we generate an exception or whether we
return replacement characters.  

Here the first process created a (for the nonce, nonfatal) warning, 
whereas the second process raised an exception:

     %   perl -wle 'binmode(STDOUT, "encoding(UTF-16)")|| die; print chr(0xFDD0)' | 
	 perl -wle 'binmode(STDIN, "encoding(UTF-16)")||die; print ord <STDIN>'
    Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
    UTF-16:Unicode character fdd0 is illegal at -e line 1.
    Exit 255

Here the first again makes a warning, and the second returns a replacement
string because:

    % perl -wle 'binmode(STDOUT, "encoding(UTF-8)")|| die; print chr(0xFDD0)' | 
	perl -wle 'binmode(STDIN, "encoding(UTF-8)")||die; print ord <STDIN>'
    Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
    "\x{fdd0}" does not map to utf8.
    92

If you call encode() manually, you have a lot clearer control over this, 
beause you can specify what to do with invalid characters (exceptions,
replacements, etc).

We have a flavor of non-strict utf8, spelled "utf8" instead of "UTF-8", that
can produce and accept illegal characters, although by default it is still
going to generate a warning:

    %   perl -wle 'binmode(STDOUT, "encoding(utf8)")|| die; print chr(0xFDD0)' | 
	perl -wle 'binmode(STDIN, "encoding(utf8)")||die; print ord <STDIN>'
    Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
    64976

I could talk about ways to control whether it's a warning or an exception
or a replacement string or nothing at all, but suffice to say such
mechanisms do exist.  I just don't know that I agree with the defaults.

I think a big problem here is that the Python culture doesn't use stream
encodings enough.  People are always making their own repeated and tedious
calls to encode and then sending stuff out a byte stream, by which time it
is too late to check.  This is a real problem, because now you cannot be
permissive for the CES but conservative for the CEF.  

In Perl this doesn't in practice happen because in Perl people seldom send
the result of encode() out a byte stream; they send things out character
streams that have proper encodings affiliated with them.  Yes, you can do
it, but then you lose the checks.  That's not a good idea.

Anything that deals with streams should have an encoding argument.  But
often/many? things in Python don't.  For example, subprocess.Popen
doesn't even seem to take an encoding argument.  This makes people do
things by hand too often.  In fact, subprocess.Popen won't even accept
normal (Python 3 Unicode) strings, which is a real pain.  I do think the
culture of calling .encode("utf8") all over the place needs to be
replaced with a more stream-based approach in Python.  I had another
place where this happens too much in Python besides subprocess.Popen but
I can't remember where it is right now.

Perl's internal name for the strict utf stuff is for example "utf-8-strict".
I think you probably want to distingish these, and make the default strict
the way we do with "UTF-8".  We do not ever allow nonstrict UTF-16 or UTF-32,
only sometimes nonstrict UTF-8 if you call it "utf8".

I quote a bit of the perlunicode manpage below which talks about this a bit.

Sorry it's taken me so long to get back to you on this.  I'd be happy to answer
any further questions you might have.

--tom

	PERLUNICODE(1)   Perl Programmers Reference Guide  PERLUNICODE(1)

	   Non-character code points
	       66 code points are set aside in Unicode as "non-character code
	       points".  These all have the Unassigned (Cn) General Category, and
	       they never will be assigned.  These are never supposed to be in
	       legal Unicode input streams, so that code can use them as sentinels
	       that can be mixed in with character data, and they always will be
	       distinguishable from that data.  To keep them out of Perl input
	       streams, strict UTF-8 should be specified, such as by using the
	       layer ":encoding('UTF-8')".  The non-character code points are the
	       32 between U+FDD0 and U+FDEF, and the 34 code points U+FFFE,
	       U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.  Some people are
	       under the mistaken impression that these are "illegal", but that is
	       not true.  An application or cooperating set of applications can
	       legally use them at will internally; but these code points are
	       "illegal for open interchange". Therefore, Perl will not accept
	       these from input streams unless lax rules are being used, and will
	       warn (using the warning category "nonchar", which is a sub-category
	       of "utf8") if an attempt is made to output them.

	   Beyond Unicode code points
	       The maximum Unicode code point is U+10FFFF.  But Perl accepts code
	       points up to the maximum permissible unsigned number available on
	       the platform.  However, Perl will not accept these from input
	       streams unless lax rules are being used, and will warn (using the
	       warning category "non_unicode", which is a sub-category of "utf8")
	       if an attempt is made to operate on or output them.  For example,
	       "uc(0x11_0000)" will generate this warning, returning the input
	       parameter as its result, as the upper case of every non-Unicode
	       code point is the code point itself.

	perl v5.14.0                2011-05-07                         26

History
Date	User	Action	Args
2011-09-07 19:26:47	tchrist	set	recipients: + tchrist, lemburg, gvanrossum, terry.reedy, pitrou, vstinner, jkloth, ezio.melotti, mrabarnett, Arfrever, v+python, r.david.murray
2011-09-07 19:26:45	tchrist	link	issue12729 messages
2011-09-07 19:26:43	tchrist	create