Message144256
"Terry J. Reedy" <report@bugs.python.org> wrote
on Thu, 08 Sep 2011 18:56:11 -0000:
>On 9/8/2011 4:32 AM, Ezio Melotti wrote:
>> So to summarize a bit, there are different possible level of strictness:
>> 1) all the possible encodable values, including the ones>10FFFF;
>> 2) values in range 0..10FFFF;
>> 3) values in range 0..10FFFF except surrogates (aka scalar values);
>> 4) values in range 0..10FFFF except surrogates and noncharacters;
>> and this is what is currently available in Python:
>> 1) not available, probably it will never be;
>> 2) available through the 'surrogatepass' error handler;
>> 3) default behavior (i.e. with the 'strict' error handler);
>> 4) currently not available.
>> Now, assume that we don't care about option 1 and want to implement the missing option 4 (which I'm still not 100% sure about). The possible options are:
>> * add a new codec (actually one for each UTF encoding);
>> * add a new error handler that explicitly disallows noncharacters;
>> * change the meaning of 'strict' to match option 4;
> If 'strict' meant option 4, then 'scalarpass' could mean option 3.
> 'surrogatepass' would then mean 'pass surragates also, in addition to
> non-char scalers'.
I'm pretty sure that anything that claims to be UTF-{8,16,32} needs
to reject both surrogates *and* noncharacters. Here's something from the
published Unicode Standard's p.24 about noncharacter code points:
• Noncharacter code points are reserved for internal use, such as for
sentinel values. They should never be interchanged. They do, however,
have well-formed representations in Unicode encoding forms and survive
conversions between encoding forms. This allows sentinel values to be
preserved internally across Unicode encoding forms, even though they are
not designed to be used in open interchange.
And here from the Unicode Standard's chapter on Conformance, section 3.2, p. 59:
C2 A process shall not interpret a noncharacter code point as an
abstract character.
• The noncharacter code points may be used internally, such as for
sentinel values or delimiters, but should not be exchanged publicly.
I'd have to check the fine print, but I am pretty sure that "shall not"
is an imperative form. We have understand that to read that a comforming
process *must*not* do that. It's because of that wording that in Perl,
using either of {en,de}code() with any of the "UTF-{8,16,32}" encodings,
including the LE/BE versions as appropriate, it will not produce nor accept
a noncharacter code point like FDD0 or FFFE.
Do you think we may perhaps have misread that conformance clause?
Using Perl's special, loose-fitting "utf8" encoding, you can get it do
noncharacter code points and even surrogates, but you have to suppress
certain things to make that happen quietly. You can only do this with
"utf8", not any of the UTF-16 or UTF-32 flavors. There we give them no
choice, so you must be strict. I agree this is not fully orthogonal.
Note that this is the normal thing that people do:
binmode(STDOUT, ":utf8");
which is the *loose* version. The strict one is "utf8-strict" or "UTF-8":
open(my $fh, "< :encoding(UTF-8)", $pathname)
So it is a bit too easy to get the loose one. We felt we had to do this
because we were already using the loose definition (and allowing up to
chr(2**32) etc) when the Unicode Consortium made clear what sorts of
things must not be accepted, or perhaps, before we made ourselves clear
on this. This will have been back in 2003, when I wasn't paying very
close attention.
I think that just like Perl, Python has a legacy of the original loose
definition. So some way to accommodate that legacy while still allowing
for a comformant application should be devised. My concern with Python
is that people tend to make they own manual calls to encode/decode a lot
more often than they do in Perl. That people that if you only catch it
on a stream encoding, you'll miss it, because they will use binary I/O
and miss the check.
--tom
Below I show a bit of how this works in Perl. Currently the builtin
utf8 encoding is controlled somewhat differently from how the Encode
module's encode/decode functions are. Yes, this is not my idea of good.
This shows that noncharacters and surrogates do not survive the
encoding/decoding process for UTF-16:
% perl -CS -MEncode -wle 'print decode("UTF-16", encode("UTF-16", chr(0xFDD0)))' | uniquote -v
\N{REPLACEMENT CHARACTER}
% perl -CS -MEncode -wle 'print decode("UTF-16", encode("UTF-16", chr(0xFFFE)))' | uniquote -v
\N{REPLACEMENT CHARACTER}
% perl -CS -MEncode -wle 'print decode("UTF-16", encode("UTF-16", chr(0xD800)))' | uniquote -v
UTF-16 surrogate U+D800 in subroutine entry at /usr/local/lib/perl5/5.14.0/darwin-2level/Encode.pm line 158.
If you pass a third argument to encode/decode, you can tell it what to
do on error; an argument of 1 raises an exception. Not supplying a
third argument gets the "default" behavior, which varies by encoding.
(The careful programmer is apt to want to pass in an appropropriate
bit mask of things like DIE_ON_ERR, WARN_ON_ERR, RETURN_ON_ERR,
LEAVE_SRC, PERLQQ, HTMLCREF, or XMLCREF.)
With "utf8" vs "UTF-8" using encode(), the default behavior is to swap in
the Unicode replacement character for things that don't map to the given
encoding, as you saw above with UTF-16:
% perl -C0 -MEncode -wle 'print encode("utf8", chr(0xFDD0))' | uniquote -v
\N{U+FDD0}
% perl -C0 -MEncode -wle 'print encode("UTF-8", chr(0xFDD0))' | uniquote -v
\N{REPLACEMENT CHARACTER}
% perl -C0 -MEncode= -wle 'print encode("utf8", chr(0xD800))' | uniquote -v
\N{U+D800}
% perl -C0 -MEncode= -wle 'print encode("UTF-8", chr(0xFDD0))' | uniquote -v
\N{REPLACEMENT CHARACTER}
% perl -C0 -MEncode=:all -wle 'print encode("utf8", chr(0x100_0000))' | uniquote -v
\N{U+1000000}
% perl -C0 -MEncode=:all -wle 'print encode("UTF-8", chr(0x100_0000))' | uniquote -v
\N{REPLACEMENT CHARACTER}
With the builtin "utf8" encoding, which does *not* go through the
Encode module, you instead control all this through lexical
warnings/exceptions categories. By default, you get a warning if
you try to use noncharacter, surrogate, or nonunicode code points
even on a loose utf8 stream (which is what -CS gets you):
% perl -CS -le 'print chr for 0xFDD0, 0xD800, 0x100_0000' | uniquote
Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
Code point 0x1000000 is not Unicode, may not be portable at -e line 1.
\N{U+FDD0}
\N{U+D800}
\N{U+1000000}
Notice I didn't ask for warnings there, but I still got them. This
promotes all utf8 warnings into exceptions, thus dying on the first one
it finds:
% perl -CS -Mwarnings=FATAL,utf8 -le 'print chr for 0xFDD0, 0xD800, 0x100_0000' | uniquote
Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
You can control these separately. For example, these all die of an
exception:
% perl -CS -Mwarnings=FATAL,utf8 -wle 'print chr(0xFDD0)'
Unicode non-character U+FDD0 is illegal for open interchange at -e line 1.
% perl -CS -Mwarnings=FATAL,utf8 -wle 'print chr(0xD800)'
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 1.
% perl -CS -Mwarnings=FATAL,utf8 -wle 'print chr(0x100_0000)'
Code point 0x1000000 is not Unicode, may not be portable at -e line 1.
While these do not:
% perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings "nonchar"; print chr(0xFDD0)' | uniquote
\N{U+FDD0}
% perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings "surrogate"; print chr(0xD800)' | uniquote
\N{U+D800}
% perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings "non_unicode"; print chr(0x100_0000)' | uniquote
\N{U+1000000}
% perl -CS -Mwarnings=FATAL,utf8 -wle 'no warnings qw(nonchar surrogate non_unicode);
print chr for 0xFDD0, 0xD800, 0x100_0000' | uniquote
\N{U+FDD0}
\N{U+D800}
\N{U+1000000} |
|
Date |
User |
Action |
Args |
2011-09-18 22:45:31 | tchrist | set | recipients:
+ tchrist, lemburg, gvanrossum, terry.reedy, pitrou, vstinner, jkloth, ezio.melotti, mrabarnett, Arfrever, v+python, r.david.murray, abacabadabacaba |
2011-09-18 22:45:29 | tchrist | link | issue12729 messages |
2011-09-18 22:45:28 | tchrist | create | |
|