Author ezio.melotti
Recipients Arfrever, ezio.melotti, gvanrossum, jkloth, lemburg, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy, v+python, vstinner
Date 2011-09-08.08:32:42
SpamBayes Score 0.0
Marked as misclassified No
Message-id <>
So to summarize a bit, there are different possible level of strictness:
  1) all the possible encodable values, including the ones >10FFFF;
  2) values in range 0..10FFFF;
  3) values in range 0..10FFFF except surrogates (aka scalar values);
  4) values in range 0..10FFFF except surrogates and noncharacters;

and this is what is currently available in Python:
  1) not available, probably it will never be;
  2) available through the 'surrogatepass' error handler;
  3) default behavior (i.e. with the 'strict' error handler);
  4) currently not available.

(note: this refers to the utf-8 codec in Python 3, but it should be true for the utf-16/32 codecs too once #12892 is fixed.  This whole message refers to codecs only and what they should (dis)allow.  What we use internally seems to work fine and doesn't need to be changed.)

Now, assume that we don't care about option 1 and want to implement the missing option 4 (which I'm still not 100% sure about).  The possible options are:
  * add a new codec (actually one for each UTF encoding);
  * add a new error handler that explicitly disallows noncharacters;
  * change the meaning of 'strict' to match option 4;

This depends on what should be the default behavior while dealing with noncharacters.  If they are rejected by default, then 'strict' should reject them.  However this would leave us without option 3 (something to encode all and only scalar values), and surrogatepass will be misnamed if it also allows noncharacters (and if it doesn't we will end up without option 2 too).  This is apparently what Perl does:
> Perl will never ever produce nor accept one of the 66 noncharacers
> on any stream marked as one of the 7 character encoding schemes. 

Implementation-wise, I think the 'strict' error handler should be the strictest one, because the codec must detects all the "problematic" chars and send them to the error handler that might then decide what to do with them.  I.e. if the codec detects noncharacters, sends them to the error handler, and the error handler is strict, an error will be raised; if it doesn't detect them, the error handler won't be able to do anything with them.  
Another option is to provide another codec that specifically detects them, but this means re-implementing a slightly different version of each codec (or possibly add an extra argument to the PyUnicode_{Encode,Decode}UTF* functions).

We could also decide to leave the handling of noncharacters as it is -- after all the Unicode standard doesn't seem to explicitly forbid them as it does with e.g. surrogates.

> We have a flavor of non-strict utf8, spelled "utf8" instead of "UTF-8",
> that can produce and accept illegal characters, although by default it
> is still going to generate a warning

How did Perl implement this?  With two (or more) slightly different version of the same codec?
And how does Perl handle errors?  With some global options that turns (possibly specific) warnings into error (like python -We)?

Python has different codecs that encode/decode str/bytes and whenever they find a "problematic" char they send it to the error handler that might decide to raise an error, remove the char, replace it with something else, sending it back unchanged, generate a warning and so on.  In this way you can have different combinations of codecs and error handlers to get the desired behaviors.  (and FWIW in Python 'utf8' is an alias for 'UTF-8'.)

> I think a big problem here is that the Python culture doesn't use stream
> encodings enough.  People are always making their own repeated and tedious
> calls to encode and then sending stuff out a byte stream, by which time it
> is too late to check.
> [...]
> Anything that deals with streams should have an encoding argument.  But
> often/many? things in Python don't.
Several objects have an .encoding and .error attributes (e.g. sys.stdin/out), and they are used to encode/decode the text/bytes sent/read to/from them.  In other places we prefer the "explicit is better than implicit" approach and require the user (or some other higher-level layer) to encode/decode manually.

I'm not sure why you are saying that it's too late to check, and since the encoding/decoding happens only in a few places I don't think it's tedious at all (and often it's automatic too).
Date User Action Args
2011-09-08 08:32:45ezio.melottisetrecipients: + ezio.melotti, lemburg, gvanrossum, terry.reedy, pitrou, vstinner, jkloth, mrabarnett, Arfrever, v+python, r.david.murray, tchrist
2011-09-08 08:32:45ezio.melottisetmessageid: <>
2011-09-08 08:32:44ezio.melottilinkissue12729 messages
2011-09-08 08:32:42ezio.melotticreate