Message 118337 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	loewis
Recipients	ixokai, lemburg, loewis, pitrou, pjenvey, ronaldoussoren, vstinner
Date	2010-10-10.16:22:25
SpamBayes Score	1.0331047e-10
Marked as misclassified	No
Message-id	<4CB1E83E.9050003@v.loewis.de>
In-reply-to	<1286725888.95.0.846762557387.issue9992@psf.upfronthosting.co.za>

Content
Am 10.10.2010 17:51, schrieb STINNER Victor: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> We run into problems because we have two inconsistent encodings, >> ... > > What? No. We have problems because we don't use the same encoding to > decode and to encode the same data type. It's not a problem to use a > different encoding for each data type (stdout, filenames, environment > variables, ...). This is exactly the very problem that we face. In particular, the question is what encoding to use if something is both a filename and an environment variable value, or both a filename and a command line argument. > Mac OS X is a special case. Filesystem encoding is utf-8 on this OS, > whereas the locale encoding depends on LANG variable. If I understood > MvL proposition correctly, we should not rely on the locale on Mac OS > X. "Not rely on" is perhaps a bit harsh. It's not clear (to me) under what conditions the locale's encoding will be more correct than just assuming UTF-8 - there may actually be use cases for it. However, with the surrogate escapes, we could just always decode using UTF-8, and leave any mojibake problems that may arise from this from this to the application. I do think that these problems will be rare, since a) many OSX installations use UTF-8, anyway, and b) those that don't likely experience the proper round-tripping of the escape mechanism. > So the "3rd encoding" and the filesystem encodings should be > hardcoded to utf-8? That's an option to consider, yes - I'd like an OSX expert to comment. > The "third encoding" is no more controlable by a special environment > variable, only by classic locale environment variables (LC_ALL, > LC_CTYPE, LANG). Is it a problem? I remember a comment from MAL > saying that it may be a problem for CGI for the environment variables > because some (all?) variables are not encoded with the locale > encoding (but the HTML encoding?). I don't know if Python should > workaround CGI specific issues. In Python 3.2, we have now > os.environb: it's now possible to use a different encoding for each > variable. I think these problems are sufficiently resolved now: either by PEP 3333, PEP 444, PEP 383, or os.environb. I think you misunderstood MAL's comment, though: the environment variables are not encoded in any specific encoding. Instead, they are copied literally from the HTTP request, using whatever bytes the browser originally put in there - which may or may not have followed a particular encoding. HTTP is silent on this most of the time, and HTML is out of scope.

Am 10.10.2010 17:51, schrieb STINNER Victor:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> We run into problems because we have two inconsistent encodings,
>> ...
> 
> What? No. We have problems because we don't use the same encoding to
> decode and to encode the same data type. It's not a problem to use a
> different encoding for each data type (stdout, filenames, environment
> variables, ...).

This is exactly the very problem that we face. In particular, the
question is what encoding to use if something is *both* a filename
and an environment variable value, or both a filename and a command
line argument.

> Mac OS X is a special case. Filesystem encoding is utf-8 on this OS,
> whereas the locale encoding depends on LANG variable. If I understood
> MvL proposition correctly, we should not rely on the locale on Mac OS
> X.

"Not rely on" is perhaps a bit harsh. It's not clear (to me) under what
conditions the locale's encoding will be more correct than just assuming
UTF-8 - there may actually be use cases for it.

However, with the surrogate escapes, we could just always decode using
UTF-8, and leave any mojibake problems that may arise from this from
this to the application. I do think that these problems will be rare,
since a) many OSX installations use UTF-8, anyway, and b) those that
don't likely experience the proper round-tripping of the escape mechanism.

> So the "3rd encoding" and the filesystem encodings should be
> hardcoded to utf-8?

That's an option to consider, yes - I'd like an OSX expert to
comment.

> The "third encoding" is no more controlable by a special environment
> variable, only by classic locale environment variables (LC_ALL,
> LC_CTYPE, LANG). Is it a problem? I remember a comment from MAL
> saying that it may be a problem for CGI for the environment variables
> because some (all?) variables are not encoded with the locale
> encoding (but the HTML encoding?). I don't know if Python should
> workaround CGI specific issues. In Python 3.2, we have now
> os.environb: it's now possible to use a different encoding for each
> variable.

I think these problems are sufficiently resolved now: either by
PEP 3333, PEP 444, PEP 383, or os.environb.

I think you misunderstood MAL's comment, though: the environment
variables are not encoded in *any* specific encoding. Instead,
they are copied literally from the HTTP request, using whatever
bytes the browser originally put in there - which may or may
not have followed a particular encoding. HTTP is silent on
this most of the time, and HTML is out of scope.

History
Date	User	Action	Args
2010-10-10 16:22:28	loewis	set	recipients: + loewis, lemburg, ixokai, ronaldoussoren, pitrou, vstinner, pjenvey
2010-10-10 16:22:26	loewis	link	issue9992 messages
2010-10-10 16:22:25	loewis	create