This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author loewis
Recipients ixokai, lemburg, loewis, pitrou, pjenvey, ronaldoussoren, vstinner
Date 2010-10-10.16:22:25
SpamBayes Score 1.0331047e-10
Marked as misclassified No
Message-id <4CB1E83E.9050003@v.loewis.de>
In-reply-to <1286725888.95.0.846762557387.issue9992@psf.upfronthosting.co.za>
Content
Am 10.10.2010 17:51, schrieb STINNER Victor:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> We run into problems because we have two inconsistent encodings,
>> ...
> 
> What? No. We have problems because we don't use the same encoding to
> decode and to encode the same data type. It's not a problem to use a
> different encoding for each data type (stdout, filenames, environment
> variables, ...).

This is exactly the very problem that we face. In particular, the
question is what encoding to use if something is *both* a filename
and an environment variable value, or both a filename and a command
line argument.

> Mac OS X is a special case. Filesystem encoding is utf-8 on this OS,
> whereas the locale encoding depends on LANG variable. If I understood
> MvL proposition correctly, we should not rely on the locale on Mac OS
> X.

"Not rely on" is perhaps a bit harsh. It's not clear (to me) under what
conditions the locale's encoding will be more correct than just assuming
UTF-8 - there may actually be use cases for it.

However, with the surrogate escapes, we could just always decode using
UTF-8, and leave any mojibake problems that may arise from this from
this to the application. I do think that these problems will be rare,
since a) many OSX installations use UTF-8, anyway, and b) those that
don't likely experience the proper round-tripping of the escape mechanism.

> So the "3rd encoding" and the filesystem encodings should be
> hardcoded to utf-8?

That's an option to consider, yes - I'd like an OSX expert to
comment.

> The "third encoding" is no more controlable by a special environment
> variable, only by classic locale environment variables (LC_ALL,
> LC_CTYPE, LANG). Is it a problem? I remember a comment from MAL
> saying that it may be a problem for CGI for the environment variables
> because some (all?) variables are not encoded with the locale
> encoding (but the HTML encoding?). I don't know if Python should
> workaround CGI specific issues. In Python 3.2, we have now
> os.environb: it's now possible to use a different encoding for each
> variable.

I think these problems are sufficiently resolved now: either by
PEP 3333, PEP 444, PEP 383, or os.environb.

I think you misunderstood MAL's comment, though: the environment
variables are not encoded in *any* specific encoding. Instead,
they are copied literally from the HTTP request, using whatever
bytes the browser originally put in there - which may or may
not have followed a particular encoding. HTTP is silent on
this most of the time, and HTML is out of scope.
History
Date User Action Args
2010-10-10 16:22:28loewissetrecipients: + loewis, lemburg, ixokai, ronaldoussoren, pitrou, vstinner, pjenvey
2010-10-10 16:22:26loewislinkissue9992 messages
2010-10-10 16:22:25loewiscreate