Message 104635 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Arfrever, ezio.melotti, gregory.p.smith, lemburg, loewis, vstinner
Date	2010-04-30.13:58:23
SpamBayes Score	4.7554101e-07
Marked as misclassified	No
Message-id	<4BDAE1FD.6020307@egenix.com>
In-reply-to	<201004261400.08237.victor.stinner@haypocalc.com>

Content
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > Le lundi 26 avril 2010 13:06:48, vous avez écrit : >> I don't see what environment variables have to do with the file >> system. > > A POSIX system only offers one function about the encoding: > nl_langinfo(CODESET) and Python3 uses it for the filenames, environment > variables and the command line arguments. > > Are you suggesting that Python3 should support a encoding different for > environment variables and the file system? How would the user configure it? It's better to let the application decide how to solve this problem and in order to allow for this, the encodings must be adjustable. By using fsencode() and fsdecode() in stdlib functions, you basically prevent this kind of adjustment, since they hardcode the use of a single encoding which is guessed by looking at nl_langinfo(CODESET). Note that application may well use completely different encodings in the environment and for things like pipes than what the user setup for her GUI environment. In the end, this will only lead to the same kind of mess we've had with sys.setdefaultencoding() in Python 2.x, only this time with sys.setfilesystemencoding() and I'd like to avoid that. > Since Python3 choosed to store environment variables as unicode string on > Windows and POSIX, in this specific case you should reconvert the value to > byte strings using fsencode() and then manipulate byte strings. Because > Python3 uses surrogateescape, you will get the original byte string values. Well, yes, but that's a cludge isn't it ? If you know that e.g. your environment variables are going to have Latin-1 data (say some content-type variable has this information), but the user's default LANG setting is UTF-8, Python will fetch the data as broken Unicode data, you then have to convert it back to bytes and then back to Unicode using the correct Latin-1 encoding. It would be a lot better to have the application provide the encoding to the os.getenv() function and have Python do the correct decoding right from the start.

STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> Le lundi 26 avril 2010 13:06:48, vous avez écrit :
>> I don't see what environment variables have to do with the file
>> system.
> 
> A POSIX system only offers *one* function about the encoding: 
> nl_langinfo(CODESET) and Python3 uses it for the filenames, environment 
> variables and the command line arguments.
> 
> Are you suggesting that Python3 should support a encoding different for 
> environment variables and the file system? How would the user configure it?

It's better to let the application decide how to solve this problem
and in order to allow for this, the encodings must be adjustable.

By using fsencode() and fsdecode() in stdlib functions, you basically
prevent this kind of adjustment, since they hardcode the use of
a single encoding which is guessed by looking at nl_langinfo(CODESET).

Note that application may well use completely different encodings
in the environment and for things like pipes than what the user
setup for her GUI environment.

In the end, this will only lead to the same kind of mess we've
had with sys.setdefaultencoding() in Python 2.x, only this
time with sys.setfilesystemencoding() and I'd like to avoid that.

> Since Python3 choosed to store environment variables as unicode string on 
> Windows and POSIX, in this specific case you should reconvert the value to 
> byte strings using fsencode() and then manipulate byte strings. Because 
> Python3 uses surrogateescape, you will get the original byte string values.

Well, yes, but that's a cludge isn't it ?

If you know that e.g. your environment variables are going to have
Latin-1 data (say some content-type variable has this information),
but the user's default LANG setting is UTF-8, Python will fetch the
data as broken Unicode data, you then have to convert it back to bytes
and then back to Unicode using the correct Latin-1 encoding.

It would be a lot better to have the application provide the
encoding to the os.getenv() function and have Python do the
correct decoding right from the start.

History
Date	User	Action	Args
2010-04-30 13:58:28	lemburg	set	recipients: + lemburg, loewis, gregory.p.smith, vstinner, ezio.melotti, Arfrever
2010-04-30 13:58:24	lemburg	link	issue8514 messages
2010-04-30 13:58:23	lemburg	create