Message 104650 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Arfrever, ezio.melotti, gregory.p.smith, lemburg, loewis, vstinner
Date	2010-04-30.16:25:38
SpamBayes Score	5.1292304e-14
Marked as misclassified	No
Message-id	<4BDB047F.209@egenix.com>
In-reply-to	<201004301805.18230.victor.stinner@haypocalc.com>

Content
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > Le vendredi 30 avril 2010 15:58:28, vous avez écrit : >> It's better to let the application decide how to solve this problem >> and in order to allow for this, the encodings must be adjustable. > > On POSIX, use byte strings to avoid encoding issues. Examples: > > subprocess.call(['env'], {b'TEST: b'a\xff-'}) # env > subprocess.call(['echo', b'a\xff-']) # command line > open('a\xff-') # filename > os.getenv(b'a\xff-') # get env (result as unicode) > > Are you talking about issues on Windows? The issues normally occur on the way in, not the way out of Python, so I don't see how using bytes would help. >> By using fsencode() and fsdecode() in stdlib functions, you basically >> prevent this kind of adjustment, ... > > Not if you use byte strings. On POSIX, an unicode string is always converted > at the end for the system call (using sys.getfilesystemencoding()). Right and that's a problem since the file system encoding doesn't need to have anything to do with what you have in the environment. >> If you know that e.g. your environment variables are going to have >> Latin-1 data (say some content-type variable has this information), >> but the user's default LANG setting is UTF-8, Python will fetch the >> data as broken Unicode data, you then have to convert it back to bytes >> and then back to Unicode using the correct Latin-1 encoding. >> >> It would be a lot better to have the application provide the >> encoding to the os.getenv() function and have Python do the >> correct decoding right from the start. > > You mean that os.getenv() should have an optionnal argument? Something like: Yes. > def getenv(key, default=None, encoding=None): > value = environ.get(key, default) > if encoding: > value = value.encode(sys.getfileystemencoding(), 'surrogateescape') > value = value.decode(encoding, 'surrogateescape') > return value No, you store the environment data as bytes and only decode in getenv() based on the given encoding or using the file system encoding or default encoding (UTF-8) as default. It would probably also worthwhile adding the encoding parameter to os.environ.get(). > There are many indirect calls to os.getenv() (eg. by using os.environ.get()): > - curses uses TERM > - webbrowser uses PROGRAMFILES (path) > - distutils.msvc9compiler uses "VS%0.f0COMNTOOLS" % version (path) > - wsgiref.util uses HTTP_HOST, SERVER_NAME, SCRIPT_NAME, ... (url) > - platform uses PROCESSOR_ARCHITEW6432 > - sysconfig uses PYTHONUSERBASE, APPDATA, ... (path) > - idlelib.PyShell uses IDLESTARTUP and PYTHONSTARTUP (path) > - ... > > How would you specify the correct encoding in indirect calls? In all of the above cases, the application (in this case the various modules) knows which encoding to expect and can add the right encoding parameter to the os.getenv() call. E.g. the cgi module can use the content-type passed in as environment parameter to determine the encoding, most other modules will just use ASCII or the file system encoding if they are dealing with paths or file names. > If your application gets variables in mixed encoding, I think that your > program should start by reencoding variables: > > for name, encoding in (('PATH', 'latin1'), ...): > value = os.getenv(name) > value = value.encode(sys.getfileystemencoding(), 'surrogateescape') > value = value.decode(encoding, 'surrogateescape') > os.setenv(name, value) Which is a cludge as I mentioned in my previous comment: value = os.getenv(name, encoding=encoding) my_environ[name] = value reads much better. Also note that os.setenv() won't work since that'll use the file system encoding for encoding the value back into the C process environment array. You'd end up with mojibake in your C environment array. The point I want to make is that adding fsencode() and fsdecode() will help refactor the code a bit, but it shouldn't be used as excuse for not making the encoding explicit.

STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> Le vendredi 30 avril 2010 15:58:28, vous avez écrit :
>> It's better to let the application decide how to solve this problem
>> and in order to allow for this, the encodings must be adjustable.
> 
> On POSIX, use byte strings to avoid encoding issues. Examples:
> 
>    subprocess.call(['env'], {b'TEST: b'a\xff-'}) # env
>    subprocess.call(['echo', b'a\xff-']) # command line
>    open('a\xff-') # filename
>    os.getenv(b'a\xff-') # get env (result as unicode)
> 
> Are you talking about issues on Windows?

The issues normally occur on the way in, not the way out of Python,
so I don't see how using bytes would help.

>> By using fsencode() and fsdecode() in stdlib functions, you basically
>> prevent this kind of adjustment, ...
> 
> Not if you use byte strings. On POSIX, an unicode string is always converted 
> at the end for the system call (using sys.getfilesystemencoding()).

Right and that's a problem since the file system encoding
doesn't need to have anything to do with what you have in
the environment.

>> If you know that e.g. your environment variables are going to have
>> Latin-1 data (say some content-type variable has this information),
>> but the user's default LANG setting is UTF-8, Python will fetch the
>> data as broken Unicode data, you then have to convert it back to bytes
>> and then back to Unicode using the correct Latin-1 encoding.
>>
>> It would be a lot better to have the application provide the
>> encoding to the os.getenv() function and have Python do the
>> correct decoding right from the start.
> 
> You mean that os.getenv() should have an optionnal argument? Something like:

Yes.

>   def getenv(key, default=None, encoding=None):
>      value = environ.get(key, default)
>      if encoding:
>         value = value.encode(sys.getfileystemencoding(), 'surrogateescape')
>         value = value.decode(encoding, 'surrogateescape')
>      return value

No, you store the environment data as bytes and only
decode in getenv() based on the given encoding or using
the file system encoding or default encoding (UTF-8)
as default.

It would probably also worthwhile adding the encoding
parameter to os.environ.get().

> There are many indirect calls to os.getenv() (eg. by using os.environ.get()):
>  - curses uses TERM
>  - webbrowser uses PROGRAMFILES (path)
>  - distutils.msvc9compiler uses "VS%0.f0COMNTOOLS" % version (path)
>  - wsgiref.util uses HTTP_HOST, SERVER_NAME,  SCRIPT_NAME, ... (url)
>  - platform uses PROCESSOR_ARCHITEW6432
>  - sysconfig uses PYTHONUSERBASE, APPDATA, ... (path)
>  - idlelib.PyShell uses IDLESTARTUP and PYTHONSTARTUP (path)
>  - ...
> 
> How would you specify the correct encoding in indirect calls?

In all of the above cases, the application (in this case the
various modules) knows which encoding to expect and can
add the right encoding parameter to the os.getenv() call.

E.g. the cgi module can use the content-type passed in as
environment parameter to determine the encoding, most other
modules will just use ASCII or the file system encoding
if they are dealing with paths or file names.

> If your application gets variables in *mixed* encoding, I think that your 
> program should start by reencoding variables:
> 
>   for name, encoding in (('PATH', 'latin1'), ...):
>      value = os.getenv(name)
>      value = value.encode(sys.getfileystemencoding(), 'surrogateescape')
>      value = value.decode(encoding, 'surrogateescape')
>      os.setenv(name, value)

Which is a cludge as I mentioned in my previous comment:

    value = os.getenv(name, encoding=encoding)
    my_environ[name] = value

reads much better.

Also note that os.setenv() won't work since that'll use the
file system encoding for encoding the value back into the C
process environment array. You'd end up with mojibake in
your C environment array.

The point I want to make is that adding fsencode() and
fsdecode() will help refactor the code a bit, but it
shouldn't be used as excuse for not making the encoding
explicit.

History
Date	User	Action	Args
2010-04-30 16:25:41	lemburg	set	recipients: + lemburg, loewis, gregory.p.smith, vstinner, ezio.melotti, Arfrever
2010-04-30 16:25:39	lemburg	link	issue8514 messages
2010-04-30 16:25:38	lemburg	create