Author vstinner
Recipients Arfrever, ezio.melotti, gregory.p.smith, lemburg, loewis, vstinner
Date 2010-04-30.16:05:26
SpamBayes Score 7.11496e-10
Marked as misclassified No
Message-id <201004301805.18230.victor.stinner@haypocalc.com>
In-reply-to <4BDAE1FD.6020307@egenix.com>
Content
Le vendredi 30 avril 2010 15:58:28, vous avez écrit :
> It's better to let the application decide how to solve this problem
> and in order to allow for this, the encodings must be adjustable.

On POSIX, use byte strings to avoid encoding issues. Examples:

   subprocess.call(['env'], {b'TEST: b'a\xff-'}) # env
   subprocess.call(['echo', b'a\xff-']) # command line
   open('a\xff-') # filename
   os.getenv(b'a\xff-') # get env (result as unicode)

Are you talking about issues on Windows?

> By using fsencode() and fsdecode() in stdlib functions, you basically
> prevent this kind of adjustment, ...

Not if you use byte strings. On POSIX, an unicode string is always converted 
at the end for the system call (using sys.getfilesystemencoding()).

> If you know that e.g. your environment variables are going to have
> Latin-1 data (say some content-type variable has this information),
> but the user's default LANG setting is UTF-8, Python will fetch the
> data as broken Unicode data, you then have to convert it back to bytes
> and then back to Unicode using the correct Latin-1 encoding.
> 
> It would be a lot better to have the application provide the
> encoding to the os.getenv() function and have Python do the
> correct decoding right from the start.

You mean that os.getenv() should have an optionnal argument? Something like:

  def getenv(key, default=None, encoding=None):
     value = environ.get(key, default)
     if encoding:
        value = value.encode(sys.getfileystemencoding(), 'surrogateescape')
        value = value.decode(encoding, 'surrogateescape')
     return value

There are many indirect calls to os.getenv() (eg. by using os.environ.get()):
 - curses uses TERM
 - webbrowser uses PROGRAMFILES (path)
 - distutils.msvc9compiler uses "VS%0.f0COMNTOOLS" % version (path)
 - wsgiref.util uses HTTP_HOST, SERVER_NAME,  SCRIPT_NAME, ... (url)
 - platform uses PROCESSOR_ARCHITEW6432
 - sysconfig uses PYTHONUSERBASE, APPDATA, ... (path)
 - idlelib.PyShell uses IDLESTARTUP and PYTHONSTARTUP (path)
 - ...

How would you specify the correct encoding in indirect calls?

If your application gets variables in *mixed* encoding, I think that your 
program should start by reencoding variables:

  for name, encoding in (('PATH', 'latin1'), ...):
     value = os.getenv(name)
     value = value.encode(sys.getfileystemencoding(), 'surrogateescape')
     value = value.decode(encoding, 'surrogateescape')
     os.setenv(name, value)
History
Date User Action Args
2010-04-30 16:05:29vstinnersetrecipients: + vstinner, lemburg, loewis, gregory.p.smith, ezio.melotti, Arfrever
2010-04-30 16:05:27vstinnerlinkissue8514 messages
2010-04-30 16:05:26vstinnercreate