classification
Title: os.getenv silently discards env variables with non-UTF-8 values
Type: behavior Stage:
Components: Unicode Versions: Python 3.0
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Rhamphoryncus, a.badger, haypo, loewis
Priority: normal Keywords:

Created on 2008-10-01 07:17 by a.badger, last changed 2008-12-04 20:55 by Rhamphoryncus. This issue is now closed.

Messages (16)
msg74118 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2008-10-01 07:17
On a Linux system with a locale setting whose encoding is utf-8, if you
set an environment variable to have a non-utf-8 chanacter, that
environment variable silently does not appear in os.environ::

mkdir ñ
convmv -f utf-8 -t latin-1 --notest ñ
for i in * ; do export PATH=$PATH:$i ; done
echo $PATH
/usr/lib/qt-3.3/bin:/usr/kerberos/bin:/usr/lib/ccache:/usr/local/bin:/usr/bin:/bin:/home/badger/bin:�
python3.0
Python 3.0rc1 (r30rc1:66499, Sep 28 2008, 08:21:09) 
[GCC 4.3.0 20080428 (Red Hat 4.3.0-8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ['PATH']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.0/os.py", line 389, in __getitem__
    return self.data[self.keymap(key)]
KeyError: 'PATH'

I'm uncertain of the impact of this.  It was brought up in a discussion
of sending non-ASCii data to a CGI-WSGI script where the data would be
transferred via os.environ.
msg74138 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-10-01 18:28
For the moment, this case is just not supported.
msg74151 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-02 01:24
It's not a bug, it's a feature! Python3 rejects invalid byte sequence
(according to the "default system encoding") from the command line or
environment variables. listdir(str) will also drop invalid filenames.
Yes, we need a PEP (a FAQ) about invalid bytes sequences.
msg74162 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2008-10-02 14:32
It's not a feature it's a bug! :-)  (I hope you meant to have a smiley
too ;-)

As stated in the os.listdir() related bug, on Unix filesystems filenames
are a sequence of bytes.  The system encoding allows the user-level
tools to display the filenames as characters instead of byte sequences
and allows you to manipulate the filenames using characters instead of
byte sequences.  But if you change your locale the user level tools will
interpret the byte sequences as different characters and allow you free
access to create files in a different encoding.

So in order to work correctly on Unix you must be able to accept byte
sequences in place of filename.

The sad fact of the matter is that while we can be all unicode with data
and strings inside of python we will always have to be prepared to
handle supposed strings as byte sequences when talking to some things
outside of ourselves.  Sometimes the border has a specification that
tells us what encoding to expect and we can do conversion automatically.
 But when it doesn't we have to be prepared to 1) tell the user that the
data exists even but isn't string type as expected and 2) make the byte
sequence available to the user.

Silently pretending that the data doesn't exist at all is a bug (maybe a
minor bug depending on how often we expect the situation to arise but
still a bug.)
msg74198 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-02 22:00
@a.badger: Again, dropping invalid filenames in listdir() is a (very 
recent) choice of the Python3 design. Please read this document which 
explain the current situation of bytes vs unicode:
   http://wiki.python.org/moin/Python3UnicodeDecodeError

See also issue3187 and read the long python-dev mailing list thread 
about filenames (start few days ago).

Guido just commited my huge patch to support bytes filename in Python3 
trunk. So using Python3 final, you will be abl to list all files using 
os.listdir(b'.') or os.listdir(os.getcwdb()).
msg74787 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-10-15 01:04
See also issue #4126 which is the opposite :-)
msg76297 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2008-11-24 04:34
Pardon, but when you close something as wontfix it's polite to say why.
msg76302 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-11-24 05:57
> Pardon, but when you close something as wontfix it's polite to say why.

Can you propose a reasonable way to fix this? People have thought hard,
and many days, and nobody could propose a reasonable fix. As 3.0 is
going to be released soon, there will be no way to fix it now.
msg76304 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2008-11-24 06:40
Is it a bug?  If so, then it should be retargetted to 3.1 instead of
closed wontfix.  If it's not a bug then there should be an explanation
of why it's not a bug.

As for fixing it there are several inelegant methods that are better
than silently ignoring the problem:

1) return mixed unicode and byte types in os.environ
2) return only byte types in os.environ
3) raise an exception if someone attempts to access an environment
variable that cannot be decoded to unicode via the system encoding and
allow the value to be accessed as a byte string via another method.
4) silently ignore the non-decodable variables when accessing os.environ
the normal way but have another method of accessing it that returns all
values as byte strings.

#4 is closest to what was done with os.listdir().  However, I think that
approach is wrong for os.listdir() and os.environ because it leads to
code that works in simple testing but can start failing mysteriously
when it becomes used in more environments.  The os.listdir() method will
lead to lots of people having to write code that uses the byte methods
on Unix and does its own conversion because it's the only thing
guaranteed to work on Unix and the unicode methods on Windows because
it's the only thing guaranteed to work there.  It degenerates to case #2
except harder to debug and requiring more platform specific knowledge of
the programmer.

#3 seems like the best choice to me as it provides a way for the
programmer to discover what's wrong and provide a fix but people seem to
have learned the wrong lessons from the python2 UnicodeEncode/Decode
problems so that might not have a large following other than me....

#2 is conceptually correct since environment variables are a point where
you're receiving bytes from a non-python environment.  However, it's
very annoying for the common case where everything in the environment
has a single encoding.

#1 is the easiest for simplistic code to deal with but seems to violate
the python3 philosophy the most.  I don't like it as it takes us to one
of the real failings of python2's unicode handling: Not knowing what
type of data you're going to get back from a method and therefore not
knowing if you have to convert it before passing it on.  Please don't do
this one as it's two steps forward and one step backwards from where we
are now.
msg76305 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-11-24 06:52
> Is it a bug?

It's not a bug; see my original reply. This case is just not supported.

It may be supported in future versions, but (if it was for me) not
without a PEP first.
msg76308 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2008-11-24 07:07
I'm sorry but "For the moment, this case is just not supported." is not
an explanation of why this is not a bug.  It is a statement that the
interpreter cannot handle a situation that has arisen.

If you said, "We don't believe that any computer has mixed encodings
that can show up in environment variables" that would be an explanation
of why this is not a bug and I could then give counter-examples of
computers that have mixed encodings in their environment variables.  So
what's the reason this is not a bug?
msg76309 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-11-24 07:39
Toshio Kuratomi wrote:
> So what's the reason this is not a bug?

It's a bug only if the implementation deviates from the specification.
In this case, it does not. The behavior is intentional: python
deliberately drops environment variables it cannot represent as a
string. We know that such environment variables can happen in real
life - that's why they get dropped (rather than raising an exception
at startup).
msg76315 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-11-24 10:05
@a.badger: The behaviour (drop non encodable strings) is not really a 
problem if you configure correctly your program and computer. Eg. you 
spoke about CGI-WSGI: if your website also speak UTF-8, you will be 
able to read all environment variables. So this issue is not 
important, it only appears when your website/OS is not well 
configured. I mean the problem is not in Python but outside Python. 
The PATH variable contains directory names, if you have only names 
encodable in your filesystem encoding (UTF-8 most of the time), you 
will be able to use the PATH variable. If a directory has an non 
decodable name, rename the directory but don't try to fix Python!
msg76316 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2008-11-24 10:19
The bug tracker is maybe not the right place to discuss a new Python3 feature.

> 1) return mixed unicode and byte types in os.environ

One goal of Python3 was to avoid mixing bytes and characters (bytes/str).

> 2) return only byte types in os.environ

os.environ contains text (characters) and so should decoded as unicode.

> 3) raise an exception if someone attempts to access an environment
> variable that cannot be decoded to unicode via the system encoding and
> allow the value to be accessed as a byte string via another method.
> 4) silently ignore the non-decodable variables when accessing os.environ
> the normal way but have another method of accessing it that returns all
> values as byte strings.

Why not for (3). But what would be the "another method" (4) to access byte 
string? The problem of having two methods is that you need consistent 
objects.

Imagine that you have os.environ (unicode) and os.environb (bytes).

Example 1:
  os.environb['PATH'] = b'\xff\xff\xff\xff'
What is the value in os.environ['PATH']?

Example 2:
  os.environb['PATH'] = b'têst'
What is the value in os.environ['PATH']?

Example 3:
  os.environ['PATH'] = 'têst'
What is the value in os.environb['PATH']?

Example 4:
 should I use os.environ['PATH'] or os.environb['PATH'] to get the current
 PATH?

It introduces many new cases (bugs?) that have to be prepared and tested. If 
you are motivated, you can contribute by a patch to test your ideas ;-) I'm 
interrested by os.environb, but as I wrote, I expect new complex problems :-/
msg76330 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2008-11-24 15:51
'''
@a.badger: The behaviour (drop non encodable strings) is not really a 
problem if you configure correctly your program and computer. Eg. you 
spoke about CGI-WSGI: if your website also speak UTF-8, you will be 
able to read all environment variables. So this issue is not 
important, it only appears when your website/OS is not well 
configured. I mean the problem is not in Python but outside Python. 
The PATH variable contains directory names, if you have only names 
encodable in your filesystem encoding (UTF-8 most of the time), you 
will be able to use the PATH variable. If a directory has an non 
decodable name, rename the directory but don't try to fix Python!
'''

The idea that having mixed encodings on a system is a misconfiguration 
is a fallacy.

1) In a multiuser setup, each user has a choice of what encoding to use.
 So mixed encodings are both possible and valid.

2) In a legacy system, your operating system may have all utf-8 naming
for the core OS but all of the old data files is being mounted with
another encoding that the legacy programs on the host expect.

3) On an nfs mount, data may come from users on different machines from
widely separated areas using different system encodings.

4) The same thing as 1-3 but applied to any of the data a site may be
passing via an environment variable rather than just file and directory
names.

5) An application may have to deal with different encodings from the
system default due to limitations of another program.  Since one of
python's many uses is as a glue language, it needs to be able to deal
with these quirks.

6) The application you're interfacing may just be using bytes rather
than text in the environment variables.

Let me put it this way:

If I write a file in a latin-1 encoding and put it on my system that has
a utf-8 system encoding what does python-3 do?

1) If I try to open it as a text file: "open('filename', 'r')" it throws
a UnicodeDecodeError when I attempt to read some non-utf-8 characters
from it.

2) As a programmer I then know to open it as binary "open('filename',
'rb')" and do my own decoding of the data now that I've been made aware
that I must take this corner case into account.

Some notes:
1) This seems to be the right general procedure to take when handling
things that are usually text but can contain arbitrary bytes.

2) This makes use of python's exception infrastructure to tell the
programmer plainly what's going wrong instead of silently ignoring
values that the programmer may not have encountered in their test data
but could exist in the real world.  Would you rather get a bug report
from a user that says: "FooApp gives me a UnicodeDecodeError traceback
pointing at line 345" (how open() works) or "FooApp never authenticates
me" (which you then have to track down to the fact that the credentials
on the user's system are being passed in an env var and are not in the
system encoding.)

3) This learns the correct lesson from python-2's unicode problems: Stop
the mixture of bytes and unicode at the border so the programmer can be
explicit about how to deal with the odd-ball data there.  It does not
become squeamish about throwing a Unicode Exception which is the wrong
lesson to learn from python-2.

4) It also doesn't refuse to acknowledge that the world outside python
is not as simple and elegant as the world inside python and allows the
programmer to write an interface to that world instead of forcing them
to go outside of python to deal with it.
msg76337 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2008-11-24 16:49
> The bug tracker is maybe not the right place to discuss a new Python3
feature.

It's a bug!  But if you guys want it to be a feature, then what mailing
list do I need to join?  Is there one devoted to Unicode or is
python-dev where I need to go?

>> 1) return mixed unicode and byte types in os.environ
>One goal of Python3 was to avoid mixing bytes and characters (bytes/str).

As stated, in my evaluation of the four options, +1 to this, option #1 takes
us back to the problems encountered in python-2.

>> 2) return only byte types in os.environ
> os.environ contains text (characters) and so should decoded as unicode.

This is correct but is not accurate :-)  os.environ, the python variable,
contains only unicode because that's the way it's coded.  However, the Unix
environment which os.environ attempts to give access to contains bytes which
are almost always representable as characters.  The two caveats are:

1) There's nothing that constrains it to characters -- putting byte
sequences
   that do not include null in the environment is valid.

2) The characters in the environment may be mixed encodings, sometimes
due to
   things outside of the user's control.

>> 3) raise an exception if someone attempts to access an environment
>> variable that cannot be decoded to unicode via the system encoding and
>> allow the value to be accessed as a byte string via another method.
>> 4) silently ignore the non-decodable variables when accessing os.environ
>> the normal way but have another method of accessing it that returns all
>> values as byte strings.
>
> Why not for (3).
"""

Do you mean, "I support 3"?  Or did you not finish a thought here?

> But what would be the "another method" (4) to access byte 
> string? The problem of having two methods is that you need consistent 
> objects.

This is exactly the problem I was talking about in my analysis of #4 in the
previous comment.  This problem plagues the new os.listdir() method as
well by
introducing a construct that programmers can use that doesn't give all the
information (os.listdir('.')) but also doesn't warn the programmer when the
information is not being shown.

> Imagine that you have os.environ (unicode) and os.environb (bytes).
> 
> Example 1:
>   os.environb['PATH'] = b'\xff\xff\xff\xff'
> What is the value in os.environ['PATH']?

Since option 4 mimics the os.listdir() method, accesing os.environ['PATH']
would give you a KeyError.  ie, the value was silently dropped just as
os.listdir('.') does.

> Example 2:
>   os.environb['PATH'] = b'têst'
> What is the value in os.environ['PATH']?

This doesn't work in python3 since byte strings can only be ASCii literals.

> Example 3:
>   os.environ['PATH'] = 'têst'
> What is the value in os.environb['PATH']?

Dependent on the default system encoding.  Assuming utf-8 encoding,
os.environb['PATH'] == b't\xc3\xaast'

> Example 4:
>  should I use os.environ['PATH'] or os.environb['PATH'] to get the current
>  PATH?

Should you use os.listdir('.') or os.listdir(b'.') to get the list of
files in
the current directory?

This is where treating pathnames, environment variables and etc as strings
instead of bytes becomes non-simple.  Now you have to decide what you really
want to know (and possibly keep two slightly different values if you want to
know two things.)

If you want to keep the path in order to look up commands that the user can
run you want os.environb['PATH'] since this is exactly what the shell
will use
when the user types a command at the commandline.

If you want to display the elements of the PATH for the user, you probably
want this::
  try:
      path = os.environ['PATH'].split(':')
  except KeyError:
      try:
          temp_path = os.environ['PATH'].split(b':')
      except KeyError:
          path = DEFAULT_PATH
      else:
          path = []
          for directory in os.environ['PATH'].split(b':'):
              path.append(unicode(directory,
                      sys.getdefaultencoding(), 'replace'))

> It introduces many new cases (bugs?) that have to be prepared and tested.

Those bugs are *already present*.  Without taking one of the four options,
there's simply no way to code a solution.  Take the above code and imagine
that there's no way to access the user's PATH variable when a
non-default-encoding character is present in the PATH.  That means that
you're
always stuck with the value of DEFAULT_PATH instead of being able to display
something reasonable to the user.

(Note, these examples are pretty much the same for option #3 or option
#4.  The
value of option #3 becomes apparent when you use os.getenv('PATH')
instead of
os.environ['PATH'])
History
Date User Action Args
2008-12-04 20:55:04Rhamphoryncussetnosy: + Rhamphoryncus
2008-11-24 16:49:16a.badgersetmessages: + msg76337
2008-11-24 15:51:59a.badgersetmessages: + msg76330
2008-11-24 10:19:38hayposetmessages: + msg76316
2008-11-24 10:05:43hayposetmessages: + msg76315
2008-11-24 07:39:46loewissetmessages: + msg76309
2008-11-24 07:07:11a.badgersetmessages: + msg76308
2008-11-24 06:52:45loewissetmessages: + msg76305
2008-11-24 06:40:38a.badgersetmessages: + msg76304
2008-11-24 05:57:09loewissetmessages: + msg76302
2008-11-24 04:34:49a.badgersetmessages: + msg76297
2008-11-23 23:43:06hayposetstatus: open -> closed
resolution: wont fix
2008-10-15 01:04:40hayposetmessages: + msg74787
2008-10-02 22:00:01hayposetmessages: + msg74198
2008-10-02 14:32:22a.badgersetmessages: + msg74162
2008-10-02 01:24:11hayposetnosy: + haypo
messages: + msg74151
2008-10-01 18:28:28loewissetnosy: + loewis
messages: + msg74138
versions: + Python 3.0
2008-10-01 07:17:20a.badgercreate