Message 76330 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	a.badger
Recipients	a.badger, loewis, vstinner
Date	2008-11-24.15:51:57
SpamBayes Score	7.920331e-13
Marked as misclassified	No
Message-id	<1227541920.31.0.085782862438.issue4006@psf.upfronthosting.co.za>
In-reply-to

Content
''' @a.badger: The behaviour (drop non encodable strings) is not really a problem if you configure correctly your program and computer. Eg. you spoke about CGI-WSGI: if your website also speak UTF-8, you will be able to read all environment variables. So this issue is not important, it only appears when your website/OS is not well configured. I mean the problem is not in Python but outside Python. The PATH variable contains directory names, if you have only names encodable in your filesystem encoding (UTF-8 most of the time), you will be able to use the PATH variable. If a directory has an non decodable name, rename the directory but don't try to fix Python! ''' The idea that having mixed encodings on a system is a misconfiguration is a fallacy. 1) In a multiuser setup, each user has a choice of what encoding to use. So mixed encodings are both possible and valid. 2) In a legacy system, your operating system may have all utf-8 naming for the core OS but all of the old data files is being mounted with another encoding that the legacy programs on the host expect. 3) On an nfs mount, data may come from users on different machines from widely separated areas using different system encodings. 4) The same thing as 1-3 but applied to any of the data a site may be passing via an environment variable rather than just file and directory names. 5) An application may have to deal with different encodings from the system default due to limitations of another program. Since one of python's many uses is as a glue language, it needs to be able to deal with these quirks. 6) The application you're interfacing may just be using bytes rather than text in the environment variables. Let me put it this way: If I write a file in a latin-1 encoding and put it on my system that has a utf-8 system encoding what does python-3 do? 1) If I try to open it as a text file: "open('filename', 'r')" it throws a UnicodeDecodeError when I attempt to read some non-utf-8 characters from it. 2) As a programmer I then know to open it as binary "open('filename', 'rb')" and do my own decoding of the data now that I've been made aware that I must take this corner case into account. Some notes: 1) This seems to be the right general procedure to take when handling things that are usually text but can contain arbitrary bytes. 2) This makes use of python's exception infrastructure to tell the programmer plainly what's going wrong instead of silently ignoring values that the programmer may not have encountered in their test data but could exist in the real world. Would you rather get a bug report from a user that says: "FooApp gives me a UnicodeDecodeError traceback pointing at line 345" (how open() works) or "FooApp never authenticates me" (which you then have to track down to the fact that the credentials on the user's system are being passed in an env var and are not in the system encoding.) 3) This learns the correct lesson from python-2's unicode problems: Stop the mixture of bytes and unicode at the border so the programmer can be explicit about how to deal with the odd-ball data there. It does not become squeamish about throwing a Unicode Exception which is the wrong lesson to learn from python-2. 4) It also doesn't refuse to acknowledge that the world outside python is not as simple and elegant as the world inside python and allows the programmer to write an interface to that world instead of forcing them to go outside of python to deal with it.

'''
@a.badger: The behaviour (drop non encodable strings) is not really a 
problem if you configure correctly your program and computer. Eg. you 
spoke about CGI-WSGI: if your website also speak UTF-8, you will be 
able to read all environment variables. So this issue is not 
important, it only appears when your website/OS is not well 
configured. I mean the problem is not in Python but outside Python. 
The PATH variable contains directory names, if you have only names 
encodable in your filesystem encoding (UTF-8 most of the time), you 
will be able to use the PATH variable. If a directory has an non 
decodable name, rename the directory but don't try to fix Python!
'''

The idea that having mixed encodings on a system is a misconfiguration 
is a fallacy.

1) In a multiuser setup, each user has a choice of what encoding to use.
 So mixed encodings are both possible and valid.

2) In a legacy system, your operating system may have all utf-8 naming
for the core OS but all of the old data files is being mounted with
another encoding that the legacy programs on the host expect.

3) On an nfs mount, data may come from users on different machines from
widely separated areas using different system encodings.

4) The same thing as 1-3 but applied to any of the data a site may be
passing via an environment variable rather than just file and directory
names.

5) An application may have to deal with different encodings from the
system default due to limitations of another program.  Since one of
python's many uses is as a glue language, it needs to be able to deal
with these quirks.

6) The application you're interfacing may just be using bytes rather
than text in the environment variables.

Let me put it this way:

If I write a file in a latin-1 encoding and put it on my system that has
a utf-8 system encoding what does python-3 do?

1) If I try to open it as a text file: "open('filename', 'r')" it throws
a UnicodeDecodeError when I attempt to read some non-utf-8 characters
from it.

2) As a programmer I then know to open it as binary "open('filename',
'rb')" and do my own decoding of the data now that I've been made aware
that I must take this corner case into account.

Some notes:
1) This seems to be the right general procedure to take when handling
things that are usually text but can contain arbitrary bytes.

2) This makes use of python's exception infrastructure to tell the
programmer plainly what's going wrong instead of silently ignoring
values that the programmer may not have encountered in their test data
but could exist in the real world.  Would you rather get a bug report
from a user that says: "FooApp gives me a UnicodeDecodeError traceback
pointing at line 345" (how open() works) or "FooApp never authenticates
me" (which you then have to track down to the fact that the credentials
on the user's system are being passed in an env var and are not in the
system encoding.)

3) This learns the correct lesson from python-2's unicode problems: Stop
the mixture of bytes and unicode at the border so the programmer can be
explicit about how to deal with the odd-ball data there.  It does not
become squeamish about throwing a Unicode Exception which is the wrong
lesson to learn from python-2.

4) It also doesn't refuse to acknowledge that the world outside python
is not as simple and elegant as the world inside python and allows the
programmer to write an interface to that world instead of forcing them
to go outside of python to deal with it.

History
Date	User	Action	Args
2008-11-24 15:52:00	a.badger	set	recipients: + a.badger, loewis, vstinner
2008-11-24 15:52:00	a.badger	set	messageid: <1227541920.31.0.085782862438.issue4006@psf.upfronthosting.co.za>
2008-11-24 15:51:59	a.badger	link	issue4006 messages
2008-11-24 15:51:58	a.badger	create