Author vstinner
Recipients Sworddragon, larry, lemburg, loewis, ncoghlan, pitrou, r.david.murray, terry.reedy, vstinner
Date 2013-12-08.11:37:23
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <CAMpsgwb4Ga1_inPP1-H__O=JG4bwCiHLu-d8VSz7QG31dYX-cA@mail.gmail.com>
In-reply-to <1386501362.54.0.660344740078.issue19846@psf.upfronthosting.co.za>
Content
2013/12/8 Nick Coghlan <report@bugs.python.org>:
> Yes, that's the point. *Every* case I've seen where the locale encoding has been reported as ASCII on a modern Linux system has been because the environment has been configured to use the C locale, and that locale has a silly, antiquated, encoding setting.
>
> This is particularly problematic when people remotely access a system with ssh and get given the C locale instead of something sensible, and then can't properly read the filesystem on that server.

The solution is to fix the locale, not to fix Python. For example,
don't set LANG to C.

From the C locale, you cannot guess the "correct" encoding. In
Unicode, the general rule is to never try the encoding.

> The idea of using UTF-8 instead in that case is to *change* (and hopefully reduce) the number of cases where things go wrong.

If the OS uses ISO-8859-1, forcing Python (filesystem) encoding to
UTF-8 would produce invalid filenames, display mojibake and more
generally produce data incompatible with other applicatons (who rely
on the C locale, and so the ASCII encoding).

> - there may be other cases where ASCII actually *is* the filesystem encoding (in which case they're going to have trouble anyway), or the real filesystem encoding is something other than UTF-8

As I wrote before, os.getfilesystemencoding() is *not* the filesystem
encoding. It's the "OS" encoding used to decode any kind of data
coming for the OS and used to encode back Python data to the OS. Just
some examples:

- DNS hostnames
- Environment variables
- Command line arguments
- Filenames
- user/group entries in the grp/pwd modules
- almost all functions of the os module, they return various type of
information (ttyname, ctermid, current working directory, login, ...)

> We're already approximating things on Linux by assuming every filesystem is using the *same* encoding, when that's not necessarily the case. Glib applications also assume UTF-8, regardless of the locale (http://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux).

If you use a different encoding but only just for filenames, you will
get mojibake when you pass a filename on the command line or in an
environment varialble.

> At the moment, setting "LANG=C" on a Linux system *fundamentally breaks Python 3*, and that's not OK.

Getting ASCII filesystem encoding is annoying, but I would not say
that it fundamentally breaks Python 3. If you want to do something,
you should write documentation explaining how to configure properly
Linux.
History
Date User Action Args
2013-12-08 11:37:24vstinnersetrecipients: + vstinner, lemburg, loewis, terry.reedy, ncoghlan, pitrou, larry, r.david.murray, Sworddragon
2013-12-08 11:37:24vstinnerlinkissue19846 messages
2013-12-08 11:37:23vstinnercreate