This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: 'ascii' is a bad filesystem default encoding
Type: enhancement Stage: test needed
Components: Interpreter Core Versions: Python 3.3
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: akira, benjamin.peterson, gz, ncoghlan, pitrou, poolie, r.david.murray, terry.reedy, vila, vstinner
Priority: normal Keywords: patch

Created on 2011-12-20 19:02 by gz, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
/tmp/filesystem_encoding_utf8.patch gz, 2011-12-20 19:02 Patch for using utf-8 instead of ascii if given as codeset review
Messages (36)
msg149924 - (view) Author: Martin (gz) * Date: 2011-12-20 19:02
Currently when running Python on a non-OSX posix environment under either the C locale, or with an invalid or missing locale, it's not possible to operate using unicode filenames outside the ascii range. Using bytes works, as does reading expecting unicode, using the surrogates hack.

This makes robustly working with non-ascii filenames on different platforms needlessly annoying, given no modern nix should have problems just using UTF-8 in these cases.

See the downstream bzr bug for more:
<https://bugs.launchpad.net/bzr/+bug/794353>

One option is to just use UTF-8 for encoding and decoding filenames when otherwise ascii would be used. As a strict superset, this shouldn't break too many existing assumptions, and it's unlikely that non-UTF-8 filenames will accidentally be mangled due to a locale setting blip. See the attached patch for this behaviour change. It does not include a test currently, but it's possible to write one using subprocess and overriden LANG and LC_ALL vars.
msg149925 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-12-20 19:17
I'm not sure why having a locale set to C or something invalid should be considered a Python bug.  You have to handle un-decodable filenames no matter what you do, since things aren't always encoded in utf-8 on non-OSX unix even when that is the system locale.  It's just something you have to live with.
msg149926 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-20 19:37
> Currently when running Python on a non-OSX posix environment
> under either the C locale, or with an invalid or missing locale,
> it's not possible to operate using unicode filenames outside
> the ascii range.

It was already discussed: using a different encoding for filenames and for other things is really not a good idea. The main problem is the interaction with other programs.

Read discussion of issues #8622, #8775 and #9992.
msg149927 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-20 19:38
> under either the C locale, or with an invalid or missing locale

The right fix is to fix your locale, not Python.
msg149928 - (view) Author: Martin (gz) * Date: 2011-12-20 20:24
> I'm not sure why having a locale set to C or something invalid should be
> considered a Python bug.  You have to handle un-decodable filenames no
> matter what you do, since things aren't always encoded in utf-8 on non-OSX
> unix even when that is the system locale.  It's just something you have to
> live with.

This is more about un-encodable filenames. At the moment work with non-ascii filenames in Python robustly requires two branches, one using unicode and one that encodes to bytestrings and deals with the case where the name can't be represented in the declared filesystem encoding. That may be something that just had to be lived with, but it's a little annoying when even without a UTF-8 locale for a particular process, that's what most systems will want on disk.
msg149929 - (view) Author: Martin (gz) * Date: 2011-12-20 20:45
> It was already discussed: using a different encoding for filenames and for
> other things is really not a good idea. The main problem is the interaction
> with other programs.

Yes, for many programs, a change like this will mean they create the file, but then throw a traceback anyway when trying to print its name to stdout or something.

> Read discussion of issues #8622, #8775 and #9992.

Thanks. I agree that spreading different values to things like subprocess arguments and the environment is asking for trouble. Just changing how unicode filename are encoded by default seems safer, though it certainly won't help all code.

> The right fix is to fix your locale, not Python.

I've found that hard to stick to in the face of bug reports where "your locale" turns out to be "the locale used by some cronjob". Fixing my library to work under LANG=C is easier than bugging every downstream project.
msg149938 - (view) Author: Martin Pool (poolie) Date: 2011-12-20 23:53
> I'm not sure why having a locale set to C or something invalid should be considered a Python bug. 

Programs like bzr that hit these problems can tell their users, either in the docs or an error message, "change your locale to a UTF-8 one".

There are two problems with this: one is just the practical one that it scales poorly to have to tell every user to do this and to take them through working out how to set this in a way that covers cron jobs, daemons, things run over ssh, etc.

The other problem is that the locale variables primarily describe the locale for input/output, and that can very reasonably be different from the filesystem encoding.  As a specific common example people may have UTF-8 filenames but want a C locale terminal.  If there was a separate LC_FILENAMES then Python could respect that and insist people set it, but there isn't.
msg149939 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-21 00:01
> If there was a separate LC_FILENAMES then Python could respect
> that and insist people set it, but there isn't.

During 1 month, we had PYTHONFSENCODING environment variable. It was not a good idea. Again: please read the discussion (in closed issues) explaing why we removed it (and which problems it introduced).
msg149941 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-21 00:26
> There are two problems with this: one is just the practical
> one that it scales poorly to have to tell every user to do this
> and to take them through working out how to set this in a way
> that covers cron jobs, daemons, things run over ssh, etc.

I never checked which locale is used by default for programs called by cron. So I checked: on Fedora 16, programs start with a very few environment variables, and LANG and LC_ALL are not set. You can add "LANG=fr_FR.UTF-8" (for example) to /etc/environment to set the default language for the whole system (for all programs). I checked, it works with cron. Or if you don't want to affect all programs, it is maybe safer to only set the locale for one specific program in your crontab by adding "LANG=fr_FR.UTF-8 " before you command. Example:

* *  *  *  * LANG=fr_FR.UTF-8 /home/haypo/test.sh

--


If you want to handle any filename without having to care of the locale, the simplest solution is to use the bytes type to store filenames.
msg149942 - (view) Author: Martin Pool (poolie) Date: 2011-12-21 00:28
On 21 December 2011 11:01, STINNER Victor <report@bugs.python.org> wrote:
>
> Again: please read the discussion (in closed issues) explaing why we removed it (and which problems it introduced).

There's a lot of history, so I'm not sure exactly which problems
you're referring to.  The main problem I see being discussed is that
changing the encoding after Python starts would be dangerous, which I
agree with, but we're not proposing to do that.
msg149943 - (view) Author: Martin Pool (poolie) Date: 2011-12-21 00:38
On 21 December 2011 11:26, STINNER Victor <report@bugs.python.org> wrote:
> I never checked which locale is used by default for programs called by cron. So I checked: on Fedora 16, programs start with a very few environment variables, and LANG and LC_ALL are not set. You can add "LANG=fr_FR.UTF-8" (for example) to /etc/environment to set the default language for the whole system (for all programs). I checked, it works with cron. Or if you don't want to affect all programs, it is maybe safer to only set the locale for one specific program in your crontab by adding "LANG=fr_FR.UTF-8 " before you command. Example:
>
> * *  *  *  * LANG=fr_FR.UTF-8 /home/haypo/test.sh

That is the correct kind of configuration.  When I say it scales
poorly I mean that every user running a Python program on a unicode
system needs to insert this configuration in every relevant place, and
they need to work this out from what is typically a fairly cryptic
message.  (bzr just added a workaround for this, but for other
programs it still exists.)

Also, my other point, is that people may very well want their cron
scripts to send ascii output but cope with unicode filenames.
msg149944 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-21 00:41
> The main problem I see being discussed is that
> changing the encoding after Python starts would
> be dangerous, which I agree with, but we're not
> proposing to do that.

Not after Python start. Using two encodings at the same would just adds new problems. On UNIX (at least on Linux?), it is mandatory to use the same encoding for:

 - command line arguments
 - environment variables
 - filenames
 - and more generally, all data exchanged with the system and other programs

Let's take an example: you use UTF-8 for filenames and ISO-8859-1 for all other data. You want to check if a specific filename is present in your home directory: encode the filename to UTF-8 and read the home directory from the HOME environment variable. But environment variables are decoded from ISO-8859-1, so you have to encode them back to ISO-8859-1 to avoid mojibake (and real bugs, like file not found).

Ok, let say that filenames and environment variables are UTF-8 and that other data are ISO-8859-1. You would like to play a MP3 using mplayer: you pass the filename encoded to UTF-8 as an argument of mplayer command line. But mplayer uses ISO-8859-1 to decode its command line (it's not exactly like that, but image that it's the case): mplayer will be unable to find your MP3.

etc.

That's why on UNIX there is one unique encoding, the locale encoding, and that Python uses the same encoding (called "the filesystem encoding", I don't like this name, sys.getfilesystemencoding()).

--

It is no more possible to change the Python filesystem encoding at runtime (I remove sys.setfilesystemencoding()) because I would like to inconsistency. If you decoded a filename before changing the encoding, and then you decode the same filename after changing the encoding: you will get two different names and encode the filenames back will give you two different byte sequences (and more likely, a Unicode encode error).

It was possible to override the filesystem encoding using a PYTHONFSENCODING environment variable, but it introduced all the inconsistencies listed before (especially with external programs).

Now the only right way to change the Python (filesystem) encoding is the UNIX way of doing that: set LC_ALL, LC_CTYPE or LANG environment variable (configure your locale).
msg149947 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-21 00:54
I should not write comments so late :-p

> Not after Python start. Using two encodings at the same would just ...

at the same time

 > ... because I would like to inconsistency.

because it would lead to inconsistencies
msg149948 - (view) Author: Martin (gz) * Date: 2011-12-21 01:12
> During 1 month, we had PYTHONFSENCODING environment variable. It was not a
> good idea.

I strongly agree. There is no sense in having a separate configurable value, anyone who would think about using a PYTHONFSENCODING should just change their locale instead. However, avoiding the need for manual intervention completely in a relatively narrow set of cases is still useful.

> Not after Python start. Using two encodings at the same would just adds new
> problems. On UNIX (at least on Linux?), it is mandatory to use the same
> encoding for:
>
>  - command line arguments
>  - environment variables
>  - filenames
>  - and more generally, all data exchanged with the system and other programs

Having more than one encoding on unix is already a reality, there's nothing to stop someone setting LANG=de_DE.UTF-8 and LC_MESSAGES=C say.

The real lesson is not that having more than one encoding is dangerous, but that having incompatible encodings is dangerous. As 'ascii' is a strict subset of 'utf-8' the cross process communication issues are greatly lessened, at worst stuff just breaks still.

Expanding the filesystem default encoding to utf-8 should be a very narrow change, mostly just affecting io and os operations. Other actions involving paths will still break if a non-ascii string is used, but without the possibility of mangling data.
msg149949 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-12-21 01:16
So, you're complaining about something which works, kind of:

$ touch héhé
$ LANG=C python3 -c "import os; print(os.listdir())"
['h\udcc3\udca9h\udcc3\udca9']

> This makes robustly working with non-ascii filenames on different
> platforms needlessly annoying, given no modern nix should have problems
> just using UTF-8 in these cases.

So why don't these supposedly "modern" systems at least set the appropriate environment variables for Python to infer the proper character encoding?
(since these "modern" systems don't have a well-defined encoding...)

Answer: because they are not modern at all, they are antiquated, inadapted and obsolete pieces of software designed and written by clueless Anglo-American people. Please report bugs against these systems. The culprit is not Python, it's the Unix crap and the utterly clueless attitude of its maintainers ("filesystems are just bytes", yeah, whatever...).
msg149950 - (view) Author: Martin Pool (poolie) Date: 2011-12-21 01:18
Thanks for the example.

Like you say, realistically, all data exchanged with other programs
and with the system needs to be in the same encoding.  (User document
content may be in something else.)

On modern systems, this problem is solved by making the standard
encoding UTF-8.  So it is unfortunate that, when no locale is set,
Python3 defaults to ascii for the filesystem.

With no locale set, python3 makes getdefaultencoding() utf-8, so it
seems oddly pessimistic to make the fsencoding only ascii.

If someone really wants to run everything in iso-8859-1 this patch
would not stop them doing so.
msg149951 - (view) Author: Martin Pool (poolie) Date: 2011-12-21 01:36
On 21 December 2011 12:16, Antoine Pitrou <report@bugs.python.org> wrote:
>
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> So, you're complaining about something which works, kind of:
>
> $ touch héhé
> $ LANG=C python3 -c "import os; print(os.listdir())"
> ['h\udcc3\udca9h\udcc3\udca9']

It's possible to work around this in some cases, such as listdir, by
coping with the result including some byte strings, and then manually
decoding them.  But there are, iirc, other cases where the call just
fails and there is no easy workaround.

It wasn't impossible to get unicode right in python2, but python3
still thinks it's worth changing things to make it work better.

>> This makes robustly working with non-ascii filenames on different
>> platforms needlessly annoying, given no modern nix should have problems
>> just using UTF-8 in these cases.
>
> So why don't these supposedly "modern" systems at least set the appropriate environment variables for Python to infer the proper character encoding?
> (since these "modern" systems don't have a well-defined encoding...)

The standard encoding is UTF-8.  Python shouldn't need to have a
variable set to tell it this.  Python is making an assumption about
the default but it is a bad assumption.

> The culprit is not Python, it's the Unix crap....

Programs need to work with the environments that are available to
them, even though those environments often have flaws.  Windows and
Mac have annoying bugs too, even bugs specifically about Unicode.
msg149952 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-12-21 01:41
> The standard encoding is UTF-8.

How so? I don't know of any Linux or Unix spec which says so. If you get
the Linux heads to standardize this then I'll certainly be very happy
(and countless others will, too). But AFAIK this it not the case and I
don't see why you are asking Python to make a choice that OS vendors
refuse to make. You are certainly asking the wrong project to solve this
problem. So I'd rather not solve your problem at the Python level so
that you instead try to get it solved at the right (OS) level.
msg150031 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-21 18:27
> Having more than one encoding on unix is already a reality, there's nothing to stop someone setting LANG=de_DE.UTF-8 and LC_MESSAGES=C say.

Nope. The locale encoding is chosen using LC_ALL, LC_CTYPE or LANG 
variable: use the first non-empty variable. LC_MESSAGES doesn't affect 
the encoding. Example:

$ LANG=de_DE.iso88591 LC_MESSAGES=fr_FR.UTF-8 python -c 'import os, 
locale; locale.setlocale(locale.LC_ALL, ""); 
print(locale.getpreferredencoding(), repr(os.strerror(23)))'
('ISO-8859-1', "'Trop de fichiers ouverts dans le syst\\xe8me'")

$ LANG=de_DE.UTF-8 LC_MESSAGES=fr_FR.UTF-8 python -c 'import os, locale; 
locale.setlocale(locale.LC_ALL, ""); 
print(locale.getpreferredencoding(), repr(os.strerror(23)))'
('UTF-8', "'Trop de fichiers ouverts dans le syst\\xc3\\xa8me'")

 > The real lesson is not that having more than one encoding
 > is dangerous, but that having incompatible encodings is dangerous.

Yes, and ASCII and UTF-8 are incompatible. ASCII is unable to decode an 
UTF-8 encoded string.

 > Expanding the filesystem default encoding to utf-8
 > should be a very narrow change, mostly just affecting io
 > and os operations.

It affects everything because filenames are used everywhere.

 > On modern systems, this problem is solved by making the
 > standard encoding UTF-8.  So it is unfortunate that, when
 > no locale is set, Python3 defaults to ascii for the filesystem.

Python doesn't invent an encoding: ASCII is the result of 
nl_langinfo(CODESET). Example:

$ python3 -c "import locale; print(locale.nl_langinfo(locale.CODESET))"
UTF-8
$ LANG=C python3 -c "import locale; 
print(locale.nl_langinfo(locale.CODESET))"
ANSI_X3.4-1968

 >> $ LANG=C python3 -c "import os; print(os.listdir())"
 >> ['h\udcc3\udca9h\udcc3\udca9']

 > It's possible to work around this in some cases, such as listdir,
 > by coping with the result including some byte strings, and then
 > manually decoding them.  But there are, iirc, other cases where
 > the call just fails and there is no easy workaround.

In Python 3, os.listdir(str) *CANNOT* fail because of a Unicode decode 
error thanks to the PEP 393. In Python 2, it works differently (return 
the raw bytes filename if decoding fails).

 > Windows and Mac have annoying bugs too, even bugs specifically
 > about Unicode.

Windows supports Unicode since Windows 95 and fully support all Unicode 
characters since Windows 2000.

Mac enforces UTF-8. For example, it is not possible to *create* a 
filename with invalid UTF-8 name. It looks like it always use UTF-8 on 
the command line.

On Linux, we cannot rely on anything except of the locale encoding. We 
try to use Unicode API when it's possible (e.g. use wcstime() instead of 
strftime()), but quite all functions use byte strings and so rely on the 
locale encoding.
msg150039 - (view) Author: Martin (gz) * Date: 2011-12-21 19:47
> Nope. The locale encoding is chosen using LC_ALL, LC_CTYPE or LANG 
> variable: use the first non-empty variable. LC_MESSAGES doesn't affect 
> the encoding. Example:

That's good to know, thanks. Only leaves the case where setlocale is called again with a different value.

> Yes, and ASCII and UTF-8 are incompatible. ASCII is unable to decode an 
> UTF-8 encoded string.

I think we're envisioning different things here.

  os.stat("\u2601") # with LANG=C
    current -> UnicodeEncodeError
    changed -> works if utf-8 encoded file exists

  os.listdir() # with LANG=C
    current -> returns non-ascii as unicode with funky surrogates
    changed -> returns non-utf-8 as unicode with funky surrogates

> It affects everything because filenames are used everywhere.

But currently everything handling filenames as unicode on nix needs to worry about surrogates (that can't be encoded as ascii) already, or it will still be passing values that can't be interpreted by other processes as you highlighed earlier. Making utf-8 names come out correctly rather than as surrogates doesn't seem like it increases the burden.
msg150040 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-21 20:04
> it will still be passing values that can't be
> interpreted by other processes as you highlighed earlier.

On UNIX, data going outside Python has be be encoded: you pass byte strings, not directly Unicode. Surrogates are encoded back to original bytes.

Example:

>>> b'a\xff'.decode('ascii', 'surrogateescape')
'a\udcff'
>>> b'a\xff'.decode('ascii', 'surrogateescape').encode('ascii', 'surrogateescape')
b'a\xff'
msg150050 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-12-21 22:54
> But currently everything handling filenames as unicode on
> nix needs to worry about surrogates (that can't be encoded
> as ascii) already, or it will still be passing values that
> can't be interpreted by other processes as you highlighed
> earlier. Making utf-8 names come out correctly rather than
> as surrogates doesn't seem like it increases the burden.

And that is *exactly* the problem.  You can't assume that those other programs are expecting utf-8 on unix.  The only thing you have to go by is the locale.  So that's what we use.  And as Haypo pointed out, unless you manipulate it file system stuff gets turned back into the same bytes when it exits Python, so pre-existing stuff should work fine.

Now, if posix (or a given unix platform, like OS X did) would say "utf-8 is the standard filesystem and program interchange encoding", we could change Python.  Short of that, it is our experience that using anything other than locale leads to more problems than using locale does.
msg150052 - (view) Author: Martin Pool (poolie) Date: 2011-12-21 23:02
On 21 December 2011 12:41, Antoine Pitrou <report@bugs.python.org> wrote:
>
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
>> The standard encoding is UTF-8.
>
> How so? I don't know of any Linux or Unix spec which says so. If you get
> the Linux heads to standardize this then I'll certainly be very happy
> (and countless others will, too). But AFAIK this it not the case and I
> don't see why you are asking Python to make a choice that OS vendors
> refuse to make. You are certainly asking the wrong project to solve this
> problem.

It is a de facto, not de jure standard: UTF-8 is how things are
typically stored.  Other software (eg gnome file handling utilities)
makes this assumption.  See eg
<http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux>.

I would be happy to see an authoritative document saying this is how
things _should_ be stored, but I can't find one yet.  But in Unix
there are no ultimate authorities: even if someone announced filenames
are utf-8 there will obviously continue to be many machines where in
practice they are not.

I started asking about it over here, to see if at least Ubuntu can
have an opinion that this is how things should normally be:
https://lists.ubuntu.com/archives/ubuntu-devel/2011-December/034588.html

I'm not sure what you expect a technical solution at the OS level
would look like.  The api is 8-bit strings and that's not likely to
change.  It's possible to have a situation where no locale is
specified.  Applications unavoidably need to have some opinion about
what to do there.  Other applications assume the filenames are utf-8.
Python assumes that text in general will be UTF-8
(getdefaultencoding).

It is almost like your caricature of OS developers as being
anglocentric, but in fact here it's Python that assumes everything is
probably ascii - or more charitably, it is just assuming that failing
when things aren't ascii is the best tradeoff.  Maybe it is.

One OS-level fix is to try to reduce the number of situations where
people see no locale, or the C locale, and give them C.UTF-8 instead.
That is probably worth doing.  But having no locale can still happen,
and I think Python could handle that better, so the changes are
complimentary.
msg150053 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-12-21 23:26
> It is a de facto, not de jure standard: UTF-8 is how things are
> typically stored.  Other software (eg gnome file handling utilities)
> makes this assumption.  See eg
> <http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux>.

So should we specifically detect Linux? And under which conditions? When
the encoding is detected to be "ASCII"?

> But in Unix
> there are no ultimate authorities: even if someone announced filenames
> are utf-8 there will obviously continue to be many machines where in
> practice they are not.

POSIX is kind of an authority. Freedesktop.org could be another. LSB yet
another.
(all with different scopes obviously)

> I'm not sure what you expect a technical solution at the OS level
> would look like.

It doesn't need to be technical. It could just be a convention (all
filesystem paths, and other user-visible text such as environment
variables etc., are utf-8 encoded).
Although enforcing it technically would of course be safer.

> That is probably worth doing.  But having no locale can still happen,
> and I think Python could handle that better, so the changes are
> complimentary.

How do you detect "no locale"?
msg150056 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-22 00:21
This discussion is becoming very long, I didn't remember the original 
purpose. You want to use UTF-8 instead of ASCII, so what? What do you 
want to do with your nicely well decoded filenames? You cannot print it 
to your terminal nor pass it to a subprocess, because your terminal uses 
ASCII, as subprocess. I don't see how it would help you.

Thanks to the PEP 383, Python 3 "just works" with an ASCII locale 
encoding. You can list the content of a directory and display a filename 
to your terminal: it will be displayed correctly (even if the terminal 
uses the correct encoding, UTF-8, whereas Python has an empty 
environment and use ASCII); you can also pass the filename to a 
subprocess: the other program will be able to open the file.

I don't understand what is the problem that your are trying to solve.

On 22/12/2011 00:02, Martin Pool wrote:
> It is a de facto, not de jure standard: UTF-8 is how things are
> typically stored.

For your information, on FreeBSD, Solaris and Mac OS X, the "C" locale 
encoding uses the ISO-8859-1, whereas on Linux it uses the "ASCII" 
encoding. There is no such "de facto standard". Each platform uses a 
different encoding and handle codecs differently.

> Other software (eg gnome file handling utilities)
> makes this assumption.  See eg
> <http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux>.

The Qt library (and so KDE) and the glib library (and so Gtk and Gnome) 
use also the locale encoding to encode and decode filenames.

The glib has an useful g_get_filename_charsets() function trying other 
encodings to format correctly a filename.

> I'm not sure what you expect a technical solution at the OS level
> would look like.  The api is 8-bit strings and that's not likely to
> change.

Mac OS X kept the old legacy bytes API, but the kernel enforces valid 
UTF-8 names for filenames. This is a good start to move forward to 
Unicode. On such system, we can make some assumptions. On Linux, we 
cannot do such assumptions today.
msg150058 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-22 00:50
>> Nope. The locale encoding is chosen using LC_ALL, LC_CTYPE or LANG 
>> variable: use the first non-empty variable. LC_MESSAGES doesn't affect 
>> the encoding. Example:
>
> That's good to know, thanks. Only leaves the case where setlocale
> is called again with a different value.

You mean changing the current locale encoding using setlocale(LC_CTYPE)? It doesn't affect the encoding used by Python for filenames (and other OS data). It is a design choice, but also mandatory to avoid mojibake. It was possible in Python 3.1 to set the filesystem encoding, but it doesn't solve any problem, whereas it leads to mojibake is most (or all?) cases. A very important property is: os.fsencode(os.fsdecode(name)) == name. It fails if the result of os.fsdecode(name) was stored before the encoding was changed.

Few C functions are affected by the locale encoding: strerror() and strftime() (tell me if there are others!). Python 3.2 used to filesystem encoding (so the locale encoding read at startup) for them, but it was wrong. I fixed this issue recently:  #13560 (see also #13619.
msg150061 - (view) Author: Martin Pool (poolie) Date: 2011-12-22 01:16
On 22 December 2011 11:21, STINNER Victor <report@bugs.python.org> wrote:
> This discussion is becoming very long, I didn't remember the original
> purpose.

The proposal is that in some cases where Python currently assumes
filenames are ascii on Linux, it ought to instead assume they are
utf-8.

> You want to use UTF-8 instead of ASCII, so what? What do you
> want to do with your nicely well decoded filenames? You cannot print it
> to your terminal nor pass it to a subprocess, because your terminal uses
> ASCII, as subprocess. I don't see how it would help you.

When the application has a unicode string, it can always encode itself
in whatever way it thinks most appropriate.  For instance if it is a
network service, the locale in which it was started may be entirely
irrelevant to the encoding it wants to talk to a particular peer.

However, there are or were some Python filesystem APIs where it is
very hard for the application to avoid being limited to the encoding
Python assumes at startup.  Also, for good reasons, the application
cannot change the filesystem encoding once it starts.  So the reason
for proposing a patch to Python is that there is no way for the
application to escape, once Python's assumed all names will be ascii.

It may be that all of those limitations have since been fixed
separately, either through pep383 or separate patches, so the
application at least has a chance to work around it.  It would be nice
to not burden the application or user with working around this when
the filenames really are valid in what should be the user's locale,
but perhaps this is the OS's fault for not having the right locale
configured.
msg150062 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-22 01:32
On 22/12/2011 02:16, Martin Pool wrote:
> The proposal is that in some cases where Python currently assumes
> filenames are ascii on Linux, it ought to instead assume they are
> utf-8.

Oh, I expected a use case describing the problem, not the proposed 
solution :-)

>> You want to use UTF-8 instead of ASCII, so what? What do you
>> want to do with your nicely well decoded filenames? You cannot print it
>> to your terminal nor pass it to a subprocess, because your terminal uses
>> ASCII, as subprocess. I don't see how it would help you.
>
> When the application has a unicode string,

Where does this string come from? (It is an important question).

If your locale encoding is ASCII, you cannot write such non-ASCII 
filenames using the keyboard for example.

 > with working around this when the filenames really are
 > valid in what should be the user's locale,

On your computer, UTF-8 is maybe a good candidate for "what should be 
the user's locale", but you cannot generalize for all computers.

I also wanted to force UTF-8 everywhere, but you cannot do that or your 
program will just not work in some configurations.
msg150066 - (view) Author: Martin Pool (poolie) Date: 2011-12-22 01:50
On 22 December 2011 12:32, STINNER Victor <report@bugs.python.org> wrote:
>
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
>
> On 22/12/2011 02:16, Martin Pool wrote:
>> The proposal is that in some cases where Python currently assumes
>> filenames are ascii on Linux, it ought to instead assume they are
>> utf-8.
>
> Oh, I expected a use case describing the problem, not the proposed
> solution :-)

The problem as I see it is this:

On Linux, filenames are generally (but not always) in UTF-8; people
fairly commonly end up with no locale configured, which causes Python
to decode filenames as ascii.  It is easy for this to end up with them
hitting UnicodeErrors.

>>> You want to use UTF-8 instead of ASCII, so what? What do you
>>> want to do with your nicely well decoded filenames? You cannot print it
>>> to your terminal nor pass it to a subprocess, because your terminal uses
>>> ASCII, as subprocess. I don't see how it would help you.
>>
>> When the application has a unicode string,
>
> Where does this string come from? (It is an important question).

It comes, for example, from the name of a file, or a directory, or the
contents of a symlink.  Or the problem applies equally when the
program has got a unicode string (for example off the network in a
defined encoding) and it is trying to use it to access the filesystem.

> If your locale encoding is ASCII, you cannot write such non-ASCII
> filenames using the keyboard for example.

Sure you can.  The user could enter a backslash-escaped name, which
the program knows to decode to unicode.  The point is the program has
a choice of how it deals with user input, whereas it does not have as
much control in Python of how filenames are encoded.

>  > with working around this when the filenames really are
>  > valid in what should be the user's locale,
>
> On your computer, UTF-8 is maybe a good candidate for "what should be
> the user's locale", but you cannot generalize for all computers.
>
> I also wanted to force UTF-8 everywhere, but you cannot do that or your
> program will just not work in some configurations.

Just to be clear, I'm not proposing to force UTF-8 everywhere.  I am
only proposing to 'break' the case where the user has non-ascii
filenames but, intentionally or not, a locale that specifies only
ascii is used.  With this change, Python will try to decode them as
utf-8, and fail if they're not utf-8.

I am coming to think the best step here is just for the OS to do more
to make sure the application does get the appropriate locale.  (For
example, Ubuntu in recent releases uses a pam hook to set LANG for
cron jobs, to avoid the example described above.)
msg150067 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-22 02:15
> The problem as I see it is this:
>
> On Linux, filenames are generally (but not always) in UTF-8; people
> fairly commonly end up with no locale configured, which causes Python
> to decode filenames as ascii.  It is easy for this to end up with them
> hitting UnicodeErrors.

I don't think that your problem is decoding, but encoding filenames.

>> Where does this string come from? (It is an important question).
>
> It comes, for example, from the name of a file, or a directory, or the
> contents of a symlink.

For all these cases, Python is able to decode them (but store 
undecodable bytes as surrogates, PEP 383).

> Or the problem applies equally when the
> program has got a unicode string (for example off the network in a
> defined encoding) and it is trying to use it to access the filesystem.

Hum, you can have the problem if you try to decompress a ZIP containing 
a Unicode filename. ZIP stores filenames are cp437 or UTF-8 depending on 
a flag (well, it's not exact: some buggy tools store filenames as a 
different encoding, the Windows ANSI code page...). If you try to 
decompress a ZIP containg non-ASCII filenames stored as UTF-8, whereas 
your locale encoding is ASCII, you will get a UnicodeEncodeError.

I would suggest to fix your environment: if you want to play with 
non-ASCII filenames, you should first fix your locale. Or other programs 
will also fail because of your locale.

(There is maybe something to do in the ZIP module to allow to create 
file names using the original raw bytes filename. See also issues #10614 
and #10972.)

>> If your locale encoding is ASCII, you cannot write such non-ASCII
>> filenames using the keyboard for example.
>
> Sure you can.  The user could enter a backslash-escaped name, which
> the program knows to decode to unicode.

How exactly? Users do usually not write backslash-escaped name. Users 
prefer to click on icons :-)

 > with user input, whereas it does not have as
> much control in Python of how filenames are encoded.

Ah? The application *can* control how filenames are encoded. Example:

Create a UTF-8 filename with a UTF-8 locale encoding.

$ python3
Python 3.2.1 (default, Jul 11 2011, 18:54:42)
 >>> import locale; print(locale.getpreferredencoding())
UTF-8
 >>> f=open("hé.txt", "w"); f.write("unicode!"); f.close()

Read the file content, even if the locale encoding is ASCII.

$ LANG=C python3
Python 3.2.1 (default, Jul 11 2011, 18:54:42)
 >>> import locale; print(locale.getpreferredencoding())
ANSI_X3.4-1968
 >>> f=open("h\xe9.txt", "r"); print(f.read()); f.close()
Traceback (most recent call last):
   ...
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in 
position 1: ordinal not in range(128)
 >>> f=open("h\xe9.txt".encode("utf-8"), "r"); print(f.read()); f.close()
unicode!

You cannot pass directly "h\xe9.txt", but if you know the "correct" file 
system encoding, you can encode it explicitly using str.encode("utf-8").

You are trying to do something complex (add hacks for filenames, for a 
specific configuration) for a simple problem: configure correctly 
locales. If you know and you are sure that your are using UTF-8, why not 
simply setting your locale to a UTF-8 locale?
msg150068 - (view) Author: Martin Pool (poolie) Date: 2011-12-22 02:32
On 22 December 2011 13:15, STINNER Victor <report@bugs.python.org> wrote:
> You cannot pass directly "h\xe9.txt", but if you know the "correct" file system encoding, you can encode it explicitly using str.encode("utf-8").

My recollection was that there were some cases where you couldn't do
this, but perhaps I was wrong or perhaps they're all fixed in
python3.x, or at least perhaps they are better fixed as individual
bugs.  gz may know more.

> You are trying to do something complex (add hacks for filenames, for a specific configuration) for a simple problem: configure correctly locales.

I think you may be right.

> If you know and you are sure that your are using UTF-8, why not
> simply setting your locale to a UTF-8 locale?

_My_ locale is set properly.  The problem is all the other people in
the world who do not have their locale set to match their files on
disk; telling them each to fix it is tedious.  But perhaps the OS is
the best place to address that, when the incorrect locale is just
accidental not unavoidable.
msg150069 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-12-22 04:05
> _My_ locale is set properly.  The problem is all the other
> people in the world who do not have their locale set to match
> their files on disk; telling them each to fix it is tedious.
> But perhaps the OS is the best place to address that, when the
> incorrect locale is just accidental not unavoidable.

I fixed my locale back before my OS fully supported doing so.
It was painful, but it was *so* worth it.  There were many
tools that just worked better after I did that, and several
tools that I had to convince to use utf-8 through non-standard
means.

So I think Python is doing the right thing by using the locale
(the Standard Way), and that getting the OS vendors and/or
the users to fix their locale settings is indeed the right place
to fix this.
msg150204 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-12-24 03:01
Martin, after reading most all of the unusually large sequence of messages, I am closing this because three of the core developers with the most experience in this area are dead-set against your proposal. That does not make it 'wrong', but does mean that it will not be approved and implemented without new data and more persuasive arguments than those presented so far. I do not see that continued repetition of what has been said so far will change anything.
msg150215 - (view) Author: Martin Pool (poolie) Date: 2011-12-24 06:24
Terry, that's fine.  Thanks to everyone who contributed to the discussion.
msg283718 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-12-21 04:44
Also see http://bugs.python.org/issue28180 for a more recent proposal to tackle this by coercing the C locale to the C.UTF-8 locale
msg308601 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-12-19 01:01
Follow-up: the PEP 538 (bpo-28180) and PEP 540 (bpo-29240) have been accepted and implemented in Python 3.7!
History
Date User Action Args
2022-04-11 14:57:24adminsetgithub: 57852
2017-12-19 01:01:26vstinnersetmessages: + msg308601
2016-12-21 04:44:05ncoghlansetnosy: + ncoghlan
messages: + msg283718
2011-12-24 06:24:44pooliesetmessages: + msg150215
2011-12-24 03:01:38terry.reedysetstatus: open -> closed

type: behavior -> enhancement

nosy: + terry.reedy
messages: + msg150204
resolution: rejected
stage: test needed
2011-12-22 10:45:03akirasetnosy: + akira
2011-12-22 04:05:35r.david.murraysetmessages: + msg150069
2011-12-22 02:32:20pooliesetmessages: + msg150068
2011-12-22 02:15:55vstinnersetmessages: + msg150067
2011-12-22 01:50:35pooliesetmessages: + msg150066
2011-12-22 01:32:52vstinnersetmessages: + msg150062
2011-12-22 01:16:54pooliesetmessages: + msg150061
2011-12-22 00:50:18vstinnersetmessages: + msg150058
2011-12-22 00:21:26vstinnersetmessages: + msg150056
2011-12-21 23:26:11pitrousetmessages: + msg150053
2011-12-21 23:02:31pooliesetmessages: + msg150052
2011-12-21 22:54:42r.david.murraysetmessages: + msg150050
2011-12-21 20:04:43vstinnersetmessages: + msg150040
2011-12-21 19:47:04gzsetmessages: + msg150039
2011-12-21 18:27:38vstinnersetmessages: + msg150031
2011-12-21 08:17:00vilasetnosy: + vila
2011-12-21 01:41:26pitrousetmessages: + msg149952
2011-12-21 01:36:01pooliesetmessages: + msg149951
2011-12-21 01:18:08pooliesetmessages: + msg149950
2011-12-21 01:16:21pitrousetnosy: + pitrou
messages: + msg149949
2011-12-21 01:12:29gzsetmessages: + msg149948
2011-12-21 00:54:39vstinnersetmessages: + msg149947
2011-12-21 00:41:45vstinnersetmessages: + msg149944
2011-12-21 00:38:19pooliesetmessages: + msg149943
2011-12-21 00:28:57pooliesetmessages: + msg149942
2011-12-21 00:26:03vstinnersetmessages: + msg149941
2011-12-21 00:01:54vstinnersetmessages: + msg149939
2011-12-20 23:53:28pooliesetnosy: + poolie
messages: + msg149938
2011-12-20 20:45:11gzsetmessages: + msg149929
2011-12-20 20:24:43gzsettype: behavior
messages: + msg149928
2011-12-20 19:38:54vstinnersetmessages: + msg149927
2011-12-20 19:37:40vstinnersetnosy: + vstinner
messages: + msg149926
2011-12-20 19:17:07r.david.murraysetnosy: + r.david.murray
messages: + msg149925
2011-12-20 19:02:21gzcreate