Issue 13643: 'ascii' is a bad filesystem default encoding

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Unsupported provider

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/57852

classification

Title:	'ascii' is a bad filesystem default encoding
Type:	enhancement	Stage:	test needed
Components:	Interpreter Core	Versions:	Python 3.3

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	akira, benjamin.peterson, gz, ncoghlan, pitrou, poolie, r.david.murray, terry.reedy, vila, vstinner
Priority:	normal	Keywords:	patch

Created on 2011-12-20 19:02 by gz, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
/tmp/filesystem_encoding_utf8.patch	gz, 2011-12-20 19:02	Patch for using utf-8 instead of ascii if given as codeset	review

Messages (36)
msg149924 - (view)	Author: Martin (gz) *	Date: 2011-12-20 19:02
Currently when running Python on a non-OSX posix environment under either the C locale, or with an invalid or missing locale, it's not possible to operate using unicode filenames outside the ascii range. Using bytes works, as does reading expecting unicode, using the surrogates hack. This makes robustly working with non-ascii filenames on different platforms needlessly annoying, given no modern nix should have problems just using UTF-8 in these cases. See the downstream bzr bug for more: <https://bugs.launchpad.net/bzr/+bug/794353> One option is to just use UTF-8 for encoding and decoding filenames when otherwise ascii would be used. As a strict superset, this shouldn't break too many existing assumptions, and it's unlikely that non-UTF-8 filenames will accidentally be mangled due to a locale setting blip. See the attached patch for this behaviour change. It does not include a test currently, but it's possible to write one using subprocess and overriden LANG and LC_ALL vars.
msg149925 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2011-12-20 19:17
I'm not sure why having a locale set to C or something invalid should be considered a Python bug. You have to handle un-decodable filenames no matter what you do, since things aren't always encoded in utf-8 on non-OSX unix even when that is the system locale. It's just something you have to live with.
msg149926 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-12-20 19:37
> Currently when running Python on a non-OSX posix environment > under either the C locale, or with an invalid or missing locale, > it's not possible to operate using unicode filenames outside > the ascii range. It was already discussed: using a different encoding for filenames and for other things is really not a good idea. The main problem is the interaction with other programs. Read discussion of issues #8622, #8775 and #9992.
msg149927 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-12-20 19:38
> under either the C locale, or with an invalid or missing locale The right fix is to fix your locale, not Python.
msg149928 - (view)	Author: Martin (gz) *	Date: 2011-12-20 20:24
> I'm not sure why having a locale set to C or something invalid should be > considered a Python bug. You have to handle un-decodable filenames no > matter what you do, since things aren't always encoded in utf-8 on non-OSX > unix even when that is the system locale. It's just something you have to > live with. This is more about un-encodable filenames. At the moment work with non-ascii filenames in Python robustly requires two branches, one using unicode and one that encodes to bytestrings and deals with the case where the name can't be represented in the declared filesystem encoding. That may be something that just had to be lived with, but it's a little annoying when even without a UTF-8 locale for a particular process, that's what most systems will want on disk.
msg149929 - (view)	Author: Martin (gz) *	Date: 2011-12-20 20:45
> It was already discussed: using a different encoding for filenames and for > other things is really not a good idea. The main problem is the interaction > with other programs. Yes, for many programs, a change like this will mean they create the file, but then throw a traceback anyway when trying to print its name to stdout or something. > Read discussion of issues #8622, #8775 and #9992. Thanks. I agree that spreading different values to things like subprocess arguments and the environment is asking for trouble. Just changing how unicode filename are encoded by default seems safer, though it certainly won't help all code. > The right fix is to fix your locale, not Python. I've found that hard to stick to in the face of bug reports where "your locale" turns out to be "the locale used by some cronjob". Fixing my library to work under LANG=C is easier than bugging every downstream project.
msg149938 - (view)	Author: Martin Pool (poolie)	Date: 2011-12-20 23:53
> I'm not sure why having a locale set to C or something invalid should be considered a Python bug. Programs like bzr that hit these problems can tell their users, either in the docs or an error message, "change your locale to a UTF-8 one". There are two problems with this: one is just the practical one that it scales poorly to have to tell every user to do this and to take them through working out how to set this in a way that covers cron jobs, daemons, things run over ssh, etc. The other problem is that the locale variables primarily describe the locale for input/output, and that can very reasonably be different from the filesystem encoding. As a specific common example people may have UTF-8 filenames but want a C locale terminal. If there was a separate LC_FILENAMES then Python could respect that and insist people set it, but there isn't.
msg149939 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-12-21 00:01
> If there was a separate LC_FILENAMES then Python could respect > that and insist people set it, but there isn't. During 1 month, we had PYTHONFSENCODING environment variable. It was not a good idea. Again: please read the discussion (in closed issues) explaing why we removed it (and which problems it introduced).
msg149941 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-12-21 00:26
> There are two problems with this: one is just the practical > one that it scales poorly to have to tell every user to do this > and to take them through working out how to set this in a way > that covers cron jobs, daemons, things run over ssh, etc. I never checked which locale is used by default for programs called by cron. So I checked: on Fedora 16, programs start with a very few environment variables, and LANG and LC_ALL are not set. You can add "LANG=fr_FR.UTF-8" (for example) to /etc/environment to set the default language for the whole system (for all programs). I checked, it works with cron. Or if you don't want to affect all programs, it is maybe safer to only set the locale for one specific program in your crontab by adding "LANG=fr_FR.UTF-8 " before you command. Example: * * * * * LANG=fr_FR.UTF-8 /home/haypo/test.sh -- If you want to handle any filename without having to care of the locale, the simplest solution is to use the bytes type to store filenames.
msg149942 - (view)	Author: Martin Pool (poolie)	Date: 2011-12-21 00:28
On 21 December 2011 11:01, STINNER Victor <report@bugs.python.org> wrote: > > Again: please read the discussion (in closed issues) explaing why we removed it (and which problems it introduced). There's a lot of history, so I'm not sure exactly which problems you're referring to. The main problem I see being discussed is that changing the encoding after Python starts would be dangerous, which I agree with, but we're not proposing to do that.
msg149943 - (view)	Author: Martin Pool (poolie)	Date: 2011-12-21 00:38
On 21 December 2011 11:26, STINNER Victor <report@bugs.python.org> wrote: > I never checked which locale is used by default for programs called by cron. So I checked: on Fedora 16, programs start with a very few environment variables, and LANG and LC_ALL are not set. You can add "LANG=fr_FR.UTF-8" (for example) to /etc/environment to set the default language for the whole system (for all programs). I checked, it works with cron. Or if you don't want to affect all programs, it is maybe safer to only set the locale for one specific program in your crontab by adding "LANG=fr_FR.UTF-8 " before you command. Example: > > * * * * * LANG=fr_FR.UTF-8 /home/haypo/test.sh That is the correct kind of configuration. When I say it scales poorly I mean that every user running a Python program on a unicode system needs to insert this configuration in every relevant place, and they need to work this out from what is typically a fairly cryptic message. (bzr just added a workaround for this, but for other programs it still exists.) Also, my other point, is that people may very well want their cron scripts to send ascii output but cope with unicode filenames.
msg149944 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-12-21 00:41
> The main problem I see being discussed is that > changing the encoding after Python starts would > be dangerous, which I agree with, but we're not > proposing to do that. Not after Python start. Using two encodings at the same would just adds new problems. On UNIX (at least on Linux?), it is mandatory to use the same encoding for: - command line arguments - environment variables - filenames - and more generally, all data exchanged with the system and other programs Let's take an example: you use UTF-8 for filenames and ISO-8859-1 for all other data. You want to check if a specific filename is present in your home directory: encode the filename to UTF-8 and read the home directory from the HOME environment variable. But environment variables are decoded from ISO-8859-1, so you have to encode them back to ISO-8859-1 to avoid mojibake (and real bugs, like file not found). Ok, let say that filenames and environment variables are UTF-8 and that other data are ISO-8859-1. You would like to play a MP3 using mplayer: you pass the filename encoded to UTF-8 as an argument of mplayer command line. But mplayer uses ISO-8859-1 to decode its command line (it's not exactly like that, but image that it's the case): mplayer will be unable to find your MP3. etc. That's why on UNIX there is one unique encoding, the locale encoding, and that Python uses the same encoding (called "the filesystem encoding", I don't like this name, sys.getfilesystemencoding()). -- It is no more possible to change the Python filesystem encoding at runtime (I remove sys.setfilesystemencoding()) because I would like to inconsistency. If you decoded a filename before changing the encoding, and then you decode the same filename after changing the encoding: you will get two different names and encode the filenames back will give you two different byte sequences (and more likely, a Unicode encode error). It was possible to override the filesystem encoding using a PYTHONFSENCODING environment variable, but it introduced all the inconsistencies listed before (especially with external programs). Now the only right way to change the Python (filesystem) encoding is the UNIX way of doing that: set LC_ALL, LC_CTYPE or LANG environment variable (configure your locale).
msg149947 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-12-21 00:54
I should not write comments so late :-p > Not after Python start. Using two encodings at the same would just ... at the same time > ... because I would like to inconsistency. because it would lead to inconsistencies
msg149948 - (view)	Author: Martin (gz) *	Date: 2011-12-21 01:12
> During 1 month, we had PYTHONFSENCODING environment variable. It was not a > good idea. I strongly agree. There is no sense in having a separate configurable value, anyone who would think about using a PYTHONFSENCODING should just change their locale instead. However, avoiding the need for manual intervention completely in a relatively narrow set of cases is still useful. > Not after Python start. Using two encodings at the same would just adds new > problems. On UNIX (at least on Linux?), it is mandatory to use the same > encoding for: > > - command line arguments > - environment variables > - filenames > - and more generally, all data exchanged with the system and other programs Having more than one encoding on unix is already a reality, there's nothing to stop someone setting LANG=de_DE.UTF-8 and LC_MESSAGES=C say. The real lesson is not that having more than one encoding is dangerous, but that having incompatible encodings is dangerous. As 'ascii' is a strict subset of 'utf-8' the cross process communication issues are greatly lessened, at worst stuff just breaks still. Expanding the filesystem default encoding to utf-8 should be a very narrow change, mostly just affecting io and os operations. Other actions involving paths will still break if a non-ascii string is used, but without the possibility of mangling data.
msg149949 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-12-21 01:16
So, you're complaining about something which works, kind of: $ touch héhé $ LANG=C python3 -c "import os; print(os.listdir())" ['h\udcc3\udca9h\udcc3\udca9'] > This makes robustly working with non-ascii filenames on different > platforms needlessly annoying, given no modern nix should have problems > just using UTF-8 in these cases. So why don't these supposedly "modern" systems at least set the appropriate environment variables for Python to infer the proper character encoding? (since these "modern" systems don't have a well-defined encoding...) Answer: because they are not modern at all, they are antiquated, inadapted and obsolete pieces of software designed and written by clueless Anglo-American people. Please report bugs against these systems. The culprit is not Python, it's the Unix crap and the utterly clueless attitude of its maintainers ("filesystems are just bytes", yeah, whatever...).
msg149950 - (view)	Author: Martin Pool (poolie)	Date: 2011-12-21 01:18
Thanks for the example. Like you say, realistically, all data exchanged with other programs and with the system needs to be in the same encoding. (User document content may be in something else.) On modern systems, this problem is solved by making the standard encoding UTF-8. So it is unfortunate that, when no locale is set, Python3 defaults to ascii for the filesystem. With no locale set, python3 makes getdefaultencoding() utf-8, so it seems oddly pessimistic to make the fsencoding only ascii. If someone really wants to run everything in iso-8859-1 this patch would not stop them doing so.
msg149951 - (view)	Author: Martin Pool (poolie)	Date: 2011-12-21 01:36
On 21 December 2011 12:16, Antoine Pitrou <report@bugs.python.org> wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > So, you're complaining about something which works, kind of: > > $ touch héhé > $ LANG=C python3 -c "import os; print(os.listdir())" > ['h\udcc3\udca9h\udcc3\udca9'] It's possible to work around this in some cases, such as listdir, by coping with the result including some byte strings, and then manually decoding them. But there are, iirc, other cases where the call just fails and there is no easy workaround. It wasn't impossible to get unicode right in python2, but python3 still thinks it's worth changing things to make it work better. >> This makes robustly working with non-ascii filenames on different >> platforms needlessly annoying, given no modern nix should have problems >> just using UTF-8 in these cases. > > So why don't these supposedly "modern" systems at least set the appropriate environment variables for Python to infer the proper character encoding? > (since these "modern" systems don't have a well-defined encoding...) The standard encoding is UTF-8. Python shouldn't need to have a variable set to tell it this. Python is making an assumption about the default but it is a bad assumption. > The culprit is not Python, it's the Unix crap.... Programs need to work with the environments that are available to them, even though those environments often have flaws. Windows and Mac have annoying bugs too, even bugs specifically about Unicode.
msg149952 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-12-21 01:41
> The standard encoding is UTF-8. How so? I don't know of any Linux or Unix spec which says so. If you get the Linux heads to standardize this then I'll certainly be very happy (and countless others will, too). But AFAIK this it not the case and I don't see why you are asking Python to make a choice that OS vendors refuse to make. You are certainly asking the wrong project to solve this problem. So I'd rather not solve your problem at the Python level so that you instead try to get it solved at the right (OS) level.
msg150031 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-12-21 18:27
> Having more than one encoding on unix is already a reality, there's nothing to stop someone setting LANG=de_DE.UTF-8 and LC_MESSAGES=C say. Nope. The locale encoding is chosen using LC_ALL, LC_CTYPE or LANG variable: use the first non-empty variable. LC_MESSAGES doesn't affect the encoding. Example: $ LANG=de_DE.iso88591 LC_MESSAGES=fr_FR.UTF-8 python -c 'import os, locale; locale.setlocale(locale.LC_ALL, ""); print(locale.getpreferredencoding(), repr(os.strerror(23)))' ('ISO-8859-1', "'Trop de fichiers ouverts dans le syst\\xe8me'") $ LANG=de_DE.UTF-8 LC_MESSAGES=fr_FR.UTF-8 python -c 'import os, locale; locale.setlocale(locale.LC_ALL, ""); print(locale.getpreferredencoding(), repr(os.strerror(23)))' ('UTF-8', "'Trop de fichiers ouverts dans le syst\\xc3\\xa8me'") > The real lesson is not that having more than one encoding > is dangerous, but that having incompatible encodings is dangerous. Yes, and ASCII and UTF-8 are incompatible. ASCII is unable to decode an UTF-8 encoded string. > Expanding the filesystem default encoding to utf-8 > should be a very narrow change, mostly just affecting io > and os operations. It affects everything because filenames are used everywhere. > On modern systems, this problem is solved by making the > standard encoding UTF-8. So it is unfortunate that, when > no locale is set, Python3 defaults to ascii for the filesystem. Python doesn't invent an encoding: ASCII is the result of nl_langinfo(CODESET). Example: $ python3 -c "import locale; print(locale.nl_langinfo(locale.CODESET))" UTF-8 $ LANG=C python3 -c "import locale; print(locale.nl_langinfo(locale.CODESET))" ANSI_X3.4-1968 >> $ LANG=C python3 -c "import os; print(os.listdir())" >> ['h\udcc3\udca9h\udcc3\udca9'] > It's possible to work around this in some cases, such as listdir, > by coping with the result including some byte strings, and then > manually decoding them. But there are, iirc, other cases where > the call just fails and there is no easy workaround. In Python 3, os.listdir(str) CANNOT fail because of a Unicode decode error thanks to the PEP 393. In Python 2, it works differently (return the raw bytes filename if decoding fails). > Windows and Mac have annoying bugs too, even bugs specifically > about Unicode. Windows supports Unicode since Windows 95 and fully support all Unicode characters since Windows 2000. Mac enforces UTF-8. For example, it is not possible to create a filename with invalid UTF-8 name. It looks like it always use UTF-8 on the command line. On Linux, we cannot rely on anything except of the locale encoding. We try to use Unicode API when it's possible (e.g. use wcstime() instead of strftime()), but quite all functions use byte strings and so rely on the locale encoding.
msg150039 - (view)	Author: Martin (gz) *	Date: 2011-12-21 19:47
> Nope. The locale encoding is chosen using LC_ALL, LC_CTYPE or LANG > variable: use the first non-empty variable. LC_MESSAGES doesn't affect > the encoding. Example: That's good to know, thanks. Only leaves the case where setlocale is called again with a different value. > Yes, and ASCII and UTF-8 are incompatible. ASCII is unable to decode an > UTF-8 encoded string. I think we're envisioning different things here. os.stat("\u2601") # with LANG=C current -> UnicodeEncodeError changed -> works if utf-8 encoded file exists os.listdir() # with LANG=C current -> returns non-ascii as unicode with funky surrogates changed -> returns non-utf-8 as unicode with funky surrogates > It affects everything because filenames are used everywhere. But currently everything handling filenames as unicode on nix needs to worry about surrogates (that can't be encoded as ascii) already, or it will still be passing values that can't be interpreted by other processes as you highlighed earlier. Making utf-8 names come out correctly rather than as surrogates doesn't seem like it increases the burden.
msg150040 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-12-21 20:04
> it will still be passing values that can't be > interpreted by other processes as you highlighed earlier. On UNIX, data going outside Python has be be encoded: you pass byte strings, not directly Unicode. Surrogates are encoded back to original bytes. Example: >>> b'a\xff'.decode('ascii', 'surrogateescape') 'a\udcff' >>> b'a\xff'.decode('ascii', 'surrogateescape').encode('ascii', 'surrogateescape') b'a\xff'
msg150050 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2011-12-21 22:54
> But currently everything handling filenames as unicode on > nix needs to worry about surrogates (that can't be encoded > as ascii) already, or it will still be passing values that > can't be interpreted by other processes as you highlighed > earlier. Making utf-8 names come out correctly rather than > as surrogates doesn't seem like it increases the burden. And that is exactly the problem. You can't assume that those other programs are expecting utf-8 on unix. The only thing you have to go by is the locale. So that's what we use. And as Haypo pointed out, unless you manipulate it file system stuff gets turned back into the same bytes when it exits Python, so pre-existing stuff should work fine. Now, if posix (or a given unix platform, like OS X did) would say "utf-8 is the standard filesystem and program interchange encoding", we could change Python. Short of that, it is our experience that using anything other than locale leads to more problems than using locale does.
msg150052 - (view)	Author: Martin Pool (poolie)	Date: 2011-12-21 23:02
On 21 December 2011 12:41, Antoine Pitrou <report@bugs.python.org> wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> The standard encoding is UTF-8. > > How so? I don't know of any Linux or Unix spec which says so. If you get > the Linux heads to standardize this then I'll certainly be very happy > (and countless others will, too). But AFAIK this it not the case and I > don't see why you are asking Python to make a choice that OS vendors > refuse to make. You are certainly asking the wrong project to solve this > problem. It is a de facto, not de jure standard: UTF-8 is how things are typically stored. Other software (eg gnome file handling utilities) makes this assumption. See eg <http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux>. I would be happy to see an authoritative document saying this is how things _should_ be stored, but I can't find one yet. But in Unix there are no ultimate authorities: even if someone announced filenames are utf-8 there will obviously continue to be many machines where in practice they are not. I started asking about it over here, to see if at least Ubuntu can have an opinion that this is how things should normally be: https://lists.ubuntu.com/archives/ubuntu-devel/2011-December/034588.html I'm not sure what you expect a technical solution at the OS level would look like. The api is 8-bit strings and that's not likely to change. It's possible to have a situation where no locale is specified. Applications unavoidably need to have some opinion about what to do there. Other applications assume the filenames are utf-8. Python assumes that text in general will be UTF-8 (getdefaultencoding). It is almost like your caricature of OS developers as being anglocentric, but in fact here it's Python that assumes everything is probably ascii - or more charitably, it is just assuming that failing when things aren't ascii is the best tradeoff. Maybe it is. One OS-level fix is to try to reduce the number of situations where people see no locale, or the C locale, and give them C.UTF-8 instead. That is probably worth doing. But having no locale can still happen, and I think Python could handle that better, so the changes are complimentary.
msg150053 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-12-21 23:26
> It is a de facto, not de jure standard: UTF-8 is how things are > typically stored. Other software (eg gnome file handling utilities) > makes this assumption. See eg > <http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux>. So should we specifically detect Linux? And under which conditions? When the encoding is detected to be "ASCII"? > But in Unix > there are no ultimate authorities: even if someone announced filenames > are utf-8 there will obviously continue to be many machines where in > practice they are not. POSIX is kind of an authority. Freedesktop.org could be another. LSB yet another. (all with different scopes obviously) > I'm not sure what you expect a technical solution at the OS level > would look like. It doesn't need to be technical. It could just be a convention (all filesystem paths, and other user-visible text such as environment variables etc., are utf-8 encoded). Although enforcing it technically would of course be safer. > That is probably worth doing. But having no locale can still happen, > and I think Python could handle that better, so the changes are > complimentary. How do you detect "no locale"?
msg150056 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-12-22 00:21
This discussion is becoming very long, I didn't remember the original purpose. You want to use UTF-8 instead of ASCII, so what? What do you want to do with your nicely well decoded filenames? You cannot print it to your terminal nor pass it to a subprocess, because your terminal uses ASCII, as subprocess. I don't see how it would help you. Thanks to the PEP 383, Python 3 "just works" with an ASCII locale encoding. You can list the content of a directory and display a filename to your terminal: it will be displayed correctly (even if the terminal uses the correct encoding, UTF-8, whereas Python has an empty environment and use ASCII); you can also pass the filename to a subprocess: the other program will be able to open the file. I don't understand what is the problem that your are trying to solve. On 22/12/2011 00:02, Martin Pool wrote: > It is a de facto, not de jure standard: UTF-8 is how things are > typically stored. For your information, on FreeBSD, Solaris and Mac OS X, the "C" locale encoding uses the ISO-8859-1, whereas on Linux it uses the "ASCII" encoding. There is no such "de facto standard". Each platform uses a different encoding and handle codecs differently. > Other software (eg gnome file handling utilities) > makes this assumption. See eg > <http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux>. The Qt library (and so KDE) and the glib library (and so Gtk and Gnome) use also the locale encoding to encode and decode filenames. The glib has an useful g_get_filename_charsets() function trying other encodings to format correctly a filename. > I'm not sure what you expect a technical solution at the OS level > would look like. The api is 8-bit strings and that's not likely to > change. Mac OS X kept the old legacy bytes API, but the kernel enforces valid UTF-8 names for filenames. This is a good start to move forward to Unicode. On such system, we can make some assumptions. On Linux, we cannot do such assumptions today.
msg150058 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-12-22 00:50
>> Nope. The locale encoding is chosen using LC_ALL, LC_CTYPE or LANG >> variable: use the first non-empty variable. LC_MESSAGES doesn't affect >> the encoding. Example: > > That's good to know, thanks. Only leaves the case where setlocale > is called again with a different value. You mean changing the current locale encoding using setlocale(LC_CTYPE)? It doesn't affect the encoding used by Python for filenames (and other OS data). It is a design choice, but also mandatory to avoid mojibake. It was possible in Python 3.1 to set the filesystem encoding, but it doesn't solve any problem, whereas it leads to mojibake is most (or all?) cases. A very important property is: os.fsencode(os.fsdecode(name)) == name. It fails if the result of os.fsdecode(name) was stored before the encoding was changed. Few C functions are affected by the locale encoding: strerror() and strftime() (tell me if there are others!). Python 3.2 used to filesystem encoding (so the locale encoding read at startup) for them, but it was wrong. I fixed this issue recently: #13560 (see also #13619.
msg150061 - (view)	Author: Martin Pool (poolie)	Date: 2011-12-22 01:16
On 22 December 2011 11:21, STINNER Victor <report@bugs.python.org> wrote: > This discussion is becoming very long, I didn't remember the original > purpose. The proposal is that in some cases where Python currently assumes filenames are ascii on Linux, it ought to instead assume they are utf-8. > You want to use UTF-8 instead of ASCII, so what? What do you > want to do with your nicely well decoded filenames? You cannot print it > to your terminal nor pass it to a subprocess, because your terminal uses > ASCII, as subprocess. I don't see how it would help you. When the application has a unicode string, it can always encode itself in whatever way it thinks most appropriate. For instance if it is a network service, the locale in which it was started may be entirely irrelevant to the encoding it wants to talk to a particular peer. However, there are or were some Python filesystem APIs where it is very hard for the application to avoid being limited to the encoding Python assumes at startup. Also, for good reasons, the application cannot change the filesystem encoding once it starts. So the reason for proposing a patch to Python is that there is no way for the application to escape, once Python's assumed all names will be ascii. It may be that all of those limitations have since been fixed separately, either through pep383 or separate patches, so the application at least has a chance to work around it. It would be nice to not burden the application or user with working around this when the filenames really are valid in what should be the user's locale, but perhaps this is the OS's fault for not having the right locale configured.
msg150062 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-12-22 01:32
On 22/12/2011 02:16, Martin Pool wrote: > The proposal is that in some cases where Python currently assumes > filenames are ascii on Linux, it ought to instead assume they are > utf-8. Oh, I expected a use case describing the problem, not the proposed solution :-) >> You want to use UTF-8 instead of ASCII, so what? What do you >> want to do with your nicely well decoded filenames? You cannot print it >> to your terminal nor pass it to a subprocess, because your terminal uses >> ASCII, as subprocess. I don't see how it would help you. > > When the application has a unicode string, Where does this string come from? (It is an important question). If your locale encoding is ASCII, you cannot write such non-ASCII filenames using the keyboard for example. > with working around this when the filenames really are > valid in what should be the user's locale, On your computer, UTF-8 is maybe a good candidate for "what should be the user's locale", but you cannot generalize for all computers. I also wanted to force UTF-8 everywhere, but you cannot do that or your program will just not work in some configurations.
msg150066 - (view)	Author: Martin Pool (poolie)	Date: 2011-12-22 01:50
On 22 December 2011 12:32, STINNER Victor <report@bugs.python.org> wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > On 22/12/2011 02:16, Martin Pool wrote: >> The proposal is that in some cases where Python currently assumes >> filenames are ascii on Linux, it ought to instead assume they are >> utf-8. > > Oh, I expected a use case describing the problem, not the proposed > solution :-) The problem as I see it is this: On Linux, filenames are generally (but not always) in UTF-8; people fairly commonly end up with no locale configured, which causes Python to decode filenames as ascii. It is easy for this to end up with them hitting UnicodeErrors. >>> You want to use UTF-8 instead of ASCII, so what? What do you >>> want to do with your nicely well decoded filenames? You cannot print it >>> to your terminal nor pass it to a subprocess, because your terminal uses >>> ASCII, as subprocess. I don't see how it would help you. >> >> When the application has a unicode string, > > Where does this string come from? (It is an important question). It comes, for example, from the name of a file, or a directory, or the contents of a symlink. Or the problem applies equally when the program has got a unicode string (for example off the network in a defined encoding) and it is trying to use it to access the filesystem. > If your locale encoding is ASCII, you cannot write such non-ASCII > filenames using the keyboard for example. Sure you can. The user could enter a backslash-escaped name, which the program knows to decode to unicode. The point is the program has a choice of how it deals with user input, whereas it does not have as much control in Python of how filenames are encoded. > > with working around this when the filenames really are > > valid in what should be the user's locale, > > On your computer, UTF-8 is maybe a good candidate for "what should be > the user's locale", but you cannot generalize for all computers. > > I also wanted to force UTF-8 everywhere, but you cannot do that or your > program will just not work in some configurations. Just to be clear, I'm not proposing to force UTF-8 everywhere. I am only proposing to 'break' the case where the user has non-ascii filenames but, intentionally or not, a locale that specifies only ascii is used. With this change, Python will try to decode them as utf-8, and fail if they're not utf-8. I am coming to think the best step here is just for the OS to do more to make sure the application does get the appropriate locale. (For example, Ubuntu in recent releases uses a pam hook to set LANG for cron jobs, to avoid the example described above.)
msg150067 - (view)	Author: STINNER Victor (vstinner) *	Date: 2011-12-22 02:15
> The problem as I see it is this: > > On Linux, filenames are generally (but not always) in UTF-8; people > fairly commonly end up with no locale configured, which causes Python > to decode filenames as ascii. It is easy for this to end up with them > hitting UnicodeErrors. I don't think that your problem is decoding, but encoding filenames. >> Where does this string come from? (It is an important question). > > It comes, for example, from the name of a file, or a directory, or the > contents of a symlink. For all these cases, Python is able to decode them (but store undecodable bytes as surrogates, PEP 383). > Or the problem applies equally when the > program has got a unicode string (for example off the network in a > defined encoding) and it is trying to use it to access the filesystem. Hum, you can have the problem if you try to decompress a ZIP containing a Unicode filename. ZIP stores filenames are cp437 or UTF-8 depending on a flag (well, it's not exact: some buggy tools store filenames as a different encoding, the Windows ANSI code page...). If you try to decompress a ZIP containg non-ASCII filenames stored as UTF-8, whereas your locale encoding is ASCII, you will get a UnicodeEncodeError. I would suggest to fix your environment: if you want to play with non-ASCII filenames, you should first fix your locale. Or other programs will also fail because of your locale. (There is maybe something to do in the ZIP module to allow to create file names using the original raw bytes filename. See also issues #10614 and #10972.) >> If your locale encoding is ASCII, you cannot write such non-ASCII >> filenames using the keyboard for example. > > Sure you can. The user could enter a backslash-escaped name, which > the program knows to decode to unicode. How exactly? Users do usually not write backslash-escaped name. Users prefer to click on icons :-) > with user input, whereas it does not have as > much control in Python of how filenames are encoded. Ah? The application can control how filenames are encoded. Example: Create a UTF-8 filename with a UTF-8 locale encoding. $ python3 Python 3.2.1 (default, Jul 11 2011, 18:54:42) >>> import locale; print(locale.getpreferredencoding()) UTF-8 >>> f=open("hé.txt", "w"); f.write("unicode!"); f.close() Read the file content, even if the locale encoding is ASCII. $ LANG=C python3 Python 3.2.1 (default, Jul 11 2011, 18:54:42) >>> import locale; print(locale.getpreferredencoding()) ANSI_X3.4-1968 >>> f=open("h\xe9.txt", "r"); print(f.read()); f.close() Traceback (most recent call last): ... UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 1: ordinal not in range(128) >>> f=open("h\xe9.txt".encode("utf-8"), "r"); print(f.read()); f.close() unicode! You cannot pass directly "h\xe9.txt", but if you know the "correct" file system encoding, you can encode it explicitly using str.encode("utf-8"). You are trying to do something complex (add hacks for filenames, for a specific configuration) for a simple problem: configure correctly locales. If you know and you are sure that your are using UTF-8, why not simply setting your locale to a UTF-8 locale?
msg150068 - (view)	Author: Martin Pool (poolie)	Date: 2011-12-22 02:32
On 22 December 2011 13:15, STINNER Victor <report@bugs.python.org> wrote: > You cannot pass directly "h\xe9.txt", but if you know the "correct" file system encoding, you can encode it explicitly using str.encode("utf-8"). My recollection was that there were some cases where you couldn't do this, but perhaps I was wrong or perhaps they're all fixed in python3.x, or at least perhaps they are better fixed as individual bugs. gz may know more. > You are trying to do something complex (add hacks for filenames, for a specific configuration) for a simple problem: configure correctly locales. I think you may be right. > If you know and you are sure that your are using UTF-8, why not > simply setting your locale to a UTF-8 locale? _My_ locale is set properly. The problem is all the other people in the world who do not have their locale set to match their files on disk; telling them each to fix it is tedious. But perhaps the OS is the best place to address that, when the incorrect locale is just accidental not unavoidable.
msg150069 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2011-12-22 04:05
> _My_ locale is set properly. The problem is all the other > people in the world who do not have their locale set to match > their files on disk; telling them each to fix it is tedious. > But perhaps the OS is the best place to address that, when the > incorrect locale is just accidental not unavoidable. I fixed my locale back before my OS fully supported doing so. It was painful, but it was so worth it. There were many tools that just worked better after I did that, and several tools that I had to convince to use utf-8 through non-standard means. So I think Python is doing the right thing by using the locale (the Standard Way), and that getting the OS vendors and/or the users to fix their locale settings is indeed the right place to fix this.
msg150204 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2011-12-24 03:01
Martin, after reading most all of the unusually large sequence of messages, I am closing this because three of the core developers with the most experience in this area are dead-set against your proposal. That does not make it 'wrong', but does mean that it will not be approved and implemented without new data and more persuasive arguments than those presented so far. I do not see that continued repetition of what has been said so far will change anything.
msg150215 - (view)	Author: Martin Pool (poolie)	Date: 2011-12-24 06:24
Terry, that's fine. Thanks to everyone who contributed to the discussion.
msg283718 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-21 04:44
Also see http://bugs.python.org/issue28180 for a more recent proposal to tackle this by coercing the C locale to the C.UTF-8 locale
msg308601 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-12-19 01:01
Follow-up: the PEP 538 (bpo-28180) and PEP 540 (bpo-29240) have been accepted and implemented in Python 3.7!

History
Date	User	Action	Args
2022-04-11 14:57:24	admin	set	github: 57852
2017-12-19 01:01:26	vstinner	set	messages: + msg308601
2016-12-21 04:44:05	ncoghlan	set	nosy: + ncoghlan messages: + msg283718
2011-12-24 06:24:44	poolie	set	messages: + msg150215
2011-12-24 03:01:38	terry.reedy	set	status: open -> closed type: behavior -> enhancement nosy: + terry.reedy messages: + msg150204 resolution: rejected stage: test needed
2011-12-22 10:45:03	akira	set	nosy: + akira
2011-12-22 04:05:35	r.david.murray	set	messages: + msg150069
2011-12-22 02:32:20	poolie	set	messages: + msg150068
2011-12-22 02:15:55	vstinner	set	messages: + msg150067
2011-12-22 01:50:35	poolie	set	messages: + msg150066
2011-12-22 01:32:52	vstinner	set	messages: + msg150062
2011-12-22 01:16:54	poolie	set	messages: + msg150061
2011-12-22 00:50:18	vstinner	set	messages: + msg150058
2011-12-22 00:21:26	vstinner	set	messages: + msg150056
2011-12-21 23:26:11	pitrou	set	messages: + msg150053
2011-12-21 23:02:31	poolie	set	messages: + msg150052
2011-12-21 22:54:42	r.david.murray	set	messages: + msg150050
2011-12-21 20:04:43	vstinner	set	messages: + msg150040
2011-12-21 19:47:04	gz	set	messages: + msg150039
2011-12-21 18:27:38	vstinner	set	messages: + msg150031
2011-12-21 08:17:00	vila	set	nosy: + vila
2011-12-21 01:41:26	pitrou	set	messages: + msg149952
2011-12-21 01:36:01	poolie	set	messages: + msg149951
2011-12-21 01:18:08	poolie	set	messages: + msg149950
2011-12-21 01:16:21	pitrou	set	nosy: + pitrou messages: + msg149949
2011-12-21 01:12:29	gz	set	messages: + msg149948
2011-12-21 00:54:39	vstinner	set	messages: + msg149947
2011-12-21 00:41:45	vstinner	set	messages: + msg149944
2011-12-21 00:38:19	poolie	set	messages: + msg149943
2011-12-21 00:28:57	poolie	set	messages: + msg149942
2011-12-21 00:26:03	vstinner	set	messages: + msg149941
2011-12-21 00:01:54	vstinner	set	messages: + msg149939
2011-12-20 23:53:28	poolie	set	nosy: + poolie messages: + msg149938
2011-12-20 20:45:11	gz	set	messages: + msg149929
2011-12-20 20:24:43	gz	set	type: behavior messages: + msg149928
2011-12-20 19:38:54	vstinner	set	messages: + msg149927
2011-12-20 19:37:40	vstinner	set	nosy: + vstinner messages: + msg149926
2011-12-20 19:17:07	r.david.murray	set	nosy: + r.david.murray messages: + msg149925
2011-12-20 19:02:21	gz	create