Issue 19846: Python 3 raises Unicode errors with the C locale

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/64045

classification

Title:	Python 3 raises Unicode errors with the C locale
Type:	behavior	Stage:	resolved
Components:	IO	Versions:	Python 3.3, Python 3.4

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	Sworddragon, a.badger, bkabrda, editor-buzzfeed, jwilk, larry, lemburg, loewis, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, terry.reedy, vstinner
Priority:	normal	Keywords:	patch

Created on 2013-11-30 21:40 by Sworddragon, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test.py	Sworddragon, 2013-11-30 21:40	Example script
asciilocale.patch	pitrou, 2013-12-07 17:17		review

Messages (68)
msg204849 - (view)	Author: (Sworddragon)	Date: 2013-11-30 21:40
It seems that print() and write() (and maybe other of such I/O functions) are relying on sys.getfilesystemencoding(). But these functions are not operating with filenames but with their content. In the attachments is an example script which demonstrates this problem. Here is what I get: sworddragon@ubuntu:~/tmp$ echo $LANG de_DE.UTF-8 sworddragon@ubuntu:~/tmp$ python3 test.py sys.getdefaultencoding(): utf-8 sys.getfilesystemencoding(): utf-8 ä sworddragon@ubuntu:~/tmp$ LANG=C sworddragon@ubuntu:~/tmp$ python3 test.py sys.getdefaultencoding(): utf-8 sys.getfilesystemencoding(): ascii Traceback (most recent call last): File "test.py", line 4, in <module> print('\xe4') UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 0: ordinal not in range(128)
msg204850 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-11-30 21:53
Victor can correct me if I'm wrong, but I believe that stdin/stdout/stderr all use the filesystem encoding because filenames are the most likely source of non-ascii characters on those streams. (Not a perfect solution, but the best we can do.)
msg204852 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-11-30 22:25
"Filesystem encoding" is not a good name. You should read "OS encoding" or maybe "locale encoding". This encoding is the best choice for interopability with other (python2 or non python) programs. If you don't care of interoperabilty, force the encoding using PYTHONIOENCODING environment variable.
msg205418 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2013-12-07 00:12
Unless there is an actually possibility of changing this, which I doubt since it is a choice and not a bug, and changing might break things, this issue should be closed.
msg205419 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-12-07 00:13
I think the ship has sailed on this. We can't change our heuristic everyone someone finds a flaw in the current one. In the long term, all sensible UNIX systems should be configured for utf-8 filenames and contents, so it won't make a difference anymore.
msg205454 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-07 14:01
If you want to avoid the encoding errors, you can also use PYTHONIOENCODING=:replace or PYTHONIOENCODING=:backslashreplace in Python 3.4 to use the locale encoding, but use an error handler different than strict.
msg205459 - (view)	Author: (Sworddragon)	Date: 2013-12-07 15:14
Using an environment variable is not the holy grail for this. On writing a non-single-user application you can't expect the user to set extra environment variables. If compatibility is the only reason in my opinion it would be much better to include something like sys.use_strict_encoding() which decides if print()/write() will use sys.getfilesystemencoding() or sys.getdefaultencoding().
msg205462 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-12-07 15:34
> Using an environment variable is not the holy grail for this. On > writing a non-single-user application you can't expect the user to set > extra environment variables. I am not understanding why the user would have to set anything at all. What is the use case for per-user encoding settings? I understand that passing LANG=C (e.g. to disable a program's translations) forces ASCII instead of UTF-8, which is a flaw. Perhaps the filesystem encoding should be set to UTF-8 when the system locale says ASCII. (OTOH, it's IMHO a system bug that LANG=C forces the ASCII charset; we're not in the 80s anymore)
msg205465 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-12-07 16:10
Antoine's suggestion of being a little more aggressive in choosing utf-8 over ascii as the OS API encoding sounds reasonable to me. I think we're getting to a point where a system claiming ASCII as the encoding to use is almost certainly a misconfiguration rather than a desired setting. If someone really means ASCII, they can force it for at least the std streams with PYTHONIOENCODING.
msg205472 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-12-07 17:17
Here is a patch. $ LANG=C ./python -c "import os, sys, locale; print(sys.getfilesystemencoding(), sys.stdin.encoding, os.device_encoding(0), locale.getpreferredencoding())" -> Without the patch: ascii ANSI_X3.4-1968 ANSI_X3.4-1968 ANSI_X3.4-1968 -> With the patch: utf-8 utf-8 utf-8 ANSI_X3.4-1968
msg205497 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-07 23:18
There was a previous try to use a file encoding different than the locale encoding and it introduces too many issues: https://mail.python.org/pipermail/python-dev/2010-October/104509.html "Inconsistencies if locale and filesystem encodings are different" Python uses the fact that the filesystem encoding is the locale encoding in various places. For example, Python uses the C codec (mbstowcs) to decode byte string from the filesystem encoding before Python codecs can be used. For example, the ISO 8859-15 codec is implemented in Python and so you need something during Python startup until the import machinery is ready and the codec is loaded (using ascii encoding is not correct). The C locale may use a different encoding. For example on AIX, the ISO 8859-1 encoding is used. On FreeBSD and Solaris, the ISO 8859-1 encoding is announced but the ASCII encoding is used in practice. Python forces the ascii encoding on FreeBSD to avoid other issues. I worked hard to have Python 3 working out of the box on all platform. In my opinion, going against the locale encoding in some cases (the C locale) would introduce more issues than it solves.
msg205498 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-12-07 23:22
> Python uses the fact that the filesystem encoding is the locale > encoding in various places. The patch doesn't change that.
msg205505 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-12-08 02:16
Note that the only change Antoine's patch makes is that: - if the locale encoding is ASCII (or an alias for ASCII) - then Python sets the filesystem encoding to UTF-8 instead If the locale encoding is anything other than ASCII, then that will still be used as the filesystem encoding, so environments that use something other than ASCII for the C locale will retain their current behaviour. The rationale for this approach is based on the assumption that the most likely way to get a locale encoding of ASCII at this point in time is to use "LANG=C" on a system where the locale encoding is normally something more suited to a Unicode world (likely UTF-8). Will assuming utf-8 sometimes cause problems? Quite possibly. But assuming that the platform's claim to only support ASCII is correct causes serious usability problems, too.
msg205538 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-08 10:49
Antoine Pitrou added the comment: > > Python uses the fact that the filesystem encoding is the locale > > encoding in various places. > The patch doesn't change that. Nick Coghlan added the comment: > Note that the only change Antoine's patch makes is that: > - if the locale encoding is ASCII (or an alias for ASCII) > - then Python sets the filesystem encoding to UTF-8 instead If the locale encoding is ASCII, filesystem encoding (UTF-8) is different than the locale encoding.
msg205545 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-12-08 11:16
Yes, that's the point. Every case I've seen where the locale encoding has been reported as ASCII on a modern Linux system has been because the environment has been configured to use the C locale, and that locale has a silly, antiquated, encoding setting. This is particularly problematic when people remotely access a system with ssh and get given the C locale instead of something sensible, and then can't properly read the filesystem on that server. The idea of using UTF-8 instead in that case is to change (and hopefully reduce) the number of cases where things go wrong. - if no non-ASCII data is encountered, the choice of ASCII vs UTF-8 doesn't matter - if it's a modern Linux distro, then the real filesystem encoding is UTF-8, and the setting it provides for LANG=C is just plain wrong - there may be other cases where ASCII actually is the filesystem encoding (in which case they're going to have trouble anyway), or the real filesystem encoding is something other than UTF-8 We're already approximating things on Linux by assuming every filesystem is using the same encoding, when that's not necessarily the case. Glib applications also assume UTF-8, regardless of the locale (http://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux). At the moment, setting "LANG=C" on a Linux system fundamentally breaks Python 3, and that's not OK.
msg205547 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-08 11:37
2013/12/8 Nick Coghlan <report@bugs.python.org>: > Yes, that's the point. Every case I've seen where the locale encoding has been reported as ASCII on a modern Linux system has been because the environment has been configured to use the C locale, and that locale has a silly, antiquated, encoding setting. > > This is particularly problematic when people remotely access a system with ssh and get given the C locale instead of something sensible, and then can't properly read the filesystem on that server. The solution is to fix the locale, not to fix Python. For example, don't set LANG to C. From the C locale, you cannot guess the "correct" encoding. In Unicode, the general rule is to never try the encoding. > The idea of using UTF-8 instead in that case is to change (and hopefully reduce) the number of cases where things go wrong. If the OS uses ISO-8859-1, forcing Python (filesystem) encoding to UTF-8 would produce invalid filenames, display mojibake and more generally produce data incompatible with other applicatons (who rely on the C locale, and so the ASCII encoding). > - there may be other cases where ASCII actually is the filesystem encoding (in which case they're going to have trouble anyway), or the real filesystem encoding is something other than UTF-8 As I wrote before, os.getfilesystemencoding() is not the filesystem encoding. It's the "OS" encoding used to decode any kind of data coming for the OS and used to encode back Python data to the OS. Just some examples: - DNS hostnames - Environment variables - Command line arguments - Filenames - user/group entries in the grp/pwd modules - almost all functions of the os module, they return various type of information (ttyname, ctermid, current working directory, login, ...) > We're already approximating things on Linux by assuming every filesystem is using the same encoding, when that's not necessarily the case. Glib applications also assume UTF-8, regardless of the locale (http://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux). If you use a different encoding but only just for filenames, you will get mojibake when you pass a filename on the command line or in an environment varialble. > At the moment, setting "LANG=C" on a Linux system fundamentally breaks Python 3, and that's not OK. Getting ASCII filesystem encoding is annoying, but I would not say that it fundamentally breaks Python 3. If you want to do something, you should write documentation explaining how to configure properly Linux.
msg205548 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-12-08 11:41
> If you use a different encoding but only just for filenames, you will > get mojibake when you pass a filename on the command line or in an > environment varialble. That's not what the patch does.
msg205549 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-08 11:45
2013/12/8 Antoine Pitrou <report@bugs.python.org>: >> Python uses the fact that the filesystem encoding is the locale >> encoding in various places. > > The patch doesn't change that. You wrote: "-> With the patch: utf-8 utf-8 utf-8 ANSI_X3.4-1968", so os.get sys.getfilesystemencoding() != locale.getpreferredencoding(). Or said differently, the filesystem encoding is different than the locale encoding. So please read again my following message which list real bugs: https://mail.python.org/pipermail/python-dev/2010-October/104509.html If you want to use a filesystem encoding different than the locale encoding, you have to patch Python where Python assumes that the filesystem encoding is the locale encoding, to fix all these bugs. Starts with: - PyUnicode_DecodeFSDefaultAndSize() - PyUnicode_EncodeFSDefault() - _Py_wchar2char() - _Py_char2wchar() It should be easier to change this function if the FS != locale only occurs when FS is "UTF-8". On Mac OS X, Python always use UTF-8 for the filesystem encoding, it doesn't care of the locale encoding. See _Py_DecodeUTF8_surrogateescape() in unicodeobject.c, you may reuse it. With a better patch, I can do more experiment to check if they are other tricky bugs. Does at least your patch pass the whole test suite with LANG=C?
msg205550 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-08 11:52
Setting sys.stderr encoding to UTF-8 on ASCII locale is wrong. sys.stderr has the backslashreplace error handler by default, so it newer fails and should newer produce non-ASCII data on ASCII locale.
msg205554 - (view)	Author: Larry Hastings (larry) *	Date: 2013-12-08 12:35
Antoine: are you characterizing this as a "bug" rather than a "new feature"? I'd like to see more of a consensus before something like this gets checked in. Right now I see a variety of opinions. When I think "conservative approach" and "knows about system encoding stuff", I think of Martin. Martin, can I ask you to form an opinion about this?
msg205555 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-12-08 12:38
> Or said differently, the filesystem encoding is different than the > locale encoding. Indeed, but the FS encoding and the IO encoding are the same. "locale encoding" doesn't really matter here, as we are assuming that it's wrong.
msg205564 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-12-08 14:09
Victor, people set "LANG=C" for all sorts of reasons, and we have no control over how operating systems define that locale. The user perception is "Python 3 doesn't work properly when you ssh into systems", not "Gee, I wish operating systems defined the C locale more sensibly". If you can come up with a more sensible guess than UTF-8, great, but believing the nonsense claim of "ASCII" from the OS is a not-insignificant usability issue on Linux, because it hoses all the OS API interactions. Yes, theoretically, using UTF-8 can cause problems, if the following all occur: - the OS claims the OS encoding is ASCII (so Python uses UTF-8 instead) - the OS encoding is actually something other than UTF-8 - the program encounters non-ASCII data and writes it out to disk For fear of doing the wrong thing in that incredibly rare scenario, you're leaving Python broken under the C locale on every modern Linux distro as soon as it encounters non-ASCII data in an OS interface.
msg205611 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-08 22:02
"haypo: title: Setting LANG=C breaks Python 3 -> print() and write() are relying on sys.getfilesystemencoding() instead of sys.getdefaultencoding()" Oh, I didn't want to change the title of the issue, it's a bug in Roundup when I reply by email :-/
msg205615 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-08 22:22
>> Or said differently, the filesystem encoding is different than the >> locale encoding. > Indeed, but the FS encoding and the IO encoding are the same. > "locale encoding" doesn't really matter here, as we are assuming that > it's wrong. Oh, I realized that "FS encoding" term in not clear. When I wrote "FS encoding", I mean sys.getfilesystemencoding() which is mbcs on Windows, UTF-8 on Mac OS X and (currently) the locale encoding on other platforms (UNIX, ex: Linux/FreeBSD/Solaris/AIX). -- IMO there are two different points in this issue: (a) which encoding should be used when the C locale is used: the encoding announced by the OS using nl_langinfo(CODESET) (current choice) or use an arbitrary optimistic "utf-8" encoding? (b) for technical reasons, Python reuses the C codec during Python initialization to decode and encode OS data, and so currently Python must use the locale encoding for its "filesystem encoding" Before being able to pronounce me on the point (a), I would like to see a patch fixing the point (b). I'm not against fixing point (b). I'm just saying that it's not trivial and obviously it must be fixed to change the status of point (a). I even gave clues to fix point (b). -- asciilocale.patch has many issues. Try to run the Python test suite using this patch to see what I mean. Example of failures: ====================================================================== FAIL: test_non_ascii (test.test_cmd_line.CmdLineTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/haypo/prog/python/default/Lib/test/test_cmd_line.py", line 140, in test_non_ascii assert_python_ok('-c', command) File "/home/haypo/prog/python/default/Lib/test/script_helper.py", line 69, in assert_python_ok return _assert_python(True, args, *env_vars) File "/home/haypo/prog/python/default/Lib/test/script_helper.py", line 55, in _assert_python "stderr follows:\n%s" % (rc, err.decode('ascii', 'ignore'))) AssertionError: Process return code is 1, stderr follows: Unable to decode the command from the command line: UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 12: surrogates not allowed ====================================================================== FAIL: test_ioencoding_nonascii (test.test_sys.SysModuleTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/haypo/prog/python/default/Lib/test/test_sys.py", line 603, in test_ioencoding_nonascii self.assertEqual(out, os.fsencode(test.support.FS_NONASCII)) AssertionError: b'' != b'\xc3\xa6' ====================================================================== FAIL: test_nonascii (test.test_warnings.CEnvironmentVariableTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/haypo/prog/python/default/Lib/test/test_warnings.py", line 774, in test_nonascii "['ignore:Deprecaci\xf3nWarning']".encode('utf-8')) AssertionError: b"['ignore:Deprecaci\\udcc3\\udcb3nWarning']" != b"['ignore:Deprecaci\xc3\xb3nWarning']" ====================================================================== FAIL: test_nonascii (test.test_warnings.PyEnvironmentVariableTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/haypo/prog/python/default/Lib/test/test_warnings.py", line 774, in test_nonascii "['ignore:Deprecaci\xf3nWarning']".encode('utf-8')) AssertionError: b"['ignore:Deprecaci\\udcc3\\udcb3nWarning']" != b"['ignore:Deprecaci\xc3\xb3nWarning']" test_warnings is probably #9988, test_cmd_line failure is maybe #9992. There are maybe other issues, the Python test suite only have a few tests for non-ASCII characters. -- If anything is changed, I would prefer to have more than a few months of test to make sure that it doesn't break anything. So I set the version field to Python 3.5.
msg205623 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-12-09 00:24
On dim., 2013-12-08 at 22:22 +0000, STINNER Victor wrote: > (b) for technical reasons, Python reuses the C codec during Python > initialization to decode and encode OS data, and so currently Python > must use the locale encoding for its "filesystem encoding" Ahhh! Well indeed that's a bummer :-) > asciilocale.patch has many issues. Try to run the Python test suite > using this patch to see what I mean. I'm assuming much of this is due to (b) (all those tests seem to spawn external processes). It seems there is more work to do to get this right, but I'm not terribly interested either. Feel free to take over.
msg205625 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-09 00:33
> It seems there is more work to do to get this right, but I'm not > terribly interested either. Feel free to take over. If you are talking to me: I'm currently opposed to change anything, so I'm not interested to work on a patch. IMO Python works fine and you should try to workaround the current limitations :-) If someone is interested to write an huge patch fixing all these issues, I would be able to reconsider my opinion on point (a).
msg205637 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-12-09 01:54
End users tripping over this by setting LANG=C is one of the pain points of Python 3 relative to Python 2 for Fedora, so I've added a couple of Fedora folks to the nosy list. My current understanding of the situation: - we should leave Windows and Mac OS X alone, since they ignore the locale when choosing the OS API encoding anyway - the main problem is on Linux (but potentially other *nix systems as well), where people set "LANG=C" for a variety of reasons, but this has the side effect of Python 3 choosing an inappropriate encoding (ASCII rather than UTF-8) when talking to the OS APIs. Given the initialisation problems, this may be something that PEP 432 (the initialisation process rewrite) can help with (since it changes the initialisation order to create a more complete Python runtime before it starts to configure the OS interfaces). Tangentially related, we may want to consider aliasing sys.getfilesystemencoding, os.fsencode and os.fsdecode as something like sys.getosapiencoding, os.apiencode and os.apidecode, since the current naming is misleading (the value is based on the platform and environment, not any particular filesystem, and is used for almost all bytes-based OS APIs, not just filesystem metadata)
msg205640 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-09 02:08
> End users tripping over this by setting LANG=C is one of the pain points of Python 3 relative to Python 2 for Fedora, so I've added a couple of Fedora folks to the nosy list. Sorry, I'm not aware of such issue. Do you have examples? > - the main problem is on Linux (but potentially other *nix systems as well), where people set "LANG=C" for a variety of reasons, but this has the side effect of Python 3 choosing an inappropriate encoding (ASCII rather than UTF-8) when talking to the OS APIs. Why do you think that the issue is specific to Python 3? Try to open a terminal with LC_ALL=C and try to type non-ASCII characters with your keyboard. You can't because your terminal uses ASCII. Did you applications written in another language handling Unicode, like Perl? (Perl with Unicode support correctly enabled, it's "use utf8;" if I remember correctly). Can you explain the "various reasons" why users explictly force the encoding to ASCII? I use LANG=C to get manual pages and error messages in english. But "LANG=en_US man ls" would be more correct, or "LC_MESSAGES=en_US man ls" to be pedantic. (Env var priority: LC_ALL > LANG > LC_xxx). IMO if you use LANG=C, you must not complain that Unicode stopped working, but you should learn how to configure locales. Trivial examples like the one which can be found in the initial message (msg204849) are wrong: why would you force all locales to C and use non-ASCII characters? > Given the initialisation problems, this may be something that PEP 432 (the initialisation process rewrite) can help with (since it changes the initialisation order to create a more complete Python runtime before it starts to configure the OS interfaces). I don't see how it would help to solve my point (b). Technically, this issue cannot be fixed. Or to be more specific, I don't want to fix it, it's a waste of time. So I don't understand what do you expect from this open issue? I would prefer to close it as invalid or wontfix to be clear.
msg205642 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-12-09 02:56
On 9 December 2013 12:08, STINNER Victor <report@bugs.python.org> wrote: > > STINNER Victor added the comment: > >> End users tripping over this by setting LANG=C is one of the pain points of Python 3 relative to Python 2 for Fedora, so I've added a couple of Fedora folks to the nosy list. > > Sorry, I'm not aware of such issue. Do you have examples? Armin's travails with remote shell access and Python 3 are just as likely today as they were a couple of years ago: http://lucumr.pocoo.org/2011/12/7/thoughts-on-python3/ (although technically that was a terminal ending up with the POSIX locale, rather than specifically LANG=C) >> - the main problem is on Linux (but potentially other nix systems as well), where people set "LANG=C" for a variety of reasons, but this has the side effect of Python 3 choosing an inappropriate encoding (ASCII rather than UTF-8) when talking to the OS APIs. > > Why do you think that the issue is specific to Python 3? Try to open a > terminal with LC_ALL=C and try to type non-ASCII characters with your > keyboard. You can't because your terminal uses ASCII. Did you > applications written in another language handling Unicode, like Perl? > (Perl with Unicode support correctly enabled, it's "use utf8;" if I > remember correctly). It's the fact this used to work transparently in Python 2 (since all these interfaces were just bytes based on the Python side as well) that's a problem. That makes the new sensitivity to the locale encoding a usability regression, and that's a concern for distros that are considering switching their default Python version. > Can you explain the "various reasons" why users explictly force the > encoding to ASCII? - testing applications for POSIX compliance - default settings on servers where you don't control the environment - because they never previously had to care, and it's only Python 3 deciding to pay attention to it which makes it relevent for them > I use LANG=C to get manual pages and error messages in english. But > "LANG=en_US man ls" would be more correct, or "LC_MESSAGES=en_US man > ls" to be pedantic. (Env var priority: LC_ALL > LANG > LC_xxx). > > IMO if you use LANG=C, you must not complain that Unicode stopped > working, but you should learn how to configure locales. Trivial > examples like the one which can be found in the initial message > (msg204849) are wrong: why would you force all locales to C and use > non-ASCII characters? And yet, in Python 2, people could do that, and Python didn't care. That's* the regression I'm worried about. If it hadn't round-tripped cleanly in Python 2, I wouldn't care here either. >> Given the initialisation problems, this may be something that PEP 432 (the initialisation process rewrite) can help with (since it changes the initialisation order to create a more complete Python runtime before it starts to configure the OS interfaces). > > I don't see how it would help to solve my point (b). Having a Python runtime available makes things that are currently tediously painful to deal with during startup easier to tweak. I'm not sure it will help in this particular case, but it's now one I'm going to keep an eye on. > Technically, this issue cannot be fixed. Or to be more specific, I > don't want to fix it, it's a waste of time. So I don't understand what > do you expect from this open issue? A way to get Python 3 to cope as well with a misconfigured OS environment as Python 2 did. > I would prefer to close it as invalid or wontfix to be clear. It's a usability regression from Python 2, so I don't want to give up on it. It may be that we just implement a "ignore what the OS claims, it's misconfigured, just use UTF-8 for everything" flag. But OS configuration errors shouldn't cripple the Python runtime.
msg205646 - (view)	Author: (Sworddragon)	Date: 2013-12-09 04:03
You should keep things more simple: - Python and the operation system/filesystem are in a client-server relationship and Python should validate all. - It doesn't matter what you will finally decide to be the default encoding on various places - all will provide race-conditions with no exception. - The easiest way to fix this is to give the developer the ability to make a decision (like sys.use_strict_encoding(), sys.setfilesystemencoding(), sys.setdefaultencoding() etc.). * For example giving the developer control is especially needed if he wants to handle multiple different filesystems. > Why do you think that the issue is specific to Python 3? Try to open a > terminal with LC_ALL=C and try to type non-ASCII characters with your > keyboard. You can't because your terminal uses ASCII. sworddragon@ubuntu:~$ LANG=C sworddragon@ubuntu:~$ ä bash: $'\303\244': command not found - The terminal doesn't pseudo-crash with an exception because it doesn't matter about encodings. - It allows to change the encoding at runtime. > Did you > applications written in another language handling Unicode, like Perl? Compare C: It wouldn't matter like the terminal. For example fopen will simply return NULL if it can't open the file 'ä' because the filesystem is endoded with ISO-8859-1 and we wanted to open the utf-8 counterpart. > Can you explain the "various reasons" why users explictly force the > encoding to ASCII? For example I'm using this for testcases to set the language uncomplicated to english.
msg205654 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-09 09:36
> And yet, in Python 2, people could do that, and Python didn't care. > That's the regression I'm worried about. If it hadn't round-tripped > cleanly in Python 2, I wouldn't care here either. $ python2.7 -c "print u'\u20ac'" € $ LANG=C python2.7 -c "print u'\u20ac'" Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128) And even worse: $ python2.7 -c "print u'\u20ac'" >/dev/null Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128) What the wart! Other program can produces wrong (or even absolutely senseless) output with C locale. $ LANG=C ls ???????????? ???????????????? ???????????????????? ?????????? ?????????? ?????????????? ?????????????? ???????????????????????? ?????????? ???????? ???????????? ?????????????? ?????????????? ?????????????????? ???????? ???????????????????? What is better, silently produce corrupted output or raise an exception? If first, then let just set the "replace" or "backslashreplace" error handler for sys.stdout by default.
msg205655 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-09 09:40
> sworddragon@ubuntu:~$ LANG=C > sworddragon@ubuntu:~$ ä > bash: $'\303\244': command not found > > - The terminal doesn't pseudo-crash with an exception because it doesn't > matter about encodings. - It allows to change the encoding at runtime. This is not a locale of your terminal. Try `LANG=C xterm`.
msg205669 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-12-09 10:11
The "C" locale is part of the ANSI C standard. The "POSIX" locale is an alias for the "C" locale and a POSIX standard, so we cannot just replace the ASCII encoding with UTF-8 as we wish, so Antoine's patch won't work. See e.g. http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html The C and POSIX locale settings are the only locale settings that are guaranteed to always exist in C libraries. Python 3 should work with such locale settings. It doesn't have to be able to output non-ASCII code points, but it should run with ASCII data. AFAIK, Python 3 does work with ASCII data in the C locale, so I'm not sure whether this is a bug at all.
msg205670 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-09 10:13
I didn't understand Serhiy's "ls" example. I tried: $ mkdir unicode $ cd unicode $ python3 -c 'open("ab\xe9.txt", "w").close()' $ python3 -c 'open("euro\u20ac.txt", "w").close()' $ ls abé.txt euro€.txt $ LANG=C ls ab??.txt euro???.txt Ah yes, I didn't remember that "ls" is aware of the locale encoding. printf() and wprintf() behave differently on unencodable/undecoable characters: http://unicodebook.readthedocs.org/en/latest/programming_languages.html#printf-functions-family Again, the issue is not specific to Python. So it's time to learn how to configure correctly your locales. About the "interoperability" point I mentionned in my first message ("This encoding is the best choice for interopability with other (python2 or non python) programs."): if you work around the annoying ASCII encoding by forcing UTF-8 encoding, Python may produce data which would be incompatible with other applications following POSIX and so using the ASCII encoding.
msg205671 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-09 10:17
Nick> testing applications for POSIX compliance Sorry but what do you mean by "POSIX compliance"? The POSIX standard only specify the ASCII encoding. http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html "The tables in Locale Definition describe the characteristics and behavior of the POSIX locale for data consisting entirely of characters from the portable character set and the control character set. For other characters, the behavior is unspecified. For C-language programs, the POSIX locale shall be the default locale when the setlocale() function is not called." http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tagtcjh_3 "Portable character set" = ASCII
msg205672 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-09 10:19
Marc-Andre> AFAIK, Python 3 does work with ASCII data in the C locale, so I'm not sure whether this is a bug at all. What do you mean? Python uses the surrogateescape encoding since Python 3.1, undecodable bytes are stored as surrogate characters. Many bugs related to locales were fixed in Python 3.2, 3.3 and 3.4. There are remaining bugs?
msg205673 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-12-09 10:30
On 09.12.2013 11:19, STINNER Victor wrote: > > STINNER Victor added the comment: > > Marc-Andre> AFAIK, Python 3 does work with ASCII data in the C locale, so I'm not sure whether this is a bug at all. > > What do you mean? Python uses the surrogateescape encoding since Python 3.1, undecodable bytes are stored as surrogate characters. > > Many bugs related to locales were fixed in Python 3.2, 3.3 and 3.4. > > There are remaining bugs? I was referring to the original bug report on this ticket. FWIW: I don't think you can expect Python to work without exceptions if you use a C locale and write non-ASCII data to stdout. I also don't think that the new ticket title is correct - or at least, I fail to see which aspect of Python breaks with LANG=C :-)
msg205675 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-09 10:42
I'm closing the issue as invalid, because Python 3 behaviour is correct and must not be changed. Standard streams (sys.stdin, sys.stdout, sys.stderr) uses the locale encoding. sys.stdin and sys.stdout use the strict error handler, sys.stderr uses the backslashreplace error handler. These encodings and error handlers can be overriden by the PYTHONIOENCODING. Since Python 3.3, it's possible to only set the error handler using ":errors" syntax (ex: PYTHONIOENCODING=":replace"). Python uses sys.getfilesystemencoding() to decode data from / encode data to the operating system. Example of operating system data: command line arguments, environment variables, host names, filenames, user names, etc. On Windows, Python tries to use the wide character (Unicode) API of Windows anywhere to avoid any conversion, to not loose data. The MBCS codec (ANSI code page) of Windows uses a replace error handler by default, it looses data. Try for example os.listdir() in a directory containing filenames not encodable to the ANSI code page in Python 2 (or os.listdir(b'.') in Python 3). On Mac OS X, Python always use UTF-8 for sys.getfilesystemencoding() (with the surrogateescape error handler, see the PEP 383). The locale encoding is ignored for sys.getfilesystemencoding() (the locale encoding is still used in some functions). On other operating systems... it's more complex. Python uses the locale encoding for sys.getfilesystemencoding() (with the surrogateescape error handler, see the PEP 383). For the POSIX locale (aka the "C" locale), you may get the ASCII encoding on Linux, ASCII on FreeBSD and Solaris (whereas these operating systems announce an alias of the ISO 8859-1 encoding, but use ASCII in practice), ISO 8859-1 on AIX etc. Using the locale encoding is the best choice for interoperability with other applications (which use also the locale encoding). Even if an application uses "raw bytes" (like Python 2), these bytes are still "locale aware". For example, when "raw bytes" are written to the standard output, bytes are decoded to find the appropriate character in the font of the terminal. When "raw bytes" are written into a socket to generate a HTML document (ex: listing of a directory, so a list of filenames), the web brower will decode them from them encoding announced in the HTML page. Even if the encoding is not explicit, it does still exist. Read other comments of this issue for other examples. Forcing the POSIX locale to get an user interface in english is wrong if you also expect from your application to still generate valid "raw bytes" in your "system" encoding (ISO 8859-1, ShiftJIS, UTF-8, whatever). To change the language, the correct environment variable is LC_CTYPE: use LC_CTYPE=C. Or better, use the real english locale which will probably handle better currency, numbers, etc. Example: LC_CTYPE=en_US.utf8 (on Fedora, "en_US" locale uses the ISO 8859-1 encoding).
msg205688 - (view)	Author: (Sworddragon)	Date: 2013-12-09 13:20
> I'm closing the issue as invalid, because Python 3 behaviour is correct > and must not be changed. The fact that write() uses sys.getfilesystemencoding() is either a defect or a bad design (I leave the decision to you). But I'm still missing a reply to my suggestion. As I'm seeing it has no disadvantages to give the developer optionally the control.
msg205690 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-09 13:27
> The fact that write() uses sys.getfilesystemencoding() is either a defect or a bad design (I leave the decision to you). "Standard streams (sys.stdin, sys.stdout, sys.stderr) uses the locale encoding. sys.stdin and sys.stdout use the strict error handler, sys.stderr uses the backslashreplace error handler. These encodings and error handlers can be overriden by the PYTHONIOENCODING. Since Python 3.3, it's possible to only set the error handler using ":errors" syntax (ex: PYTHONIOENCODING=":replace")." stdout uses the locale encoding (and if you read my whole message, you may understand why sys.getfilesystemencoding() is also the locale encoding on UNIX). (FYI on Windows, the OEM code page is used for standard streams.) sys.getdefaultencoding() is always utf-8, this is unrelated to standard streams and OS data: it's the default value of the encoding parameter of str.encode() and str.decode(). I'm surprised that it's not documented to be utf-8, it is hardcoded and so always utf-8 in Python 3. > But I'm still missing a reply to my suggestion. As I'm seeing it has no disadvantages to give the developer optionally the control. "Standard streams (sys.stdin, sys.stdout, sys.stderr) uses the locale encoding. sys.stdin and sys.stdout use the strict error handler, sys.stderr uses the backslashreplace error handler. These encodings and error handlers can be overriden by the PYTHONIOENCODING. Since Python 3.3, it's possible to only set the error handler using ":errors" syntax (ex: PYTHONIOENCODING=":replace")." If the environment variable is not enough, see also #15216 which proposes to add a TextIOWrapper.set_encoding() method. (I'm not really a fan of this proposition, but it looks like some users ask for it.)
msg205691 - (view)	Author: Larry Hastings (larry) *	Date: 2013-12-09 13:28
> The fact that write() uses sys.getfilesystemencoding() is either > a defect or a bad design (I leave the decision to you). I have good news for you. write() does not cal sys.getfilesystemencoding(), because the encoding is set at the time the file is opened. > But I'm still missing a reply to my suggestion. As I'm seeing it > has no disadvantages to give the developer optionally the control. The programmer has all the control they need. They can open their own pipes using any encoding they like, and they can even reopen stdin/stdout with a different encoding if they wish.
msg205693 - (view)	Author: (Sworddragon)	Date: 2013-12-09 13:48
> If the environment variable is not enough There is a big difference between environment variables and internal calls: Environment variables are user-space while builtin/library functions are developer-space. > I have good news for you. write() does not cal > sys.getfilesystemencoding(), because the encoding is set at the time > the file is opened. Thanks for the clarification. I wished somebody had sayed me that after this sentence in my startpost: "It seems that print() and write() (and maybe other of such I/O functions) are relying on sys.getfilesystemencoding()." In theory this makes already my ticket invalid. Well, but now I would wish print() would allow to choose the encoding like open() too^^
msg205694 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-09 13:55
> There is a big difference between environment variables and internal calls: Environment variables are user-space while builtin/library functions are developer-space. You can reopen sys.stdout with a different encoding and replace sys.stdout. I don't remember the exact recipe, it's tricky if you want portable code (you have to take care of newline). For example, I wrote: http://hg.python.org/cpython/file/ebe28dba4a78/Lib/test/regrtest.py#l895 But you can avoid reopening the file using stdout.detach(). > In theory this makes already my ticket invalid. Well, but now I would wish print() would allow to choose the encoding like open() too^^ Many options were already proposed. Another way, less convinient is to use sys.stdout.buffer.write("text".encode(encoding)) (you have to flush sys.stdout before, and flush the buffer after, to avoid inconsistencies between the TextIOWrapper and the BufferedWriter).
msg205727 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2013-12-09 18:50
Ahh... added to the nosy list and bug closed all before I got up for the day ;-) A few words: I do think that python is broken here. I do not think that translating everything to utf-8 if ascii is the locale's encoding is the solution. As I would state it, the problem is that python's boundary with the OS is not yet uniform. If you set LC_ALL=C (note, LC_ALL=C is just one of multiple ways to beak things. For instance, LC_ALL=en_US.utf8 when dealing with latin-1 data will also break) then python will still read non-ascii data from the OS through some interfaces but it won't output it back to the OS. ie: $ mkdir unicode && cd unicode $ python3 -c 'open("ñ.txt".encode("latin-1"), "w").close()' $ LC_ALL=en_US.utf8 python3 >>> import os >>> dir_listing = os.listdir('.') >>> for entry in dir_listing: print(entry) ... Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf1' in position 0: surrogates not allowed Note that currently, input() and sys.stdin.read() won't read undecodable data so this is somewhat symmetrical but it seems to me that saying "everything that interfaces with the OS except the standard streams will use surrogateescape on undecodable bytes" is drawing a line in an unintuitive location. (A further note to serhiy.storchaka.... Your examples are not showing anything broken in other programs. xterm is refusing both input and output that is non-ascii. This is symmetric behaviour. ls is doing its best to display a human-readable representation of bytes that it cannot convert in the current encoding. It also provides the -b switch to see the octal values if you actually care. Think of this like opening a binary file in less or another pager.) (Further note for haypo -- On Fedora, the default of en_US is utf8, not ISO8859-1.)
msg205747 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-12-09 22:50
There's a wrong assumption here: glib applications on Linux use UTF-8 regardless of locale. That's the part I have a problem with: the assumption that the locale will correctly specify the encoding to use for OS APIs on modern Linux systems. It's simply not always true: some Linux distros would be better handled like OS X, where we always use UTF-8, regardless of what the locale says.
msg205748 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2013-12-09 22:57
Nick: which glib functions are you specifically referring to? Many of them don't deal with strings at all, and of those that do, many are encoding-agnostic (i.e. it is correct to claim that they operate on UTF-8, but likewise also correct that they operate on Latin-1, simultaneously).
msg205749 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-12-09 23:00
> It's simply not always true: some Linux distros would be better handled > like OS X, where we always use UTF-8, regardless of what the locale says. Perhaps by the 3.5 timeframe we can default to utf-8 on all Unix systems?
msg205751 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-12-09 23:06
I confess I didn't independently verify the glib claim in the Stack Overflow post. However, Toshio's post covers the specific error case we were discussing at Flock (and I had misremembered), where the standard streams are classed as "OS APIs" for the purpose of deciding which encoding to use, but as user data APIs for the purpose of deciding which error handler to use. So the standard streams are only "sort of" an OS API, since they don't participate in the surrogateescape based round tripping guarantee by default.
msg205772 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2013-12-10 07:08
From what I read, it appears that the SO posting is plain wrong. Consider, for example, https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#g-filename-to-utf8 # Converts a string which is in the encoding used by GLib for filenames # into a UTF-8 string. Note that on Windows GLib uses UTF-8 for filenames; # on other platforms, this function indirectly depends on the current locale. The SO author might have misread the part where it says that glib uses UTF-8 on Windows (instead of the braindead "ANSI" encoding indirection).
msg205783 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-10 10:36
2013/12/10 Martin v. Löwis <report@bugs.python.org>: > >From what I read, it appears that the SO posting is plain wrong. Consider, for example, > > https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#g-filename-to-utf8 > > # Converts a string which is in the encoding used by GLib for filenames > # into a UTF-8 string. Note that on Windows GLib uses UTF-8 for filenames; > # on other platforms, this function indirectly depends on the current locale. > > The SO author might have misread the part where it says that glib uses UTF-8 on Windows (instead of the braindead "ANSI" encoding indirection). I wrote some notes about glib here: http://unicodebook.readthedocs.org/en/latest/libraries.html#the-glib-library g_filename_from_utf8() uses the g_get_filename_charsets() encoding. g_get_filename_charsets() is the ANSI code page on Windows and the locale encoding on Linux, except if G_FILENAME_ENCODING or G_BROKEN_FILENAMES environment variables are set. glib has a nice g_filename_display_name() function.
msg205848 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2013-12-10 19:28
Looking at the glib code, this looks like the SO post is closer to the truth. The API documentation for g_filename_to_utf8() is over-simplified to the point of confusion. This section of the glib API document is closer to what the code is doing: https://developer.gnome.org/glib/stable/glib-Character-Set-Conversion.html#file-name-encodings * When encoding matters, glib and gtk functions will assume that char's that you pass to them point to strings which are encoded in utf-8. When char* are not utf8 you are responsible for converting them to utf8 to be used by the glib functions (if encoding matters). * glib provides g_filename_to_utf8() for the special case of transforming filenames into the encoding that glib expects. (Presumably because glib and gtk deal with non-utf8 unicode filenames more often than the equivalent environment variables, command line switches, etc). * Contrary to the API docs for g_filename_to_utf8(), g_filename_to_utf8() will simply return a copy of the byte string it was passed unless G_FILENAME_ENCODING or G_BROKEN_FILENAMES is set. If those are set, then the value of G_FILENAME_ENCODING might be used to attempt to decode the filename or the encoding specified in the user's locale might be used. @haypo, I'm pretty sure from reading the code for g_get_filename_charsets() that you have the conditionals reversed. What I'm seeing is: if G_FILENAME_ENCODING: charset = the first charset listed in G_FILENAME_ENCODING if charset == '@locale': charset = charset of user's locale elif G_BROKEN_FILENAMES: charset = charset of user's locale else: charset = 'UTF-8'
msg205855 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-10 20:27
2013/12/10 Toshio Kuratomi <report@bugs.python.org>: > if G_FILENAME_ENCODING: > charset = the first charset listed in G_FILENAME_ENCODING > if charset == '@locale': > charset = charset of user's locale > elif G_BROKEN_FILENAMES: > charset = charset of user's locale > else: > charset = 'UTF-8' g_get_filename_charsets() returns a list of encodings. For the last case (else:), it uses ['utf-8', local_encoding] on UNIX. It's reliable because the utf-8 encoding has a nice feature, the utf-8 decoder fails if the byte string is not a valid utf-8 string. It would interesting to test this approach (try utf-8 or use the locale encoding) in PyUnicode_DecodeFSDefault/PyUnicode_EncodeFSDefault and _Py_char2wchar/_Py_wchar2char.
msg205859 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-10 21:37
> It would interesting to test this approach (try utf-8 or use the locale encoding) ... Oh, it may be easy to implement it for decoders, but what about encoders? Should os.fsencode() always use UTF-8??
msg205871 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2013-12-10 23:30
Yes, it returns a list but unless I'm missing something in the general case it's the caller's responsibility to loop through the charsets to test for failure and try again. This is not done automatically. In the specific case we're talking about, first get_filename_charset() decides to only return the first entry in the list of charsets: list.https://git.gnome.org/browse/glib/tree/glib/gconvert.c#n1118 and then g_filename_to_utf8() disregards the charsets altogether because it sees that the filename is supposed to be utf-8 https://git.gnome.org/browse/glib/tree/glib/gconvert.c#n1160
msg206055 - (view)	Author: (Sworddragon)	Date: 2013-12-13 11:01
>> The fact that write() uses sys.getfilesystemencoding() is either >> a defect or a bad design (I leave the decision to you). > I have good news for you. write() does not cal sys.getfilesystemencoding(), because the encoding is set at the time the > file is opened. Now after some researching I see I wasn't wrong at all. I should've been sayed: "The fact that write() -> open() relies on sys.getfilesystemencoding() (respectively locale.getpreferredencoding()) at default as encoding is either a defect or a bad design (I leave the decision to you)." Or am I overlooking something?
msg206065 - (view)	Author: Larry Hastings (larry) *	Date: 2013-12-13 11:58
> "The fact that write() -> open() relies on sys.getfilesystemencoding() > (respectively locale.getpreferredencoding()) at default as encoding is > either a defect or a bad design (I leave the decision to you)." > > Or am I overlooking something? First, you should probably just drop mentioning write() or print() or any of the functions that actually perform I/O. The crucial decisions about decoding are made inside open(). Second, open() is implemented in C. It cannot "rely on sys.getfilesystemencoding()" as it never calls it. Internally, sys.getfilesystemencoding() simply returns a C global called Py_FileSystemDefaultEncoding. But open() doesn't examine that, either. Instead, open() determines the default encoding by calling the same function that's used to initialize Py_FileSystemDefaultEncoding: get_locale_encoding() in Python/pythonrun.c. Which on POSIX systems calls the POSIX function nl_langinfo(). If you want to see the actual mechanisms involved, you should read the C source code in Modules/_io in the Python trunk. open() is implemented as the C function io_open() in _iomodule.c. When it opens a file in text mode without an explicit encoding, it wraps it in a TextIOWrapper object; the __init__ function for this class is the C function textiowrapper_init() in textio.c. As for your assertion that this is "either a defect or a bad design": I leave the critique of that to others.
msg206068 - (view)	Author: (Sworddragon)	Date: 2013-12-13 12:19
> Instead, open() determines the default encoding by calling the same function that's used to initialize Py_FileSystemDefaultEncoding: get_locale_encoding() in Python/pythonrun.c. Which on POSIX systems calls the POSIX function nl_langinfo(). open() will use at default the encoding of nl_langinfo() as sys.getfilesystemencoding() does on *nix. This is the part that looks dirty to me. As soon as LANG is set to C open() will rely on 'ascii' due to nl_langinfo() like sys.getfilesystemencoding() does too.
msg206071 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-12-13 12:51
There's an alternative to trying to force a different encoding for the standard streams when the OS claims ASCII as the OS encoding: we can default to surrogateescape as the error handler, on the assumption that whatever the real OS encoding is, it definitely isn't ASCII. That means we'll still complain about displaying improperly encoded data when the OS suggests a plausible encoding, but we won't fail entirely just because someone enabled (deliberately or accidentally) the POSIX locale.
msg206098 - (view)	Author: (Sworddragon)	Date: 2013-12-13 15:46
By the way I have found a valid use case for LANG=C. udev and Upstart are not setting LANG which will result in the ascii encoding for invoked Python scripts. This could be a problem since these applications are commonly dealing with non-ascii filesystems.
msg206101 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-13 15:57
By the way, Java behaves as Python: with LANG=C, Java uses ASCII: http://stackoverflow.com/questions/13415975/cant-read-utf-8-filenames-when-launched-as-an-upstart-service > udev and Upstart are not setting LANG So it's an issue in udev and Upstart. See for example: https://bugs.launchpad.net/ubuntu/+source/upstart/+bug/1235483 https://bugs.launchpad.net/ubuntu-translations/+bug/1208272 I found examples using "LANG=$LANG ..." when running a command in Upstart for example. I found another example using: if [ -r /etc/default/locale ]; then . /etc/default/locale export LANG LANGUAGE elif [ -r /etc/environment ]; then . /etc/environment export LANG LANGUAGE fi
msg206107 - (view)	Author: (Sworddragon)	Date: 2013-12-13 16:17
> https://bugs.launchpad.net/ubuntu/+source/upstart/+bug/1235483 After opening many hundred tickets I would say: With luck this ticket will get a response within the next year. But in the worst case it will be simply refused. > I found examples using "LANG=$LANG This Upstart script: "exec echo LANG=$LANG > /tmp/test.txt" Will result in the following: root@ubuntu:~# start test test stop/waiting root@ubuntu:~# cat /tmp/test.txt LANG= At least in this example I'm getting on my system an empty LANG.
msg206109 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2013-12-13 16:31
It's not a bug for upstart, systemd, sysvinit, cron, etc to use LANG=C. The POSIX locale is the only locale guaranteed to exist on a system. Therefore these low level services should be using LANG=C. Embedded systems, thin clients, and other low memory or low disk devices may benefit from shipping without any locales.
msg206112 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-13 16:40
I created the issue #19977 as a follow up of this one: "Use surrogateescape error handler for sys.stdout on UNIX for the C locale".
msg206116 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-12-13 16:44
I propose to modify the error handler, the encoding cannot be modified. See my following message explaining why it's not possible to change the encoding: http://bugs.python.org/issue19846#msg205675
msg206169 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2013-12-14 06:10
Thanks Victor - I now agree that trying to guess another encoding is a bad idea, and that enabling surrogateescape for the standard streams under the C locale is a better way to go.
msg232290 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2014-12-07 23:42
Since Viktor's alternative in #19977 has been applied, should this issue be closed?
msg283717 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-21 04:42
Also see http://bugs.python.org/issue28180 for a more recent proposal to tackle this by coercing the C locale to the C.UTF-8 locale
msg308567 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-12-18 14:38
Follow-up: the PEP 538 (bpo-28180) and PEP 540 (bpo-29240) have been accepted and implemented in Python 3.7!

History
Date	User	Action	Args
2022-04-11 14:57:54	admin	set	github: 64045
2017-12-18 14:38:09	vstinner	set	messages: + msg308567
2016-12-21 04:42:44	ncoghlan	set	messages: + msg283717
2016-04-22 09:24:02	serhiy.storchaka	set	messages: - msg263975
2016-04-22 08:48:15	SilentGhost	set	nosy: + lemburg, loewis, terry.reedy, ncoghlan, pitrou, vstinner, larry, jwilk, a.badger, r.david.murray, Sworddragon, serhiy.storchaka, bkabrda
2016-04-22 07:44:31	editor-buzzfeed	set	nosy: + editor-buzzfeed, - lemburg, loewis, terry.reedy, ncoghlan, pitrou, vstinner, larry, jwilk, a.badger, r.david.murray, Sworddragon, serhiy.storchaka, bkabrda messages: + msg263975
2015-05-17 22:44:01	terry.reedy	set	stage: patch review -> resolved
2014-12-07 23:42:22	terry.reedy	set	messages: + msg232290
2013-12-21 17:09:52	jwilk	set	nosy: + jwilk
2013-12-14 06:10:02	ncoghlan	set	messages: + msg206169
2013-12-13 16:44:04	vstinner	set	messages: + msg206116
2013-12-13 16:40:59	vstinner	set	messages: + msg206112
2013-12-13 16:31:40	a.badger	set	messages: + msg206109
2013-12-13 16:17:08	Sworddragon	set	messages: + msg206107
2013-12-13 15:57:02	vstinner	set	messages: + msg206101
2013-12-13 15:46:40	Sworddragon	set	messages: + msg206098
2013-12-13 12:51:39	ncoghlan	set	messages: + msg206071
2013-12-13 12:19:49	Sworddragon	set	messages: + msg206068
2013-12-13 11:58:51	larry	set	messages: + msg206065
2013-12-13 11:01:31	Sworddragon	set	messages: + msg206055
2013-12-10 23:30:43	a.badger	set	messages: + msg205871
2013-12-10 21:37:27	vstinner	set	messages: + msg205859
2013-12-10 20:27:49	vstinner	set	messages: + msg205855
2013-12-10 19:28:11	a.badger	set	messages: + msg205848
2013-12-10 10:36:37	vstinner	set	messages: + msg205783
2013-12-10 07:08:45	loewis	set	messages: + msg205772
2013-12-09 23:06:51	ncoghlan	set	messages: + msg205751
2013-12-09 23:00:17	pitrou	set	messages: + msg205749
2013-12-09 22:57:01	loewis	set	messages: + msg205748
2013-12-09 22:50:02	ncoghlan	set	messages: + msg205747
2013-12-09 18:50:38	a.badger	set	messages: + msg205727
2013-12-09 13:55:31	vstinner	set	messages: + msg205694
2013-12-09 13:48:16	Sworddragon	set	messages: + msg205693
2013-12-09 13:28:57	larry	set	messages: + msg205691
2013-12-09 13:27:37	vstinner	set	messages: + msg205690
2013-12-09 13:20:13	Sworddragon	set	messages: + msg205688
2013-12-09 10:42:16	vstinner	set	status: open -> closed versions: + Python 3.3, Python 3.4, - Python 3.5 title: Setting LANG=C breaks Python 3 on Linux -> Python 3 raises Unicode errors with the C locale messages: + msg205675 resolution: not a bug
2013-12-09 10:30:12	lemburg	set	messages: + msg205673
2013-12-09 10:19:10	vstinner	set	messages: + msg205672
2013-12-09 10:17:14	vstinner	set	messages: + msg205671
2013-12-09 10:13:15	vstinner	set	messages: + msg205670
2013-12-09 10:11:55	lemburg	set	messages: + msg205669
2013-12-09 09:40:34	serhiy.storchaka	set	messages: + msg205655
2013-12-09 09:36:51	serhiy.storchaka	set	messages: + msg205654
2013-12-09 04:03:50	Sworddragon	set	messages: + msg205646
2013-12-09 02:56:42	ncoghlan	set	messages: + msg205642
2013-12-09 02:08:26	vstinner	set	messages: + msg205640
2013-12-09 01:54:26	ncoghlan	set	nosy: + a.badger, bkabrda messages: + msg205637 title: Setting LANG=C breaks Python 3 -> Setting LANG=C breaks Python 3 on Linux
2013-12-09 00:33:03	vstinner	set	messages: + msg205625
2013-12-09 00:24:19	pitrou	set	messages: + msg205623
2013-12-08 22:22:16	vstinner	set	messages: + msg205615 versions: + Python 3.5, - Python 3.4
2013-12-08 22:03:09	vstinner	set	title: print() and write() are relying on sys.getfilesystemencoding() instead of sys.getdefaultencoding() -> Setting LANG=C breaks Python 3
2013-12-08 22:02:50	vstinner	set	messages: + msg205611
2013-12-08 14:09:48	ncoghlan	set	messages: + msg205564
2013-12-08 12:38:15	pitrou	set	messages: + msg205555
2013-12-08 12:35:02	larry	set	messages: + msg205554
2013-12-08 11:52:08	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg205550
2013-12-08 11:45:22	vstinner	set	messages: + msg205549
2013-12-08 11:41:37	pitrou	set	messages: + msg205548
2013-12-08 11:37:24	vstinner	set	messages: + msg205547 title: Setting LANG=C breaks Python 3 -> print() and write() are relying on sys.getfilesystemencoding() instead of sys.getdefaultencoding()
2013-12-08 11:16:16	ncoghlan	set	title: print() and write() are relying on sys.getfilesystemencoding() instead of sys.getdefaultencoding() -> Setting LANG=C breaks Python 3
2013-12-08 11:16:02	ncoghlan	set	messages: + msg205545
2013-12-08 10:49:59	vstinner	set	messages: + msg205538
2013-12-08 02:16:03	ncoghlan	set	messages: + msg205505
2013-12-07 23:22:23	pitrou	set	messages: + msg205498
2013-12-07 23:18:40	vstinner	set	messages: + msg205497
2013-12-07 17:25:42	serhiy.storchaka	set	nosy: + lemburg, loewis
2013-12-07 17:17:20	pitrou	set	files: + asciilocale.patch versions: + Python 3.4, - Python 3.3 keywords: + patch nosy: + larry messages: + msg205472 stage: patch review
2013-12-07 16:10:51	ncoghlan	set	messages: + msg205465
2013-12-07 15:54:47	pitrou	set	nosy: + ncoghlan
2013-12-07 15:34:40	pitrou	set	messages: + msg205462
2013-12-07 15:14:30	Sworddragon	set	messages: + msg205459
2013-12-07 14:01:31	vstinner	set	messages: + msg205454
2013-12-07 00:13:52	pitrou	set	nosy: + pitrou messages: + msg205419
2013-12-07 00:12:03	terry.reedy	set	nosy: + terry.reedy messages: + msg205418
2013-11-30 22:25:20	vstinner	set	messages: + msg204852
2013-11-30 21:53:45	r.david.murray	set	nosy: + vstinner, r.david.murray messages: + msg204850
2013-11-30 21:40:45	Sworddragon	create