Issue 28180: Implementation of the PEP 538: coerce C locale to C.utf-8

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/72367

classification

Title:	Implementation of the PEP 538: coerce C locale to C.utf-8
Type:	behavior	Stage:	resolved
Components:	Unicode	Versions:	Python 3.7

process

Status:	closed	Resolution:	fixed
Dependencies:	30565 30635 30647	Superseder:
Assigned To:	ncoghlan	Nosy List:	Jan Niklas Hasse, Sworddragon, abarry, akira, barry, ezio.melotti, lemburg, mcepl, methane, ncoghlan, ned.deily, r.david.murray, ronaldoussoren, vstinner, xdegaye, yan12125
Priority:	normal	Keywords:	patch

Created on 2016-09-16 11:17 by Jan Niklas Hasse, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
fedora-cpython-force-c-utf-8.diff	ncoghlan, 2016-12-15 06:15	Downstream patch currently proposed for Fedora 26	review
fedora-cpython-PYTHONALLOWCLOCALE.diff	ncoghlan, 2016-12-18 07:10	Draft Fedora 26 patch as at 2016-12-18	review
pep538_coerce_legacy_c_locale.diff	ncoghlan, 2016-12-28 02:45	Initial patch for PEP 538 (targeting 3.7)	review
pep538_coerce_legacy_c_locale_v2.diff	ncoghlan, 2017-01-03 04:15	Add test cases for handling of unknown locales	review
pep538-check-click.sh	ncoghlan, 2017-01-07 11:43	Utility script to check click's behaviour in a PEP 538 patched CPython
pep538_coerce_legacy_c_locale_v3.diff	ncoghlan, 2017-01-08 02:22	Refactor PEP 538 test cases to cover no locale setting, C locale, POSIX locale and unknown locale	review
android_setlocale.patch	xdegaye, 2017-01-18 15:15
pep538_coerce_legacy_c_locale.patch	mcepl, 2020-03-21 12:47	Ufinished attempt to port this patch to Python 3.4

Pull Requests
URL	Status	Linked	Edit
PR 659	merged	ncoghlan, 2017-03-13 06:08
PR 2130	merged	ncoghlan, 2017-06-12 13:28
PR 2155	merged	vstinner, 2017-06-13 09:22
PR 2208	merged	ncoghlan, 2017-06-15 04:42
PR 4334	merged	xdegaye, 2017-11-08 11:09

Messages (89)
msg276693 - (view)	Author: Jan Niklas Hasse (Jan Niklas Hasse)	Date: 2016-09-16 11:17
Working with Docker I often end up with an environment where the locale isn't correctly set. In these cases it would be great if sys.getfilesystemencoding() could default to 'utf-8' instead of 'ascii', as it's the encoding of the future and ascii is a subset of it anyway. Related: http://bugs.python.org/issue19846
msg276694 - (view)	Author: Anilyka Barry (abarry) *	Date: 2016-09-16 11:22
This is a duplicate of issue27781.
msg276707 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-09-16 13:02
> This is a duplicate of issue27781. issue27781 is specific to Windows. I'm not sure that it's the base in this issue. So I reopen the issue. @Jan Niklas Hasse: What is your OS? I proposed to add "-X utf8" command line option for UNIX to force utf8 encoding. Would it work for you?
msg276709 - (view)	Author: Jan Niklas Hasse (Jan Niklas Hasse)	Date: 2016-09-16 13:09
Unfortunately no, as this would mean I'll have to change all my python invocations in my scripts and it wouldn't work for executable files with #!/usr/bin/env python3 would it?
msg276722 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2016-09-16 14:46
I thought we "fixed" this by using surrogate escape when the locale was ASCII? We certainly have discussed changing the default and posix and so far have decided not to (someday that will change...is this someday already?)
msg276729 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-09-16 17:18
> is this someday already?) Not yet :-)
msg277273 - (view)	Author: Jan Niklas Hasse (Jan Niklas Hasse)	Date: 2016-09-23 12:47
Why not?
msg277274 - (view)	Author: Inada Naoki (methane) *	Date: 2016-09-23 12:59
I want locale free Python which behaves like on C.UTF-8 locale. (stdio encoding, preferred encoding, weekday in _strptime._strptime, and more maybe) But Python 3.6 is feature freeze already >_<;;
msg282964 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-12 05:26
I think we're genuinely getting to the point now where the majority of "LANG=C" cases are misconfigurations rather than intended behaviour. We're also to the point where: - on Mac OS X, binary system interfaces have been handled as UTF-8 by default since 3.0 - on Windows, as of 3.6, the OS native binary system interfaces are now bypassed entirely in favour of transcoding from UTF-8 to UTF-16-LE So I think for Python 3.7 it makes sense to do the following on other nix systems: - very early in CPython startup (even before argument processing), if the detected locale is "C", force it to "C.UTF-8" if possible, and print a warning either way - add a PYTHONKEEPASCIILOCALE environment variable to turn that behaviour off I do think we actually want to change* the C level locale in the process though, as otherwise we can expect to see weird interactions where CPython and extension modules disagree about the default text encoding.
msg282965 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-12 05:29
Note also that if we say we're going to do this for 3.7, and go ahead and implement it, then distros may be more inclined to incorporate the same behavioural changes into distro-provided releases of 3.6, providing real world testing of the concept before we make it the default behaviour.
msg282970 - (view)	Author: Jan Niklas Hasse (Jan Niklas Hasse)	Date: 2016-12-12 08:03
Actually in a new Docker container, the LANG variable isn't set at all. Defaulting to UTF-8 in that case should be easier to reason about, shouldn't it?
msg282971 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-12 08:10
From CPython's point of view, glibc behaves the same way (i.e. reporting `ascii` as the preferred encoding for operating system interfaces) regardless of whether the cause is the locale not being set at all, or due to it being explicitly set to the legacy POSIX locale via `LANG=C`.
msg282972 - (view)	Author: Jan Niklas Hasse (Jan Niklas Hasse)	Date: 2016-12-12 08:45
https://sourceware.org/glibc/wiki/Proposals/C.UTF-8#Defaults mentions that C.UTF-8 should be glibc's default. This bug report also mentions Python: https://sourceware.org/bugzilla/show_bug.cgi?id=17318 It hasn't been fixed yet, though :/
msg282977 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2016-12-12 10:17
If we just restrict this to the file system encoding (and not the whole LANG setting), how about: * default the file system encoding to 'utf-8' and use the surrogate escape handler as default error handler * add a PYTHONFSENCODING env var to set the file system encoding to something else () () I believe we discussed this at some point already, but don't remember the outcome. Regarding the questions of defaulting to LANG=C.UTF-8: I think this needs some more thought, since it would also affect many C locale aware functions. To make this work, Python would have to call setlocale() early on in the startup phase to adjust the C lib accordingly.
msg282978 - (view)	Author: Inada Naoki (methane) *	Date: 2016-12-12 10:26
Sorry for confusing. I didn't meant defaulting LANG=C.UTF-8. I meant use UTF-8 as default fsencoding, stdioencoding regardless locale, and locale.getpreferredencoding() returns 'utf-8' when LC_CTYPE is ascii.
msg282984 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-12 11:49
The challenge that arises in being selective about this is that "sys.getfilesystemencoding()" is actually a misnomer, and some of the things we use it for (like decoding command line arguments and environment variables) necessarily happen really early in the interpreter bootstrapping process. The bugs that arise from being internally inconsistent are then even harder to debug than those that arise from believing the OS when it says the right encoding to use is ASCII - the latter at least don't tend to be subtle, and are amenable to being resolved via "LC_ALL=C.UTF-8" and "LANG=C.UTF-8". I believe Victor put quite a bit of time into trying to get more selective approaches to work reliably and eventually gave up. For Fedora 26, I'm going to explore the feasibility of patching our system 3.6 installation such that the python3 command itself (rather than the shared library) checks for "LC_CTYPE=C" as almost the first thing it does, and forcibly sets LANG and LC_ALL to C.UTF-8 if it gets an answer it doesn't like. If we're able to do that successfully in the more constrained environment of a specific recent Fedora release, then I think it will bode well for doing something similar by default in CPython 3.7
msg283244 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-15 06:15
Downstream Fedora issue proposing the above idea for F26: https://bugzilla.redhat.com/show_bug.cgi?id=1404918 I've also attached the patch from that issue here.
msg283408 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-12-16 15:12
Victor>> I proposed to add "-X utf8" command line option for UNIX to force utf8 encoding. Would it work for you? Jan Niklas Hasse> Unfortunately no, as this would mean I'll have to change all my python invocations in my scripts and it wouldn't work for executable files with "#!/usr/bin/env python3" would it? Usually, when a new option is added to Python, we add a command line option (-X utf8) but also an environment variable: I propose PYTHONUTF8=1. Use your favorite method to define the env var "system wide" in your docker containers. Note: Technically, I'm not sure that it's possible to support -E option with PYTHONUTF8, since -E comes from the command line, and we first need to decode command line arguments with an encoding to parse these options.... Chicken-and-egg issue ;-)
msg283409 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-12-16 15:15
> I believe Victor put quite a bit of time into trying to get more selective approaches to work reliably and eventually gave up. Yeah, it just doesn't work to use more than one encoding per process. You should use the same encoding for the whole lifetime of a process. If you decode early data from an encoding A and later encode it back to encoding B, you get mojibake. The problem is simple. Using more than one encoding per process means starting to make assumtpions on how data is used. For example, consider that environment variables use the encoding A, but filenames should use the encoding B. Or, but what if an environment variable contains a filename? Similar issues for command line arguments, subprocess pipes, standard streams (sys.std*), etc.
msg283469 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-17 07:46
We've been discussing this further downstream in the Fedora Python SIG, and we have a draft approach that we're pretty sure will work for us (based in turn on the approach Armin Ronacher came up with for click), and we think it should work for other distros as well (as long as they already ship the C.UTF-8 locale, and if they don't, they should fix that limitation anyway). So I'm assigning this to myself as I think the next step will be to write a PEP that both proposes the specific idea as the default behaviour in 3.7, and also encourages distros to opt-in to trialling it as a downstream patch for 3.6.
msg283471 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-17 07:56
Making an explicit note of this so I remember to mention it in the draft PEP: one of the biggest problems that arises in any attempt at a Python-only solution to overriding the locale is that we can end up disagreeing with C/C++ extensions, and this is especially a problem when sharing a process with GUI frameworks like Tcl/Tk, Qt, and GTK (since they tend to read the process-wide settings, rather than querying anything that CPython configures during normal operation). So the approach I'm proposing is to implement a C->C.UTF-8 locale override in the actual python CLI executable, and then in the dynamically linked library we only emit a warning if we detect the C locale, we don't actually do anything to change it.
msg283482 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2016-12-17 10:15
On 17.12.2016 08:56, Nick Coghlan wrote: > > Making an explicit note of this so I remember to mention it in the draft PEP: one of the biggest problems that arises in any attempt at a Python-only solution to overriding the locale is that we can end up disagreeing with C/C++ extensions, and this is especially a problem when sharing a process with GUI frameworks like Tcl/Tk, Qt, and GTK (since they tend to read the process-wide settings, rather than querying anything that CPython configures during normal operation). Another use case to consider is embedding the Python interpreter in another application. In such situations, the C locale will usually already be set by the main application and it may conflict with the LANG or other locale env var settings, since the user may have chosen to use a different locale in the context of the application.
msg283495 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-17 15:33
On 17 December 2016 at 20:15, Marc-Andre Lemburg <report@bugs.python.org> wrote: > Another use case to consider is embedding the Python > interpreter in another application. In such situations, > the C locale will usually already be set by the main > application and it may conflict with the LANG or other > locale env var settings, since the user may have chosen > to use a different locale in the context of the application. > Aye, that's the origin of the split proposal to only emit a warning in the shared library (since CPython might only be a piece of a larger application), but implement actual locale coercion (by overriding LANG and LC_ALL in the process environment) in the command line app's main() function (as in that case we know CPython is the application). The hard part of writing the PEP isn't really going to be explaining the proposal itself (I expect it to be around a 20 line patch to the C code) - it's going to be explaining why all the other possibilities we've considered over the years don't work, and why we (as in the Fedora Python SIG) think this one actually stands a chance of working properly :)
msg283515 - (view)	Author: Jan Niklas Hasse (Jan Niklas Hasse)	Date: 2016-12-17 20:19
> Usually, when a new option is added to Python, we add a command line option (-X utf8) but also an environment variable: I propose PYTHONUTF8=1. > > Use your favorite method to define the env var "system wide" in your docker containers. This doesn't help me, as I already set LANG to C.utf-8. I'm rather thing about new people trying out Python in Docker who don't know about this. Furthermore I think that UTF-8 is the future and the use of ASCII should be discouraged.
msg283543 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-18 07:20
For folks not following the Fedora BZ issue directly, I've also attached the latest draft downstream patch here, which gives the following behaviour: ========================== $ ./python -c "import sys; print(sys.getfilesystemencoding())" utf-8 $ LANG=C.UTF-8 ./python -c "import sys; print(sys.getfilesystemencoding())" utf-8 $ LANG=C ./python -c "import sys; print(sys.getfilesystemencoding())" Python detected LC_CTYPE=C, forcing LC_ALL & LANG to C.UTF-8 (set PYTHONALLOWCLOCALE to disable this behaviour). utf-8 $ PYTHONALLOWCLOCALE=1 LANG=C ./python -c "import sys; print(sys.getfilesystemencoding())" Python detected LC_CTYPE=C, but PYTHONALLOWCLOCALE is set. Some libraries, applications, and operating system interfaces may not work correctly. Py_Initialize detected LC_CTYPE=C, which limits Unicode compatibility. Some libraries and operating system interfaces may not work correctly. Use `PYTHONALLOWCLOCALE=1 LC_CTYPE=C python3` to configure a similar environment when running Python directly. ascii ========================== (The double warning in the last example is likely to go away by skipping the CLI level warning in that case) The Python tests checking for the expected behaviour are signficantly longer than the C level changes needed to implement it :)
msg283732 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-12-21 09:54
Previous related work: changeset: 89836:bc06f67234d0 user: Victor Stinner <victor.stinner@gmail.com> date: Tue Mar 18 01:18:21 2014 +0100 files: Doc/whatsnew/3.5.rst Lib/test/test_sys.py Misc/NEWS Python/pythonru description: Issue #19977: When the ``LC_TYPE`` locale is the POSIX locale (``C`` locale), :py:data:`sys.stdin` and :py:data:`sys.stdout` are now using the ``surrogateescape`` error handler, instead of the ``strict`` error handler.
msg284150 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-28 02:45
I've now written this up as a PEP: https://github.com/python/peps/blob/master/pep-0538.txt The latest attached patch implements the specific design proposed in the PEP. Relative to the last Fedora specific patch, this tweaks the warning message wording slightly, and only emits the library level warning when PYTHONALLOWCLOCALE is set: ====================== $ LANG=C ./python -c "import sys; print(sys.getfilesystemencoding())" Python detected LC_CTYPE=C, forcing LC_ALL & LANG to C.UTF-8 (set PYTHONALLOWCLOCALE to disable this locale coercion behaviour). utf-8 ====================== $ PYTHONALLOWCLOCALE=1 LANG=C ./python -c "import sys; print(sys.getfilesystemencoding())" Py_Initialize detected LC_CTYPE=C, which limits Unicode compatibility. Some libraries and operating system interfaces may not work correctly. Set `PYTHONALLOWCLOCALE=1 LC_CTYPE=C` to configure a similar environment when running Python directly. ascii
msg284170 - (view)	Author: Jan Niklas Hasse (Jan Niklas Hasse)	Date: 2016-12-28 12:23
Only important case for me: What when LANG is unset?
msg284176 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2016-12-28 15:29
If nothing is configured (i.e. none of LC_ALL, LC_CTYPE or LANG are set in the environment), then C reports the locale as "C". It's probably worthwhile for me to add a Background section to the PEP that explains the behaviour of ``setlocale`` at the C level, as that's the source of the majority of the problems, as well as the key mechanism used to implement the locale coercion.
msg284537 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-01-03 04:15
Updated patch adds some tests showing that this change should also help with cases where SSH environment forwarding results in an unknown locale being requested in the server environment.
msg284605 - (view)	Author: Inada Naoki (methane) *	Date: 2017-01-04 01:02
I read PEP 538 but I can't understand why just using UTF-8 when locale is C like macOS is bad idea.
msg284620 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-01-04 08:01
On Mac OS X, the XCode libc already ignores the locale settings and just uses UTF-8 as the default text encoding, so the hardcoding in CPython aligns with that behaviour. That isn't the case on other nix systems - there, we need CPython to be consistent with the configured C/C++ locale, and* we need it to be using something other than ASCII as the default encoding. Answer: coerce the default locale from C to C.UTF-8 (if available), or to en_US.UTF-8 (for older distros that don't provide C.UTF-8). (The latter aspect isn't in the PEP yet, it's an improvement that came up in the linux-sig discussions: https://github.com/python/peps/issues/171 )
msg284621 - (view)	Author: Inada Naoki (methane) *	Date: 2017-01-04 08:29
> That isn't the case on other nix systems - there, we need CPython to be consistent with the configured C/C++ locale, and* we need it to be using something other than ASCII as the default encoding. Isn't using UTF-8 as filesystem encoding and stdin/stdout encoding consistent with C or POSIX locale? Don't "modern" programming environments (Rust, Go, node.js) use UTF-8 even if locale is C or POSIX?
msg284631 - (view)	Author: Inada Naoki (methane) *	Date: 2017-01-04 11:41
I'm sorry. I must search old discussion about why we can't simply use utf-8 for fsencoding when C locale, instead of asking here.
msg284641 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-01-04 14:46
The default encoding in the C/POSIX locale is ASCII (which is the entire source of the problem). The initial verison of the PEP I uploaded didn't explain that background, but I added a section about it in the update earlier this week: https://www.python.org/dev/peps/pep-0538/#background
msg284647 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-01-04 16:06
> The default encoding in the C/POSIX locale is ASCII (which is the entire source of the problem). The reality is more complex than that :-) It depends on the OS. Some OS uses Latin1 for the POSIX locale. Some OS announces to use Latin1 for the POSIX locale, but use ASCII in practice :-) On these lying OS, Python decodes bytes 0x80..0xff using mbstowcs() to check if we get ASCII or Latin1: see the check_force_ascii() function. /* Workaround FreeBSD and OpenIndiana locale encoding issue with the C locale. On these operating systems, nl_langinfo(CODESET) announces an alias of the ASCII encoding, whereas mbstowcs() and wcstombs() functions use the ISO-8859-1 encoding. The problem is that os.fsencode() and os.fsdecode() use locale.getpreferredencoding() codec. For example, if command line arguments are decoded by mbstowcs() and encoded back by os.fsencode(), we get a UnicodeEncodeError instead of retrieving the original byte string. The workaround is enabled if setlocale(LC_CTYPE, NULL) returns "C", nl_langinfo(CODESET) announces "ascii" (or an alias to ASCII), and at least one byte in range 0x80-0xff can be decoded from the locale encoding. The workaround is also enabled on error, for example if getting the locale failed. (...) */
msg284697 - (view)	Author: Inada Naoki (methane) *	Date: 2017-01-05 03:32
On Linux, I think most people wants UTF-8:surrogateescape by default, without fighting against locale and environment variables. There are already `#if defined(__APPLE__) \|\| defined(__ANDROID__)` path for it. How about adding configure option to use same logic? (say `--with-encoding=(locale\|utf-8)`, preferred encoding is changed in same way). It may help many people building Python themselves without having root privilege for generating C.UTF-8 locale.
msg284716 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-01-05 09:26
Anything purely on the Python side of things doesn't work in a traditional C environment - CPython relies on the C lib to do conversions during startup, so we need the C locale to be set correctly. We can do things differently on Mac OS X and iOS because Apple ensure that C behaves differently on Mac OS X and iOS (and apparently Google do something similar for Android, so I'll update the PEP to mention that as well).
msg284718 - (view)	Author: Inada Naoki (methane) *	Date: 2017-01-05 09:42
> Anything purely on the Python side of things doesn't work in a traditional C environment - CPython relies on the C lib to do conversions during startup, so we need the C locale to be set correctly. What I propose is non't use mbstowcs, like __ANDROID__ wchar_t* Py_DecodeLocale(const char* arg, size_t size) { #if defined(__APPLE__) \|\| defined(__ANDROID__) wchar_t wstr; wstr = _Py_DecodeUTF8_surrogateescape(arg, strlen(arg)); On Linux, command line arguments and filepath is just a byte sequence. So using UTF-8:surrogateescape from during startup should works fine. Am I wrong?
msg284719 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2017-01-05 09:51
On 05.01.2017 10:26, Nick Coghlan wrote: > > Anything purely on the Python side of things doesn't work in a traditional C environment - CPython relies on the C lib to do conversions during startup, so we need the C locale to be set correctly. We can do things differently on Mac OS X and iOS because Apple ensure that C behaves differently on Mac OS X and iOS (and apparently Google do something similar for Android, so I'll update the PEP to mention that as well). I believe IANADA-san (hope that's the right way to address him) raised a good point though: what if a system doesn't come with the C.UTF-8 local setup ? The C lib would then error out when trying to use setlocale() on such an environment. Now, Python's main() function doesn't look at any such errors (and neither do the other places which use it such as frozenmain.c and readline.c), so it wouldn't even notice. The setlocal() man-page doesn't mention how such a failure would affect the current locale settings. My guess is that the locale remains set to what it was before, which in case of a fresh C application start is the "C" locale. So in the implementation of the PEP, there should be a test to see whether "C.UTF-8" does result in a successful call to setlocale(). If it doesn't, there would have to be some work-around to still make Python's FS encoding happy while leaving the C lib locale set at "C".
msg284720 - (view)	Author: Inada Naoki (methane) *	Date: 2017-01-05 10:10
Why I want to add configure option to ignore locale is: 1. C.UTF-8 is not supported by RHEL7 (https://bugzilla.redhat.com/show_bug.cgi?id=1361965) RHEL7 will be used for a long time. And many people uses new Python instead of distro's Python, via pyenv or pythonz. I feel deprecating C locale from Python 3.7 is bit aggressive. 2. Many admins like C locale. locale setting will cause unintended side effects. So many admins dislike xx_XX.UTF-8 locale. For example (from https://fumiyas.github.io/2016/12/25/dislike.sh-advent-calendar.html ): $ mkdir tmp $ cd tmp $ touch a b c x y z A B C X Y Z $ LC_ALL=C /bin/bash --noprofile --norc -c 'echo [A-Z]' A B C X Y Z $ LC_ALL=en_US.UTF-8 /bin/bash --noprofile --norc -c 'echo [A-Z]' A b B c C x X y Y z Z 3. Many other languages can use UTF-8 even when C locale node.js, Ruby, Rust, Go can use UTF-8 on Linux People don't want to learn how to configure locale properly only for Python.
msg284722 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-01-05 10:50
No, requesting a locale that doesn't exist doesn't error out, because we don't check the return code - it just keeps working the same way it does now (i.e. falling back to the legacy C locale). However, it would be entirely reasonable to put together a competing PEP proposing to eliminate the reliance on the problematic libc APIs, and instead use locale independent replacements. I'm simply not offering to implement or champion such a PEP myself, as I think ignoring the locale settings rather than coercing them to something more sensible will break integration with C/C++ GUI toolkits like Tcl/Tk, Gtk, and Qt, and it's reasonable for us to expect OS providers to offer at least one of C.UTF-8 or en_US.UTF-8 (see https://github.com/python/peps/issues/171 for more on that).
msg284725 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-01-05 10:54
The PEP already explains how other runtimes achieve UTF-8 and UTF-18-LE everywhere: by ignoring the C/C++ locale entirely. While this breaks integration with other C/C++ components, the developers of those languages and runtimes simply don't care, as they never supported integrating with those components in the first place. CPython doesn't have that luxury, since it is used extensively in locale aware desktop applications.
msg284729 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-01-05 11:11
Sorry, I still didn't have enough time to read carefully the PEP 538. But since the discussion already started on this issue, I will add my comments: * I'm sure that many Linux, UNIX and BSD systems don't have the "C.UTF-8" locale. For example, HP-UX has "C.utf8" which is not exactly "C.UTF-8". * Setting the locale has an impact on all libraries running in the Python process. At this point, I'm not sure that it is what we want. * I'm not sure that it's ok in 2017 to always force the UTF-8 encoding if the user locale uses a different encoding. I had the same concern with the PEP 528 (Change Windows console encoding to UTF-8) and PEP 529 (Change Windows filesystem encoding to UTF-8) on Windows, but these PEPs were approved and merged into Python 3.6. My fear is obviously mojibake with the other applications using the other encoding, the locale encoding. Other applications are not impacted by setlocale() in the Python process. * I proposed an opt-in option to force UTF-8: -X utf8 command line option and PYTHONUTF8=1 env var. Opt-in will obviously reduce the risk of backward compatibility issues. With an opt-in option, users are better prepared for mojibake issues. * I dislike "Backporting to earlier Python 3 releases". In my experience, changes on how Python handles text (encodings, codecs, etc.) always have subtle issues, and users dislike getting backward incompatible changes in minor releases. Maybe if the option is an opt-in, the risk is lower and acceptable? * I dislike that Fedora has such downstream change. I would prefer to decide upstream how to convert UTF-8 slowly as a first-class citizen in Python. Otherwise, Fedora would behave differently than other Linux distributions and it can be painful to write applications having the same behaviour on all Linux distributions. But I also understand that Fedora has sometimes to move faster than the slow CPython project :-) Fedora can also seen as a toy to experiment changes quickly which helps to provide a wide feedback upstream to take better decision. * Using strict or surrogateescape error handler is a very important choice which has a wide impact. If we use utf8 by default (PEP 538), people will problably complain less if Python magically pass undecoded bytes thanks to the surrogateescape. If the option is an opt-in, strict may make sense. But surrogateescape is maybe still more "convenient". I don't know at this point. Nick: it seems like you have a well defined plan. But I dislike on multiple points. I don't know if it's better to try to convince you to change your PEP, or write a different PEP. I planned to write such "UTF-8" PEP since 2015, but I never started because the scope is so large that I fear all tiny but annoying corner cases...
msg284736 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2017-01-05 11:36
While going for the full locale setting may be a good option, perhaps just focusing on the FS encoding for now is a better way forward (and also more in line with the ticket title). So essentially go for the PEP 529 approach on Unix as well (except that we use 'ascii' as fallback in legacy mode): https://www.python.org/dev/peps/pep-0529/ The PEP also includes a section on affected modules, which we could double check (even though the term "FS encoding" implies that only file system relevant APIs are touched by such a change, the encoding is used in several other places as well): https://www.python.org/dev/peps/pep-0529/#id14 For Windows, a couple of modules such as pwd and nis are not used, so those may need some extra attention.
msg284742 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-01-05 12:41
The trade-offs here are incredibly complex (and are mainly a matter of deciding whose code and configurations we want to break in 3.7+), so I think competing PEPs are going to be better than attempting to create a combined PEP that tries to cover all the options. That way each PEP can argue as strongly as it can for the respective authors preferred approach to tackling the default C locale problem, even if they point to a common background section in one of the PEPs (similar to the way PEPs 522 and 524 shared a common problem definition, even though they proposed different ways of handling it).
msg284747 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2017-01-05 14:44
On Jan 05, 2017, at 11:11 AM, STINNER Victor wrote: >I'm sure that many Linux, UNIX and BSD systems don't have the "C.UTF-8" >locale. For example, HP-UX has "C.utf8" which is not exactly "C.UTF-8". > >I'm not sure that it's ok in 2017 to always force the UTF-8 encoding if the >user locale uses a different encoding. It's not just any different encoding, it's specifically C (implicitly, C.ASCII). >I proposed an opt-in option to force UTF-8: -X utf8 command line option and >PYTHONUTF8=1 env var. Opt-in will obviously reduce the risk of backward >compatibility issues. With an opt-in option, users are better prepared for >mojibake issues. If this is true, then I would like a configuration option to default this on. As mentioned, Debian and Ubuntu already have C.UTF-8 and most environments (although not all, see my sbuild/schroot comment earlier) will at least be C.UTF-8. Perhaps it doesn't matter then, but what I really want is that for those few odd outliers (e.g. schroot), Python would act the same inside and out those environments. I really don't want people to have to add that envar or switch (or even export LC_ALL) to get proper build behavior.
msg284764 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-01-05 17:18
> That way each PEP can argue as strongly as it can for the respective authors preferred approach to tackling the default C locale problem, even if they point to a common background section in one of the PEPs (similar to the way PEPs 522 and 524 shared a common problem definition, even though they proposed different ways of handling it). Ok, same players play again: as PEP 522/524 with Nick and me, I just wrote the PEP 540 "Add a new UTF-8 mode" and Nick wrote the PEP 538 :-D I started a thread to discuss the PEP on python-ideas: https://mail.python.org/pipermail/python-ideas/2017-January/044089.html IMHO the PEP 538 should discuss the usage of the surrogateescape error handler: see my second mail in the thread for the details. I proposed a change in my 3rd mail which would move my PEP closer to Nick's PEP 538: enable "automatically" the UTF-8 mode when the locale is POSIX.
msg284782 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-01-05 22:44
> Working with Docker I often end up with an environment where the locale isn't correctly set. The locale encoding is controlled by 3 environment variables: LC_ALL, LC_CTYPE and LANG. https://www.python.org/dev/peps/pep-0540/#the-posix-locale-and-its-encoding Can you please tell me if these variables are set and if yes, give me their value? I would like to know if it would be possible to change the behaviour of Python when the (LC_CTYPE) locale is POSIX (aka the famous "C" locale).
msg284794 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-01-06 02:50
Docker containers don't have a locale set by default - the approach proposed in PEP 528 actually comes from the way I configure Docker images (which in turn comes from Armin Ronacher's recommendations in click for Python 3 locale handling). In the Dockerfile for Fedora based containers I add: ENV LC_ALL=C.UTF-8 ENV LANG=C.UTF-8 while in CentOS 7 based containers I add: ENV LC_ALL=en_US.UTF-8 ENV LANG=en_US.UTF-8 And with those settings, Python 3 based containers just work (my laptop is running en_AU.UTF-8 locally)
msg284795 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-01-06 03:07
And by PEP 528, I actually mean PEP 538 :)
msg284799 - (view)	Author: Jan Niklas Hasse (Jan Niklas Hasse)	Date: 2017-01-06 07:51
> Can you please tell me if these variables are set and if yes, give me their value? None of these variables are set (with `docker run -it fedora:25 /bin/bash`).
msg284882 - (view)	Author: (Sworddragon)	Date: 2017-01-07 02:33
On looking into PEP 538 and PEP 540 I think PEP 540 is the way to go. It provides an option for a stronger encapsulation for the de-/encoding logic between the interpreter and the developer. Instead of caring about error handling the developer has now to care about mojibake handling (for me and maybe others that is explicitly preferred but maybe this depends on each individual). If I'm not wrong PEP 538 improves this for the output too but input handling will still suffer from the overall issue while PEP 540 does also solve this case. Also PEP 540 would not make the C locale and thus eventually some systems potentially unsupported (but it might be an acceptable trade-off if we should really go PEP 538). Specific for PEP 540: > The POSIX locale enables the UTF-8 mode Non-strict I assume? > UTF-8 /backslashreplace Was/is the reason to use backslashreplace for sys.stderr to guarantee that the developer/user sees the error messages? Might it make sense to also use surrogateescape instead of backslashescape for sys.stderr in UTF-8 non-strict mode to be consistent here?
msg284884 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-01-07 02:53
Sworddragon added the comment: > (for me and maybe others that is explicitly preferred but maybe this depends on each individual) That's why the PEP 540 has options to enable to disable its UTF-8 mode(s). > If I'm not wrong PEP 538 improves this for the output too but input handling will still suffer from the overall issue while PEP 540 does also solve this case. The PEP 538 works fine if all inputs and outputs are encoded to UTF-8. I understand that it's a deliberate choice to fail on decoding/encoding error (to not use surrogateescape), but I can be wrong. > Also PEP 540 would not make the C locale and thus eventually some systems potentially unsupported (but it might be an acceptable trade-off if we should really go PEP 538). What do you mean by "make the C locale"? > Specific for PEP 540: > >> The POSIX locale enables the UTF-8 mode > > Non-strict I assume? Yes, non strict. I'm not sure of the name of each mode yet. After having written the "Use Cases" section and especially the Mojibake column of results, I consider the option of renaming the "UTF-8 mode" to "YOLO mode". >> UTF-8 /backslashreplace > > Was/is the reason to use backslashreplace for sys.stderr to guarantee that the developer/user sees the error messages? Yes. > Might it make sense to also use surrogateescape instead of backslashescape for sys.stderr in UTF-8 non-strict mode to be consistent here? Using surrogateescape means that you pass through undecodable bytes from inputs to stderr which can cause various kinds of bad surprises. stderr is used to log errors. Getting a new error when trying to log an error is kind of annoying. Victor
msg284886 - (view)	Author: (Sworddragon)	Date: 2017-01-07 05:14
> What do you mean by "make the C locale"? I was pointing to the Platform Support Changes of PEP 538. > I'm not sure of the name of each mode yet. > > After having written the "Use Cases" section and especially the > Mojibake column of results, I consider the option of renaming the > "UTF-8 mode" to "YOLO mode". Assumingly YOLO is meant to be negative: Things are whirling in my mind. Eventually you want to save your joker :> > Using surrogateescape means that you pass through undecodable bytes > from inputs to stderr which can cause various kinds of bad surprises. > > stderr is used to log errors. Getting a new error when trying to log > an error is kind of annoying. Hm, what bad surprise/error could appear that would not appear with backslashescape?
msg284887 - (view)	Author: Inada Naoki (methane) *	Date: 2017-01-07 05:24
>> stderr is used to log errors. Getting a new error when trying to log >> an error is kind of annoying. > > Hm, what bad surprise/error could appear that would not appear with backslashescape? $ cat badfilename.py badfn = "こんにちは".encode('euc-jp').decode('utf-8', 'surrogateescape') print("bad filename:", badfn) $ PYTHONIOENCODING=utf-8:backslashreplace python3 badfilename.py bad filename: \udca4\udcb3\udca4\udcf3\udca4ˤ\udcc1\udca4\udccf $ PYTHONIOENCODING=utf-8:surrogateescape python3 badfilename.py bad filename: ��ˤ��
msg284900 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-01-07 08:41
I just pushed an update to PEP 538 based on PEP 540 and the feedback in the linux-sig discussion: https://github.com/python/peps/commit/221099d8765125bbd798e869846b005bcca84b47 I'll be starting a thread for that on python-ideas shortly, but in the context of the discussion here: * There are good reasons to go back to strict error handling by default on the standard streams when we're using UTF-8 as the default encoding rather than ASCII: https://www.python.org/dev/peps/pep-0538/#using-strict-error-handling-by-default * The right overall answer might actually be to create a hybrid merger of the two PEPs, rather than seeing them as strictly competitors: https://www.python.org/dev/peps/pep-0538/#relationship-with-other-peps
msg284908 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-01-07 11:43
While the attached PEP 538 patches include their own tests, the uploaded pep538-check-click.sh script is the one I've been using to check that the changes have the desired effect of letting click "just work", even when the nominal locale is cleared, explicitly set to C, or explicitly set to POSIX.
msg284943 - (view)	Author: (Sworddragon)	Date: 2017-01-07 22:20
> $ cat badfilename.py > badfn = "こんにちは".encode('euc-jp').decode('utf-8', 'surrogateescape') > print("bad filename:", badfn) > > $ PYTHONIOENCODING=utf-8:backslashreplace python3 badfilename.py > bad filename: \udca4\udcb3\udca4\udcf3\udca4ˤ\udcc1\udca4\udccf > > $ PYTHONIOENCODING=utf-8:surrogateescape python3 badfilename.py > bad filename: ��ˤ�� The first example is still readable (but effectively for an user not so much) while the second example appears to be not readable anymore at all. But the second example is actually technically still readable and there is no data loss, isn't it? As in this case it would probably not speak against surrogateescape for sys.stderr in UTF-8 non-strict mode. Otherwise backslashescape might be indeed the better choice. I have thought about this a bit more and in case we go PEP 538 with keeping strict errors more or less the old way there might be another solution that could improve the overall issue: print() could get an option to allow changing the error handler on demand (with 'strict' still being the default). Most things that I do output with print() are deterministic or optional and not important application data. Being able to print this information without caring for de-/encoding errors would mitigate this issue. In case application data is being printed where data loss is not desired exceptions can still be thrown.
msg284952 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-01-08 02:22
Uploaded one last version of the patch implementing the previous PEP 538 design. This refactors the test cases so they systematically cover 4 cases that we expect to be reported as "the C locale": - LC_ALL, LC_CTYPE, and LANG all empty - one of them set to "C", others empty - one of them set to "POSIX", others empty - one of them set to an unknown locale, others empty The next version of the patch will update it to match the latest draft of the PEP (PYTHONCOERCECLOCALE, different message wording, etc)
msg285735 - (view)	Author: Xavier de Gaye (xdegaye) *	Date: 2017-01-18 15:15
pep538_coerce_legacy_c_locale_v3.diff fixes issue 28997 on Android (api 21 and 24). This issue is raised because there is an inconsistency between Python on Android that considers the locale encoding to be always UTF-8 and GNU Readline that does not accept eight-bit characters when LANG is not set (on Android). On Android, setlocale(CATEGORY, "") does not look for the locale environment variables (LANG, ...) but sets the 'C' locale instead, so the patch does not fully behave as expected and the 'Py_Initialize detected' warning is emitted. Here is the output of an interactive session on Android: root@generic_x86:/data/data/org.bitbucket.pyona # python Python detected LC_CTYPE=C, forcing LC_ALL & LANG to C.UTF-8 (set PYTHONALLOWCLOCALE to disable this locale coercion behaviour). Py_Initialize detected LC_CTYPE=C, which limits Unicode compatibility. Some libraries and operating system interfaces may not work correctly. Set `PYTHONALLOWCLOCALE=1 LC_CTYPE=C` to configure a similar environment when running Python directly. Python 3.7.0a0 (default:0503024831ad+, Jan 18 2017, 11:34:53) [GCC 4.2.1 Compatible Android Clang 3.8.256229 ] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import locale, os >>> os.environ['LANG'] 'C.UTF-8' >>> locale.getdefaultlocale() ('en_US', 'UTF-8') >>> locale.setlocale(locale.LC_CTYPE) 'C' >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 'C.UTF-8' >>> locale.setlocale(locale.LC_CTYPE) 'C.UTF-8' The attached android_setlocale.patch fixes the following problems when applied after pep538_coerce_legacy_c_locale_v3.diff: * No 'Py_Initialize detected' warning is emitted. * locale.setlocale(locale.LC_CTYPE) returns now 'C.UTF-8'.
msg286001 - (view)	Author: Xavier de Gaye (xdegaye) *	Date: 2017-01-22 10:05
> On Android, setlocale(CATEGORY, "") does not look for the locale environment variables (LANG, ...) but sets the 'C' locale instead FWIW the source code of setlocale() on bionic (Android libc) is at https://android.googlesource.com/platform/bionic/+/master/libc/bionic/locale.cpp#144
msg289002 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-03-05 07:12
An updated reference implementation has been pushed to the pep538-coerce-c-locale branch in my GitHub fork: https://github.com/python/cpython/compare/master...ncoghlan:pep538-coerce-c-locale (That doesn't include Xavier's Android fixes yet, though)
msg289534 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-03-13 07:22
OK, the PEP 538 reference implementation has reached the point where I was willing to create a PR for it: https://github.com/python/cpython/pull/659 That PR/branch also includes the necessary changes to always force the C.UTF-8 locale on Android rather than defaulting to the C locale. I believe the only thing missing at this point is the configure.ac dance to ensure that PY_WARN_ON_C_LOCALE and PY_COERCE_C_LOCALE never get set on Mac OS X.
msg295121 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-06-04 10:00
The PEP 538 PR is mostly complete now, but I created https://bugs.python.org/issue30565 to track making a follow-up decision on whether or not we really want to emit a warning on successful implicit locale coercion. The pre-release What's New entry for PEP 538 will include a link to that issue to allow folks to provide feedback on their preferences.
msg295683 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-06-11 03:16
New changeset 6ea4186de32d65b1f1dc1533b6312b798d300466 by Nick Coghlan in branch 'master': bpo-28180: Implementation for PEP 538 (#659) https://github.com/python/cpython/commit/6ea4186de32d65b1f1dc1533b6312b798d300466
msg295688 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-06-11 04:55
And merged! Thanks to all involved in the process of getting this change through to implementation :)
msg295698 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-06-11 10:13
Tests fail on many buildbots.
msg295710 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-06-11 14:07
Ah, it would have been too easy for all the other *nix variants to be close enough to Fedora & Ubuntu for everything to work first time :)
msg295713 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-06-11 14:26
Initial look at the failures on the stable buildbots: FreeBSD 10.x: if locale coercion succeeds, we then fail on get_codeset() (perhaps because that doesn't recognise LC_CTYPE=UTF-8?) FreeBSD CURRENT: if locale coercion fails (due to no suitable locale), lots of error handling tests fail due to the unexpected warning message on stderr Mac OS X Tiger: looks like the test expectations aren't right on Mac OS X (at least for Tiger). I've added the Mac OS X folks to the nosy list. Ubuntu shared library build: loading the shared library fails in _testembed for the `test_forced_io_encoding` test case, which suggest a problem with the way that particular test is running the binary Windows 8.1 refleak hunting: failure doesn't appear to be due to this change (multiprocessing test failures) s390x RHEL 7: failure doesn't appear to be due to this change (multiprocessing test failures)
msg295722 - (view)	Author: Ronald Oussoren (ronaldoussoren) *	Date: 2017-06-11 16:13
The macOS failures are at least partially caused by test assumptions that aren't true on macOS: in particular the filesystem encoding defaults to UTF-8 on macOS (because HFS+ and the recent APFS filesystem store unicode data and not pure byte strings).
msg295871 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-06-13 09:02
> FreeBSD 10.x: if locale coercion succeeds, we then fail on get_codeset() (perhaps because that doesn't recognise LC_CTYPE=UTF-8?) I created bpo-30647 to track this one.
msg295872 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-06-13 09:02
Ronald Oussoren: > The macOS failures are at least partially caused by test assumptions that aren't true on macOS (...) Nick is working on a fix for macOS: https://github.com/python/cpython/pull/2130
msg295875 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-06-13 09:17
It seems like this change: def test_forced_io_encoding(self): # Checks forced configuration of embedded interpreter IO streams - out, err = self.run_embedded_interpreter("forced_io_encoding") - if support.verbose: + env = {"PYTHONIOENCODING": "utf-8:surrogateescape"} + out, err = self.run_embedded_interpreter("forced_io_encoding", env=env) (...) Caused a failure on the "shared" buildbot (./configure --enable-shared): http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/877/steps/test/logs/stdio ====================================================================== FAIL: test_forced_io_encoding (test.test_capi.EmbeddingTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_capi.py", line 484, in test_forced_io_encoding out, err = self.run_embedded_interpreter("forced_io_encoding", env=env) File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_capi.py", line 392, in run_embedded_interpreter (p.returncode, err)) AssertionError: 127 != 0 : bad returncode 127, stderr is '/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Programs/_testembed: error while loading shared libraries: libpython3.7dm.so.1.0: cannot open shared object file: No such file or directory\n'
msg295885 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-06-13 09:49
New changeset eb52ac89929bb09b15c014ab8ff60eee685e86c7 by Victor Stinner in branch 'master': bpo-28180: Fix test_capi.test_forced_io_encoding() (#2155) https://github.com/python/cpython/commit/eb52ac89929bb09b15c014ab8ff60eee685e86c7
msg295913 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-06-13 12:49
New changeset 4563099d28e832aed22b85ce7e2a92236df03847 by Nick Coghlan in branch 'master': bpo-28180: assume UTF-8 for Mac OS X PEP 538 tests (GH-2130) https://github.com/python/cpython/commit/4563099d28e832aed22b85ce7e2a92236df03847
msg295914 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-06-13 12:54
I've added dependencies for PEP 538 induced testing problems that have been broken out into their own issues. I've also merged my attempt at fixing the tests on Mac OS X. Something that's included in that patch is an implicit skip of the "LANG=UTF-8" case when checking external locale configuration. I expected that to behave the same way as "LC_CTYPE=UTF-8", but instead it's behaving more like "LC_CTYPE=C".
msg296064 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-06-15 04:18
Ah, I finally understand Victor's comment on my initial attempt at fixing the tests on Mac OS X - the standard streams don't use the filesystem encoding, so they default to ASCII in the C locale, even on Mac OS X.
msg296075 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-06-15 09:11
New changeset 7926516ff95ed9c8345ed4c4c4910f44ffbd5949 by Nick Coghlan in branch 'master': bpo-28180: Standard stream & FS encoding differ on Mac OS X (GH-2208) https://github.com/python/cpython/commit/7926516ff95ed9c8345ed4c4c4910f44ffbd5949
msg296077 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2017-06-15 09:32
The latest commit should get the Mac OS X buildbot back to green, but I had to disable some test cases to do it - see issue 30672 for details. Issue 30565 is the one that covers silencing the locale coercion and locale compatibility warnings by default.
msg305850 - (view)	Author: Xavier de Gaye (xdegaye) *	Date: 2017-11-08 14:26
PR 4334 added: fix the implementation of PEP 538 on Android. The current implementation of PEP 538 fixes issue 28997 without the locale coercion for Android added by PR 4334, see msg305848.
msg306108 - (view)	Author: Xavier de Gaye (xdegaye) *	Date: 2017-11-12 11:46
New changeset 1588be66d7b0eeebc4614309cd0fc837ff52776a by xdegaye in branch 'master': bpo-28180: Fix the implementation of PEP 538 on Android (GH-4334) https://github.com/python/cpython/commit/1588be66d7b0eeebc4614309cd0fc837ff52776a
msg314627 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2018-03-29 01:23
Given that issue 32002 and issue 30672 track the known challenges in testing the expected locale coercion behaviour reliably, I'm going to go ahead and close this overall implementation issue (the feature is there, and works in a way we're happy with, we're just encountering some challenges clearly expressing those expectations as a regression test).
msg364740 - (view)	Author: Matej Cepl (mcepl) *	Date: 2020-03-21 12:47
I have tried to port this patch to Python 3.4 (still maintained by SUSE on SLE-12), but I have the hardest time to debug this. All affected tests end with errors like this: [ 493s] ====================================================================== [ 493s] FAIL: test_test_PYTHONCOERCECLOCALE_not_set (test.test_c_locale_coercion.LocaleCoercionTests) (PYTHONCOERCECLOCALE=None, env_var='LC_CTYPE', nominal_locale='invalid.ascii') [ 493s] ---------------------------------------------------------------------- [ 493s] Traceback (most recent call last): [ 493s] File "/home/abuild/rpmbuild/BUILD/Python-3.4.10/Lib/test/test_c_locale_coercion.py", line 326, in _check_c_locale_coercion [ 493s] coercion_expected) [ 493s] File "/home/abuild/rpmbuild/BUILD/Python-3.4.10/Lib/test/test_c_locale_coercion.py", line 219, in _check_child_encoding_details [ 493s] self.assertEqual(encoding_details, expected_details) [ 493s] AssertionError: {'fse[79 chars]cii:strict', 'stderr_info': 'ascii:backslashre[45 chars]ict'} != {'fse[79 chars]cii:surrogateescape', 'stderr_info': 'ascii:ba[63 chars]ape'} [ 493s] {'fsencoding': 'ascii', [ 493s] 'lang': '', [ 493s] 'lc_all': '', [ 493s] 'lc_ctype': 'invalid.ascii', [ 493s] 'stderr_info': 'ascii:backslashreplace', [ 493s] - 'stdin_info': 'ascii:strict', [ 493s] ? ^^ ^ [ 493s] [ 493s] + 'stdin_info': 'ascii:surrogateescape', [ 493s] ? ++++++ ^^^ ^^^ [ 493s] [ 493s] - 'stdout_info': 'ascii:strict'} [ 493s] ? ^^ ^ [ 493s] [ 493s] + 'stdout_info': 'ascii:surrogateescape'} [ 493s] ? ++++++ ^^^ ^^^ yes, it is always a conflict between strict and surrogateescape. I probably don’t have time to finish debugging this, so I am just leaving this for posterity.
msg364760 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-03-21 16:51
Python 3.4 is no longer supported upstream. Python 3 got tons of Unicode fixes between Python 3.4 and Python 3.8.
msg364767 - (view)	Author: Matej Cepl (mcepl) *	Date: 2020-03-21 18:06
> Python 3.4 is no longer supported upstream. Python 3 got tons of Unicode fixes between Python 3.4 and Python 3.8. Of course, I know that, but I just didn’t want to throw all my effort away, when I spent some hours on making it. And I guess, there may be somebody else who cares for 3.4 (ehm, RHEL-7 has 3.3, doesn’t it?).
msg364770 - (view)	Author: STINNER Victor (vstinner) *	Date: 2020-03-21 18:33
RHEL 7.7 and RHEL 8 provides Python 3.6. PEP 538 was implemented in Python 3.7. PEP 538 feature was backported in RHEL 7.7 and RHEL 8 Python 3.6.
msg364804 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2020-03-22 12:55
The test cases for locale coercion not triggering still assume that bpo-19977, using surrogateescape on the standard streams in the POSIX locale, has been implemented (since that was implemented in Python 3.5). Hence the various test cases complaining that they found "ascii:strict" (Py 3.4 behaviour without bpo-19977) where they expected "ascii:surrogateescape" (the Py 3.5+ behaviour with bpo-19977). To get a PEP 538 backport to work as intended on 3.4, you'd need to backport that earlier IO stream error handling change as well.
msg364810 - (view)	Author: Matej Cepl (mcepl) *	Date: 2020-03-22 16:25
Thank you very much for the hint. Do I have to include the patch for bpo-19977 only (that would be easy), or also all twelve PRs for bpo-29240 (that would probably broke my will to do it)?

History
Date	User	Action	Args
2022-04-11 14:58:36	admin	set	github: 72367
2020-03-22 16:25:00	mcepl	set	messages: + msg364810
2020-03-22 12:55:15	ncoghlan	set	messages: + msg364804
2020-03-21 18:33:09	vstinner	set	messages: + msg364770
2020-03-21 18:06:02	mcepl	set	messages: + msg364767
2020-03-21 16:51:34	vstinner	set	messages: + msg364760
2020-03-21 12:47:16	mcepl	set	files: + pep538_coerce_legacy_c_locale.patch nosy: + mcepl messages: + msg364740
2018-03-29 01:23:38	ncoghlan	set	status: open -> closed messages: + msg314627 dependencies: - PEP 538: Unexpected locale behaviour on *BSD (including Mac OS X) resolution: fixed stage: patch review -> resolved
2017-11-12 11:46:04	xdegaye	set	messages: + msg306108
2017-11-08 14:26:02	xdegaye	set	messages: + msg305850
2017-11-08 11:09:55	xdegaye	set	stage: patch review pull_requests: + pull_request4289
2017-06-15 09:32:09	ncoghlan	set	dependencies: + PEP 538: silence locale coercion and compatibility warnings by default?, PEP 538: Unexpected locale behaviour on *BSD (including Mac OS X) messages: + msg296077 stage: resolved -> (no value)
2017-06-15 09:11:42	ncoghlan	set	messages: + msg296075
2017-06-15 04:42:06	ncoghlan	set	pull_requests: + pull_request2252
2017-06-15 04:18:39	ncoghlan	set	messages: + msg296064
2017-06-13 12:55:00	ncoghlan	set	dependencies: + Leak in test_c_locale_coercion, CODESET error on AMD64 FreeBSD 10.x Shared 3.x caused by the PEP 538 messages: + msg295914
2017-06-13 12:49:47	ncoghlan	set	messages: + msg295913
2017-06-13 09:49:47	vstinner	set	messages: + msg295885
2017-06-13 09:22:42	vstinner	set	pull_requests: + pull_request2205
2017-06-13 09:21:13	vstinner	set	title: sys.getfilesystemencoding() should default to utf-8 -> Implementation of the PEP 538: coerce C locale to C.utf-8
2017-06-13 09:17:54	vstinner	set	messages: + msg295875
2017-06-13 09:02:55	vstinner	set	messages: + msg295872
2017-06-13 09:02:04	vstinner	set	messages: + msg295871
2017-06-12 13:28:51	ncoghlan	set	pull_requests: + pull_request2184
2017-06-11 16:13:32	ronaldoussoren	set	messages: + msg295722
2017-06-11 14:26:03	ncoghlan	set	nosy: + ronaldoussoren, ned.deily messages: + msg295713
2017-06-11 14:07:20	ncoghlan	set	messages: + msg295710
2017-06-11 10:13:32	vstinner	set	status: closed -> open resolution: fixed -> (no value) messages: + msg295698
2017-06-11 04:55:59	ncoghlan	set	status: open -> closed resolution: fixed messages: + msg295688 stage: resolved
2017-06-11 03:16:17	ncoghlan	set	messages: + msg295683
2017-06-04 10:00:11	ncoghlan	set	messages: + msg295121
2017-03-13 07:22:30	ncoghlan	set	messages: + msg289534
2017-03-13 06:08:40	ncoghlan	set	pull_requests: + pull_request540
2017-03-05 07:12:56	ncoghlan	set	messages: + msg289002
2017-01-22 10:05:22	xdegaye	set	messages: + msg286001
2017-01-19 11:21:27	xdegaye	link	issue26865 dependencies
2017-01-18 15:16:00	xdegaye	set	files: + android_setlocale.patch nosy: + xdegaye messages: + msg285735
2017-01-08 02:22:40	ncoghlan	set	files: + pep538_coerce_legacy_c_locale_v3.diff messages: + msg284952
2017-01-07 22:20:07	Sworddragon	set	messages: + msg284943
2017-01-07 11:43:47	ncoghlan	set	files: + pep538-check-click.sh messages: + msg284908
2017-01-07 08:41:17	ncoghlan	set	messages: + msg284900
2017-01-07 05:24:57	methane	set	messages: + msg284887
2017-01-07 05:14:19	Sworddragon	set	messages: + msg284886
2017-01-07 02:53:55	vstinner	set	messages: + msg284884
2017-01-07 02:33:14	Sworddragon	set	messages: + msg284882
2017-01-06 07:51:04	Jan Niklas Hasse	set	messages: + msg284799
2017-01-06 03:07:56	ncoghlan	set	messages: + msg284795
2017-01-06 02:50:53	ncoghlan	set	messages: + msg284794
2017-01-05 22:44:13	vstinner	set	messages: + msg284782
2017-01-05 17:18:25	vstinner	set	messages: + msg284764
2017-01-05 14:44:31	barry	set	messages: + msg284747
2017-01-05 12:41:50	ncoghlan	set	messages: + msg284742
2017-01-05 11:36:20	lemburg	set	messages: + msg284736
2017-01-05 11:11:53	vstinner	set	messages: + msg284729
2017-01-05 10:54:48	ncoghlan	set	messages: + msg284725
2017-01-05 10:50:28	ncoghlan	set	messages: + msg284722
2017-01-05 10:10:46	methane	set	messages: + msg284720
2017-01-05 09:51:21	lemburg	set	messages: + msg284719
2017-01-05 09:42:56	methane	set	messages: + msg284718
2017-01-05 09:26:22	ncoghlan	set	messages: + msg284716
2017-01-05 03:32:16	methane	set	messages: + msg284697
2017-01-04 16:06:09	vstinner	set	messages: + msg284647
2017-01-04 14:46:13	ncoghlan	set	messages: + msg284641
2017-01-04 11:41:12	methane	set	messages: + msg284631
2017-01-04 08:29:24	methane	set	messages: + msg284621
2017-01-04 08:01:45	ncoghlan	set	messages: + msg284620
2017-01-04 01:02:24	methane	set	messages: + msg284605
2017-01-03 15:02:01	barry	set	nosy: + barry
2017-01-03 04:15:32	ncoghlan	set	files: + pep538_coerce_legacy_c_locale_v2.diff messages: + msg284537
2016-12-28 15:29:21	ncoghlan	set	messages: + msg284176
2016-12-28 12:23:31	Jan Niklas Hasse	set	messages: + msg284170
2016-12-28 02:45:48	ncoghlan	set	files: + pep538_coerce_legacy_c_locale.diff messages: + msg284150
2016-12-21 20:11:01	Sworddragon	set	nosy: + Sworddragon
2016-12-21 16:00:47	akira	set	nosy: + akira
2016-12-21 09:54:58	vstinner	set	messages: + msg283732
2016-12-18 07:20:38	ncoghlan	set	messages: + msg283543
2016-12-18 07:10:59	ncoghlan	set	files: + fedora-cpython-PYTHONALLOWCLOCALE.diff
2016-12-17 20:19:37	Jan Niklas Hasse	set	messages: + msg283515
2016-12-17 15:33:17	ncoghlan	set	messages: + msg283495
2016-12-17 10:15:25	lemburg	set	messages: + msg283482
2016-12-17 07:56:20	ncoghlan	set	messages: + msg283471
2016-12-17 07:46:52	ncoghlan	set	assignee: ncoghlan messages: + msg283469
2016-12-16 17:13:58	yan12125	set	nosy: + yan12125
2016-12-16 15:15:34	vstinner	set	messages: + msg283409
2016-12-16 15:12:22	vstinner	set	messages: + msg283408
2016-12-15 06:15:51	ncoghlan	set	files: + fedora-cpython-force-c-utf-8.diff keywords: + patch messages: + msg283244
2016-12-12 11:49:23	ncoghlan	set	messages: + msg282984
2016-12-12 10:26:47	methane	set	messages: + msg282978
2016-12-12 10:17:21	lemburg	set	nosy: + lemburg messages: + msg282977
2016-12-12 08:45:01	Jan Niklas Hasse	set	messages: + msg282972
2016-12-12 08:10:30	ncoghlan	set	messages: + msg282971
2016-12-12 08:03:12	Jan Niklas Hasse	set	messages: + msg282970
2016-12-12 05:29:16	ncoghlan	set	messages: + msg282965
2016-12-12 05:26:55	ncoghlan	set	nosy: + ncoghlan messages: + msg282964
2016-09-23 12:59:13	methane	set	nosy: + methane messages: + msg277274
2016-09-23 12:47:00	Jan Niklas Hasse	set	messages: + msg277273
2016-09-16 17:18:03	vstinner	set	messages: + msg276729
2016-09-16 14:46:05	r.david.murray	set	versions: + Python 3.7, - Python 3.5 nosy: + r.david.murray messages: + msg276722 stage: resolved -> (no value)
2016-09-16 13:09:34	Jan Niklas Hasse	set	messages: + msg276709
2016-09-16 13:02:20	vstinner	set	status: closed -> open superseder: Change sys.getfilesystemencoding() on Windows to UTF-8 -> resolution: duplicate -> (no value) messages: + msg276707
2016-09-16 11:22:16	abarry	set	status: open -> closed superseder: Change sys.getfilesystemencoding() on Windows to UTF-8 nosy: + abarry messages: + msg276694 resolution: duplicate stage: resolved
2016-09-16 11:17:02	Jan Niklas Hasse	create