This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Implementation of the PEP 538: coerce C locale to C.utf-8
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.7
process
Status: closed Resolution: fixed
Dependencies: 30565 30635 30647 Superseder:
Assigned To: ncoghlan Nosy List: Jan Niklas Hasse, Sworddragon, abarry, akira, barry, ezio.melotti, lemburg, mcepl, methane, ncoghlan, ned.deily, r.david.murray, ronaldoussoren, vstinner, xdegaye, yan12125
Priority: normal Keywords: patch

Created on 2016-09-16 11:17 by Jan Niklas Hasse, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
fedora-cpython-force-c-utf-8.diff ncoghlan, 2016-12-15 06:15 Downstream patch currently proposed for Fedora 26 review
fedora-cpython-PYTHONALLOWCLOCALE.diff ncoghlan, 2016-12-18 07:10 Draft Fedora 26 patch as at 2016-12-18 review
pep538_coerce_legacy_c_locale.diff ncoghlan, 2016-12-28 02:45 Initial patch for PEP 538 (targeting 3.7) review
pep538_coerce_legacy_c_locale_v2.diff ncoghlan, 2017-01-03 04:15 Add test cases for handling of unknown locales review
pep538-check-click.sh ncoghlan, 2017-01-07 11:43 Utility script to check click's behaviour in a PEP 538 patched CPython
pep538_coerce_legacy_c_locale_v3.diff ncoghlan, 2017-01-08 02:22 Refactor PEP 538 test cases to cover no locale setting, C locale, POSIX locale and unknown locale review
android_setlocale.patch xdegaye, 2017-01-18 15:15
pep538_coerce_legacy_c_locale.patch mcepl, 2020-03-21 12:47 Ufinished attempt to port this patch to Python 3.4
Pull Requests
URL Status Linked Edit
PR 659 merged ncoghlan, 2017-03-13 06:08
PR 2130 merged ncoghlan, 2017-06-12 13:28
PR 2155 merged vstinner, 2017-06-13 09:22
PR 2208 merged ncoghlan, 2017-06-15 04:42
PR 4334 merged xdegaye, 2017-11-08 11:09
Messages (89)
msg276693 - (view) Author: Jan Niklas Hasse (Jan Niklas Hasse) Date: 2016-09-16 11:17
Working with Docker I often end up with an environment where the locale isn't correctly set. In these cases it would be great if sys.getfilesystemencoding() could default to 'utf-8' instead of 'ascii', as it's the encoding of the future and ascii is a subset of it anyway.

Related: http://bugs.python.org/issue19846
msg276694 - (view) Author: Anilyka Barry (abarry) * (Python triager) Date: 2016-09-16 11:22
This is a duplicate of issue27781.
msg276707 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-16 13:02
> This is a duplicate of issue27781.

issue27781 is specific to Windows. I'm not sure that it's the base in this issue. So I reopen the issue.

@Jan Niklas Hasse: What is your OS?

I proposed to add "-X utf8" command line option for UNIX to force utf8 encoding. Would it work for you?
msg276709 - (view) Author: Jan Niklas Hasse (Jan Niklas Hasse) Date: 2016-09-16 13:09
Unfortunately no, as this would mean I'll have to change all my python invocations in my scripts and it wouldn't work for executable files with

#!/usr/bin/env python3

would it?
msg276722 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-09-16 14:46
I thought we "fixed" this by using surrogate escape when the locale was ASCII?  We certainly have discussed changing the default and posix and so far have decided not to (someday that will change...is this someday already?)
msg276729 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-09-16 17:18
> is this someday already?)

Not yet :-)
msg277273 - (view) Author: Jan Niklas Hasse (Jan Niklas Hasse) Date: 2016-09-23 12:47
Why not?
msg277274 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-09-23 12:59
I want locale free Python which behaves like on C.UTF-8 locale.
(stdio encoding, preferred encoding, weekday in _strptime._strptime,
and more maybe)

But Python 3.6 is feature freeze already >_<;;
msg282964 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-12-12 05:26
I think we're genuinely getting to the point now where the majority of "LANG=C" cases are misconfigurations rather than intended behaviour. We're also to the point where:

- on Mac OS X, binary system interfaces have been handled as UTF-8 by default since 3.0
- on Windows, as of 3.6, the OS native binary system interfaces are now bypassed entirely in favour of transcoding from UTF-8 to UTF-16-LE 

So I think for Python 3.7 it makes sense to do the following on other *nix systems:

- very early in CPython startup (even before argument processing), if the detected locale is "C", force it to "C.UTF-8" if possible, and print a warning either way
- add a PYTHONKEEPASCIILOCALE environment variable to turn that behaviour off

I do think we actually want to *change* the C level locale in the process though, as otherwise we can expect to see weird interactions where CPython and extension modules disagree about the default text encoding.
msg282965 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-12-12 05:29
Note also that if we say we're going to do this for 3.7, *and* go ahead and implement it, then distros may be more inclined to incorporate the same behavioural changes into distro-provided releases of 3.6, providing real world testing of the concept before we make it the default behaviour.
msg282970 - (view) Author: Jan Niklas Hasse (Jan Niklas Hasse) Date: 2016-12-12 08:03
Actually in a new Docker container, the LANG variable isn't set at all. Defaulting to UTF-8 in that case should be easier to reason about, shouldn't it?
msg282971 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-12-12 08:10
From CPython's point of view, glibc behaves the same way (i.e. reporting `ascii` as the preferred encoding for operating system interfaces) regardless of whether the cause is the locale not being set at all, or due to it being explicitly set to the legacy POSIX locale via `LANG=C`.
msg282972 - (view) Author: Jan Niklas Hasse (Jan Niklas Hasse) Date: 2016-12-12 08:45
https://sourceware.org/glibc/wiki/Proposals/C.UTF-8#Defaults mentions that C.UTF-8 should be glibc's default.

This bug report also mentions Python: https://sourceware.org/bugzilla/show_bug.cgi?id=17318
It hasn't been fixed yet, though :/
msg282977 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-12-12 10:17
If we just restrict this to the file system encoding (and not the whole LANG setting), how about:

 * default the file system encoding to 'utf-8' and use the surrogate escape handler as default error handler
 * add a PYTHONFSENCODING env var to set the file system encoding to something else (*)

(*) I believe we discussed this at some point already, but don't remember the outcome.

Regarding the questions of defaulting to LANG=C.UTF-8: I think this needs some more thought, since it would also affect many C locale aware functions. To make this work, Python would have to call setlocale() early on in the startup phase to adjust the C lib accordingly.
msg282978 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2016-12-12 10:26
Sorry for confusing.
I didn't meant defaulting LANG=C.UTF-8.

I meant use UTF-8 as default fsencoding, stdioencoding regardless locale,
and locale.getpreferredencoding() returns 'utf-8' when LC_CTYPE is ascii.
msg282984 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-12-12 11:49
The challenge that arises in being selective about this is that "sys.getfilesystemencoding()" is actually a misnomer, and some of the things we use it for (like decoding command line arguments and environment variables) necessarily happen *really* early in the interpreter bootstrapping process. The bugs that arise from being internally inconsistent are then even harder to debug than those that arise from believing the OS when it says the right encoding to use is ASCII - the latter at least don't tend to be subtle, and are amenable to being resolved via "LC_ALL=C.UTF-8" and "LANG=C.UTF-8".

I believe Victor put quite a bit of time into trying to get more selective approaches to work reliably and eventually gave up.

For Fedora 26, I'm going to explore the feasibility of patching our system 3.6 installation such that the python3 command itself (rather than the shared library) checks for "LC_CTYPE=C" as almost the first thing it does, and forcibly sets LANG and LC_ALL to C.UTF-8 if it gets an answer it doesn't like. If we're able to do that successfully in the more constrained environment of a specific recent Fedora release, then I think it will bode well for doing something similar by default in CPython 3.7
msg283244 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-12-15 06:15
Downstream Fedora issue proposing the above idea for F26: https://bugzilla.redhat.com/show_bug.cgi?id=1404918

I've also attached the patch from that issue here.
msg283408 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-16 15:12
Victor>> I proposed to add "-X utf8" command line option for UNIX to force utf8 encoding. Would it work for you?

Jan Niklas Hasse> Unfortunately no, as this would mean I'll have to change all my python invocations in my scripts and it wouldn't work for executable files with "#!/usr/bin/env python3" would it?

Usually, when a new option is added to Python, we add a command line option (-X utf8) but also an environment variable: I propose PYTHONUTF8=1.

Use your favorite method to define the env var "system wide" in your docker containers.

Note: Technically, I'm not sure that it's possible to support -E option with PYTHONUTF8, since -E comes from the command line, and we first need to decode command line arguments with an encoding to parse these options.... Chicken-and-egg issue ;-)
msg283409 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-16 15:15
> I believe Victor put quite a bit of time into trying to get more selective approaches to work reliably and eventually gave up.

Yeah, it just doesn't work to use more than one encoding per process. You should use the same encoding for the whole lifetime of a process.

If you decode early data from an encoding A and later encode it back to encoding B, you get mojibake. The problem is simple.

Using more than one encoding per process means starting to make assumtpions on how data is used. For example, consider that environment variables use the encoding A, but filenames should use the encoding B. Or, but what if an environment variable contains a filename? Similar issues for command line arguments, subprocess pipes, standard streams (sys.std*), etc.
msg283469 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-12-17 07:46
We've been discussing this further downstream in the Fedora Python SIG, and we have a draft approach that we're pretty sure will work for us (based in turn on the approach Armin Ronacher came up with for click), and we think it should work for other distros as well (as long as they already ship the C.UTF-8 locale, and if they don't, they should fix that limitation anyway).

So I'm assigning this to myself as I think the next step will be to write a PEP that both proposes the specific idea as the default behaviour in 3.7, and also encourages distros to opt-in to trialling it as a downstream patch for 3.6.
msg283471 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-12-17 07:56
Making an explicit note of this so I remember to mention it in the draft PEP: one of the biggest problems that arises in any attempt at a Python-only solution to overriding the locale is that we can end up disagreeing with C/C++ extensions, and this is *especially* a problem when sharing a process with GUI frameworks like Tcl/Tk, Qt, and GTK (since they tend to read the process-wide settings, rather than querying anything that CPython configures during normal operation).

So the approach I'm proposing is to implement a C->C.UTF-8 locale override in the *actual python CLI executable*, and then in the dynamically linked library we only emit a warning if we detect the C locale, we don't actually do anything to change it.
msg283482 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2016-12-17 10:15
On 17.12.2016 08:56, Nick Coghlan wrote:
> 
> Making an explicit note of this so I remember to mention it in the draft PEP: one of the biggest problems that arises in any attempt at a Python-only solution to overriding the locale is that we can end up disagreeing with C/C++ extensions, and this is *especially* a problem when sharing a process with GUI frameworks like Tcl/Tk, Qt, and GTK (since they tend to read the process-wide settings, rather than querying anything that CPython configures during normal operation).

Another use case to consider is embedding the Python
interpreter in another application. In such situations,
the C locale will usually already be set by the main
application and it may conflict with the LANG or other
locale env var settings, since the user may have chosen
to use a different locale in the context of the application.
msg283495 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-12-17 15:33
On 17 December 2016 at 20:15, Marc-Andre Lemburg <report@bugs.python.org>
wrote:

> Another use case to consider is embedding the Python
> interpreter in another application. In such situations,
> the C locale will usually already be set by the main
> application and it may conflict with the LANG or other
> locale env var settings, since the user may have chosen
> to use a different locale in the context of the application.
>

Aye, that's the origin of the split proposal to only emit a warning in the
shared library (since CPython might only be a piece of a larger
application), but implement actual locale coercion (by overriding LANG and
LC_ALL in the process environment) in the command line app's main()
function (as in that case we know CPython *is* the application).

The hard part of writing the PEP isn't really going to be explaining the
proposal itself (I expect it to be around a 20 line patch to the C code) -
it's going to be explaining why all the other possibilities we've
considered over the years don't work, and why we (as in the Fedora Python
SIG) think this one actually stands a chance of working properly :)
msg283515 - (view) Author: Jan Niklas Hasse (Jan Niklas Hasse) Date: 2016-12-17 20:19
> Usually, when a new option is added to Python, we add a command line option (-X utf8) but also an environment variable: I propose PYTHONUTF8=1.
>
> Use your favorite method to define the env var "system wide" in your docker containers.

This doesn't help me, as I already set LANG to C.utf-8.

I'm rather thing about new people trying out Python in Docker who don't know about this.

Furthermore I think that UTF-8 is the future and the use of ASCII should be discouraged.
msg283543 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-12-18 07:20
For folks not following the Fedora BZ issue directly, I've also attached the latest draft downstream patch here, which gives the following behaviour:

==========================

$ ./python -c "import sys; print(sys.getfilesystemencoding())"
utf-8

$ LANG=C.UTF-8 ./python -c "import sys; print(sys.getfilesystemencoding())"
utf-8

$ LANG=C ./python -c "import sys; print(sys.getfilesystemencoding())"
Python detected LC_CTYPE=C, forcing LC_ALL & LANG to C.UTF-8 (set PYTHONALLOWCLOCALE to disable this behaviour).
utf-8

$ PYTHONALLOWCLOCALE=1 LANG=C ./python -c "import sys; print(sys.getfilesystemencoding())"
Python detected LC_CTYPE=C, but PYTHONALLOWCLOCALE is set. Some libraries, applications, and operating system interfaces may not work correctly.
Py_Initialize detected LC_CTYPE=C, which limits Unicode compatibility. Some libraries and operating system interfaces may not work correctly. Use `PYTHONALLOWCLOCALE=1 LC_CTYPE=C python3` to configure a similar environment when running Python directly.
ascii
==========================

(The double warning in the last example is likely to go away by skipping the CLI level warning in that case)

The Python tests checking for the expected behaviour are signficantly longer than the C level changes needed to implement it :)
msg283732 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-12-21 09:54
Previous related work:

changeset:   89836:bc06f67234d0
user:        Victor Stinner <victor.stinner@gmail.com>
date:        Tue Mar 18 01:18:21 2014 +0100
files:       Doc/whatsnew/3.5.rst Lib/test/test_sys.py Misc/NEWS Python/pythonru
description:
Issue #19977: When the ``LC_TYPE`` locale is the POSIX locale (``C`` locale),
:py:data:`sys.stdin` and :py:data:`sys.stdout` are now using the
``surrogateescape`` error handler, instead of the ``strict`` error handler.
msg284150 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-12-28 02:45
I've now written this up as a PEP: https://github.com/python/peps/blob/master/pep-0538.txt

The latest attached patch implements the specific design proposed in the PEP. Relative to the last Fedora specific patch, this tweaks the warning message wording slightly, and only emits the library level warning when PYTHONALLOWCLOCALE is set:

======================
$ LANG=C ./python -c "import sys; print(sys.getfilesystemencoding())"
Python detected LC_CTYPE=C, forcing LC_ALL & LANG to C.UTF-8 (set PYTHONALLOWCLOCALE to disable this locale coercion behaviour).
utf-8


======================
$ PYTHONALLOWCLOCALE=1 LANG=C ./python -c "import sys; print(sys.getfilesystemencoding())"
Py_Initialize detected LC_CTYPE=C, which limits Unicode compatibility. Some libraries and operating system interfaces may not work correctly. Set `PYTHONALLOWCLOCALE=1 LC_CTYPE=C` to configure a similar environment when running Python directly.
ascii
msg284170 - (view) Author: Jan Niklas Hasse (Jan Niklas Hasse) Date: 2016-12-28 12:23
Only important case for me: What when LANG is unset?
msg284176 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2016-12-28 15:29
If nothing is configured (i.e. none of LC_ALL, LC_CTYPE or LANG are set in the environment), then C reports the locale as "C". It's probably worthwhile for me to add a Background section to the PEP that explains the behaviour of ``setlocale`` at the C level, as that's the source of the majority of the problems, as well as the key mechanism used to implement the locale coercion.
msg284537 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-01-03 04:15
Updated patch adds some tests showing that this change should also help with cases where SSH environment forwarding results in an unknown locale being requested in the server environment.
msg284605 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-01-04 01:02
I read PEP 538 but I can't understand why just using UTF-8 when locale is C like macOS is bad idea.
msg284620 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-01-04 08:01
On Mac OS X, the XCode libc already ignores the locale settings and just uses UTF-8 as the default text encoding, so the hardcoding in CPython aligns with that behaviour.

That isn't the case on other *nix systems - there, we need CPython to be consistent with the configured C/C++ locale, *and* we need it to be using something other than ASCII as the default encoding.

Answer: coerce the default locale from C to C.UTF-8 (if available), or to en_US.UTF-8 (for older distros that don't provide C.UTF-8). (The latter aspect isn't in the PEP yet, it's an improvement that came up in the linux-sig discussions: https://github.com/python/peps/issues/171 )
msg284621 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-01-04 08:29
> That isn't the case on other *nix systems - there, we need CPython to be consistent with the configured C/C++ locale, *and* we need it to be using something other than ASCII as the default encoding.

Isn't using UTF-8 as filesystem encoding and stdin/stdout encoding consistent with C or POSIX locale?

Don't "modern" programming environments (Rust, Go, node.js) use UTF-8 even if locale is C or POSIX?
msg284631 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-01-04 11:41
I'm sorry.
I must search old discussion about why we can't simply use utf-8
for fsencoding when C locale, instead of asking here.
msg284641 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-01-04 14:46
The default encoding in the C/POSIX locale is ASCII (which is the entire source of the problem).

The initial verison of the PEP I uploaded didn't explain that background, but I added a section about it in the update earlier this week: https://www.python.org/dev/peps/pep-0538/#background
msg284647 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-04 16:06
> The default encoding in the C/POSIX locale is ASCII (which is the entire source of the problem).

The reality is more complex than that :-) It depends on the OS.

Some OS uses Latin1 for the POSIX locale. Some OS announces to use
Latin1 for the POSIX locale, but use ASCII in practice :-) On these
lying OS, Python decodes bytes 0x80..0xff using mbstowcs() to check if
we get ASCII or Latin1: see the check_force_ascii() function.

/* Workaround FreeBSD and OpenIndiana locale encoding issue with the C locale.
   On these operating systems, nl_langinfo(CODESET) announces an alias of the
   ASCII encoding, whereas mbstowcs() and wcstombs() functions use the
   ISO-8859-1 encoding. The problem is that os.fsencode() and os.fsdecode() use
   locale.getpreferredencoding() codec. For example, if command line arguments
   are decoded by mbstowcs() and encoded back by os.fsencode(), we get a
   UnicodeEncodeError instead of retrieving the original byte string.

   The workaround is enabled if setlocale(LC_CTYPE, NULL) returns "C",
   nl_langinfo(CODESET) announces "ascii" (or an alias to ASCII), and at least
   one byte in range 0x80-0xff can be decoded from the locale encoding. The
   workaround is also enabled on error, for example if getting the locale
   failed.

    (...) */
msg284697 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-01-05 03:32
On Linux, I think most people wants UTF-8:surrogateescape by default, without fighting against locale and environment variables.

There are already `#if defined(__APPLE__) || defined(__ANDROID__)` path for it.
How about adding configure option to use same logic? (say `--with-encoding=(locale|utf-8)`, preferred encoding is changed in same way).

It may help many people building Python themselves without having root privilege for generating C.UTF-8 locale.
msg284716 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-01-05 09:26
Anything purely on the Python side of things doesn't work in a traditional C environment - CPython relies on the C lib to do conversions during startup, so we need the C locale to be set correctly. We can do things differently on Mac OS X and iOS because Apple ensure that *C* behaves differently on Mac OS X and iOS (and apparently Google do something similar for Android, so I'll update the PEP to mention that as well).
msg284718 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-01-05 09:42
> Anything purely on the Python side of things doesn't work in a traditional C environment - CPython relies on the C lib to do conversions during startup, so we need the C locale to be set correctly. 

What I propose is non't use mbstowcs, like __ANDROID__

wchar_t*
Py_DecodeLocale(const char* arg, size_t *size)
{
#if defined(__APPLE__) || defined(__ANDROID__)
    wchar_t *wstr;
    wstr = _Py_DecodeUTF8_surrogateescape(arg, strlen(arg));


On Linux, command line arguments and filepath is just a byte sequence.
So using UTF-8:surrogateescape from during startup should works fine.

Am I wrong?
msg284719 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2017-01-05 09:51
On 05.01.2017 10:26, Nick Coghlan wrote:
> 
> Anything purely on the Python side of things doesn't work in a traditional C environment - CPython relies on the C lib to do conversions during startup, so we need the C locale to be set correctly. We can do things differently on Mac OS X and iOS because Apple ensure that *C* behaves differently on Mac OS X and iOS (and apparently Google do something similar for Android, so I'll update the PEP to mention that as well).

I believe IANADA-san (hope that's the right way to address him)
raised a good point though: what if a system doesn't come with
the C.UTF-8 local setup ?

The C lib would then error out when trying to use setlocale()
on such an environment.

Now, Python's main() function doesn't look at any such errors
(and neither do the other places which use it such as frozenmain.c
and readline.c), so it wouldn't even notice.

The setlocal() man-page doesn't mention how such a failure would
affect the current locale settings. My guess is that the locale
remains set to what it was before, which in case of a fresh C
application start is the "C" locale.

So in the implementation of the PEP, there should be a test
to see whether "C.UTF-8" does result in a successful
call to setlocale(). If it doesn't, there would have to be
some work-around to still make Python's FS encoding happy
while leaving the C lib locale set at "C".
msg284720 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-01-05 10:10
Why I want to add configure option to ignore locale is:


1. C.UTF-8 is not supported by RHEL7 (https://bugzilla.redhat.com/show_bug.cgi?id=1361965)

RHEL7 will be used for a long time.
And many people uses new Python instead of distro's Python, via pyenv or pythonz.
I feel deprecating C locale from Python 3.7 is bit aggressive.


2. Many admins like C locale.

locale setting will cause unintended side effects. So many admins dislike xx_XX.UTF-8 locale.
For example (from https://fumiyas.github.io/2016/12/25/dislike.sh-advent-calendar.html ):

$ mkdir tmp
$ cd tmp
$ touch a b c x y z A B C X Y Z
$ LC_ALL=C /bin/bash --noprofile --norc -c 'echo [A-Z]'
A B C X Y Z
$ LC_ALL=en_US.UTF-8 /bin/bash --noprofile --norc -c 'echo [A-Z]'
A b B c C x X y Y z Z


3. Many other languages can use UTF-8 even when C locale

node.js, Ruby, Rust, Go can use UTF-8 on Linux
People don't want to learn how to configure locale properly only for Python.
msg284722 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-01-05 10:50
No, requesting a locale that doesn't exist doesn't error out, because we don't check the return code - it just keeps working the same way it does now (i.e. falling back to the legacy C locale).

However, it would be entirely reasonable to put together a competing PEP proposing to eliminate the reliance on the problematic libc APIs, and instead use locale independent replacements. I'm simply not offering to implement or champion such a PEP myself, as I think ignoring the locale settings rather than coercing them to something more sensible will break integration with C/C++ GUI toolkits like Tcl/Tk, Gtk, and Qt, and it's reasonable for us to expect OS providers to offer at least one of C.UTF-8 or en_US.UTF-8 (see https://github.com/python/peps/issues/171 for more on that).
msg284725 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-01-05 10:54
The PEP already explains how other runtimes achieve UTF-8 and UTF-18-LE everywhere: by ignoring the C/C++ locale entirely. While this breaks integration with other C/C++ components, the developers of those languages and runtimes simply don't care, as they never supported integrating with those components in the first place.

CPython doesn't have that luxury, since it is used extensively in locale aware desktop applications.
msg284729 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-05 11:11
Sorry, I still didn't have enough time to read carefully the PEP 538. But since the discussion already started on this issue, I will add my comments:

* I'm sure that many Linux, UNIX and BSD systems don't have the "C.UTF-8" locale. For example, HP-UX has "C.utf8" which is not exactly "C.UTF-8".

* Setting the locale has an impact on all libraries running in the Python process. At this point, I'm not sure that it is what we want.

* I'm not sure that it's ok in 2017 to always force the UTF-8 encoding if the user locale uses a different encoding. I had the same concern with the PEP 528 (Change Windows console encoding to UTF-8) and PEP 529 (Change Windows filesystem encoding to UTF-8) on Windows, but these PEPs were approved and merged into Python 3.6. My fear is obviously mojibake with the other applications using the other encoding, the locale encoding. Other applications are not impacted by setlocale() in the Python process.

* I proposed an opt-in option to force UTF-8: -X utf8 command line option and PYTHONUTF8=1 env var. Opt-in will obviously reduce the risk of backward compatibility issues. With an opt-in option, users are better prepared for mojibake issues.

* I dislike "Backporting to earlier Python 3 releases". In my experience, changes on how Python handles text (encodings, codecs, etc.) always have subtle issues, and users dislike getting backward incompatible changes in minor releases. *Maybe* if the option is an opt-in, the risk is lower and acceptable?

* I dislike that Fedora has such downstream change. I would prefer to decide upstream how to convert UTF-8 slowly as a first-class citizen in Python. Otherwise, Fedora would behave differently than other Linux distributions and it can be painful to write applications having the same behaviour on all Linux distributions. But I also understand that Fedora has sometimes to move faster than the slow CPython project :-) Fedora can also seen as a toy to experiment changes quickly which helps to provide a wide feedback upstream to take better decision.

* Using strict or surrogateescape error handler is a very important choice which has a wide impact. If we use utf8 by default (PEP 538), people will problably complain less if Python magically pass undecoded bytes thanks to the surrogateescape. If the option is an opt-in, strict may make sense. But surrogateescape is maybe still more "convenient". I don't know at this point.

Nick: it seems like you have a well defined plan. But I dislike on multiple points. I don't know if it's better to try to convince you to change your PEP, or write a different PEP.

I planned to write such "UTF-8" PEP since 2015, but I never started because the scope is so large that I fear all tiny but annoying corner cases...
msg284736 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2017-01-05 11:36
While going for the full locale setting may be a good option,
perhaps just focusing on the FS encoding for now is a better
way forward (and also more in line with the ticket title).

So essentially go for the PEP 529 approach on Unix as well
(except that we use 'ascii' as fallback in legacy mode):

https://www.python.org/dev/peps/pep-0529/

The PEP also includes a section on affected modules, which we
could double check (even though the term "FS encoding" implies
that only file system relevant APIs are touched by such a change,
the encoding is used in several other places as well):

https://www.python.org/dev/peps/pep-0529/#id14

For Windows, a couple of modules such as pwd and nis are not
used, so those may need some extra attention.
msg284742 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-01-05 12:41
The trade-offs here are incredibly complex (and are mainly a matter of deciding whose code and configurations we want to break in 3.7+), so I think competing PEPs are going to be better than attempting to create a combined PEP that tries to cover all the options.

That way each PEP can argue as strongly as it can for the respective authors preferred approach to tackling the default C locale problem, even if they point to a common background section in one of the PEPs (similar to the way PEPs 522 and 524 shared a common problem definition, even though they proposed different ways of handling it).
msg284747 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2017-01-05 14:44
On Jan 05, 2017, at 11:11 AM, STINNER Victor wrote:

>I'm sure that many Linux, UNIX and BSD systems don't have the "C.UTF-8"
>locale. For example, HP-UX has "C.utf8" which is not exactly "C.UTF-8".
>
>I'm not sure that it's ok in 2017 to always force the UTF-8 encoding if the
>user locale uses a different encoding.

It's not just any different encoding, it's specifically C (implicitly,
C.ASCII).

>I proposed an opt-in option to force UTF-8: -X utf8 command line option and
>PYTHONUTF8=1 env var. Opt-in will obviously reduce the risk of backward
>compatibility issues. With an opt-in option, users are better prepared for
>mojibake issues.

If this is true, then I would like a configuration option to default this on.
As mentioned, Debian and Ubuntu already have C.UTF-8 and most environments
(although not all, see my sbuild/schroot comment earlier) will at least be
C.UTF-8.  Perhaps it doesn't matter then, but what I really want is that for
those few odd outliers (e.g. schroot), Python would act the same inside and
out those environments.  I really don't want people to have to add that envar
or switch (or even export LC_ALL) to get proper build behavior.
msg284764 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-05 17:18
> That way each PEP can argue as strongly as it can for the respective authors preferred approach to tackling the default C locale problem, even if they point to a common background section in one of the PEPs (similar to the way PEPs 522 and 524 shared a common problem definition, even though they proposed different ways of handling it).

Ok, same players play again: as PEP 522/524 with Nick and me, I just wrote the PEP 540 "Add a new UTF-8 mode" and Nick wrote the PEP 538 :-D

I started a thread to discuss the PEP on python-ideas:
https://mail.python.org/pipermail/python-ideas/2017-January/044089.html

IMHO the PEP 538 should discuss the usage of the surrogateescape error handler: see my second mail in the thread for the details.

I proposed a change in my 3rd mail which would move my PEP closer to Nick's PEP 538: enable "automatically" the UTF-8 mode when the locale is POSIX.
msg284782 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-05 22:44
> Working with Docker I often end up with an environment where the locale isn't correctly set.

The locale encoding is controlled by 3 environment variables: LC_ALL, LC_CTYPE and LANG.
https://www.python.org/dev/peps/pep-0540/#the-posix-locale-and-its-encoding

Can you please tell me if these variables are set and if yes, give me their value?

I would like to know if it would be possible to change the behaviour of Python when the (LC_CTYPE) locale is POSIX (aka the famous "C" locale).
msg284794 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-01-06 02:50
Docker containers don't have a locale set by default - the approach proposed in PEP 528 actually comes from the way I configure Docker images (which in turn comes from Armin Ronacher's recommendations in click for Python 3 locale handling).

In the Dockerfile for Fedora based containers I add:

    ENV LC_ALL=C.UTF-8
    ENV LANG=C.UTF-8

while in CentOS 7 based containers I add:

    ENV LC_ALL=en_US.UTF-8
    ENV LANG=en_US.UTF-8

And with those settings, Python 3 based containers just work (my laptop is running en_AU.UTF-8 locally)
msg284795 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-01-06 03:07
And by PEP 528, I actually mean PEP 538 :)
msg284799 - (view) Author: Jan Niklas Hasse (Jan Niklas Hasse) Date: 2017-01-06 07:51
> Can you please tell me if these variables are set and if yes, give me their value?

None of these variables are set (with `docker run -it fedora:25 /bin/bash`).
msg284882 - (view) Author: (Sworddragon) Date: 2017-01-07 02:33
On looking into PEP 538 and PEP 540 I think PEP 540 is the way to go. It provides an option for a stronger encapsulation for the de-/encoding logic between the interpreter and the developer. Instead of caring about error handling the developer has now to care about mojibake handling (for me and maybe others that is explicitly preferred but maybe this depends on each individual). If I'm not wrong PEP 538 improves this for the output too but input handling will still suffer from the overall issue while PEP 540 does also solve this case. Also PEP 540 would not make the C locale and thus eventually some systems potentially unsupported (but it might be an acceptable trade-off if we should really go PEP 538).


Specific for PEP 540:

> The POSIX locale enables the UTF-8 mode

Non-strict I assume?


> UTF-8 /backslashreplace

Was/is the reason to use backslashreplace for sys.stderr to guarantee that the developer/user sees the error messages? Might it make sense to also use surrogateescape instead of backslashescape for sys.stderr in UTF-8 non-strict mode to be consistent here?
msg284884 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-01-07 02:53
Sworddragon added the comment:
> (for me and maybe others that is explicitly preferred but maybe this depends on each individual)

That's why the PEP 540 has options to enable to disable its UTF-8 mode(s).

> If I'm not wrong PEP 538 improves this for the output too but input handling will still suffer from the overall issue while PEP 540 does also solve this case.

The PEP 538 works fine if all inputs and outputs are encoded to UTF-8.
I understand that it's a deliberate choice to fail on
decoding/encoding error (to not use surrogateescape), but I can be
wrong.

> Also PEP 540 would not make the C locale and thus eventually some systems potentially unsupported (but it might be an acceptable trade-off if we should really go PEP 538).

What do you mean by "make the C locale"?

> Specific for PEP 540:
>
>> The POSIX locale enables the UTF-8 mode
>
> Non-strict I assume?

Yes, non strict.

I'm not sure of the name of each mode yet.

After having written the "Use Cases" section and especially the
Mojibake column of results, I consider the option of renaming the
"UTF-8 mode" to "YOLO mode".

>> UTF-8 /backslashreplace
>
> Was/is the reason to use backslashreplace for sys.stderr to guarantee that the developer/user sees the error messages?

Yes.

> Might it make sense to also use surrogateescape instead of backslashescape for sys.stderr in UTF-8 non-strict mode to be consistent here?

Using surrogateescape means that you pass through undecodable bytes
from inputs to stderr which can cause various kinds of bad surprises.

stderr is used to log errors. Getting a new error when trying to log
an error is kind of annoying.

Victor
msg284886 - (view) Author: (Sworddragon) Date: 2017-01-07 05:14
> What do you mean by "make the C locale"?

I was pointing to the Platform Support Changes of PEP 538.


> I'm not sure of the name of each mode yet.
>
> After having written the "Use Cases" section and especially the
> Mojibake column of results, I consider the option of renaming the
> "UTF-8 mode" to "YOLO mode".

Assumingly YOLO is meant to be negative: Things are whirling in my mind. Eventually you want to save your joker :>


> Using surrogateescape means that you pass through undecodable bytes
> from inputs to stderr which can cause various kinds of bad surprises.
>
> stderr is used to log errors. Getting a new error when trying to log
> an error is kind of annoying.

Hm, what bad surprise/error could appear that would not appear with backslashescape?
msg284887 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2017-01-07 05:24
>> stderr is used to log errors. Getting a new error when trying to log
>> an error is kind of annoying.
>
> Hm, what bad surprise/error could appear that would not appear with backslashescape?

$ cat badfilename.py 
badfn = "こんにちは".encode('euc-jp').decode('utf-8', 'surrogateescape')
print("bad filename:", badfn)

$ PYTHONIOENCODING=utf-8:backslashreplace python3 badfilename.py 
bad filename: \udca4\udcb3\udca4\udcf3\udca4ˤ\udcc1\udca4\udccf

$ PYTHONIOENCODING=utf-8:surrogateescape python3 badfilename.py 
bad filename: �����ˤ���
msg284900 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-01-07 08:41
I just pushed an update to PEP 538 based on PEP 540 and the feedback in the linux-sig discussion: https://github.com/python/peps/commit/221099d8765125bbd798e869846b005bcca84b47

I'll be starting a thread for that on python-ideas shortly, but in the context of the discussion here:

* There are good reasons to go back to strict error handling by default on the standard streams when we're using UTF-8 as the default encoding rather than ASCII: https://www.python.org/dev/peps/pep-0538/#using-strict-error-handling-by-default
* The right overall answer might actually be to create a hybrid merger of the two PEPs, rather than seeing them as strictly competitors: https://www.python.org/dev/peps/pep-0538/#relationship-with-other-peps
msg284908 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-01-07 11:43
While the attached PEP 538 patches include their own tests, the uploaded pep538-check-click.sh script is the one I've been using to check that the changes have the desired effect of letting click "just work", even when the nominal locale is cleared, explicitly set to C, or explicitly set to POSIX.
msg284943 - (view) Author: (Sworddragon) Date: 2017-01-07 22:20
> $ cat badfilename.py 
> badfn = "こんにちは".encode('euc-jp').decode('utf-8', 'surrogateescape')
> print("bad filename:", badfn)
>
> $ PYTHONIOENCODING=utf-8:backslashreplace python3 badfilename.py 
> bad filename: \udca4\udcb3\udca4\udcf3\udca4ˤ\udcc1\udca4\udccf
>
> $ PYTHONIOENCODING=utf-8:surrogateescape python3 badfilename.py 
> bad filename: �����ˤ���

The first example is still readable (but effectively for an user not so much) while the second example appears to be not readable anymore at all. But the second example is actually technically still readable and there is no data loss, isn't it? As in this case it would probably not speak against surrogateescape for sys.stderr in UTF-8 non-strict mode. Otherwise backslashescape might be indeed the better choice.


I have thought about this a bit more and in case we go PEP 538 with keeping strict errors more or less the old way there might be another solution that could improve the overall issue: print() could get an option to allow changing the error handler on demand (with 'strict' still being the default).

Most things that I do output with print() are deterministic or optional and not important application data. Being able to print this information without caring for de-/encoding errors would mitigate this issue. In case application data is being printed where data loss is not desired exceptions can still be thrown.
msg284952 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-01-08 02:22
Uploaded one last version of the patch implementing the previous PEP 538 design. This refactors the test cases so they systematically cover 4 cases that we expect to be reported as "the C locale":

- LC_ALL, LC_CTYPE, and LANG all empty
- one of them set to "C", others empty
- one of them set to "POSIX", others empty
- one of them set to an unknown locale, others empty

The next version of the patch will update it to match the latest draft of the PEP (PYTHONCOERCECLOCALE, different message wording, etc)
msg285735 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2017-01-18 15:15
pep538_coerce_legacy_c_locale_v3.diff fixes issue 28997 on Android (api 21 and 24). This issue is raised because there is an inconsistency between Python on Android that considers the locale encoding to be always UTF-8 and GNU Readline that does not accept eight-bit characters when LANG is not set (on Android).

On Android, setlocale(CATEGORY, "") does not look for the locale environment variables (LANG, ...) but sets the 'C' locale instead, so the patch does not fully behave as expected and the 'Py_Initialize detected' warning is emitted. Here is the output of an interactive session on Android:

root@generic_x86:/data/data/org.bitbucket.pyona # python
Python detected LC_CTYPE=C, forcing LC_ALL & LANG to C.UTF-8 (set PYTHONALLOWCLOCALE to disable this locale coercion behaviour).
Py_Initialize detected LC_CTYPE=C, which limits Unicode compatibility. Some libraries and operating system interfaces may not work correctly. Set `PYTHONALLOWCLOCALE=1 LC_CTYPE=C` to configure a similar environment when running Python directly.
Python 3.7.0a0 (default:0503024831ad+, Jan 18 2017, 11:34:53)
[GCC 4.2.1 Compatible Android Clang 3.8.256229 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale, os
>>> os.environ['LANG']
'C.UTF-8'
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')
>>> locale.setlocale(locale.LC_CTYPE)
'C'
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'C.UTF-8'
>>> locale.setlocale(locale.LC_CTYPE)
'C.UTF-8'

The attached android_setlocale.patch fixes the following problems when applied after pep538_coerce_legacy_c_locale_v3.diff:
* No 'Py_Initialize detected' warning is emitted.
* locale.setlocale(locale.LC_CTYPE) returns now 'C.UTF-8'.
msg286001 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2017-01-22 10:05
> On Android, setlocale(CATEGORY, "") does not look for the locale environment variables (LANG, ...) but sets the 'C' locale instead

FWIW the source code of setlocale() on bionic (Android libc) is at https://android.googlesource.com/platform/bionic/+/master/libc/bionic/locale.cpp#144
msg289002 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-03-05 07:12
An updated reference implementation has been pushed to the pep538-coerce-c-locale branch in my GitHub fork: https://github.com/python/cpython/compare/master...ncoghlan:pep538-coerce-c-locale

(That doesn't include Xavier's Android fixes yet, though)
msg289534 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-03-13 07:22
OK, the PEP 538 reference implementation has reached the point where I was willing to create a PR for it: https://github.com/python/cpython/pull/659

That PR/branch also includes the necessary changes to always force the C.UTF-8 locale on Android rather than defaulting to the C locale.

I believe the only thing missing at this point is the configure.ac dance to ensure that PY_WARN_ON_C_LOCALE and PY_COERCE_C_LOCALE never get set on Mac OS X.
msg295121 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-06-04 10:00
The PEP 538 PR is mostly complete now, but I created https://bugs.python.org/issue30565 to track making a follow-up decision on whether or not we really want to emit a warning on *successful* implicit locale coercion.

The pre-release What's New entry for PEP 538 will include a link to that issue to allow folks to provide feedback on their preferences.
msg295683 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-06-11 03:16
New changeset 6ea4186de32d65b1f1dc1533b6312b798d300466 by Nick Coghlan in branch 'master':
bpo-28180: Implementation for PEP 538 (#659)
https://github.com/python/cpython/commit/6ea4186de32d65b1f1dc1533b6312b798d300466
msg295688 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-06-11 04:55
And merged!

Thanks to all involved in the process of getting this change through to implementation :)
msg295698 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-11 10:13
Tests fail on many buildbots.
msg295710 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-06-11 14:07
Ah, it would have been too easy for all the other *nix variants to be close enough to Fedora & Ubuntu for everything to work first time :)
msg295713 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-06-11 14:26
Initial look at the failures on the stable buildbots:

FreeBSD 10.x: if locale coercion succeeds, we then fail on get_codeset() (perhaps because that doesn't recognise LC_CTYPE=UTF-8?)
FreeBSD CURRENT: if locale coercion fails (due to no suitable locale), lots of error handling tests fail due to the unexpected warning message on stderr

Mac OS X Tiger: looks like the test expectations aren't right on Mac OS X (at least for Tiger). I've added the Mac OS X folks to the nosy list.

Ubuntu shared library build: loading the shared library fails in _testembed for the `test_forced_io_encoding` test case, which suggest a problem with the way that particular test is running the binary

Windows 8.1 refleak hunting: failure doesn't appear to be due to this change (multiprocessing test failures)
s390x RHEL 7: failure doesn't appear to be due to this change (multiprocessing test failures)
msg295722 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2017-06-11 16:13
The macOS failures are at least partially caused by test assumptions that aren't true on macOS: in particular the filesystem encoding defaults to UTF-8 on macOS (because HFS+ and the recent APFS filesystem store unicode data and not pure byte strings).
msg295871 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-13 09:02
> FreeBSD 10.x: if locale coercion succeeds, we then fail on get_codeset() (perhaps because that doesn't recognise LC_CTYPE=UTF-8?)

I created bpo-30647 to track this one.
msg295872 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-13 09:02
Ronald Oussoren:
> The macOS failures are at least partially caused by test assumptions that aren't true on macOS (...)

Nick is working on a fix for macOS:
https://github.com/python/cpython/pull/2130
msg295875 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-13 09:17
It seems like this change:

     def test_forced_io_encoding(self):
         # Checks forced configuration of embedded interpreter IO streams
-        out, err = self.run_embedded_interpreter("forced_io_encoding")
-        if support.verbose:
+        env = {"PYTHONIOENCODING": "utf-8:surrogateescape"}
+        out, err = self.run_embedded_interpreter("forced_io_encoding", env=env)
(...)

Caused a failure on the "shared" buildbot (./configure --enable-shared):

http://buildbot.python.org/all/builders/x86%20Ubuntu%20Shared%203.x/builds/877/steps/test/logs/stdio

======================================================================
FAIL: test_forced_io_encoding (test.test_capi.EmbeddingTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_capi.py", line 484, in test_forced_io_encoding
    out, err = self.run_embedded_interpreter("forced_io_encoding", env=env)
  File "/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Lib/test/test_capi.py", line 392, in run_embedded_interpreter
    (p.returncode, err))
AssertionError: 127 != 0 : bad returncode 127, stderr is '/srv/buildbot/buildarea/3.x.bolen-ubuntu/build/Programs/_testembed: error while loading shared libraries: libpython3.7dm.so.1.0: cannot open shared object file: No such file or directory\n'
msg295885 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-13 09:49
New changeset eb52ac89929bb09b15c014ab8ff60eee685e86c7 by Victor Stinner in branch 'master':
bpo-28180: Fix test_capi.test_forced_io_encoding() (#2155)
https://github.com/python/cpython/commit/eb52ac89929bb09b15c014ab8ff60eee685e86c7
msg295913 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-06-13 12:49
New changeset 4563099d28e832aed22b85ce7e2a92236df03847 by Nick Coghlan in branch 'master':
bpo-28180: assume UTF-8 for Mac OS X PEP 538 tests (GH-2130)
https://github.com/python/cpython/commit/4563099d28e832aed22b85ce7e2a92236df03847
msg295914 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-06-13 12:54
I've added dependencies for PEP 538 induced testing problems that have been broken out into their own issues.

I've also merged my attempt at fixing the tests on Mac OS X.

Something that's included in that patch is an implicit skip of the "LANG=UTF-8" case when checking external locale configuration. I expected that to behave the same way as "LC_CTYPE=UTF-8", but instead it's behaving more like "LC_CTYPE=C".
msg296064 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-06-15 04:18
Ah, I finally understand Victor's comment on my initial attempt at fixing the tests on Mac OS X - the standard streams *don't* use the filesystem encoding, so they default to ASCII in the C locale, even on Mac OS X.
msg296075 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-06-15 09:11
New changeset 7926516ff95ed9c8345ed4c4c4910f44ffbd5949 by Nick Coghlan in branch 'master':
bpo-28180: Standard stream & FS encoding differ on Mac OS X (GH-2208)
https://github.com/python/cpython/commit/7926516ff95ed9c8345ed4c4c4910f44ffbd5949
msg296077 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2017-06-15 09:32
The latest commit should get the Mac OS X buildbot back to green, but I had to disable some test cases to do it - see issue 30672 for details.

Issue 30565 is the one that covers silencing the locale coercion and locale compatibility warnings by default.
msg305850 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2017-11-08 14:26
PR 4334 added: fix the implementation of PEP 538 on Android.

The current implementation of PEP 538 fixes issue 28997 without the locale coercion for Android added by PR 4334, see msg305848.
msg306108 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2017-11-12 11:46
New changeset 1588be66d7b0eeebc4614309cd0fc837ff52776a by xdegaye in branch 'master':
bpo-28180: Fix the implementation of PEP 538 on Android (GH-4334)
https://github.com/python/cpython/commit/1588be66d7b0eeebc4614309cd0fc837ff52776a
msg314627 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2018-03-29 01:23
Given that issue 32002 and issue 30672 track the known challenges in testing the expected locale coercion behaviour reliably, I'm going to go ahead and close this overall implementation issue (the feature is there, and works in a way we're happy with, we're just encountering some challenges clearly expressing those expectations as a regression test).
msg364740 - (view) Author: Matej Cepl (mcepl) * Date: 2020-03-21 12:47
I have tried to port this patch to Python 3.4 (still maintained by SUSE on SLE-12), but I have the hardest time to debug this. All affected tests end with errors like this:

[  493s] ======================================================================
[  493s] FAIL: test_test_PYTHONCOERCECLOCALE_not_set (test.test_c_locale_coercion.LocaleCoercionTests) (PYTHONCOERCECLOCALE=None, env_var='LC_CTYPE', nominal_locale='invalid.ascii')
[  493s] ----------------------------------------------------------------------
[  493s] Traceback (most recent call last):
[  493s]   File "/home/abuild/rpmbuild/BUILD/Python-3.4.10/Lib/test/test_c_locale_coercion.py", line 326, in _check_c_locale_coercion
[  493s]     coercion_expected)
[  493s]   File "/home/abuild/rpmbuild/BUILD/Python-3.4.10/Lib/test/test_c_locale_coercion.py", line 219, in _check_child_encoding_details
[  493s]     self.assertEqual(encoding_details, expected_details)
[  493s] AssertionError: {'fse[79 chars]cii:strict', 'stderr_info': 'ascii:backslashre[45 chars]ict'} != {'fse[79 chars]cii:surrogateescape', 'stderr_info': 'ascii:ba[63 chars]ape'}
[  493s]   {'fsencoding': 'ascii',
[  493s]    'lang': '',
[  493s]    'lc_all': '',
[  493s]    'lc_ctype': 'invalid.ascii',
[  493s]    'stderr_info': 'ascii:backslashreplace',
[  493s] -  'stdin_info': 'ascii:strict',
[  493s] ?                         ^^ ^
[  493s] 
[  493s] +  'stdin_info': 'ascii:surrogateescape',
[  493s] ?                        ++++++ ^^^ ^^^
[  493s] 
[  493s] -  'stdout_info': 'ascii:strict'}
[  493s] ?                          ^^ ^
[  493s] 
[  493s] +  'stdout_info': 'ascii:surrogateescape'}
[  493s] ?                         ++++++ ^^^ ^^^

yes, it is always a conflict between strict and surrogateescape. I probably don’t have time to finish debugging this, so I am just leaving this for posterity.
msg364760 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-21 16:51
Python 3.4 is no longer supported upstream. Python 3 got tons of Unicode fixes between Python 3.4 and Python 3.8.
msg364767 - (view) Author: Matej Cepl (mcepl) * Date: 2020-03-21 18:06
> Python 3.4 is no longer supported upstream. Python 3 got tons of Unicode fixes between Python 3.4 and Python 3.8.

Of course, I know that, but I just didn’t want to throw all my effort away, when I spent some hours on making it. And I guess, there may be somebody else who cares for 3.4 (ehm, RHEL-7 has 3.3, doesn’t it?).
msg364770 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-03-21 18:33
RHEL 7.7 and RHEL 8 provides Python 3.6. PEP 538 was implemented in Python 3.7. PEP 538 feature was backported in RHEL 7.7 and RHEL 8 Python 3.6.
msg364804 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2020-03-22 12:55
The test cases for locale coercion *not* triggering still assume that bpo-19977, using surrogateescape on the standard streams in the POSIX locale, has been implemented (since that was implemented in Python 3.5).

Hence the various test cases complaining that they found "ascii:strict" (Py 3.4 behaviour without bpo-19977) where they expected "ascii:surrogateescape" (the Py 3.5+ behaviour *with* bpo-19977).

To get a PEP 538 backport to work as intended on 3.4, you'd need to backport that earlier IO stream error handling change as well.
msg364810 - (view) Author: Matej Cepl (mcepl) * Date: 2020-03-22 16:25
Thank you very much for the hint. Do I have to include the patch for bpo-19977 only (that would be easy), or also all twelve PRs for bpo-29240 (that would probably broke my will to do it)?
History
Date User Action Args
2022-04-11 14:58:36adminsetgithub: 72367
2020-03-22 16:25:00mceplsetmessages: + msg364810
2020-03-22 12:55:15ncoghlansetmessages: + msg364804
2020-03-21 18:33:09vstinnersetmessages: + msg364770
2020-03-21 18:06:02mceplsetmessages: + msg364767
2020-03-21 16:51:34vstinnersetmessages: + msg364760
2020-03-21 12:47:16mceplsetfiles: + pep538_coerce_legacy_c_locale.patch
nosy: + mcepl
messages: + msg364740

2018-03-29 01:23:38ncoghlansetstatus: open -> closed
messages: + msg314627

dependencies: - PEP 538: Unexpected locale behaviour on *BSD (including Mac OS X)
resolution: fixed
stage: patch review -> resolved
2017-11-12 11:46:04xdegayesetmessages: + msg306108
2017-11-08 14:26:02xdegayesetmessages: + msg305850
2017-11-08 11:09:55xdegayesetstage: patch review
pull_requests: + pull_request4289
2017-06-15 09:32:09ncoghlansetdependencies: + PEP 538: silence locale coercion and compatibility warnings by default?, PEP 538: Unexpected locale behaviour on *BSD (including Mac OS X)
messages: + msg296077
stage: resolved -> (no value)
2017-06-15 09:11:42ncoghlansetmessages: + msg296075
2017-06-15 04:42:06ncoghlansetpull_requests: + pull_request2252
2017-06-15 04:18:39ncoghlansetmessages: + msg296064
2017-06-13 12:55:00ncoghlansetdependencies: + Leak in test_c_locale_coercion, CODESET error on AMD64 FreeBSD 10.x Shared 3.x caused by the PEP 538
messages: + msg295914
2017-06-13 12:49:47ncoghlansetmessages: + msg295913
2017-06-13 09:49:47vstinnersetmessages: + msg295885
2017-06-13 09:22:42vstinnersetpull_requests: + pull_request2205
2017-06-13 09:21:13vstinnersettitle: sys.getfilesystemencoding() should default to utf-8 -> Implementation of the PEP 538: coerce C locale to C.utf-8
2017-06-13 09:17:54vstinnersetmessages: + msg295875
2017-06-13 09:02:55vstinnersetmessages: + msg295872
2017-06-13 09:02:04vstinnersetmessages: + msg295871
2017-06-12 13:28:51ncoghlansetpull_requests: + pull_request2184
2017-06-11 16:13:32ronaldoussorensetmessages: + msg295722
2017-06-11 14:26:03ncoghlansetnosy: + ronaldoussoren, ned.deily
messages: + msg295713
2017-06-11 14:07:20ncoghlansetmessages: + msg295710
2017-06-11 10:13:32vstinnersetstatus: closed -> open
resolution: fixed -> (no value)
messages: + msg295698
2017-06-11 04:55:59ncoghlansetstatus: open -> closed
resolution: fixed
messages: + msg295688

stage: resolved
2017-06-11 03:16:17ncoghlansetmessages: + msg295683
2017-06-04 10:00:11ncoghlansetmessages: + msg295121
2017-03-13 07:22:30ncoghlansetmessages: + msg289534
2017-03-13 06:08:40ncoghlansetpull_requests: + pull_request540
2017-03-05 07:12:56ncoghlansetmessages: + msg289002
2017-01-22 10:05:22xdegayesetmessages: + msg286001
2017-01-19 11:21:27xdegayelinkissue26865 dependencies
2017-01-18 15:16:00xdegayesetfiles: + android_setlocale.patch
nosy: + xdegaye
messages: + msg285735

2017-01-08 02:22:40ncoghlansetfiles: + pep538_coerce_legacy_c_locale_v3.diff

messages: + msg284952
2017-01-07 22:20:07Sworddragonsetmessages: + msg284943
2017-01-07 11:43:47ncoghlansetfiles: + pep538-check-click.sh

messages: + msg284908
2017-01-07 08:41:17ncoghlansetmessages: + msg284900
2017-01-07 05:24:57methanesetmessages: + msg284887
2017-01-07 05:14:19Sworddragonsetmessages: + msg284886
2017-01-07 02:53:55vstinnersetmessages: + msg284884
2017-01-07 02:33:14Sworddragonsetmessages: + msg284882
2017-01-06 07:51:04Jan Niklas Hassesetmessages: + msg284799
2017-01-06 03:07:56ncoghlansetmessages: + msg284795
2017-01-06 02:50:53ncoghlansetmessages: + msg284794
2017-01-05 22:44:13vstinnersetmessages: + msg284782
2017-01-05 17:18:25vstinnersetmessages: + msg284764
2017-01-05 14:44:31barrysetmessages: + msg284747
2017-01-05 12:41:50ncoghlansetmessages: + msg284742
2017-01-05 11:36:20lemburgsetmessages: + msg284736
2017-01-05 11:11:53vstinnersetmessages: + msg284729
2017-01-05 10:54:48ncoghlansetmessages: + msg284725
2017-01-05 10:50:28ncoghlansetmessages: + msg284722
2017-01-05 10:10:46methanesetmessages: + msg284720
2017-01-05 09:51:21lemburgsetmessages: + msg284719
2017-01-05 09:42:56methanesetmessages: + msg284718
2017-01-05 09:26:22ncoghlansetmessages: + msg284716
2017-01-05 03:32:16methanesetmessages: + msg284697
2017-01-04 16:06:09vstinnersetmessages: + msg284647
2017-01-04 14:46:13ncoghlansetmessages: + msg284641
2017-01-04 11:41:12methanesetmessages: + msg284631
2017-01-04 08:29:24methanesetmessages: + msg284621
2017-01-04 08:01:45ncoghlansetmessages: + msg284620
2017-01-04 01:02:24methanesetmessages: + msg284605
2017-01-03 15:02:01barrysetnosy: + barry
2017-01-03 04:15:32ncoghlansetfiles: + pep538_coerce_legacy_c_locale_v2.diff

messages: + msg284537
2016-12-28 15:29:21ncoghlansetmessages: + msg284176
2016-12-28 12:23:31Jan Niklas Hassesetmessages: + msg284170
2016-12-28 02:45:48ncoghlansetfiles: + pep538_coerce_legacy_c_locale.diff

messages: + msg284150
2016-12-21 20:11:01Sworddragonsetnosy: + Sworddragon
2016-12-21 16:00:47akirasetnosy: + akira
2016-12-21 09:54:58vstinnersetmessages: + msg283732
2016-12-18 07:20:38ncoghlansetmessages: + msg283543
2016-12-18 07:10:59ncoghlansetfiles: + fedora-cpython-PYTHONALLOWCLOCALE.diff
2016-12-17 20:19:37Jan Niklas Hassesetmessages: + msg283515
2016-12-17 15:33:17ncoghlansetmessages: + msg283495
2016-12-17 10:15:25lemburgsetmessages: + msg283482
2016-12-17 07:56:20ncoghlansetmessages: + msg283471
2016-12-17 07:46:52ncoghlansetassignee: ncoghlan
messages: + msg283469
2016-12-16 17:13:58yan12125setnosy: + yan12125
2016-12-16 15:15:34vstinnersetmessages: + msg283409
2016-12-16 15:12:22vstinnersetmessages: + msg283408
2016-12-15 06:15:51ncoghlansetfiles: + fedora-cpython-force-c-utf-8.diff
keywords: + patch
messages: + msg283244
2016-12-12 11:49:23ncoghlansetmessages: + msg282984
2016-12-12 10:26:47methanesetmessages: + msg282978
2016-12-12 10:17:21lemburgsetnosy: + lemburg
messages: + msg282977
2016-12-12 08:45:01Jan Niklas Hassesetmessages: + msg282972
2016-12-12 08:10:30ncoghlansetmessages: + msg282971
2016-12-12 08:03:12Jan Niklas Hassesetmessages: + msg282970
2016-12-12 05:29:16ncoghlansetmessages: + msg282965
2016-12-12 05:26:55ncoghlansetnosy: + ncoghlan
messages: + msg282964
2016-09-23 12:59:13methanesetnosy: + methane
messages: + msg277274
2016-09-23 12:47:00Jan Niklas Hassesetmessages: + msg277273
2016-09-16 17:18:03vstinnersetmessages: + msg276729
2016-09-16 14:46:05r.david.murraysetversions: + Python 3.7, - Python 3.5
nosy: + r.david.murray

messages: + msg276722

stage: resolved -> (no value)
2016-09-16 13:09:34Jan Niklas Hassesetmessages: + msg276709
2016-09-16 13:02:20vstinnersetstatus: closed -> open
superseder: Change sys.getfilesystemencoding() on Windows to UTF-8 ->
resolution: duplicate -> (no value)
messages: + msg276707
2016-09-16 11:22:16abarrysetstatus: open -> closed

superseder: Change sys.getfilesystemencoding() on Windows to UTF-8

nosy: + abarry
messages: + msg276694
resolution: duplicate
stage: resolved
2016-09-16 11:17:02Jan Niklas Hassecreate