Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale #64176

vstinner · 2013-12-13T16:40:20Z

BPO	19977
Nosy	@loewis, @atsuoishimoto, @ncoghlan, @pitrou, @vstinner, @larryhastings, @jwilk, @ezio-melotti, @abadger, @bitdancer, @vadmium, @serhiy-storchaka
Files	c_locale_surrogateescape.patch test_ls.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2014-04-27.23:57:39.901>
created_at = <Date 2013-12-13.16:40:20.152>
labels = ['type-bug', 'expert-unicode']
title = 'Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale'
updated_at = <Date 2017-12-18.14:31:51.323>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2017-12-18.14:31:51.323>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2014-04-27.23:57:39.901>
closer = 'vstinner'
components = ['Unicode']
creation = <Date 2013-12-13.16:40:20.152>
creator = 'vstinner'
dependencies = []
files = ['33122', '33124']
hgrepos = []
issue_num = 19977
keywords = ['patch']
message_count = 38.0
messages = ['206111', '206114', '206121', '206122', '206123', '206124', '206131', '206141', '206148', '206168', '207210', '207296', '207414', '210990', '213922', '213925', '213927', '213932', '213933', '215029', '215786', '215812', '215815', '217300', '217314', '217315', '217317', '217329', '217332', '217355', '217385', '217386', '217387', '284849', '284863', '284873', '284875', '308562']
nosy_count = 15.0
nosy_names = ['loewis', 'ishimoto', 'ncoghlan', 'pitrou', 'vstinner', 'larry', 'jwilk', 'ezio.melotti', 'a.badger', 'r.david.murray', 'Sworddragon', 'python-dev', 'martin.panter', 'serhiy.storchaka', 'bkabrda']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue19977'
versions = ['Python 3.5']

vstinner · 2013-12-13T16:40:20Z

When LANG=C is used to get the english language (which is a mistake, LC_CTYPE=C should be used instead) or when Python is started with an empty environment (no environment variable), Python gets the POSIX locale (aka "C locale") for the LC_CTYPE (encoding) locale.

Standard streams use the locale encoding, which is usually ASCII with POSIX locale on most platforms (except on AIX: ISO 8859-1). In this case, data read from the OS (environment variables, command line arguments, filenames, etc.) may contain surrogate characters because of the internal usage of the surrogateescape error handler (see the PEP-383 for the rationale).

The problem is that standard output uses the strict error handler, and so print() fails to display OS data like filenames.

Example, "ls" command in Python:
---

import os
for name in sorted(os.listdir()): print(name)

Try it with "LANG=C python ls.py" in a directory containing non-ASCII characters and you will get unicode errors.

Issues bpo-19846 and bpo-19847 are examples of this annoyance.

I propose to use also the surrogateescape error handler for sys.stdout if the POSIX locale is used for LC_CTYPE at startup. Attached patch implements this idea.

With the patch, "LANG=C python ls.py" almost works as filenames and stdout are byte streams, even if the Unicode type is used.

vstinner · 2013-12-13T16:42:19Z

Oh, in fact, sys.stdin is also modified by the patch (as I expected).

Sworddragon · 2013-12-13T16:58:58Z

What would happen if we call this example script with LANG=C on the patch?:

---

import os
for name in sorted(os.listdir('ä')):
	print(name)

Would it throw an exception on os.listdir('ä')?

vstinner · 2013-12-13T17:03:54Z

test_ls.py: test script producing invalid filenames and then trying to display them into stdout.

Output with UTF-8 locale, UTF-8 terminal and Python 3.3 (or unpatched 3.4, it's the same):

ascii.txt
<UnicodeError 'invalid_utf8:\udcff.txt'>
<UnicodeError 'latin1:\udce9.txt'>
utf8:é€.txt

Output with C locale (ASCII), UTF-8 terminal and Python 3.3:

ascii.txt
<UnicodeError 'invalid_utf8:\udcff.txt'>
<UnicodeError 'latin1:\udce9.txt'>
<UnicodeError 'utf8:\udcc3\udca9\udce2\udc82\udcac.txt'>

Output with C locale (ASCII), UTF-8 terminal and patched Python 3.4:

ascii.txt
invalid_utf8:�.txt
latin1:�.txt
utf8:é€.txt

You get no Unicode error with LANG=C, but you get mojibake instead.

vstinner · 2013-12-13T17:08:28Z

os.fsencode(text) always fail if text cannot be encoded to sys.getfilesystemencoding(). surrogateescape doesn't help here.

Your example is "artificial", you should not get 'ä'. All OS data is decoded from the filesystem encoding using the surrogateescape error handler (except on Windows, where strict is used, but it's a different story, Python uses Unicode functions when available so don't worry). So all these data can always be encoded back to bytes using os.fsencode().

More generally, os.fsencode(os.fsdecode(read_data)) == read_data is always true on Unix, with any filesystem (locale) encoding.

You may get Unicode data from other sources like files or a GUI, but I don't see what can be done here.

pitrou · 2013-12-13T17:21:06Z

When LANG=C is used to get the english language (which is a mistake,
LC_CTYPE=C should be used instead)

I think you mean LC_MESSAGES=C here.
(but it's not only about the English language; it's also about other locale parameters such as number formatting)

I think we should start thinking about making utf-8 the default filesystem encoding in 3.5 (under Unix).

bitdancer · 2013-12-13T17:59:51Z

Reintroducing moji-bake intentionally doesn't sound like a particularly good idea, wasn't that what python3 was supposed to help prevent?

It does seem like a utf-8 default is the Way of the Future. Or even the present, most places.

abadger · 2013-12-13T19:44:37Z

My impression was that python3 was supposed to help get rid of UnicodeError tracebacks, not mojibake. If mojibake was the problem then we should never have gone down the surrogateescape path for input.

serhiy-storchaka · 2013-12-13T21:27:01Z

Mojibake in input can cause decoding error in other application which consumes output of Python script. In some cases this can be even worse thin UnicodeError in producer.

But for C locale this makes sense. I think we should try this experiment in 3.5. There will be much time for testing before 3.5 beta 1.

ncoghlan · 2013-12-14T06:07:30Z

Getting rid of mojibake was the goal, surrogateescape was about dealing with cases where the "avoid mojibake" checks were spuriously breaking round-tripping between OS APIs due to other configuration errors (with LANG=C being set, or LANG not being set at all being the main problem). Other "high mojibake risk" power tools (like changing the encoding of an already open stream) are likely to return in the future, since there *are* cases where they're the right answer (e.g. you can't right an iconv equivalent in Python 3 at the moment, we need bpo-15216 implemented before that will be possible).

+1 for this solution - see bpo-19846 for the long discussion which got us to this point (there are a few unrelated tangents, a couple of them my fault, but this is definitely an improvement over the status quo.

ncoghlan · 2014-01-03T06:13:54Z

Larry: I'm assuming it's way too late to make a change like this for the 3.4 release?

Slavek: assuming this change is made for 3.5 upstream, we may want to look at backporting it as a 3.4 patch in Fedora (as part of the Python-3-by-default project). Otherwise it's very easy to provoke Python 3 into throwing Unicode errors when attempting to print data provided by the OS.

larryhastings · 2014-01-04T18:04:27Z

Yeah, unless there was a *huge* amount of support for changing this, it's way too late for 3.4.

bkabrda · 2014-01-06T07:39:50Z

Nick: Sure, once there is an upstream solution that people have agreed on, I'll look into backporting it, NP. Thanks for letting me know about this.

vstinner · 2014-02-11T17:47:46Z

Reintroducing moji-bake intentionally doesn't sound like a particularly good idea, wasn't that what python3 was supposed to help prevent?

Sometimes practicality beats purity :-(

I tried to convince users that their computer was "not well configured", they always replied that Python 3 fails where Perl, PHP, Python 2, C, etc. "just work".

python-dev · 2014-03-18T00:27:01Z

New changeset bc06f67234d0 by Victor Stinner in branch 'default':
Issue bpo-19977: When the LC_TYPE locale is the POSIX locale (C locale),
http://hg.python.org/cpython/rev/bc06f67234d0

vstinner · 2014-03-18T00:49:10Z

Test failing on "x86 OpenIndiana 3.x" buildbot:

http://buildbot.python.org/all/builders/x86%20OpenIndiana%203.x/builds/7939/steps/test/logs/stdio

======================================================================
FAIL: test_forced_io_encoding (test.test_capi.EmbeddingTests)
----------------------------------------------------------------------

Traceback (most recent call last):
  File "/export/home/buildbot/32bits/3.x.cea-indiana-x86/build/Lib/test/test_capi.py", line 352, in test_forced_io_encoding
    self.assertEqual(out.strip(), expected_output)
AssertionError: '--- [79 chars]646:surrogateescape\nstdout: 646:surrogateesca[576 chars]lace' != '--- [79 chars]646:strict\nstdout: 646:strict\nstderr: 646:ba[540 chars]lace'
  --- Use defaults

  Expected encoding: default
  Expected errors: default
- stdin: 646:surrogateescape
- stdout: 646:surrogateescape
+ stdin: 646:strict
+ stdout: 646:strict
  stderr: 646:backslashreplace
  --- Set errors only

Expected encoding: default
Expected errors: surrogateescape
stdin: 646:surrogateescape
stdout: 646:surrogateescape
stderr: 646:backslashreplace
--- Set encoding only ---

  Expected encoding: latin-1
  Expected errors: default
- stdin: latin-1:surrogateescape
- stdout: latin-1:surrogateescape
+ stdin: latin-1:strict
+ stdout: latin-1:strict
  stderr: latin-1:backslashreplace
  --- Set encoding and errors

Expected encoding: latin-1
Expected errors: surrogateescape
stdin: latin-1:surrogateescape
stdout: latin-1:surrogateescape
stderr: latin-1:backslashreplace

vstinner · 2014-03-18T00:57:47Z

New behaviour:

$ mkdir z
$ touch z/abcé
$ LC_CTYPE=C ./python -c 'import os; print(os.listdir("z")[0])'
abcé

Old behaviour, before the change (test with Python 3.3):

$ LC_CTYPE=C python3 -c 'import os; print(os.listdir("z")[0])'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-4: ordinal not in range(128)

python-dev · 2014-03-18T01:32:24Z

New changeset 3589980c98de by Victor Stinner in branch 'default':
Issue bpo-19977, bpo-19036: Always include <locale.h> in pythonrun.c
http://hg.python.org/cpython/rev/3589980c98de

New changeset 94d5025c70a3 by Victor Stinner in branch 'default':
Issue bpo-19977: Enable test_c_locale_surrogateescape() on Windows
http://hg.python.org/cpython/rev/94d5025c70a3

python-dev · 2014-03-18T01:38:21Z

New changeset c9905e802042 by Victor Stinner in branch 'default':
Issue bpo-19977: Fix test_capi when LC_CTYPE locale is POSIX
http://hg.python.org/cpython/rev/c9905e802042

ncoghlan · 2014-03-28T10:01:25Z

This seems to be working on the buildbots for 3.5 now (buildbot failures appear to be due to other issues).

However, I'd still like to discuss the idea of backporting this to 3.4.1.

From a Fedora point of view, it's still *very* easy to flip an environment into POSIX mode, so even if the system is appropriately configured to use UTF-8 everywhere, Python 3.4 may still blow up if a script or application ends up running under the POSIX locale.

That has long made Toshio nervous about the migration of core services to Python 3 (https://fedoraproject.org/wiki/Changes/Python_3_as_Default), and his concerns make sense to me, as that migration covers little things like the installer, package manager, post-image install initialisation, etc. I'm not sure the Fedora team can deliver on the "Users shouldn't notice any changes, except that packages in minimal buildroot and on LiveCD will be python3-, not python-." aspect of the change proposal without this behavioural tweak in the 3.4 series as well.

Note that this *isn't* a blocker for the migration - if it was, it would be mentioned in the Fedora proposal. However, I think there's a risk to the Fedora user experience if the status quo remains in place for the life of Python 3.4, and I'd hate for the first encounter Fedora users have with Python 3 to be inexplicable tracebacks from components that have been migrated.

vstinner · 2014-04-09T00:28:32Z

"However, I'd still like to discuss the idea of backporting this to 3.4.1."

THe idea of doing this change in Python 3.5 is that I have no idea of the risk of regression. To backport such change in a minor version (3.4.1), I would feel more confident with user tests of Python 3.5 or patched Python 3.4.

"That has long made Toshio nervous about the migration of core services to Python 3 (https://fedoraproject.org/wiki/Changes/Python_3_as_Default), and his concerns make sense to me, as that migration covers little things like the installer, package manager, post-image install initialisation, etc. "

Which programs in this test are or may be running with the POSIX locale?

Fedora doesn't use en_US.utf8 locale by default?

ncoghlan · 2014-04-09T10:52:05Z

The default locale on Fedora is indeed UTF-8 these days - the problem is that *users* are used to being able to use "LANG=C" to force the POSIX locale (whether for testing purposes or other reasons), and that currently means system utilities written in Python may fail in such situations if used with UTF-8 data from the filesystem (or elsewhere). (I believe there may also be other cases where POSIX mandates the use of the C locale, but Toshio would be in a better position than I am to confirm whether or not that is actually the case).

So perhaps this is best left in a "wait & see" mode for now - as the Fedora migration to Python 3 progresses, if the folks working on that find specific utilities where the Python 3.4 standard stream handling in the C locale appears problematic, then Slavek & Toshio can bring them up here.

The counterargument is that if we're going to change it, 3.4.1 would be a better time frame than 3.4.2. In that case, the task of identifying specific Fedora utilities of concern still falls back on Toshio & Slavek, but it would be a matter of going hunting for them specifically *now*, rather than waiting until they come up over the course of the migration.

vstinner · 2014-04-09T11:08:45Z

The default locale on Fedora is indeed UTF-8 these days - the problem is that *users* are used to being able to use "LANG=C" to force the POSIX locale (whether for testing purposes or other reasons), and that currently means system utilities written in Python may fail in such situations if used with UTF-8 data from the filesystem (or elsewhere). (I believe there may also be other cases where POSIX mandates the use of the C locale, but Toshio would be in a better position than I am to confirm whether or not that is actually the case).

A common situation where you get a C locale is for programs started by
a crontab. If I remember correctly, these programs start with the C
locale, instead of the "system" (user?) locale.

ncoghlan · 2014-04-27T18:13:18Z

Additional environments where the system misreports the encoding to use (courtesy of Armin Ronacher & Graham Dumpleton on Twitter): upstart, Salt, mod_wsgi.

Note that for more complex applications (e.g. integrated web UIs, socket servers, sending email), round tripping to the standard streams won't be enough - what we really need is a better "source of truth" as to the real system encoding when POSIX compliant systems provide incorrect configuration data to the interpreter.

ncoghlan · 2014-04-27T19:34:06Z

bpo-21368 now suggests looking for /etc/locale.conf before falling back to ASCII+surrogateescape.

pitrou · 2014-04-27T19:48:32Z

We should not overcomplicate this. I suggest that we simply use utf-8 under the C locale.

ncoghlan · 2014-04-27T20:47:40Z

If you can convince Stephen Turnbull that's a good idea, sure. It's
probably more likely to be the right thing than "ASCII" or "ASCII +
surrogateescape", but in the absence of hard data, he's in a better
position than we are to judge the likely impact of that, at least in Japan.

I'm also going to hunt around on freedesktop.org to see if there's anything
more general there on the topic of encodings.

vstinner · 2014-04-27T23:33:57Z

We should not overcomplicate this. I suggest that we simply use utf-8 under the C locale.

Do you mean utf8/strict or utf8/surrogateescape?

utf8/strict doesn't work (os.listdir raises an unicode error) if your
system is configured to use latin1 (ex: filenames are stored in this
encoding), but unfortunately your program is running in an empty
environment (so will use the POSIX locale).

vstinner · 2014-04-27T23:57:40Z

We should not overcomplicate this. I suggest that we simply use utf-8 under the C locale.

Please open a new issue if you would prefer UTF-8. You will have to solve different technical issues. I tried to list some of them in issues bpo-19846 and bpo-19847.

In short, you should always decode and encode "OS data" with the same encoding. Python "file system encoding" is the locale encoding because in some places, PyUnicode_DecodeLocaleAndSize is used (ex: to decode PYTHONWARNINGS environment variable). A common location is PyUnicode_DecodeFSDefaultAndSize() before the Python codec is loaded. See also _Py_wchar2char() and _Py_char2wchar() functions which use the locale encoding and are used in many places.

I'm now closing the issue because the initial point (use surrogateescape error handler) is implemented in Python 3.5, and backporting such major change in Python 3.4 branch is risky right now.

pitrou · 2014-04-28T10:07:29Z

> We should not overcomplicate this. I suggest that we simply use utf-8 under the C locale.

Do you mean utf8/strict or utf8/surrogateescape?

utf8/strict doesn't work (os.listdir raises an unicode error) if your
system is configured to use latin1 (ex: filenames are stored in this
encoding), but unfortunately your program is running in an empty
environment (so will use the POSIX locale).

The issue is about stdin and stdout, I'm not sure why os.listdir would
be affected.

ncoghlan · 2014-04-28T17:11:25Z

Victor was referring to code like "print(os.listdir())". Those are the
motivating cases for ensuring round trips from system APIs to the standard
streams work correctly.

There's also the problem that sys.argv currently relies on the locale
encoding directly, because the filesystem encoding hasn't been worked out
at that point (see bpo-8776). So this current change will also make
"print(sys.argv)" work more reliably in the POSIX locale.

The conclusion I have come to is that any further decoupling of Python 3
from the locale encoding will actually depend on getting the PEP-432
bootstrapping changes implemented, reviewed and the PEP approved, so we
have more interpreter infrastructure in place by the time the interpreter
starts trying to figure out all these boundary encoding issues.

pitrou · 2014-04-28T17:13:17Z

The conclusion I have come to is that any further decoupling of Python 3
from the locale encoding will actually depend on getting the PEP-432
bootstrapping changes implemented, reviewed and the PEP approved, so we
have more interpreter infrastructure in place by the time the interpreter
starts trying to figure out all these boundary encoding issues.

Yeah. My proposal had more to do with the fact that we should some day
switch to utf-8 by default on all POSIX systems, regardless of what the
system advertises as "best encoding".

ncoghlan · 2014-04-28T17:19:55Z

Antoine Pitrou added the comment:

Yeah. My proposal had more to do with the fact that we should some day
switch to utf-8 by default on all POSIX systems, regardless of what the
system advertises as "best encoding".

Yeah, that seems like a plausible future to me as well, and knowing it's a
step along that path actually gives me more motivation to get back to
working on the startup issues :)

Sworddragon · 2017-01-06T20:55:11Z

Bug bpo-28180 has caused me to make a look at the "encoding" issue this and the tickets before have tried to solve more or less. Being a bit unsure what the root cause and intention for all this was I'm now at a point to actually check this ticket. Here is an example code (executed with Python 3.5.3 RC1 by having LANG set to C):

import sys
sys.stdout.write('ä')

I thought with the surrogateescape error handler now being used for sys.stdout this would not throw an exception but I'm getting this:

UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 0: ordinal not in range(128)

vstinner · 2017-01-06T22:23:46Z

"I thought with the surrogateescape error handler now being used for sys.stdout this would not throw an exception but I'm getting this: (...)"

Please see the two recently proposed PEP: Nick's PEP-538 and my PEP-540, both propose (two different) solutions to your issue, especially for the POSIX locale (aka "C" locale).

Sworddragon · 2017-01-06T23:41:30Z

The point is this ticket claims to be using the surrogateescape error handler for sys.stdout and sys.stdin for the C locale. I have never used surrogateescape explicitly before and thus have no experience for it and consulting the documentation mentions throwing an exception only for the strict error handler. I don't see anything that would make me think that surrogateescape would throw here an exception too. But maybe I'm just missing something.

vstinner · 2017-01-06T23:53:11Z

But maybe I'm just missing something.

This issue fixed exactly one use case: "List a directory into stdout" (similar to the UNIX "ls" or Windows "dir" commands):
https://www.python.org/dev/peps/pep-0540/#list-a-directory-into-stdout

Your use case is more "Display Unicode characters into stdout":
https://www.python.org/dev/peps/pep-0540/#display-unicode-characters-into-stdout

This use case is not supported by the issue. It should be fixed by PEP-538 or PEP-540.

Please join the happy discussion on the python-ideas mailing list to discuss how to "force UTF-8": this issue is closed, you shouldn't add new comments (other people will not see your comments).

vstinner · 2017-12-18T14:31:51Z

Follow-up: the PEP-538 (bpo-28180) and PEP-540 (bpo-29240) have been accepted and implemented in Python 3.7!

vstinner added the topic-unicode label Dec 13, 2013

vstinner changed the title ~~Use "surrogateescape" error handler for sys.stdout on UNIX for the C locale~~ Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale Dec 13, 2013

ncoghlan added the type-bug An unexpected behavior, bug, or error label Jan 3, 2014

vstinner closed this as completed Apr 27, 2014

ezio-melotti transferred this issue from another repository Apr 10, 2022

Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale #64176

Use "surrogateescape" error handler for sys.stdin and sys.stdout on UNIX for the C locale #64176

Comments

vstinner commented Dec 13, 2013

vstinner commented Dec 13, 2013

vstinner commented Dec 13, 2013

Sworddragon mannequin commented Dec 13, 2013

vstinner commented Dec 13, 2013

vstinner commented Dec 13, 2013

pitrou commented Dec 13, 2013

bitdancer commented Dec 13, 2013

abadger mannequin commented Dec 13, 2013

serhiy-storchaka commented Dec 13, 2013

ncoghlan commented Dec 14, 2013

ncoghlan commented Jan 3, 2014

larryhastings commented Jan 4, 2014

bkabrda mannequin commented Jan 6, 2014

vstinner commented Feb 11, 2014

python-dev mannequin commented Mar 18, 2014

vstinner commented Mar 18, 2014

vstinner commented Mar 18, 2014

python-dev mannequin commented Mar 18, 2014

python-dev mannequin commented Mar 18, 2014

ncoghlan commented Mar 28, 2014

vstinner commented Apr 9, 2014

ncoghlan commented Apr 9, 2014

vstinner commented Apr 9, 2014

ncoghlan commented Apr 27, 2014

ncoghlan commented Apr 27, 2014

pitrou commented Apr 27, 2014

ncoghlan commented Apr 27, 2014

vstinner commented Apr 27, 2014

vstinner commented Apr 27, 2014

pitrou commented Apr 28, 2014

ncoghlan commented Apr 28, 2014

pitrou commented Apr 28, 2014

ncoghlan commented Apr 28, 2014

Sworddragon mannequin commented Jan 6, 2017

vstinner commented Jan 6, 2017

Sworddragon mannequin commented Jan 6, 2017

vstinner commented Jan 6, 2017

vstinner commented Dec 18, 2017