Issue 23993: Use surrogateescape error handler by default in open() if the LC_CTYPE locale is C at startup

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68181

classification

Title:	Use surrogateescape error handler by default in open() if the LC_CTYPE locale is C at startup
Type:		Stage:
Components:	Unicode	Versions:	Python 3.5

process

Status:	closed	Resolution:	postponed
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, ncoghlan, r.david.murray, vstinner
Priority:	normal	Keywords:	patch

Created on 2015-04-18 09:25 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
default_error_handler.patch	vstinner, 2015-04-18 09:25		review
default_error_handler-2.patch	vstinner, 2015-04-18 14:51		review

Messages (13)
msg241405 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-04-18 09:25
As a following of the issue #19977, I propose to use also the surrogateescape error handler in open() by default if the locale is C. Attached issue adds a new sys.getdefaulterrorhandler() function and use it in io.TextIOWrapper (and _pyio.TextIOWrapper). We may use sys.getdefaulterrorhandler() in more places. I don't think that it would be correct to use in for str.encode() or bytes.decode().
msg241406 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-04-18 09:26
The patch is a work-in-progress, I didn't have time to run unit tests, and the documentation is not completed.
msg241416 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-04-18 14:30
I am -1 on this. (Or may be more). What's the rationale? I could see using utf-8 by default if the locale is C, but I don't think we want to encourage going back to a world where people don't pay attention to the encoding of their data. A more productive approach to solving the problem that I think you are trying to solve here would be to work on including chardet in the standard library, something that was brought up, and seemed to receive positive reception (or at least not negative), during the Requests segment of the PyCon language summit.
msg241417 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-04-18 14:51
Updated and better patch: version 2. - revert changes on fileutils.c: it's not useful to check for check_force_ascii(), because this function is more strict than checking of the LC_CTYPE is "C" - fix _pyio.py: add sys import - complete the documentation - tests pass
msg241418 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-04-18 15:00
> I am -1 on this. (Or may be more). What's the rationale? See the issue #19977. In many cases you get the C locale by mistake. For example, by setting the LANG environment variable to an empty string to run a program in english (whereas LC_MESSAGES is the appropriate variable). For deamons, in many cases you get the C locale and it's hard to configure all systems to run the daemon with the user locale. I read that systemd runs daemons with the user locale, but I'm not sure. The idea is to reduce the pain caused by this locale. When porting an application from Python 2 to Python 3, it's annoying to start to get unicode errors everywhere. This issue starts to make Python 3 more convinient. > I could see using utf-8 by default if the locale is C, This has been proposed many times, but I'm opposed to that. Python must be interoperable with other programs, and other programs use the locale encoding. For example, you get the ASCII locale encoding when the LC_CTYPE is the POSIX locale ("C"). If Python writes UTF-8, other applications will be unable to decode UTF-8 data. Maybe I'm wrong and you should continue to investigate this option. This issue is very specific to "OS" data: environment variables, filenames, command line arguments, standard streams (stdin, stdout, stderr). You may do other choices for other kind of data unrelated to the locale encoding. For example, JSON must use UTF-8, it's well defined. XML announces its encoding. etc.
msg241419 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-04-18 15:03
For a more concrete use case, see the "makefile problem" in Mercurial wiki page: http://mercurial.selenic.com/wiki/EncodingStrategy#The_.22makefile_problem.22
msg241443 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-04-18 19:01
Hmm. Upon reflection I guess I can see the validity of "if you are using the C locale you or the OS are broken anyway, so we'll just pass the bytes through". I'm not entirely convinced this won't cause issues, but I suppose it might not cause any more issues that having things break due to the C locale does. It is, however, going to return us to the days when a program that works fine most of the time suddenly blows up in the face of non-ascii data, and that's my biggest concern. I'd certainly be fine with it if it wasn't the default (that is, programs who need this have to opt in to it).
msg241460 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-04-18 22:29
> "if you are using the C locale you or the OS are broken anyway, so we'll just pass the bytes through" Exactly. Even if you use Unicode, the Python 3 str type, you store text as raw bytes (in a custom format, as surrogate characters). > I'm not entirely convinced this won't cause issues, but I suppose it might not cause any more issues that having things break due to the C locale does. The most obvious issue is the come back of mojibake. Since you manipulate raw bytes, it's easy to concatenate two bytes strings encoded to two different encodings. https://unicodebook.readthedocs.org/definitions.html#mojibake The problem is that the question is not how bad it is use to manipulate text as bytes. The problem is that a working application written for Python 2 starts to randomly fail (on non-ASCII characters) on Python 3 when the LC_CTYPE locale is the POSIX locale ("C"). The first question is: should I keep Python 2 or write my application in a language which doesn't force me to understand Unicode?
msg241519 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-04-19 15:33
Well, previously our answer has been "you have to understand unicode". If we are going to change that, it probably needs a python-dev discussion. But like I said, providing the tools to make it possible to easily do this, just not as a default, seems like mostly a no-brainer. It's making it the default that is controversial, IMO. (Call me -0.5 at this point in the discussion, as regards making it the default).
msg241564 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-04-19 20:30
Related issues and discussions: - [Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3? https://mail.python.org/pipermail/python-dev/2011-June/112086.html - Issue #12451: open: avoid the locale encoding when possible https://bugs.python.org/issue12451
msg242000 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2015-04-25 07:15
If a Linux distro is using systemd (which is essentially all recent versions of popular distros, including RHEL/CentOS, although it won't land in Ubuntu LTS until 16.04), then cron jobs and service daemons will get their locale set properly based on the contents of /etc/locale.conf. Thus "use an init system that reliably sets the locale correctly for cron jobs and service daemons" is the correct fix for this problem. Unfortunately, there are still an awful lot of Linux systems out there using other init systems that don't reliably set the locale, and for those "Python 3 shouldn't be worse than Python 2" is a desirable behavioural goal here. Thus, I think it makes sense for Python to special case the C locale by assuming it's always the wrong setting, and thus surrogateescape is going to be needed on all system interfaces. While it won't be a perfect fix, at least we'll be able to roundtrip data within the system appropriately, even if it still gets corrupted in the face of encoding conversions.
msg244051 - (view)	Author: STINNER Victor (vstinner) *	Date: 2015-05-25 22:32
Without a strong support, I don't want to put this in Python 3.5. It's too late (we reached the feature freeze). For Python 3.6, we may experiment using UTF-8 for Python filesystem encoding when the LC_CTYPE locale is POSIX ("C").
msg249440 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2015-09-01 00:03
I found this discussion again while looking for issue #19977 to reference from issue #24968. "fixed" wasn't the right resolution, so I've moved it to "postponed" - the SSH locale forwarding problem highlighted again in #24968 means I think there's a discussion worth having about reading /etc/locale/conf when it's available, rather than always trusting the glibc locale settings.

History
Date	User	Action	Args
2022-04-11 14:58:15	admin	set	github: 68181
2015-09-01 00:03:33	ncoghlan	set	resolution: fixed -> postponed messages: + msg249440
2015-05-25 22:33:08	vstinner	set	status: open -> closed resolution: fixed
2015-05-25 22:32:58	vstinner	set	messages: + msg244051
2015-04-25 07:15:07	ncoghlan	set	messages: + msg242000
2015-04-19 20:30:44	vstinner	set	messages: + msg241564
2015-04-19 15:33:43	r.david.murray	set	messages: + msg241519
2015-04-18 22:29:23	vstinner	set	messages: + msg241460
2015-04-18 19:01:09	r.david.murray	set	messages: + msg241443
2015-04-18 15:03:07	vstinner	set	messages: + msg241419
2015-04-18 15:01:21	vstinner	set	title: Use surrogateescape error handler by default in open() if the locale is C -> Use surrogateescape error handler by default in open() if the LC_CTYPE locale is C at startup
2015-04-18 15:00:04	vstinner	set	messages: + msg241418
2015-04-18 14:51:10	vstinner	set	files: + default_error_handler-2.patch messages: + msg241417
2015-04-18 14:30:46	r.david.murray	set	nosy: + r.david.murray messages: + msg241416
2015-04-18 09:26:16	vstinner	set	messages: + msg241406
2015-04-18 09:25:51	vstinner	create