Author vstinner
Recipients Sworddragon, a.badger, ezio.melotti, loewis, ncoghlan, vstinner
Date 2013-12-13.16:40:19
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1386952820.2.0.77364136667.issue19977@psf.upfronthosting.co.za>
In-reply-to
Content
When LANG=C is used to get the english language (which is a mistake, LC_CTYPE=C should be used instead) or when Python is started with an empty environment (no environment variable), Python gets the POSIX locale (aka "C locale") for the LC_CTYPE (encoding) locale.

Standard streams use the locale encoding, which is usually ASCII with POSIX locale on most platforms (except on AIX: ISO 8859-1). In this case, data read from the OS (environment variables, command line arguments, filenames, etc.) may contain surrogate characters because of the internal usage of the surrogateescape error handler (see the PEP 383 for the rationale).

The problem is that standard output uses the strict error handler, and so print() fails to display OS data like filenames.

Example, "ls" command in Python:
---
import os
for name in sorted(os.listdir()): print(name)
---

Try it with "LANG=C python ls.py" in a directory containing non-ASCII characters and you will get unicode errors.

Issues #19846 and #19847 are examples of this annoyance.

I propose to use also the surrogateescape error handler for sys.stdout if the POSIX locale is used for LC_CTYPE at startup. Attached patch implements this idea.

With the patch, "LANG=C python ls.py" almost works as filenames and stdout are byte streams, even if the Unicode type is used.
History
Date User Action Args
2013-12-13 16:40:20vstinnersetrecipients: + vstinner, loewis, ncoghlan, ezio.melotti, a.badger, Sworddragon
2013-12-13 16:40:20vstinnersetmessageid: <1386952820.2.0.77364136667.issue19977@psf.upfronthosting.co.za>
2013-12-13 16:40:20vstinnerlinkissue19977 messages
2013-12-13 16:40:19vstinnercreate