Author ncoghlan
Recipients Arfrever, a.badger, abadger1999, benjamin.peterson, ezio.melotti, lemburg, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, vstinner
Date 2013-08-22.14:42:04
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
Note that the specific case I'm really interested is printing on systems that are properly configured to use UTF-8, but are getting bad metadata from an OS API. I'm OK with the idea of *only* changing it for UTF-8 rather than for arbitrary encodings, as well as restricting it to sys.stdout when the codec used matches the default filesystem encoding.

To double check the current behaviour, I created a directory to tinker with this. Filenames were created with the following:

>>> open("ℙƴ☂ℌøἤ".encode("utf-8"), "w")
>>> open("basic_ascii".encode("utf-8"), "w")
>>> b"\xd0\xd1\xd2\xd3".decode("latin-1")
>>> open(b"\xd0\xd1\xd2\xd3", "w")

That last generates an invalid UTF-8 filename. "ls" actually degrades less gracefully than I thought, and just prints question marks for the bad file:

$ ls -l
total 0
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:04 ????
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 basic_ascii
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 ℙƴ☂ℌøἤ

Python 2 & 3 both work OK if you just print the directory listing directly, since repr() happily displays the surrogate escaped string:

$ python -c "import os; print(os.listdir('.'))"
['basic_ascii', '\xd0\xd1\xd2\xd3', '\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4']
$ python3 -c "import os; print(os.listdir('.'))"
['basic_ascii', '\udcd0\udcd1\udcd2\udcd3', 'ℙƴ☂ℌøἤ']

Where it falls down is when you try to print the strings directly in Python 3:

$ python3 -c "import os; [print(fname) for fname in os.listdir('.')]"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <listcomp>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd0' in position 0: surrogates not allowed

While setting the IO encoding produces behaviour closer to that of the native tools:
$ PYTHONIOENCODING=utf-8:surrogateescape python3 -c "import os; [print(fname) for fname in os.listdir('.')]"

On the other hand, setting PYTHONIOENCODING as shown provides an environmental workaround, and will provide an improved programmatic workaround (which tools like could use to configure surrogateescape by default).

So perhaps pursuing #15216 further would be a better approach than selectively changing the default behaviour? And better documentation for ways to handle the surrogate escape error when it arises?
Date User Action Args
2013-08-22 14:42:05ncoghlansetrecipients: + ncoghlan, lemburg, pitrou, vstinner, abadger1999, benjamin.peterson, ezio.melotti, a.badger, Arfrever, r.david.murray, serhiy.storchaka
2013-08-22 14:42:05ncoghlansetmessageid: <>
2013-08-22 14:42:05ncoghlanlinkissue18713 messages
2013-08-22 14:42:04ncoghlancreate