Author ncoghlan
Recipients Arfrever, a.badger, abadger1999, benjamin.peterson, ezio.melotti, lemburg, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, vstinner
Date 2013-08-22.14:42:04
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1377182525.34.0.504141141625.issue18713@psf.upfronthosting.co.za>
In-reply-to
Content
Note that the specific case I'm really interested is printing on systems that are properly configured to use UTF-8, but are getting bad metadata from an OS API. I'm OK with the idea of *only* changing it for UTF-8 rather than for arbitrary encodings, as well as restricting it to sys.stdout when the codec used matches the default filesystem encoding.

To double check the current behaviour, I created a directory to tinker with this. Filenames were created with the following:

>>> open("ℙƴ☂ℌøἤ".encode("utf-8"), "w")
>>> open("basic_ascii".encode("utf-8"), "w")
>>> b"\xd0\xd1\xd2\xd3".decode("latin-1")
'ÐÑÒÓ'
>>> open(b"\xd0\xd1\xd2\xd3", "w")

That last generates an invalid UTF-8 filename. "ls" actually degrades less gracefully than I thought, and just prints question marks for the bad file:

$ ls -l
total 0
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:04 ????
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 basic_ascii
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 ℙƴ☂ℌøἤ

Python 2 & 3 both work OK if you just print the directory listing directly, since repr() happily displays the surrogate escaped string:

$ python -c "import os; print(os.listdir('.'))"
['basic_ascii', '\xd0\xd1\xd2\xd3', '\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4']
$ python3 -c "import os; print(os.listdir('.'))"
['basic_ascii', '\udcd0\udcd1\udcd2\udcd3', 'ℙƴ☂ℌøἤ']

Where it falls down is when you try to print the strings directly in Python 3:

$ python3 -c "import os; [print(fname) for fname in os.listdir('.')]"
basic_ascii
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <listcomp>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd0' in position 0: surrogates not allowed

While setting the IO encoding produces behaviour closer to that of the native tools:
$ PYTHONIOENCODING=utf-8:surrogateescape python3 -c "import os; [print(fname) for fname in os.listdir('.')]"
basic_ascii
����
ℙƴ☂ℌøἤ

On the other hand, setting PYTHONIOENCODING as shown provides an environmental workaround, and http://bugs.python.org/issue15216 will provide an improved programmatic workaround (which tools like http://code.google.com/p/pyp/ could use to configure surrogateescape by default).

So perhaps pursuing #15216 further would be a better approach than selectively changing the default behaviour? And better documentation for ways to handle the surrogate escape error when it arises?
History
Date User Action Args
2013-08-22 14:42:05ncoghlansetrecipients: + ncoghlan, lemburg, pitrou, vstinner, abadger1999, benjamin.peterson, ezio.melotti, a.badger, Arfrever, r.david.murray, serhiy.storchaka
2013-08-22 14:42:05ncoghlansetmessageid: <1377182525.34.0.504141141625.issue18713@psf.upfronthosting.co.za>
2013-08-22 14:42:05ncoghlanlinkissue18713 messages
2013-08-22 14:42:04ncoghlancreate