Message195897
Note that the specific case I'm really interested is printing on systems that are properly configured to use UTF-8, but are getting bad metadata from an OS API. I'm OK with the idea of *only* changing it for UTF-8 rather than for arbitrary encodings, as well as restricting it to sys.stdout when the codec used matches the default filesystem encoding.
To double check the current behaviour, I created a directory to tinker with this. Filenames were created with the following:
>>> open("ℙƴ☂ℌøἤ".encode("utf-8"), "w")
>>> open("basic_ascii".encode("utf-8"), "w")
>>> b"\xd0\xd1\xd2\xd3".decode("latin-1")
'ÐÑÒÓ'
>>> open(b"\xd0\xd1\xd2\xd3", "w")
That last generates an invalid UTF-8 filename. "ls" actually degrades less gracefully than I thought, and just prints question marks for the bad file:
$ ls -l
total 0
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:04 ????
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 basic_ascii
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 ℙƴ☂ℌøἤ
Python 2 & 3 both work OK if you just print the directory listing directly, since repr() happily displays the surrogate escaped string:
$ python -c "import os; print(os.listdir('.'))"
['basic_ascii', '\xd0\xd1\xd2\xd3', '\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4']
$ python3 -c "import os; print(os.listdir('.'))"
['basic_ascii', '\udcd0\udcd1\udcd2\udcd3', 'ℙƴ☂ℌøἤ']
Where it falls down is when you try to print the strings directly in Python 3:
$ python3 -c "import os; [print(fname) for fname in os.listdir('.')]"
basic_ascii
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "<string>", line 1, in <listcomp>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd0' in position 0: surrogates not allowed
While setting the IO encoding produces behaviour closer to that of the native tools:
$ PYTHONIOENCODING=utf-8:surrogateescape python3 -c "import os; [print(fname) for fname in os.listdir('.')]"
basic_ascii
����
ℙƴ☂ℌøἤ
On the other hand, setting PYTHONIOENCODING as shown provides an environmental workaround, and http://bugs.python.org/issue15216 will provide an improved programmatic workaround (which tools like http://code.google.com/p/pyp/ could use to configure surrogateescape by default).
So perhaps pursuing #15216 further would be a better approach than selectively changing the default behaviour? And better documentation for ways to handle the surrogate escape error when it arises? |
|
Date |
User |
Action |
Args |
2013-08-22 14:42:05 | ncoghlan | set | recipients:
+ ncoghlan, lemburg, pitrou, vstinner, abadger1999, benjamin.peterson, ezio.melotti, a.badger, Arfrever, r.david.murray, serhiy.storchaka |
2013-08-22 14:42:05 | ncoghlan | set | messageid: <1377182525.34.0.504141141625.issue18713@psf.upfronthosting.co.za> |
2013-08-22 14:42:05 | ncoghlan | link | issue18713 messages |
2013-08-22 14:42:04 | ncoghlan | create | |
|