Message 195897 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	Arfrever, a.badger, abadger1999, benjamin.peterson, ezio.melotti, lemburg, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, vstinner
Date	2013-08-22.14:42:04
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1377182525.34.0.504141141625.issue18713@psf.upfronthosting.co.za>
In-reply-to

Content
Note that the specific case I'm really interested is printing on systems that are properly configured to use UTF-8, but are getting bad metadata from an OS API. I'm OK with the idea of only changing it for UTF-8 rather than for arbitrary encodings, as well as restricting it to sys.stdout when the codec used matches the default filesystem encoding. To double check the current behaviour, I created a directory to tinker with this. Filenames were created with the following: >>> open("ℙƴ☂ℌøἤ".encode("utf-8"), "w") >>> open("basic_ascii".encode("utf-8"), "w") >>> b"\xd0\xd1\xd2\xd3".decode("latin-1") 'ÐÑÒÓ' >>> open(b"\xd0\xd1\xd2\xd3", "w") That last generates an invalid UTF-8 filename. "ls" actually degrades less gracefully than I thought, and just prints question marks for the bad file: $ ls -l total 0 -rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:04 ???? -rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 basic_ascii -rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 ℙƴ☂ℌøἤ Python 2 & 3 both work OK if you just print the directory listing directly, since repr() happily displays the surrogate escaped string: $ python -c "import os; print(os.listdir('.'))" ['basic_ascii', '\xd0\xd1\xd2\xd3', '\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'] $ python3 -c "import os; print(os.listdir('.'))" ['basic_ascii', '\udcd0\udcd1\udcd2\udcd3', 'ℙƴ☂ℌøἤ'] Where it falls down is when you try to print the strings directly in Python 3: $ python3 -c "import os; [print(fname) for fname in os.listdir('.')]" basic_ascii Traceback (most recent call last): File "<string>", line 1, in <module> File "<string>", line 1, in <listcomp> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd0' in position 0: surrogates not allowed While setting the IO encoding produces behaviour closer to that of the native tools: $ PYTHONIOENCODING=utf-8:surrogateescape python3 -c "import os; [print(fname) for fname in os.listdir('.')]" basic_ascii �� ℙƴ☂ℌøἤ On the other hand, setting PYTHONIOENCODING as shown provides an environmental workaround, and http://bugs.python.org/issue15216 will provide an improved programmatic workaround (which tools like http://code.google.com/p/pyp/ could use to configure surrogateescape by default). So perhaps pursuing #15216 further would be a better approach than selectively changing the default behaviour? And better documentation for ways to handle the surrogate escape error when it arises?

Note that the specific case I'm really interested is printing on systems that are properly configured to use UTF-8, but are getting bad metadata from an OS API. I'm OK with the idea of *only* changing it for UTF-8 rather than for arbitrary encodings, as well as restricting it to sys.stdout when the codec used matches the default filesystem encoding.

To double check the current behaviour, I created a directory to tinker with this. Filenames were created with the following:

>>> open("ℙƴ☂ℌøἤ".encode("utf-8"), "w")
>>> open("basic_ascii".encode("utf-8"), "w")
>>> b"\xd0\xd1\xd2\xd3".decode("latin-1")
'ÐÑÒÓ'
>>> open(b"\xd0\xd1\xd2\xd3", "w")

That last generates an invalid UTF-8 filename. "ls" actually degrades less gracefully than I thought, and just prints question marks for the bad file:

$ ls -l
total 0
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:04 ????
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 basic_ascii
-rw-rw-r--. 1 ncoghlan ncoghlan 0 Aug 23 00:01 ℙƴ☂ℌøἤ

Python 2 & 3 both work OK if you just print the directory listing directly, since repr() happily displays the surrogate escaped string:

$ python -c "import os; print(os.listdir('.'))"
['basic_ascii', '\xd0\xd1\xd2\xd3', '\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4']
$ python3 -c "import os; print(os.listdir('.'))"
['basic_ascii', '\udcd0\udcd1\udcd2\udcd3', 'ℙƴ☂ℌøἤ']

Where it falls down is when you try to print the strings directly in Python 3:

$ python3 -c "import os; [print(fname) for fname in os.listdir('.')]"
basic_ascii
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "<string>", line 1, in <listcomp>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd0' in position 0: surrogates not allowed

While setting the IO encoding produces behaviour closer to that of the native tools:
$ PYTHONIOENCODING=utf-8:surrogateescape python3 -c "import os; [print(fname) for fname in os.listdir('.')]"
basic_ascii
����
ℙƴ☂ℌøἤ

On the other hand, setting PYTHONIOENCODING as shown provides an environmental workaround, and http://bugs.python.org/issue15216 will provide an improved programmatic workaround (which tools like http://code.google.com/p/pyp/ could use to configure surrogateescape by default).

So perhaps pursuing #15216 further would be a better approach than selectively changing the default behaviour? And better documentation for ways to handle the surrogate escape error when it arises?

History
Date	User	Action	Args
2013-08-22 14:42:05	ncoghlan	set	recipients: + ncoghlan, lemburg, pitrou, vstinner, abadger1999, benjamin.peterson, ezio.melotti, a.badger, Arfrever, r.david.murray, serhiy.storchaka
2013-08-22 14:42:05	ncoghlan	set	messageid: <1377182525.34.0.504141141625.issue18713@psf.upfronthosting.co.za>
2013-08-22 14:42:05	ncoghlan	link	issue18713 messages
2013-08-22 14:42:04	ncoghlan	create