Author vstinner
Recipients a.badger, abadger1999, benjamin.peterson, ezio.melotti, lemburg, ncoghlan, pitrou, r.david.murray, vstinner
Date 2013-08-21.10:38:52
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <CAMpsgwZSk2k41uRgph3y9fF3jc75Wbss+9wz3wqa2ADHRHoP0A@mail.gmail.com>
In-reply-to <1377078267.22.0.222957122817.issue18713@psf.upfronthosting.co.za>
Content
Currently, Python 3 fails miserabily when it gets a non-ASCII
character from stdin or when it tries to write a byte encoded as a
Unicode surrogate to stdout.

It works fine when OS data can be decoded from and encoded to the
locale encoding. Example on Linux with UTF-8 data and UTF-8 locale
encoding:

$ mkdir test
$ cd test
$ touch héhé.txt
$ ls
héhé.txt
$ python3 -c 'import os; print(", ".join(os.listdir()))'
héhé.txt
$ echo "héhé"|python3 -c 'import sys; sys.stdout.write(sys.stdin.read())'|cat
héhé

It fails miserabily when OS data cannot be decoded from or encoded to
the locale encoding. Example on Linux with UTF-8 data and ASCII locale
encoding:

$ mkdir test
$ cd test
$ touch héhé.txt
$ export LANG=  # switch to ASCII locale encoding
$ ls
h??h??.txt
$ python3 -c 'import os; print(", ".join(os.listdir()))'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
1-2: ordinal not in range(128)

$ echo "héhé"|LANG= python3 -c 'import sys;
sys.stdout.write(sys.stdin.read())'|cat
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/vstinner/prog/python/default/Lib/encodings/ascii.py",
line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
1: ordinal not in range(128)

The ls output is not the expected "héhé" string, but it is an issue
with the console output, not the ls program. ls does just write raw
bytes to stdout:

$ ls|hexdump -C
00000000  68 c3 a9 68 c3 a9 2e 74  78 74 0a                 |h..h...txt.|
0000000b

("héhé" encoded to UTF-8 gives b'h\xc3\xa9h\xc3\xa9')

I agree that we can do something to improve the situation on standard
streams, but only on standard streams. It is already possible to
workaround the issue by forcing the surrogateescape error handler on
stdout:

$ LANG= PYTHONIOENCODING=utf-8:surrogateescape python3 -c 'import os;
print(", ".join(os.listdir()))'
héhé.txt

Something similar can be done in Python. For example,
test.support.regrtest reopens sys.stdout to set the error handle to
"backslashreplace". Extract of the replace_stdout() function:

    sys.stdout = open(stdout.fileno(), 'w',
        encoding=sys.stdout.encoding,
        errors="backslashreplace",
        closefd=False,
        newline='\n')
History
Date User Action Args
2013-08-21 10:38:53vstinnersetrecipients: + vstinner, lemburg, ncoghlan, pitrou, abadger1999, benjamin.peterson, ezio.melotti, a.badger, r.david.murray
2013-08-21 10:38:53vstinnerlinkissue18713 messages
2013-08-21 10:38:52vstinnercreate