Message 205727 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	a.badger
Recipients	Sworddragon, a.badger, bkabrda, larry, lemburg, loewis, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, terry.reedy, vstinner
Date	2013-12-09.18:50:38
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1386615038.62.0.477150221488.issue19846@psf.upfronthosting.co.za>
In-reply-to

Content
Ahh... added to the nosy list and bug closed all before I got up for the day ;-) A few words: I do think that python is broken here. I do not think that translating everything to utf-8 if ascii is the locale's encoding is the solution. As I would state it, the problem is that python's boundary with the OS is not yet uniform. If you set LC_ALL=C (note, LC_ALL=C is just one of multiple ways to beak things. For instance, LC_ALL=en_US.utf8 when dealing with latin-1 data will also break) then python will still read non-ascii data from the OS through some interfaces but it won't output it back to the OS. ie: $ mkdir unicode && cd unicode $ python3 -c 'open("ñ.txt".encode("latin-1"), "w").close()' $ LC_ALL=en_US.utf8 python3 >>> import os >>> dir_listing = os.listdir('.') >>> for entry in dir_listing: print(entry) ... Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf1' in position 0: surrogates not allowed Note that currently, input() and sys.stdin.read() won't read undecodable data so this is somewhat symmetrical but it seems to me that saying "everything that interfaces with the OS except the standard streams will use surrogateescape on undecodable bytes" is drawing a line in an unintuitive location. (A further note to serhiy.storchaka.... Your examples are not showing anything broken in other programs. xterm is refusing both input and output that is non-ascii. This is symmetric behaviour. ls is doing its best to display a human-readable representation of bytes that it cannot convert in the current encoding. It also provides the -b switch to see the octal values if you actually care. Think of this like opening a binary file in less or another pager.) (Further note for haypo -- On Fedora, the default of en_US is utf8, not ISO8859-1.)

Ahh... added to the nosy list and bug closed all before I got up for the day ;-)

A few words:

I do think that python is broken here.

I do not think that translating everything to utf-8 if ascii is the locale's encoding is the solution.

As I would state it, the problem is that python's boundary with the OS is not yet uniform.  If you set LC_ALL=C (note, LC_ALL=C is just one of multiple ways to beak things.  For instance, LC_ALL=en_US.utf8 when dealing with latin-1 data will also break) then python will still *read* non-ascii data from the OS through some interfaces but it won't output it back to the OS.  ie:

$ mkdir unicode && cd unicode
$ python3 -c 'open("ñ.txt".encode("latin-1"), "w").close()'
$ LC_ALL=en_US.utf8 python3
>>> import os
>>> dir_listing = os.listdir('.')
>>> for entry in dir_listing: print(entry)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf1' in position 0: surrogates not allowed

Note that currently, input() and sys.stdin.read() won't read undecodable data so this is somewhat symmetrical but it seems to me that saying "everything that interfaces with the OS except the standard streams will use surrogateescape on undecodable bytes" is drawing a line in an unintuitive location.

(A further note to serhiy.storchaka.... Your examples are not showing anything broken in other programs.  xterm is refusing both input and output that is non-ascii.  This is symmetric behaviour.  ls is doing its best to display a *human-readable* representation of bytes that it cannot convert in the current encoding.  It also provides the -b switch to see the octal values if you actually care.  Think of this like opening a binary file in less or another pager.)

(Further note for haypo -- On Fedora, the default of en_US is utf8, not ISO8859-1.)

History
Date	User	Action	Args
2013-12-09 18:50:38	a.badger	set	recipients: + a.badger, lemburg, loewis, terry.reedy, ncoghlan, pitrou, vstinner, larry, r.david.murray, Sworddragon, serhiy.storchaka, bkabrda
2013-12-09 18:50:38	a.badger	set	messageid: <1386615038.62.0.477150221488.issue19846@psf.upfronthosting.co.za>
2013-12-09 18:50:38	a.badger	link	issue19846 messages
2013-12-09 18:50:38	a.badger	create