Message 205545 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	Sworddragon, larry, lemburg, loewis, ncoghlan, pitrou, r.david.murray, terry.reedy, vstinner
Date	2013-12-08.11:16:01
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1386501362.54.0.660344740078.issue19846@psf.upfronthosting.co.za>
In-reply-to

Content
Yes, that's the point. Every case I've seen where the locale encoding has been reported as ASCII on a modern Linux system has been because the environment has been configured to use the C locale, and that locale has a silly, antiquated, encoding setting. This is particularly problematic when people remotely access a system with ssh and get given the C locale instead of something sensible, and then can't properly read the filesystem on that server. The idea of using UTF-8 instead in that case is to change (and hopefully reduce) the number of cases where things go wrong. - if no non-ASCII data is encountered, the choice of ASCII vs UTF-8 doesn't matter - if it's a modern Linux distro, then the real filesystem encoding is UTF-8, and the setting it provides for LANG=C is just plain wrong - there may be other cases where ASCII actually is the filesystem encoding (in which case they're going to have trouble anyway), or the real filesystem encoding is something other than UTF-8 We're already approximating things on Linux by assuming every filesystem is using the same encoding, when that's not necessarily the case. Glib applications also assume UTF-8, regardless of the locale (http://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux). At the moment, setting "LANG=C" on a Linux system fundamentally breaks Python 3, and that's not OK.

Yes, that's the point. *Every* case I've seen where the locale encoding has been reported as ASCII on a modern Linux system has been because the environment has been configured to use the C locale, and that locale has a silly, antiquated, encoding setting.

This is particularly problematic when people remotely access a system with ssh and get given the C locale instead of something sensible, and then can't properly read the filesystem on that server.

The idea of using UTF-8 instead in that case is to *change* (and hopefully reduce) the number of cases where things go wrong.

- if no non-ASCII data is encountered, the choice of ASCII vs UTF-8 doesn't matter
- if it's a modern Linux distro, then the real filesystem encoding is UTF-8, and the setting it provides for LANG=C is just plain *wrong*
- there may be other cases where ASCII actually *is* the filesystem encoding (in which case they're going to have trouble anyway), or the real filesystem encoding is something other than UTF-8

We're already approximating things on Linux by assuming every filesystem is using the *same* encoding, when that's not necessarily the case. Glib applications also assume UTF-8, regardless of the locale (http://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux).

At the moment, setting "LANG=C" on a Linux system *fundamentally breaks Python 3*, and that's not OK.

History
Date	User	Action	Args
2013-12-08 11:16:02	ncoghlan	set	recipients: + ncoghlan, lemburg, loewis, terry.reedy, pitrou, vstinner, larry, r.david.murray, Sworddragon
2013-12-08 11:16:02	ncoghlan	set	messageid: <1386501362.54.0.660344740078.issue19846@psf.upfronthosting.co.za>
2013-12-08 11:16:02	ncoghlan	link	issue19846 messages
2013-12-08 11:16:01	ncoghlan	create