Message 205675 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	Sworddragon, a.badger, bkabrda, larry, lemburg, loewis, ncoghlan, pitrou, r.david.murray, serhiy.storchaka, terry.reedy, vstinner
Date	2013-12-09.10:42:16
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1386585736.77.0.335520394132.issue19846@psf.upfronthosting.co.za>
In-reply-to

Content
I'm closing the issue as invalid, because Python 3 behaviour is correct and must not be changed. Standard streams (sys.stdin, sys.stdout, sys.stderr) uses the locale encoding. sys.stdin and sys.stdout use the strict error handler, sys.stderr uses the backslashreplace error handler. These encodings and error handlers can be overriden by the PYTHONIOENCODING. Since Python 3.3, it's possible to only set the error handler using ":errors" syntax (ex: PYTHONIOENCODING=":replace"). Python uses sys.getfilesystemencoding() to decode data from / encode data to the operating system. Example of operating system data: command line arguments, environment variables, host names, filenames, user names, etc. On Windows, Python tries to use the wide character (Unicode) API of Windows anywhere to avoid any conversion, to not loose data. The MBCS codec (ANSI code page) of Windows uses a replace error handler by default, it looses data. Try for example os.listdir() in a directory containing filenames not encodable to the ANSI code page in Python 2 (or os.listdir(b'.') in Python 3). On Mac OS X, Python always use UTF-8 for sys.getfilesystemencoding() (with the surrogateescape error handler, see the PEP 383). The locale encoding is ignored for sys.getfilesystemencoding() (the locale encoding is still used in some functions). On other operating systems... it's more complex. Python uses the locale encoding for sys.getfilesystemencoding() (with the surrogateescape error handler, see the PEP 383). For the POSIX locale (aka the "C" locale), you may get the ASCII encoding on Linux, ASCII on FreeBSD and Solaris (whereas these operating systems announce an alias of the ISO 8859-1 encoding, but use ASCII in practice), ISO 8859-1 on AIX etc. Using the locale encoding is the best choice for interoperability with other applications (which use also the locale encoding). Even if an application uses "raw bytes" (like Python 2), these bytes are still "locale aware". For example, when "raw bytes" are written to the standard output, bytes are decoded to find the appropriate character in the font of the terminal. When "raw bytes" are written into a socket to generate a HTML document (ex: listing of a directory, so a list of filenames), the web brower will decode them from them encoding announced in the HTML page. Even if the encoding is not explicit, it does still exist. Read other comments of this issue for other examples. Forcing the POSIX locale to get an user interface in english is wrong if you also expect from your application to still generate valid "raw bytes" in your "system" encoding (ISO 8859-1, ShiftJIS, UTF-8, whatever). To change the language, the correct environment variable is LC_CTYPE: use LC_CTYPE=C. Or better, use the real english locale which will probably handle better currency, numbers, etc. Example: LC_CTYPE=en_US.utf8 (on Fedora, "en_US" locale uses the ISO 8859-1 encoding).

I'm closing the issue as invalid, because Python 3 behaviour is correct and must not be changed.


Standard streams (sys.stdin, sys.stdout, sys.stderr) uses the locale encoding. sys.stdin and sys.stdout use the strict error handler, sys.stderr uses the backslashreplace error handler. These encodings and error handlers can be overriden by the PYTHONIOENCODING. Since Python 3.3, it's possible to only set the error handler using ":errors" syntax (ex: PYTHONIOENCODING=":replace").


Python uses sys.getfilesystemencoding() to decode data from / encode data to the operating system. Example of operating system data: command line arguments, environment variables, host names, filenames, user names, etc.

On Windows, Python tries to use the wide character (Unicode) API of Windows anywhere to avoid any conversion, to not loose data. The MBCS codec (ANSI code page) of Windows uses a replace error handler by default, it looses data. Try for example os.listdir() in a directory containing filenames not encodable to the ANSI code page in Python 2 (or os.listdir(b'.') in Python 3).

On Mac OS X, Python always use UTF-8 for sys.getfilesystemencoding() (with the surrogateescape error handler, see the PEP 383). The locale encoding is ignored for sys.getfilesystemencoding() (the locale encoding is still used in some functions).

On other operating systems... it's more complex. Python uses the locale encoding for sys.getfilesystemencoding() (with the surrogateescape error handler, see the PEP 383). For the POSIX locale (aka the "C" locale), you may get the ASCII encoding on Linux, ASCII on FreeBSD and Solaris (whereas these operating systems announce an alias of the ISO 8859-1 encoding, but use ASCII in practice), ISO 8859-1 on AIX etc. Using the locale encoding is the best choice for interoperability with other applications (which use also the locale encoding).

Even if an application uses "raw bytes" (like Python 2), these bytes are still "locale aware". For example, when "raw bytes" are written to the standard output, bytes are decoded to find the appropriate character in the font of the terminal. When "raw bytes" are written into a socket to generate a HTML document (ex: listing of a directory, so a list of filenames), the web brower will decode them from them encoding announced in the HTML page. Even if the encoding is not explicit, it does still exist. Read other comments of this issue for other examples.

Forcing the POSIX locale to get an user interface in english is wrong if you also expect from your application to still generate valid "raw bytes" in your "system" encoding (ISO 8859-1, ShiftJIS, UTF-8, whatever). To change the language, the correct environment variable is LC_CTYPE: use LC_CTYPE=C. Or better, use the real english locale which will probably handle better currency, numbers, etc. Example: LC_CTYPE=en_US.utf8 (on Fedora, "en_US" locale uses the ISO 8859-1 encoding).

History
Date	User	Action	Args
2013-12-09 10:42:16	vstinner	set	recipients: + vstinner, lemburg, loewis, terry.reedy, ncoghlan, pitrou, larry, a.badger, r.david.murray, Sworddragon, serhiy.storchaka, bkabrda
2013-12-09 10:42:16	vstinner	set	messageid: <1386585736.77.0.335520394132.issue19846@psf.upfronthosting.co.za>
2013-12-09 10:42:16	vstinner	link	issue19846 messages
2013-12-09 10:42:16	vstinner	create