Message 150031 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	benjamin.peterson, gz, pitrou, poolie, r.david.murray, vila, vstinner
Date	2011-12-21.18:27:38
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<4EF22598.6080900@haypocalc.com>
In-reply-to	<1324429950.17.0.0585650740961.issue13643@psf.upfronthosting.co.za>

Content
> Having more than one encoding on unix is already a reality, there's nothing to stop someone setting LANG=de_DE.UTF-8 and LC_MESSAGES=C say. Nope. The locale encoding is chosen using LC_ALL, LC_CTYPE or LANG variable: use the first non-empty variable. LC_MESSAGES doesn't affect the encoding. Example: $ LANG=de_DE.iso88591 LC_MESSAGES=fr_FR.UTF-8 python -c 'import os, locale; locale.setlocale(locale.LC_ALL, ""); print(locale.getpreferredencoding(), repr(os.strerror(23)))' ('ISO-8859-1', "'Trop de fichiers ouverts dans le syst\\xe8me'") $ LANG=de_DE.UTF-8 LC_MESSAGES=fr_FR.UTF-8 python -c 'import os, locale; locale.setlocale(locale.LC_ALL, ""); print(locale.getpreferredencoding(), repr(os.strerror(23)))' ('UTF-8', "'Trop de fichiers ouverts dans le syst\\xc3\\xa8me'") > The real lesson is not that having more than one encoding > is dangerous, but that having incompatible encodings is dangerous. Yes, and ASCII and UTF-8 are incompatible. ASCII is unable to decode an UTF-8 encoded string. > Expanding the filesystem default encoding to utf-8 > should be a very narrow change, mostly just affecting io > and os operations. It affects everything because filenames are used everywhere. > On modern systems, this problem is solved by making the > standard encoding UTF-8. So it is unfortunate that, when > no locale is set, Python3 defaults to ascii for the filesystem. Python doesn't invent an encoding: ASCII is the result of nl_langinfo(CODESET). Example: $ python3 -c "import locale; print(locale.nl_langinfo(locale.CODESET))" UTF-8 $ LANG=C python3 -c "import locale; print(locale.nl_langinfo(locale.CODESET))" ANSI_X3.4-1968 >> $ LANG=C python3 -c "import os; print(os.listdir())" >> ['h\udcc3\udca9h\udcc3\udca9'] > It's possible to work around this in some cases, such as listdir, > by coping with the result including some byte strings, and then > manually decoding them. But there are, iirc, other cases where > the call just fails and there is no easy workaround. In Python 3, os.listdir(str) CANNOT fail because of a Unicode decode error thanks to the PEP 393. In Python 2, it works differently (return the raw bytes filename if decoding fails). > Windows and Mac have annoying bugs too, even bugs specifically > about Unicode. Windows supports Unicode since Windows 95 and fully support all Unicode characters since Windows 2000. Mac enforces UTF-8. For example, it is not possible to create a filename with invalid UTF-8 name. It looks like it always use UTF-8 on the command line. On Linux, we cannot rely on anything except of the locale encoding. We try to use Unicode API when it's possible (e.g. use wcstime() instead of strftime()), but quite all functions use byte strings and so rely on the locale encoding.

> Having more than one encoding on unix is already a reality, there's nothing to stop someone setting LANG=de_DE.UTF-8 and LC_MESSAGES=C say.

Nope. The locale encoding is chosen using LC_ALL, LC_CTYPE or LANG 
variable: use the first non-empty variable. LC_MESSAGES doesn't affect 
the encoding. Example:

$ LANG=de_DE.iso88591 LC_MESSAGES=fr_FR.UTF-8 python -c 'import os, 
locale; locale.setlocale(locale.LC_ALL, ""); 
print(locale.getpreferredencoding(), repr(os.strerror(23)))'
('ISO-8859-1', "'Trop de fichiers ouverts dans le syst\\xe8me'")

$ LANG=de_DE.UTF-8 LC_MESSAGES=fr_FR.UTF-8 python -c 'import os, locale; 
locale.setlocale(locale.LC_ALL, ""); 
print(locale.getpreferredencoding(), repr(os.strerror(23)))'
('UTF-8', "'Trop de fichiers ouverts dans le syst\\xc3\\xa8me'")

 > The real lesson is not that having more than one encoding
 > is dangerous, but that having incompatible encodings is dangerous.

Yes, and ASCII and UTF-8 are incompatible. ASCII is unable to decode an 
UTF-8 encoded string.

 > Expanding the filesystem default encoding to utf-8
 > should be a very narrow change, mostly just affecting io
 > and os operations.

It affects everything because filenames are used everywhere.

 > On modern systems, this problem is solved by making the
 > standard encoding UTF-8.  So it is unfortunate that, when
 > no locale is set, Python3 defaults to ascii for the filesystem.

Python doesn't invent an encoding: ASCII is the result of 
nl_langinfo(CODESET). Example:

$ python3 -c "import locale; print(locale.nl_langinfo(locale.CODESET))"
UTF-8
$ LANG=C python3 -c "import locale; 
print(locale.nl_langinfo(locale.CODESET))"
ANSI_X3.4-1968

 >> $ LANG=C python3 -c "import os; print(os.listdir())"
 >> ['h\udcc3\udca9h\udcc3\udca9']

 > It's possible to work around this in some cases, such as listdir,
 > by coping with the result including some byte strings, and then
 > manually decoding them.  But there are, iirc, other cases where
 > the call just fails and there is no easy workaround.

In Python 3, os.listdir(str) *CANNOT* fail because of a Unicode decode 
error thanks to the PEP 393. In Python 2, it works differently (return 
the raw bytes filename if decoding fails).

 > Windows and Mac have annoying bugs too, even bugs specifically
 > about Unicode.

Windows supports Unicode since Windows 95 and fully support all Unicode 
characters since Windows 2000.

Mac enforces UTF-8. For example, it is not possible to *create* a 
filename with invalid UTF-8 name. It looks like it always use UTF-8 on 
the command line.

On Linux, we cannot rely on anything except of the locale encoding. We 
try to use Unicode API when it's possible (e.g. use wcstime() instead of 
strftime()), but quite all functions use byte strings and so rely on the 
locale encoding.

History
Date	User	Action	Args
2011-12-21 18:27:39	vstinner	set	recipients: + vstinner, pitrou, vila, benjamin.peterson, r.david.murray, gz, poolie
2011-12-21 18:27:38	vstinner	link	issue13643 messages
2011-12-21 18:27:38	vstinner	create