Author vstinner
Recipients benjamin.peterson, gz, pitrou, poolie, r.david.murray, vila, vstinner
Date 2011-12-21.18:27:38
SpamBayes Score 0.0
Marked as misclassified No
Message-id <4EF22598.6080900@haypocalc.com>
In-reply-to <1324429950.17.0.0585650740961.issue13643@psf.upfronthosting.co.za>
Content
> Having more than one encoding on unix is already a reality, there's nothing to stop someone setting LANG=de_DE.UTF-8 and LC_MESSAGES=C say.

Nope. The locale encoding is chosen using LC_ALL, LC_CTYPE or LANG 
variable: use the first non-empty variable. LC_MESSAGES doesn't affect 
the encoding. Example:

$ LANG=de_DE.iso88591 LC_MESSAGES=fr_FR.UTF-8 python -c 'import os, 
locale; locale.setlocale(locale.LC_ALL, ""); 
print(locale.getpreferredencoding(), repr(os.strerror(23)))'
('ISO-8859-1', "'Trop de fichiers ouverts dans le syst\\xe8me'")

$ LANG=de_DE.UTF-8 LC_MESSAGES=fr_FR.UTF-8 python -c 'import os, locale; 
locale.setlocale(locale.LC_ALL, ""); 
print(locale.getpreferredencoding(), repr(os.strerror(23)))'
('UTF-8', "'Trop de fichiers ouverts dans le syst\\xc3\\xa8me'")

 > The real lesson is not that having more than one encoding
 > is dangerous, but that having incompatible encodings is dangerous.

Yes, and ASCII and UTF-8 are incompatible. ASCII is unable to decode an 
UTF-8 encoded string.

 > Expanding the filesystem default encoding to utf-8
 > should be a very narrow change, mostly just affecting io
 > and os operations.

It affects everything because filenames are used everywhere.

 > On modern systems, this problem is solved by making the
 > standard encoding UTF-8.  So it is unfortunate that, when
 > no locale is set, Python3 defaults to ascii for the filesystem.

Python doesn't invent an encoding: ASCII is the result of 
nl_langinfo(CODESET). Example:

$ python3 -c "import locale; print(locale.nl_langinfo(locale.CODESET))"
UTF-8
$ LANG=C python3 -c "import locale; 
print(locale.nl_langinfo(locale.CODESET))"
ANSI_X3.4-1968

 >> $ LANG=C python3 -c "import os; print(os.listdir())"
 >> ['h\udcc3\udca9h\udcc3\udca9']

 > It's possible to work around this in some cases, such as listdir,
 > by coping with the result including some byte strings, and then
 > manually decoding them.  But there are, iirc, other cases where
 > the call just fails and there is no easy workaround.

In Python 3, os.listdir(str) *CANNOT* fail because of a Unicode decode 
error thanks to the PEP 393. In Python 2, it works differently (return 
the raw bytes filename if decoding fails).

 > Windows and Mac have annoying bugs too, even bugs specifically
 > about Unicode.

Windows supports Unicode since Windows 95 and fully support all Unicode 
characters since Windows 2000.

Mac enforces UTF-8. For example, it is not possible to *create* a 
filename with invalid UTF-8 name. It looks like it always use UTF-8 on 
the command line.

On Linux, we cannot rely on anything except of the locale encoding. We 
try to use Unicode API when it's possible (e.g. use wcstime() instead of 
strftime()), but quite all functions use byte strings and so rely on the 
locale encoding.
History
Date User Action Args
2011-12-21 18:27:39vstinnersetrecipients: + vstinner, pitrou, vila, benjamin.peterson, r.david.murray, gz, poolie
2011-12-21 18:27:38vstinnerlinkissue13643 messages
2011-12-21 18:27:38vstinnercreate