classification
Title: locale documentation doesn't mention that LC_CTYPE is changed at startup
Type: behavior Stage: resolved
Components: Documentation, Unicode Versions: Python 3.2, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Arfrever, alexis, ezio.melotti, georg.brandl, lemburg, loewis, ned.deily, petri.lehtinen, pitrou, python-dev, r.david.murray, sdaoden, vstinner
Priority: high Keywords: patch

Created on 2009-06-05 10:56 by ned.deily, last changed 2012-06-05 23:39 by python-dev. This issue is now closed.

Files
File name Uploaded Description Edit
locale_doc.patch vstinner, 2012-06-05 12:02 review
Messages (27)
msg88932 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2009-06-05 10:56
In the Library Reference section 22.2.1 for locale, it states:

"Initially, when a program is started, the locale is the C locale, no 
matter what the user’s preferred locale is. The program must explicitly 
say that it wants the user’s preferred locale settings by calling 
setlocale(LC_ALL, '')."

This is the case for python2.x:

$ export LANG=en_US.UTF-8
$ python2.5
Python 2.5.4 (r254:67916, Feb 17 2009, 20:16:45) 
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale; locale.getlocale()
(None, None)
>>> locale.getdefaultlocale()
('en_US', 'UTF8')
>>> 

but not for 3.1:
$ python3.1
Python 3.1a1+ (py3k, Mar 23 2009, 00:12:12) 
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale; locale.getlocale()
('en_US', 'UTF8')
>>> locale.getdefaultlocale()
('en_US', 'UTF8')
>>> 

Either the code is incorrect in 3.1 or the documentation should be 
updated.
msg89016 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-06-06 21:00
Confirmed for 3.1, 3.0 still returns (None, None).
msg89077 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2009-06-08 13:29
Deferring to Martin which one is correct :)
msg89084 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-06-08 16:01
This is definately a bug in 3.1, for the same reason that a C program
uses the C locale until an explicit setlocale is done: otherwise, a
non-locale-aware program can run into bugs resulting from locale issues
when run under a different locale than that of the program author.

I have a memory of this being reported before somewhere and someone
tracking it down to a change in python initialization, but I can't find
a bug report and my google-foo is failing me.
msg89088 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-06-08 16:17
For some reason only LC_CTYPE is affected:

>>> locale.getlocale(locale.LC_CTYPE)
('fr_FR', 'UTF8')
>>> locale.getlocale(locale.LC_MESSAGES)
(None, None)
>>> locale.getlocale(locale.LC_TIME)
(None, None)
>>> locale.getlocale(locale.LC_NUMERIC)
(None, None)
>>> locale.getlocale(locale.LC_COLLATE)
(None, None)
msg89089 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-06-08 16:22
Ah, I can tell you exactly why that is, then.  I noticed this in
pythonrun.c while grepping the source:

#ifdef HAVE_SETLOCALE
        /* Set up the LC_CTYPE locale, so we can obtain
           the locale's charset without having to switch
           locales. */
        setlocale(LC_CTYPE, "");
#endif

SVN blames Martin in r56922, so this case is assigned appropriately. 
Perhaps changing only LC_CTYPE is safe?  I must admit to ignorance as to
what all the LC variables mean/control.
msg89090 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-06-08 16:26
It would still be better it is was unset afterwards. Third-party
extensions could have LC_CTYPE-dependent behaviour.
msg89101 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-06-08 19:39
> It would still be better it is was unset afterwards. Third-party
> extensions could have LC_CTYPE-dependent behaviour.

In principle, they could, yes - but what specific behavior might that
be? What will change is character classification, which I consider
fairly harmless. Also, multi-byte conversion routines will change, which
is the primary reason for leaving it modified.
msg89102 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-06-08 19:43
> In principle, they could, yes - but what specific behavior might that
> be? What will change is character classification, which I consider
> fairly harmless. Also, multi-byte conversion routines will change, which
> is the primary reason for leaving it modified.

Ok, so I suppose we could leave the code as-is.
msg89120 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-06-08 21:51
Since it controls what is considered to be whitespace, it is possible
this will lead to subtle bugs, but I agree that it seems relatively
benign, especially considering 3.x's unicode orientation.  So, this
becomes a doc bug...
msg89136 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-06-09 07:10
To add a little bit more analysis: posix.device_encoding requires that
the LC_CTYPE is set. Setting it just in this function would not be
possible, as setlocale is not thread-safe.

So for 3.1, it seems that Python must set LC_CTYPE. If somebody can
propose a patch that avoids that for 3.2, I'd be certainly in favor.
msg127180 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-27 11:40
> To add a little bit more analysis: posix.device_encoding requires that
> the LC_CTYPE is set. Setting it just in this function would not be
> possible, as setlocale is not thread-safe.

open() does indirectly (locale.getpreferredencoding()) change temporary the locale (set LC_CTYPE to "") if the file is not a TTY (if it is a TTY, device_encoding() calls nl_langinfo(CODESET) without changing the current locale). If setlocale() is not thread-safe we have (maybe?) a problem here. See also #11022: report of an user not understanding why setlocale() doesn't impact open() (TextIOWrapper) encoding). A quick solution is to call locale.getpreferredencoding(False) which doesn't change the locale.

Do you really need os.device_encoding()? If we change TextIOWrapper to call locale.getpreferredencoding(False), os.device_encoding() and locale.getpreferredencoding(False) will give the same result. Except on Windows: os.device_encoding() uses GetConsoleCP() if fd==0 and GetConsoleOutputCP() if fd in (1, 2). But we can use GetConsoleCP() and GetConsoleOutputCP() directly in initstdio(). If someone closes sys.std* and recreate them later: os.device_encoding() can be use explicitly to keep the previous behaviour.

> It would still be better it is was unset afterwards. Third-party
> extensions could have LC_CTYPE-dependent behaviour.

If Python is embeded, it should not change the locale. Even if it is not embeded, it is maybe better to never set LC_CTYPE.

It is too late to touch such critical point in Python 3.2, but we may change it in Python 3.3.
msg127262 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-01-28 09:27
Python can be embedded into other applications and unconditionally
changing the locale (esp. the LC_CTYPE) is not good practice, since
it's not thread-safe and affects the entire process. An application
may have set LC_CTYPE (or the locale) to something completely
different.

If at all, Python should be more careful using this call (pseudo
code):

lc_ctype = setlocale(LC_CTYPE, NULL);
if (lc_ctype == NULL || strcmp(lc_ctype, "") || strcmp(lc_ctype, "C")) {
    env_lc_ctype = setlocale(LC_CTYPE, "");
    setlocale(LC_CTYPE, lc_ctype);
    lc_ctype = env_lc_ctype;
}

Then use lc_ctype to figure out encodings, etc.

While this is not thread-safe, it at least reverts the change back
to the original setting and only applies the change if needed. That's
still not optimal, but better than nothing.

An clean alternative would be adding LC_* variable parsing code to
Python to avoid the setlocale() call altogether.
msg127265 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-01-28 09:33
> An clean alternative would be adding LC_* variable parsing code to
> Python to avoid the setlocale() call altogether.

That would be highly non-portable, and repeat the mistakes of
getdefaultlocale.
msg127283 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2011-01-28 11:05
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
>> An clean alternative would be adding LC_* variable parsing code to
>> Python to avoid the setlocale() call altogether.
> 
> That would be highly non-portable, and repeat the mistakes of
> getdefaultlocale.

You say that often, but I don't really know why. It's certainly portable
between various Unix platforms, perhaps not Windows, but then i18n
on Windows is a different story altogether.

BTW: For Windows, you can adjust setlocale() to work thread-based
using: _configthreadlocale()
(http://msdn.microsoft.com/de-de/library/26c0tb7x(v=vs.80).aspx)

Perhaps we ought to expose this in _locale and use it in
getdefaultlocal() on Windows to query the locale settings
via the pseudocode I posted.
msg127347 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-01-28 21:22
>> That would be highly non-portable, and repeat the mistakes of
>> getdefaultlocale.
> 
> You say that often, but I don't really know why. It's certainly portable
> between various Unix platforms, perhaps not Windows, but then i18n
> on Windows is a different story altogether.

No, it's absolutely not portable across Unix platforms. Looking at
LANG or LC_ALL does *not* allow you to infer the region name, or
the locale's character set. For example, using glibc, in some
installations, /etc/locale.alias is considered to map a value of LANG
to the final locale name. As an option, glibc also considers a
LOCALE_ALIAS_PATH that may point to a (colon-separated) path of
files to search for locale aliases.

Other systems may use other databases to map a locale name to locale
properties.

Unless you know exactly what version of C library is running on
a system, parsing environment variables yourself is doomed to fail.
msg127350 - (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) * (Python triager) Date: 2011-01-28 21:36
Martin v. Löwis:
It seems that your web browser replaces ", " with ",\t" in the title (where "\t" is a tab character) each time you add a comment.
msg127351 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-01-28 21:38
More likely, it's my email reader. Sorry about that.
msg127417 - (view) Author: Steffen Daode Nurpmeso (sdaoden) Date: 2011-01-29 13:51
User lemburg pointed me to this, but no, i've posted msg127416 to Issue 11022.
msg141830 - (view) Author: Alexis Metaireau (alexis) * (Python triager) Date: 2011-08-09 15:53
Maybe could it be useful to specify in the documentation that getlocale() is not intended to be used to get information about what is the locale of the system? 

It's not explained currently and thus it's a bit weird to have getlocale returning (None, None) even if you have your locales set.
msg141847 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-08-10 00:24
This issue is about the fact that it doesn't return (None, None).  We should probably decide what we are going to do about that before changing the docs if they need it.
msg141872 - (view) Author: Alexis Metaireau (alexis) * (Python triager) Date: 2011-08-10 16:05
I see two different things here:

1) the fact that getlocale() doesn't return (None, None) on some python 
versions
2) the fact that having it returning (None, None) by default is a bit 
misleading as users may think that getlocale() is tied to environment 
variables. That's what was at the origin of #12699

My last remark is about the second bit. Maybe should I start a new issue 
for this?
msg141890 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-08-11 01:25
Yes a new issue would be more appropriate.
msg147174 - (view) Author: Petri Lehtinen (petri.lehtinen) * (Python committer) Date: 2011-11-06 19:48
If the thread safety of setlocale() is a problem, does anybody know how portable uselocale() is? It sets the locale of the current thread only, so it's safe to temporarily change the locale and then set it back.
msg162340 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-06-05 12:02
> Either the code is incorrect in 3.1
> or the documentation should be updated.

Leaving LC_CTYPE unchanged (use the "C" locale, which is ASCII in most
cases) at Python startup would be a major change in Python 3. I don't
want to change this. You would see a lot of mojibake in your GUIs and get a lot of ugly surrogate characters in filenames (because of the PEP
393) if we don't set the LC_CTYPE to the user preferred encoding at startup anymore.

Setting the LC_CTYPE to the user preferred encoding is just very
convinient and helps Python to speak to the user though the console,
to the filesystem, to pass arguments on a command line of a
subprocess, etc. For example, you cannot pass non-ASCII characters to
a subprocess, characters written by the user in your GUI, if your
current LC_CTYPE locale is C (ASCII): you get an Unicode encode error.

So it's just a documentation issue: see my attached patch.
msg162355 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2012-06-05 16:24
LGTM
msg162380 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-06-05 23:39
New changeset 113cdce4663c by Victor Stinner in branch 'default':
Close #6203: Document that Python 3 sets LC_CTYPE at startup to the user's preferred locale encoding
http://hg.python.org/cpython/rev/113cdce4663c
History
Date User Action Args
2012-06-05 23:39:39python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg162380

resolution: fixed
stage: needs patch -> resolved
2012-06-05 16:24:06ned.deilysetmessages: + msg162355
2012-06-05 12:03:57vstinnersettitle: 3.x locale does not default to C, contrary to the documentation and to 2.x behavior -> locale documentation doesn't mention that LC_CTYPE is changed at startup
components: + Unicode
versions: + Python 3.2
2012-06-05 12:02:58vstinnersetfiles: + locale_doc.patch
keywords: + patch
messages: + msg162340
2011-11-06 19:48:10petri.lehtinensetnosy: + petri.lehtinen
messages: + msg147174
2011-08-11 01:25:33r.david.murraysetmessages: + msg141890
2011-08-10 16:05:48alexissetmessages: + msg141872
title: 3.x locale does not default to C, contrary to the documentation and to 2.x behavior -> 3.x locale does not default to C, contrary to the documentation and to 2.x behavior
2011-08-10 00:24:02r.david.murraysetmessages: + msg141847
2011-08-09 15:53:51alexissetnosy: + alexis
messages: + msg141830
2011-08-05 21:34:37ned.deilylinkissue12699 superseder
2011-01-29 13:51:48sdaodensetnosy: + sdaoden
messages: + msg127417
2011-01-28 21:38:45loewissetnosy: lemburg, loewis, georg.brandl, pitrou, vstinner, ned.deily, ezio.melotti, Arfrever, r.david.murray
messages: + msg127351
2011-01-28 21:36:54Arfreversetnosy: lemburg, loewis, georg.brandl, pitrou, vstinner, ned.deily, ezio.melotti, Arfrever, r.david.murray
messages: + msg127350
2011-01-28 21:22:14loewissetnosy: lemburg, loewis, georg.brandl, pitrou, vstinner, ned.deily, ezio.melotti, Arfrever, r.david.murray
messages: + msg127347
title: 3.x locale does not default to C, contrary to the documentation and to 2.x behavior -> 3.x locale does not default to C, contrary to the documentation and to 2.x behavior
2011-01-28 15:01:17Arfreversetnosy: lemburg, loewis, georg.brandl, pitrou, vstinner, ned.deily, ezio.melotti, Arfrever, r.david.murray
title: 3.x locale does not default to C, contrary to the documentation and to 2.x behavior -> 3.x locale does not default to C, contrary to the documentation and to 2.x behavior
2011-01-28 11:05:45lemburgsetnosy: lemburg, loewis, georg.brandl, pitrou, vstinner, ned.deily, ezio.melotti, Arfrever, r.david.murray
messages: + msg127283
2011-01-28 09:33:39loewissetnosy: lemburg, loewis, georg.brandl, pitrou, vstinner, ned.deily, ezio.melotti, Arfrever, r.david.murray
messages: + msg127265
title: 3.x locale does not default to C, contrary to the documentation and to 2.x behavior -> 3.x locale does not default to C, contrary to the documentation and to 2.x behavior
2011-01-28 09:27:54lemburgsetnosy: + lemburg
messages: + msg127262
2011-01-27 16:58:10Arfreversetnosy: + Arfrever
2011-01-27 11:40:07vstinnersetnosy: + vstinner

messages: + msg127180
versions: + Python 3.3, - Python 3.2
2010-10-29 10:07:21adminsetassignee: georg.brandl -> docs@python
2009-12-30 01:46:52r.david.murraysetversions: + Python 3.2, - Python 3.1
2009-06-09 10:43:42pitrousetassignee: georg.brandl
2009-06-09 07:10:25loewissetassignee: loewis -> (no value)
messages: + msg89136
2009-06-08 21:51:50r.david.murraysetpriority: release blocker -> high

messages: + msg89120
components: - Library (Lib)
nosy: loewis, georg.brandl, pitrou, ned.deily, ezio.melotti, r.david.murray
2009-06-08 19:43:09pitrousetmessages: + msg89102
title: 3.x locale does not default to C, contrary to the documentation and to 2.x behavior -> 3.x locale does not default to C, contrary to the documentation and to 2.x behavior
2009-06-08 19:39:29loewissetmessages: + msg89101
title: 3.x locale does not default to C, contrary to the documentation and to 2.x behavior -> 3.x locale does not default to C, contrary to the documentation and to 2.x behavior
2009-06-08 16:26:25pitrousetmessages: + msg89090
2009-06-08 16:22:10r.david.murraysetmessages: + msg89089
2009-06-08 16:17:53pitrousetnosy: + pitrou
messages: + msg89088
2009-06-08 16:01:05r.david.murraysetpriority: normal -> release blocker

nosy: + r.david.murray
messages: + msg89084

stage: needs patch
2009-06-08 13:29:54georg.brandlsetassignee: georg.brandl -> loewis

messages: + msg89077
nosy: + loewis
2009-06-06 21:00:39ezio.melottisetpriority: normal

nosy: + ezio.melotti
messages: + msg89016

components: + Library (Lib)
2009-06-05 10:56:37ned.deilycreate