classification
Title: locale.getdefaultlocale() missing corner case
Type: behavior Stage: needs patch
Components: Documentation, Library (Lib) Versions: Python 3.4, Python 3.3, Python 3.2, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: georg.brandl, groodt, loewis, r.david.murray, rg3, serhiy.storchaka
Priority: normal Keywords: easy, patch

Created on 2009-04-22 18:20 by rg3, last changed 2012-10-06 15:15 by serhiy.storchaka.

Files
File name Uploaded Description Edit
locale.diff rg3, 2009-04-22 18:20
locale_parse.patch serhiy.storchaka, 2012-07-14 13:27 review
Messages (12)
msg86312 - (view) Author: (rg3) Date: 2009-04-22 18:20
A recent issue with one of my programs has shown that
locale.getdefaultlocale() does not handle correctly a corner case. The
issue URL is this one:

http://bitbucket.org/rg3/youtube-dl/issue/7/

Essentially, some users have LANG set to something like
es_CA.UTF-8@valencia. In that case, locale.getdefaultlocale() returns,
as the encoding, the string "utf_8_valencia", which cannot be used as an
argument to the string encode() function. The obvious correct encoding
in this case is UTF-8.

I have traced the problem and it seems that it could be fixed by the
attached patch. It checks if the encoding, at that point, contains the
'@' symbol and, in that case, removes everything starting at that point,
leaving only "UTF-8".

I am not sure if this patch or a similar one should be applied to other
Python versions. My system has Python 2.5.2 and that's what I have patched.

Explanation as to why I put the code there:

* The simple case, es_CA.UTF-8 goes through that point too and enters
the "if".
* I wanted to remove what goes after the '@' symbol at that point, so it
either needed to be removed before the call to the normalizing function
or inside the normalization.
* As this is not what I would consider a normalization, I put the code
before the function call.

Thanks for your hard work. I hope my patch is valid.

Regards.
msg86313 - (view) Author: (rg3) Date: 2009-04-22 18:26
I just realized that the "if" I introduced is not really needed.
"encoding = encoding.split('@')[0]" works whether the '@' symbol is
present or not.
msg86317 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-04-22 18:52
I wasn't able to reproduce this by just setting my LC_ALL environment
variable to es_CA.UTF-8@valencia and calling getdefaultlocale.  Can you
provide more complete steps to reproduce?
msg86318 - (view) Author: (rg3) Date: 2009-04-22 19:20
You are right. The issue is not reproduced with es_CA.UTF-8@valencia but
with ca_ES.UTF-8@valencia. The fact that the first case works makes me
think maybe there's another way to solve the problem. Can you check that?
msg86319 - (view) Author: (rg3) Date: 2009-04-22 19:30
Further investigation:

The guy who had this issue may be from Valencia, Spain. According to the
manpage for setlocale(3) in my system, the form is usually
language[_territory][.codeset][@modifier]. So, in this case, it would
make sense for the language to be "ca" (Catalan) and territory "ES" (Spain).

My patch may be fine after all. Because, if at that point the @modifier
is still present (I have seen code that removes it before that point),
you'd still want to remove it and keep only the "codeset", which is the
interesting part.
msg86327 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-04-22 20:26
OK, it turns out that this is one of a class of known bugs of long
standing (see issue554676 and issue1080864, for example).  The
recommended solution is to not use locale.getdefaultlocale, but to use
locale.getperferredencoding.  I have confirmed that that works for the
case of ca_ES.UTF-8@valencia in python2.5.

There is at least a doc bug here, since no mention of this
fragility/recommendation is made in the getdefaultlocale documentation.

Using getpreferredencoding seems to be the correct solution to your
problem.  However, the locale.py module contains a number of examples of
modifiers in the locale_alias table.  Presumably this case could be
added, but it is not clear to me what the policy is on that at this
time, so I'm adding Martin to the nosy list looking for some guidance.
msg86332 - (view) Author: (rg3) Date: 2009-04-22 20:52
Excellent. Thanks for the tip. I'll now proceed to modify my code to use
getpreferredencoding. Still, I think getdefaultlocale should work
because it could be used in other situations, I suppose.
msg164859 - (view) Author: Greg Roodt (groodt) * Date: 2012-07-07 14:34
Bumping this as part of a bug scrub at EuroPython. Is this still an issue? Should we fix in docs or in code?
msg165264 - (view) Author: (rg3) Date: 2012-07-11 16:45
I don't know if the behavior is considered a bug or just undocumented, but under Python 2.7.3 it's still the same. locale.getpreferredencoding() does return UTF-8, but the second element in the tuple locale.getdefaultlocale() is "utf_8_valencia", which is not a valid encoding despite the documentation saying it's supposed to be an encoding name.

From my terminal:

$ python -V
Python 2.7.3

$ LANG=ca_ES.UTF-8@valencia python -c 'import locale; print locale.getpreferredencoding()'
UTF-8

$ LANG=ca_ES.UTF-8@valencia python -c 'import locale; print locale.getdefaultlocale()'
('ca_ES', 'utf_8_valencia')

$ LANG=ca_ES.UTF-8 python -c 'import locale; print locale.getpreferredencoding()'
UTF-8

$ LANG=ca_ES.UTF-8 python -c 'import locale; print locale.getdefaultlocale()'
('ca_ES', 'UTF-8')
msg165267 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-11 19:11
The patch is not work for "ca_ES@valencia" locale.

And there are issues for such locales: "ks_in@devanagari", "ks_IN@devanagari.UTF-8", "sd", "sd_IN@devanagari.UTF-8" ("ks_in@devanagari" in locale_alias maps to "ks_IN@devanagari.UTF-8" and "sd" to "sd_IN@devanagari.UTF-8").
msg165447 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-14 13:25
Here is yet some inconsistency:

$ LANG=uk_ua.microsoftcp1251 ./python -c "import locale; print(locale.getdefaultlocale())"
('uk_UA', 'CP1251')
$ LANG=uk_ua.microsoft-cp1251 ./python -c "import locale; print(locale.getdefaultlocale())"
('uk_UA', 'microsoft_cp1251')

$ ./python -c "import locale; print(locale.normalize('ka_ge.georgianacademy'))"
ka_GE.GEORGIAN-ACADEMY
$ ./python -c "import locale; print(locale.normalize('ka_GE.GEORGIAN-ACADEMY'))"
ka_GE.georgian_academy
msg165448 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-14 13:27
Here is a complex patch for more careful locale parsing.
History
Date User Action Args
2012-10-06 15:15:36serhiy.storchakasetversions: + Python 3.4
2012-07-14 13:27:21serhiy.storchakasetfiles: + locale_parse.patch

messages: + msg165448
2012-07-14 13:25:44serhiy.storchakasetmessages: + msg165447
2012-07-11 19:11:00serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg165267
2012-07-11 16:45:38rg3setmessages: + msg165264
2012-07-07 14:34:45groodtsetnosy: + groodt
messages: + msg164859
2011-11-29 06:14:21ezio.melottisetkeywords: + easy
versions: + Python 3.2, Python 3.3, - Python 2.6, Python 3.0, Python 3.1
2010-10-29 10:07:21adminsetassignee: georg.brandl -> docs@python
2009-04-22 20:52:01rg3setmessages: + msg86332
2009-04-22 20:26:45r.david.murraysetassignee: georg.brandl
components: + Documentation
versions: + Python 2.6, Python 3.0, Python 3.1, Python 2.7, - Python 2.5
nosy: + loewis, georg.brandl

messages: + msg86327
stage: test needed -> needs patch
2009-04-22 19:30:42rg3setmessages: + msg86319
2009-04-22 19:20:23rg3setmessages: + msg86318
2009-04-22 18:52:33r.david.murraysetpriority: normal

nosy: + r.david.murray
messages: + msg86317

stage: test needed
2009-04-22 18:26:24rg3setmessages: + msg86313
2009-04-22 18:20:44rg3create