classification
Title: locale.getdefaultlocale() missing corner case
Type: behavior Stage: needs patch
Components: Documentation, Library (Lib) Versions: Python 3.3, Python 3.2, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: georg.brandl, loewis, r.david.murray, rg3
Priority: normal Keywords: easy, patch

Created on 2009-04-22 18:20 by rg3, last changed 2011-11-29 06:14 by ezio.melotti.

Files
File name Uploaded Description Edit
locale.diff rg3, 2009-04-22 18:20
Messages (7)
msg86312 - (view) Author: (rg3) Date: 2009-04-22 18:20
A recent issue with one of my programs has shown that
locale.getdefaultlocale() does not handle correctly a corner case. The
issue URL is this one:

http://bitbucket.org/rg3/youtube-dl/issue/7/

Essentially, some users have LANG set to something like
es_CA.UTF-8@valencia. In that case, locale.getdefaultlocale() returns,
as the encoding, the string "utf_8_valencia", which cannot be used as an
argument to the string encode() function. The obvious correct encoding
in this case is UTF-8.

I have traced the problem and it seems that it could be fixed by the
attached patch. It checks if the encoding, at that point, contains the
'@' symbol and, in that case, removes everything starting at that point,
leaving only "UTF-8".

I am not sure if this patch or a similar one should be applied to other
Python versions. My system has Python 2.5.2 and that's what I have patched.

Explanation as to why I put the code there:

* The simple case, es_CA.UTF-8 goes through that point too and enters
the "if".
* I wanted to remove what goes after the '@' symbol at that point, so it
either needed to be removed before the call to the normalizing function
or inside the normalization.
* As this is not what I would consider a normalization, I put the code
before the function call.

Thanks for your hard work. I hope my patch is valid.

Regards.
msg86313 - (view) Author: (rg3) Date: 2009-04-22 18:26
I just realized that the "if" I introduced is not really needed.
"encoding = encoding.split('@')[0]" works whether the '@' symbol is
present or not.
msg86317 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-04-22 18:52
I wasn't able to reproduce this by just setting my LC_ALL environment
variable to es_CA.UTF-8@valencia and calling getdefaultlocale.  Can you
provide more complete steps to reproduce?
msg86318 - (view) Author: (rg3) Date: 2009-04-22 19:20
You are right. The issue is not reproduced with es_CA.UTF-8@valencia but
with ca_ES.UTF-8@valencia. The fact that the first case works makes me
think maybe there's another way to solve the problem. Can you check that?
msg86319 - (view) Author: (rg3) Date: 2009-04-22 19:30
Further investigation:

The guy who had this issue may be from Valencia, Spain. According to the
manpage for setlocale(3) in my system, the form is usually
language[_territory][.codeset][@modifier]. So, in this case, it would
make sense for the language to be "ca" (Catalan) and territory "ES" (Spain).

My patch may be fine after all. Because, if at that point the @modifier
is still present (I have seen code that removes it before that point),
you'd still want to remove it and keep only the "codeset", which is the
interesting part.
msg86327 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-04-22 20:26
OK, it turns out that this is one of a class of known bugs of long
standing (see issue554676 and issue1080864, for example).  The
recommended solution is to not use locale.getdefaultlocale, but to use
locale.getperferredencoding.  I have confirmed that that works for the
case of ca_ES.UTF-8@valencia in python2.5.

There is at least a doc bug here, since no mention of this
fragility/recommendation is made in the getdefaultlocale documentation.

Using getpreferredencoding seems to be the correct solution to your
problem.  However, the locale.py module contains a number of examples of
modifiers in the locale_alias table.  Presumably this case could be
added, but it is not clear to me what the policy is on that at this
time, so I'm adding Martin to the nosy list looking for some guidance.
msg86332 - (view) Author: (rg3) Date: 2009-04-22 20:52
Excellent. Thanks for the tip. I'll now proceed to modify my code to use
getpreferredencoding. Still, I think getdefaultlocale should work
because it could be used in other situations, I suppose.
History
Date User Action Args
2011-11-29 06:14:21ezio.melottisetkeywords: + easy
versions: + Python 3.2, Python 3.3, - Python 2.6, Python 3.0, Python 3.1
2010-10-29 10:07:21adminsetassignee: georg.brandl -> docs@python
2009-04-22 20:52:01rg3setmessages: + msg86332
2009-04-22 20:26:45r.david.murraysetnosy: + loewis, georg.brandl
versions: + Python 2.6, Python 3.0, Python 3.1, Python 2.7, - Python 2.5
messages: + msg86327

assignee: georg.brandl
components: + Documentation
stage: test needed -> needs patch
2009-04-22 19:30:42rg3setmessages: + msg86319
2009-04-22 19:20:23rg3setmessages: + msg86318
2009-04-22 18:52:33r.david.murraysetpriority: normal

nosy: + r.david.murray
messages: + msg86317

stage: test needed
2009-04-22 18:26:24rg3setmessages: + msg86313
2009-04-22 18:20:44rg3create