classification
Title: normalize() in locale.py fails for sr_RS.UTF-8@latin
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: locale.getdefaultlocale() missing corner case
View: 5815
Assigned To: Nosy List: mfabian, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2013-11-09 08:02 by mfabian, last changed 2014-01-14 21:38 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
mike-test.py mfabian, 2013-11-09 08:05 test program to see what goes wrong in locale normalization
0001-Issue-19534-fix-normalize-in-locale.py-to-make-it-wo.patch mfabian, 2013-11-09 08:22 0001-Issue-19534-fix-normalize-in-locale.py-to-make-it-wo.patch
Messages (8)
msg202467 - (view) Author: Mike FABIAN (mfabian) Date: 2013-11-09 08:02
Originally reported here: 

https://bugzilla.redhat.com/show_bug.cgi?id=1024667

I found that Serbian translations in Latin do not work when the locale
name is written as sr_RS.UTF-8@latin (one gets the cyrillic
translations instead), but they *do* work when the locale name is
written as sr_RS@latin (i.e. omitting the '.UTF-8'):

$ LANG='sr_RS.UTF-8'  python2 -c 'import gettext; print(gettext.ldgettext("anaconda", "What language would you like to use during the installation process?").decode("UTF-8"))'
Који језик бисте желели да користите током процеса инсталације?
mfabian@ari:~
$ LANG='sr_RS.UTF-8@latin'  python2 -c 'import gettext; print(gettext.ldgettext("anaconda", "What language would you like to use during the installation process?").decode("UTF-8"))'
Који језик бисте желели да користите током процеса инсталације?
mfabian@ari:~
$ LANG='sr_RS@latin'  python2 -c 'import gettext; print(gettext.ldgettext("anaconda", "What language would you like to use during the installation process?").decode("UTF-8"))'
Koji jezik biste želeli da koristite tokom procesa instalacije?
mfabian@ari:~
$ 

The “gettext” command line tool does not have this problem:

mfabian@ari:~
$ LANG='sr_RS@latin' gettext anaconda "What language would you like to use during the installation process?"
Koji jezik biste želeli da koristite tokom procesa instalacije?mfabian@ari:~
$ LANG='sr_RS.UTF-8@latin' gettext anaconda "What language would you like to use during the installation process?"
Koji jezik biste želeli da koristite tokom procesa instalacije?mfabian@ari:~
$ LANG='sr_RS.UTF-8' gettext anaconda "What language would you like to use during the installation process?"
Који језик бисте желели да користите током процеса инсталације?mfabian@ari:~
$
msg202468 - (view) Author: Mike FABIAN (mfabian) Date: 2013-11-09 08:05
The problem turns out to be caused by a problem in normalizing
the locale name, see the output of  this test program:

mfabian@ari:~
$ cat ~/tmp/mike-test.py
#!/usr/bin/python2

import sys
import os
import locale
import encodings
import encodings.aliases

test_locales = [
    'ja_JP.UTF-8',
    'de_DE.SJIS',
    'de_DE.foobar',
    'sr_RS.UTF-8@latin',
    'sr_rs@latin',
    'sr@latin',
    'sr_yu',
    'sr_yu.SJIS@devanagari',
    'sr@foobar',
    'sR@foObar',
    'sR',
]

for test_locale in test_locales:
    print("%(orig)s -> %(norm)s"
          %{'orig': test_locale,
            'norm': locale.normalize(test_locale)}
    )

mfabian@ari:~
$ python2 ~/tmp/mike-test.py
ja_JP.UTF-8 -> ja_JP.UTF-8
de_DE.SJIS -> de_DE.SJIS
de_DE.foobar -> de_DE.foobar
sr_RS.UTF-8@latin -> sr_RS.utf_8_latin
sr_rs@latin -> sr_RS.UTF-8@latin
sr@latin -> sr_RS.UTF-8@latin
sr_yu -> sr_RS.UTF-8@latin
sr_yu.SJIS@devanagari -> sr_RS.sjis_devanagari
sr@foobar -> sr@foobar
sR@foObar -> sR@foObar
sR -> sr_RS.UTF-8
mfabian@ari:~
$ 

I.e. “sr_RS.UTF-8@latin” is normalized to “sr_RS.utf_8_latin” which
is clearly wrong and causes a fallback to sr_RS when using gettext
which gives the cyrillic translations.
msg202469 - (view) Author: Mike FABIAN (mfabian) Date: 2013-11-09 08:09
A simple fix for that problem could look like this:

mfabian@ari:~
$ diff -u /usr/lib64/python2.7/locale.py.orig /usr/lib64/python2.7/locale.py
--- /usr/lib64/python2.7/locale.py.orig 2013-11-09 09:08:24.807331535 +0100
+++ /usr/lib64/python2.7/locale.py      2013-11-09 09:08:34.526390646 +0100
@@ -377,7 +377,7 @@
     # First lookup: fullname (possibly with encoding)
     norm_encoding = encoding.replace('-', '')
     norm_encoding = norm_encoding.replace('_', '')
-    lookup_name = langname + '.' + encoding
+    lookup_name = langname + '.' + norm_encoding
     code = locale_alias.get(lookup_name, None)
     if code is not None:
         return code
@@ -1457,6 +1457,7 @@
     'sr_cs@latn':                           'sr_RS.UTF-8@latin',
     'sr_me':                                'sr_ME.UTF-8',
     'sr_rs':                                'sr_RS.UTF-8',
+    'sr_rs.utf8@latin':                      'sr_RS.UTF-8@latin',
     'sr_rs.utf8@latn':                      'sr_RS.UTF-8@latin',
     'sr_rs@latin':                          'sr_RS.UTF-8@latin',
     'sr_rs@latn':                           'sr_RS.UTF-8@latin',
mfabian@ari:~
$
msg202470 - (view) Author: Mike FABIAN (mfabian) Date: 2013-11-09 08:15
in locale.py, the comment above “locale_alias = {” says:

# Note that the normalize() function which uses this tables
# removes '_' and '-' characters from the encoding part of the
# locale name before doing the lookup. This saves a lot of
# space in the table.

But in normalize(), this is actually not done:

    # First lookup: fullname (possibly with encoding)
    norm_encoding = encoding.replace('-', '')
    norm_encoding = norm_encoding.replace('_', '')
    lookup_name = langname + '.' + encoding
    code = locale_alias.get(lookup_name, None)

“norm_encoding” holds the locale name with these replacements,
but then it is not used in the lookup.

The patch in http://bugs.python.org/msg202469
fixes that, using the norm_encoding together with adding the alias

+    'sr_rs.utf8@latin':                      'sr_RS.UTF-8@latin',

makes it work for sr_RS.UTF-8@latin, my test program then outputs:

mfabian@ari:~
$ python2 ~/tmp/mike-test.py
ja_JP.UTF-8 -> ja_JP.UTF-8
de_DE.SJIS -> de_DE.SJIS
de_DE.foobar -> de_DE.foobar
sr_RS.UTF-8@latin -> sr_RS.UTF-8@latin
sr_rs@latin -> sr_RS.UTF-8@latin
sr@latin -> sr_RS.UTF-8@latin
sr_yu -> sr_RS.UTF-8@latin
sr_yu.SJIS@devanagari -> sr_RS.sjis_devanagari
sr@foobar -> sr@foobar
sR@foObar -> sR@foObar
sR -> sr_RS.UTF-8
mfabian@ari:~
$ 

But note that the normalization of the “sr_yu.SJIS@devanagari”
locale is still weird (of course a “sr_yu.SJIS@devanagari”
is quite silly and does not exist anyway, but the code in normalize()
does not seem to work as intended.
msg202471 - (view) Author: Mike FABIAN (mfabian) Date: 2013-11-09 08:22
I think the patch I attach here is a better fix than the
patch in http://bugs.python.org/msg202469 because
it makes the normalize() function behave more logical overall,
with this patch, my test program prints:

mfabian@ari:/local/mfabian/src/cpython (2.7-mike %)
$ ./python ~/tmp/mike-test.py
ja_JP.UTF-8 -> ja_JP.UTF-8
de_DE.SJIS -> de_DE.SJIS
de_DE.foobar -> de_DE.foobar
sr_RS.UTF-8@latin -> sr_RS.UTF-8@latin
sr_rs@latin -> sr_RS.UTF-8@latin
sr@latin -> sr_RS.UTF-8@latin
sr_yu -> sr_RS.UTF-8@latin
sr_yu.SJIS@devanagari -> sr_RS.SJIS@devanagari
sr@foobar -> sr_RS.UTF-8@foobar
sR@foObar -> sr_RS.UTF-8@foobar
sR -> sr_RS.UTF-8
[18995 refs]
mfabian@ari:/local/mfabian/src/cpython (2.7-mike %)
$ 

The patch also contains a small fix for the “ks” and “sd”
locales in the locale_alias dictionary, they had the “.UTF-8”
in the wrong place:

-    'ks_in@devanagari':                     'ks_IN@devanagari.UTF-8',
+    'ks_in@devanagari':                     'ks_IN.UTF-8@devanagari',

-    'sd':                                   'sd_IN@devanagari.UTF-8',
+    'sd':                                   'sd_IN.UTF-8@devanagari',

(This error is inherited from the locale.alias file from X.org
where the locale_alias dictionary is generated from)
msg202472 - (view) Author: Mike FABIAN (mfabian) Date: 2013-11-09 08:24
The patch

http://bugs.python.org/file32552/0001-Issue-19534-fix-normalize-in-locale.py-to-make-it-wo.patch

is against the current HEAD of the 2.7 branch, but
Python 3.3 has exactly the same problem, the same patch fixes it for python
3.3 as well.
msg202473 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-11-09 08:44
Seems this is a duplicate of issue5815.
msg208116 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-14 21:38
locale.normalize() was fixed in issue5815 (and new entry for 'sr_RS.UTF-8@latin' is not needed anymore). Devanagari entries were fixed in issue20027. In any case thank you Mike for your report and proposed patch.
History
Date User Action Args
2014-01-14 21:38:10serhiy.storchakasetstatus: open -> closed
type: behavior
messages: + msg208116

resolution: duplicate
stage: resolved
2013-11-09 08:44:39serhiy.storchakasetsuperseder: locale.getdefaultlocale() missing corner case

messages: + msg202473
nosy: + serhiy.storchaka
2013-11-09 08:24:28mfabiansetmessages: + msg202472
2013-11-09 08:22:39mfabiansetfiles: + 0001-Issue-19534-fix-normalize-in-locale.py-to-make-it-wo.patch
keywords: + patch
messages: + msg202471
2013-11-09 08:15:40mfabiansetmessages: + msg202470
2013-11-09 08:09:25mfabiansetmessages: + msg202469
2013-11-09 08:05:26mfabiansetfiles: + mike-test.py

messages: + msg202468
2013-11-09 08:02:59mfabiancreate