Issue 19534: normalize() in locale.py fails for sr_RS.UTF-8@latin

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/63733

classification

Title:	normalize() in locale.py fails for sr_RS.UTF-8@latin
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 2.7

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	locale.getdefaultlocale() missing corner case View: 5815
Assigned To:		Nosy List:	mfabian, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2013-11-09 08:02 by mfabian, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
mike-test.py	mfabian, 2013-11-09 08:05	test program to see what goes wrong in locale normalization
0001-Issue-19534-fix-normalize-in-locale.py-to-make-it-wo.patch	mfabian, 2013-11-09 08:22	0001-Issue-19534-fix-normalize-in-locale.py-to-make-it-wo.patch

Messages (8)
msg202467 - (view)	Author: Mike FABIAN (mfabian)	Date: 2013-11-09 08:02
Originally reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1024667 I found that Serbian translations in Latin do not work when the locale name is written as sr_RS.UTF-8@latin (one gets the cyrillic translations instead), but they do work when the locale name is written as sr_RS@latin (i.e. omitting the '.UTF-8'): $ LANG='sr_RS.UTF-8' python2 -c 'import gettext; print(gettext.ldgettext("anaconda", "What language would you like to use during the installation process?").decode("UTF-8"))' Који језик бисте желели да користите током процеса инсталације? mfabian@ari:~ $ LANG='sr_RS.UTF-8@latin' python2 -c 'import gettext; print(gettext.ldgettext("anaconda", "What language would you like to use during the installation process?").decode("UTF-8"))' Који језик бисте желели да користите током процеса инсталације? mfabian@ari:~ $ LANG='sr_RS@latin' python2 -c 'import gettext; print(gettext.ldgettext("anaconda", "What language would you like to use during the installation process?").decode("UTF-8"))' Koji jezik biste želeli da koristite tokom procesa instalacije? mfabian@ari:~ $ The “gettext” command line tool does not have this problem: mfabian@ari:~ $ LANG='sr_RS@latin' gettext anaconda "What language would you like to use during the installation process?" Koji jezik biste želeli da koristite tokom procesa instalacije?mfabian@ari:~ $ LANG='sr_RS.UTF-8@latin' gettext anaconda "What language would you like to use during the installation process?" Koji jezik biste želeli da koristite tokom procesa instalacije?mfabian@ari:~ $ LANG='sr_RS.UTF-8' gettext anaconda "What language would you like to use during the installation process?" Који језик бисте желели да користите током процеса инсталације?mfabian@ari:~ $
msg202468 - (view)	Author: Mike FABIAN (mfabian)	Date: 2013-11-09 08:05
The problem turns out to be caused by a problem in normalizing the locale name, see the output of this test program: mfabian@ari:~ $ cat ~/tmp/mike-test.py #!/usr/bin/python2 import sys import os import locale import encodings import encodings.aliases test_locales = [ 'ja_JP.UTF-8', 'de_DE.SJIS', 'de_DE.foobar', 'sr_RS.UTF-8@latin', 'sr_rs@latin', 'sr@latin', 'sr_yu', 'sr_yu.SJIS@devanagari', 'sr@foobar', 'sR@foObar', 'sR', ] for test_locale in test_locales: print("%(orig)s -> %(norm)s" %{'orig': test_locale, 'norm': locale.normalize(test_locale)} ) mfabian@ari:~ $ python2 ~/tmp/mike-test.py ja_JP.UTF-8 -> ja_JP.UTF-8 de_DE.SJIS -> de_DE.SJIS de_DE.foobar -> de_DE.foobar sr_RS.UTF-8@latin -> sr_RS.utf_8_latin sr_rs@latin -> sr_RS.UTF-8@latin sr@latin -> sr_RS.UTF-8@latin sr_yu -> sr_RS.UTF-8@latin sr_yu.SJIS@devanagari -> sr_RS.sjis_devanagari sr@foobar -> sr@foobar sR@foObar -> sR@foObar sR -> sr_RS.UTF-8 mfabian@ari:~ $ I.e. “sr_RS.UTF-8@latin” is normalized to “sr_RS.utf_8_latin” which is clearly wrong and causes a fallback to sr_RS when using gettext which gives the cyrillic translations.
msg202469 - (view)	Author: Mike FABIAN (mfabian)	Date: 2013-11-09 08:09
A simple fix for that problem could look like this: mfabian@ari:~ $ diff -u /usr/lib64/python2.7/locale.py.orig /usr/lib64/python2.7/locale.py --- /usr/lib64/python2.7/locale.py.orig 2013-11-09 09:08:24.807331535 +0100 +++ /usr/lib64/python2.7/locale.py 2013-11-09 09:08:34.526390646 +0100 @@ -377,7 +377,7 @@ # First lookup: fullname (possibly with encoding) norm_encoding = encoding.replace('-', '') norm_encoding = norm_encoding.replace('_', '') - lookup_name = langname + '.' + encoding + lookup_name = langname + '.' + norm_encoding code = locale_alias.get(lookup_name, None) if code is not None: return code @@ -1457,6 +1457,7 @@ 'sr_cs@latn': 'sr_RS.UTF-8@latin', 'sr_me': 'sr_ME.UTF-8', 'sr_rs': 'sr_RS.UTF-8', + 'sr_rs.utf8@latin': 'sr_RS.UTF-8@latin', 'sr_rs.utf8@latn': 'sr_RS.UTF-8@latin', 'sr_rs@latin': 'sr_RS.UTF-8@latin', 'sr_rs@latn': 'sr_RS.UTF-8@latin', mfabian@ari:~ $
msg202470 - (view)	Author: Mike FABIAN (mfabian)	Date: 2013-11-09 08:15
in locale.py, the comment above “locale_alias = {” says: # Note that the normalize() function which uses this tables # removes '_' and '-' characters from the encoding part of the # locale name before doing the lookup. This saves a lot of # space in the table. But in normalize(), this is actually not done: # First lookup: fullname (possibly with encoding) norm_encoding = encoding.replace('-', '') norm_encoding = norm_encoding.replace('_', '') lookup_name = langname + '.' + encoding code = locale_alias.get(lookup_name, None) “norm_encoding” holds the locale name with these replacements, but then it is not used in the lookup. The patch in http://bugs.python.org/msg202469 fixes that, using the norm_encoding together with adding the alias + 'sr_rs.utf8@latin': 'sr_RS.UTF-8@latin', makes it work for sr_RS.UTF-8@latin, my test program then outputs: mfabian@ari:~ $ python2 ~/tmp/mike-test.py ja_JP.UTF-8 -> ja_JP.UTF-8 de_DE.SJIS -> de_DE.SJIS de_DE.foobar -> de_DE.foobar sr_RS.UTF-8@latin -> sr_RS.UTF-8@latin sr_rs@latin -> sr_RS.UTF-8@latin sr@latin -> sr_RS.UTF-8@latin sr_yu -> sr_RS.UTF-8@latin sr_yu.SJIS@devanagari -> sr_RS.sjis_devanagari sr@foobar -> sr@foobar sR@foObar -> sR@foObar sR -> sr_RS.UTF-8 mfabian@ari:~ $ But note that the normalization of the “sr_yu.SJIS@devanagari” locale is still weird (of course a “sr_yu.SJIS@devanagari” is quite silly and does not exist anyway, but the code in normalize() does not seem to work as intended.
msg202471 - (view)	Author: Mike FABIAN (mfabian)	Date: 2013-11-09 08:22
I think the patch I attach here is a better fix than the patch in http://bugs.python.org/msg202469 because it makes the normalize() function behave more logical overall, with this patch, my test program prints: mfabian@ari:/local/mfabian/src/cpython (2.7-mike %) $ ./python ~/tmp/mike-test.py ja_JP.UTF-8 -> ja_JP.UTF-8 de_DE.SJIS -> de_DE.SJIS de_DE.foobar -> de_DE.foobar sr_RS.UTF-8@latin -> sr_RS.UTF-8@latin sr_rs@latin -> sr_RS.UTF-8@latin sr@latin -> sr_RS.UTF-8@latin sr_yu -> sr_RS.UTF-8@latin sr_yu.SJIS@devanagari -> sr_RS.SJIS@devanagari sr@foobar -> sr_RS.UTF-8@foobar sR@foObar -> sr_RS.UTF-8@foobar sR -> sr_RS.UTF-8 [18995 refs] mfabian@ari:/local/mfabian/src/cpython (2.7-mike %) $ The patch also contains a small fix for the “ks” and “sd” locales in the locale_alias dictionary, they had the “.UTF-8” in the wrong place: - 'ks_in@devanagari': 'ks_IN@devanagari.UTF-8', + 'ks_in@devanagari': 'ks_IN.UTF-8@devanagari', - 'sd': 'sd_IN@devanagari.UTF-8', + 'sd': 'sd_IN.UTF-8@devanagari', (This error is inherited from the locale.alias file from X.org where the locale_alias dictionary is generated from)
msg202472 - (view)	Author: Mike FABIAN (mfabian)	Date: 2013-11-09 08:24
The patch http://bugs.python.org/file32552/0001-Issue-19534-fix-normalize-in-locale.py-to-make-it-wo.patch is against the current HEAD of the 2.7 branch, but Python 3.3 has exactly the same problem, the same patch fixes it for python 3.3 as well.
msg202473 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-11-09 08:44
Seems this is a duplicate of issue5815.
msg208116 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-01-14 21:38
locale.normalize() was fixed in issue5815 (and new entry for 'sr_RS.UTF-8@latin' is not needed anymore). Devanagari entries were fixed in issue20027. In any case thank you Mike for your report and proposed patch.

History
Date	User	Action	Args
2022-04-11 14:57:53	admin	set	github: 63733
2014-01-14 21:38:10	serhiy.storchaka	set	status: open -> closed type: behavior messages: + msg208116 resolution: duplicate stage: resolved
2013-11-09 08:44:39	serhiy.storchaka	set	superseder: locale.getdefaultlocale() missing corner case messages: + msg202473 nosy: + serhiy.storchaka
2013-11-09 08:24:28	mfabian	set	messages: + msg202472
2013-11-09 08:22:39	mfabian	set	files: + 0001-Issue-19534-fix-normalize-in-locale.py-to-make-it-wo.patch keywords: + patch messages: + msg202471
2013-11-09 08:15:40	mfabian	set	messages: + msg202470
2013-11-09 08:09:25	mfabian	set	messages: + msg202469
2013-11-09 08:05:26	mfabian	set	files: + mike-test.py messages: + msg202468
2013-11-09 08:02:59	mfabian	create