This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Optimize locale aliases table
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: 20027 20033 Superseder:
Assigned To: serhiy.storchaka Nosy List: lemburg, loewis, python-dev, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2013-12-21 19:20 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
locale_optimize_aliases.patch serhiy.storchaka, 2013-12-21 19:20 review
locale_optimize_aliases_2.patch serhiy.storchaka, 2013-12-26 20:19 review
locale_optimize_aliases_3.patch serhiy.storchaka, 2013-12-26 21:43 review
locale_optimize_aliases_4.patch serhiy.storchaka, 2013-12-26 22:19 review
Messages (13)
msg206769 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-12-21 19:20
Proposed patch removes over 400 entities from locale alias tables. They are redundant because they can be calculated on fly.

Also it enables utf8 aliases. Now this adds not hundreds of redundant aliases, but only 8 new locales:

+    'be_bg.utf8':                           'bg_BG.UTF-8',
+    'c.utf8':                               'en_US.UTF-8',
+    'en_dl.utf8':                           'en_DL.UTF-8',
+    'en_zw.utf8':                           'en_ZS.UTF-8',
+    'pa_pk.utf8':                           'pa_PK.UTF-8',
+    'sr_yu.utf8':                           'sr_RS.UTF-8',
+    'te_in.utf8':                           'te_IN.UTF-8',
+    'zh_sg.utf8':                           'zh_SG.UTF-8',
msg206770 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-12-21 19:33
Example.

     'br_fr':                                'br_FR.ISO8859-1',
-    'br_fr.iso88591':                       'br_FR.ISO8859-1',
-    'br_fr.iso885914':                      'br_FR.ISO8859-14',
-    'br_fr.iso885915':                      'br_FR.ISO8859-15',
-    'br_fr.iso885915@euro':                 'br_FR.ISO8859-15',
-    'br_fr.utf8@euro':                      'br_FR.UTF-8',
-    'br_fr@euro':                           'br_FR.ISO8859-15',

Only one of 7 br_fr entities are left. For br_fr.iso88591, br_fr.iso885914 and br_fr.iso885915 just replaced encoding of base br_fr locale. For br_fr.iso885915@euro and br_fr.utf8@euro the @euro modifier is dropped because ISO8859-15 and UTF-8 already contains the euro character. For br_fr@euro default ISO8859-1 encoding replaced to ISO8859-15 and the @euro modifier is dropped.

So now the table contains only base entities which map lang_country to lang_country.encoding and special cases for deprecated and obscure aliases.
msg206787 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-12-21 21:38
On 21.12.2013 20:33, Serhiy Storchaka wrote:
> 
> Serhiy Storchaka added the comment:
> 
> Example.
> 
>      'br_fr':                                'br_FR.ISO8859-1',
> -    'br_fr.iso88591':                       'br_FR.ISO8859-1',
> -    'br_fr.iso885914':                      'br_FR.ISO8859-14',
> -    'br_fr.iso885915':                      'br_FR.ISO8859-15',
> -    'br_fr.iso885915@euro':                 'br_FR.ISO8859-15',
> -    'br_fr.utf8@euro':                      'br_FR.UTF-8',
> -    'br_fr@euro':                           'br_FR.ISO8859-15',
> 
> Only one of 7 br_fr entities are left. For br_fr.iso88591, br_fr.iso885914 and br_fr.iso885915 just replaced encoding of base br_fr locale. For br_fr.iso885915@euro and br_fr.utf8@euro the @euro modifier is dropped because ISO8859-15 and UTF-8 already contains the euro character. For br_fr@euro default ISO8859-1 encoding replaced to ISO8859-15 and the @euro modifier is dropped.
> 
> So now the table contains only base entities which map lang_country to lang_country.encoding and special cases for deprecated and obscure aliases.

Looks like an interesting approach, but I'd like to think about it
a day or two.

Some notes:
* the patch seems to include some unrelated changes, e.g. the
  devanagari fixes and a few new mappings
* the optimize step is called twice for some reason - is this
  intended ? if yes, please add a comment why this is done
* the patch would need some tests to make sure that the removed
  aliases indeed still map to the correct C locale strings

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com
msg206796 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-12-21 23:38
> * the patch seems to include some unrelated changes, e.g. the
>   devanagari fixes and a few new mappings

May be. In any case I have added issue20027 as dependency. New mappings were 
added when enable UTF-8 locales in makelocalealias.py (I can split this in 
separate issue).

> * the optimize step is called twice for some reason - is this
>   intended ? if yes, please add a comment why this is done

Actually we should call it in a loop while the size of the table is decreased.

> * the patch would need some tests to make sure that the removed
>   aliases indeed still map to the correct C locale strings

Currently the makelocalealias.py is such manual test. test_locale contains 
multiple tests for different locales. I'll add several new cases.
msg206932 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-12-26 00:36
I thought some more about this approach. I'm +1 on it. The locale
lookup is not time critical, so the table optimization makes sense.
Nice idea, Serhiy !

On 22.12.2013 00:38, Serhiy Storchaka wrote:
> 
> Serhiy Storchaka added the comment:
> 
>> * the patch seems to include some unrelated changes, e.g. the
>>   devanagari fixes and a few new mappings
> 
> May be. In any case I have added issue20027 as dependency. New mappings were 
> added when enable UTF-8 locales in makelocalealias.py (I can split this in 
> separate issue).

I think it's better to apply the patches separately, if that's
possible.

>> * the optimize step is called twice for some reason - is this
>>   intended ? if yes, please add a comment why this is done
> 
> Actually we should call it in a loop while the size of the table is decreased.

Ok, that makes sense. Still, please add a comment on why this is necessary.

>> * the patch would need some tests to make sure that the removed
>>   aliases indeed still map to the correct C locale strings
> 
> Currently the makelocalealias.py is such manual test. test_locale contains 
> multiple tests for different locales. I'll add several new cases.

Great.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com
msg206959 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-12-26 20:19
Here is updated patch. It doesn't include any additions to locale alias table (including devanagari). Added several tests cases (many other test cases for removed aliases already exist). optimize() is called only once, looks as second run has no effect until UTF-8 aliases are enabled.
msg206964 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-12-26 21:09
On 26.12.2013 21:19, Serhiy Storchaka wrote:
> 
> Here is updated patch. It doesn't include any additions to locale alias table (including devanagari). Added several tests cases (many other test cases for removed aliases already exist). optimize() is called only once, looks as second run has no effect until UTF-8 aliases are enabled.

Looks good. Could you add a test for the optimization function ?

It should make sure that a complete set of now removed locale
names are properly optimized away, e.g.

     'nl_nl':                                'nl_NL.ISO8859-1',
-    'nl_nl.88591':                          'nl_NL.ISO8859-1',
-    'nl_nl.iso88591':                       'nl_NL.ISO8859-1',
-    'nl_nl.iso885915':                      'nl_NL.ISO8859-15',
-    'nl_nl.iso885915@euro':                 'nl_NL.ISO8859-15',
-    'nl_nl.utf8@euro':                      'nl_NL.UTF-8',
-    'nl_nl@euro':                           'nl_NL.ISO8859-15',

and

     'ja_jp':                                'ja_JP.eucJP',
-    'ja_jp.ajec':                           'ja_JP.eucJP',
     'ja_jp.euc':                            'ja_JP.eucJP',
-    'ja_jp.eucjp':                          'ja_JP.eucJP',
-    'ja_jp.iso-2022-jp':                    'ja_JP.JIS7',
-    'ja_jp.iso2022jp':                      'ja_JP.JIS7',
-    'ja_jp.jis':                            'ja_JP.JIS7',
-    'ja_jp.jis7':                           'ja_JP.JIS7',
     'ja_jp.mscode':                         'ja_JP.SJIS',
     'ja_jp.pck':                            'ja_JP.SJIS',
-    'ja_jp.sjis':                           'ja_JP.SJIS',
-    'ja_jp.ujis':                           'ja_JP.eucJP',

This should then cover all the Latin-1 encodings and also
an Asian one.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com
msg206966 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-12-26 21:43
> Could you add a test for the optimization function ?

I have no ideas. The optimization function is a part of the makelocalealias.py 
which ran manually very rarely (last time 3.5 year ago). It isn't exposed 
outside the script and I'm not sure that it worths the complication of the 
testing.

> It should make sure that a complete set of now removed locale
> names are properly optimized away, e.g.
> 
>      'nl_nl':                                'nl_NL.ISO8859-1',
> -    'nl_nl.88591':                          'nl_NL.ISO8859-1',
> -    'nl_nl.iso88591':                       'nl_NL.ISO8859-1',
> -    'nl_nl.iso885915':                      'nl_NL.ISO8859-15',
> -    'nl_nl.iso885915@euro':                 'nl_NL.ISO8859-15',
> -    'nl_nl.utf8@euro':                      'nl_NL.UTF-8',
> -    'nl_nl@euro':                           'nl_NL.ISO8859-15',

These cases are covered by test_english and test_euro_modifier.

>      'ja_jp':                                'ja_JP.eucJP',
> -    'ja_jp.ajec':                           'ja_JP.eucJP',
>      'ja_jp.euc':                            'ja_JP.eucJP',
> -    'ja_jp.eucjp':                          'ja_JP.eucJP',
> -    'ja_jp.iso-2022-jp':                    'ja_JP.JIS7',
> -    'ja_jp.iso2022jp':                      'ja_JP.JIS7',
> -    'ja_jp.jis':                            'ja_JP.JIS7',
> -    'ja_jp.jis7':                           'ja_JP.JIS7',
>      'ja_jp.mscode':                         'ja_JP.SJIS',
>      'ja_jp.pck':                            'ja_JP.SJIS',
> -    'ja_jp.sjis':                           'ja_JP.SJIS',
> -    'ja_jp.ujis':                           'ja_JP.eucJP',

Here is a test which includes tests for japanese cases end tests for the euc 
encoding (it maps to different encodings depending on language).
msg206967 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-12-26 21:47
On 26.12.2013 22:43, Serhiy Storchaka wrote:
> 
> Serhiy Storchaka added the comment:
> 
>> Could you add a test for the optimization function ?
> 
> I have no ideas. The optimization function is a part of the makelocalealias.py 
> which ran manually very rarely (last time 3.5 year ago). It isn't exposed 
> outside the script and I'm not sure that it worths the complication of the 
> testing.

I probably wasn't clear: I meant some tests that show that the
alias definitions (on the left) from the X11 file are actually mapped to the
correct alias locales (on the right).

>> It should make sure that a complete set of now removed locale
>> names are properly optimized away, e.g.
>>
>>      'nl_nl':                                'nl_NL.ISO8859-1',
>> -    'nl_nl.88591':                          'nl_NL.ISO8859-1',
>> -    'nl_nl.iso88591':                       'nl_NL.ISO8859-1',
>> -    'nl_nl.iso885915':                      'nl_NL.ISO8859-15',
>> -    'nl_nl.iso885915@euro':                 'nl_NL.ISO8859-15',
>> -    'nl_nl.utf8@euro':                      'nl_NL.UTF-8',
>> -    'nl_nl@euro':                           'nl_NL.ISO8859-15',
> 
> These cases are covered by test_english and test_euro_modifier.

Ok.

>>      'ja_jp':                                'ja_JP.eucJP',
>> -    'ja_jp.ajec':                           'ja_JP.eucJP',
>>      'ja_jp.euc':                            'ja_JP.eucJP',
>> -    'ja_jp.eucjp':                          'ja_JP.eucJP',
>> -    'ja_jp.iso-2022-jp':                    'ja_JP.JIS7',
>> -    'ja_jp.iso2022jp':                      'ja_JP.JIS7',
>> -    'ja_jp.jis':                            'ja_JP.JIS7',
>> -    'ja_jp.jis7':                           'ja_JP.JIS7',
>>      'ja_jp.mscode':                         'ja_JP.SJIS',
>>      'ja_jp.pck':                            'ja_JP.SJIS',
>> -    'ja_jp.sjis':                           'ja_JP.SJIS',
>> -    'ja_jp.ujis':                           'ja_JP.eucJP',
> 
> Here is a test which includes tests for japanese cases end tests for the euc 
> encoding (it maps to different encodings depending on language).

Thanks. I think this is good enough now.
msg206970 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-12-26 22:19
> I probably wasn't clear: I meant some tests that show that the
> alias definitions (on the left) from the X11 file are actually mapped to the
> correct alias locales (on the right).

This is how the optimization function works. It updates alias table with the 
X11 file, and then removes the alias entities one by one and checks that the 
alias definition are mapped to to correct alias locales.

Here is a patch which adds a self-check in the makelocalealias.py script.
msg206971 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-12-26 22:47
On 26.12.2013 23:19, Serhiy Storchaka wrote:
> 
> Serhiy Storchaka added the comment:
> 
>> I probably wasn't clear: I meant some tests that show that the
>> alias definitions (on the left) from the X11 file are actually mapped to the
>> correct alias locales (on the right).
> 
> This is how the optimization function works. It updates alias table with the 
> X11 file, and then removes the alias entities one by one and checks that the 
> alias definition are mapped to to correct alias locales.
> 
> Here is a patch which adds a self-check in the makelocalealias.py script.

Thanks, this makes that part of the implementation waterproof as well.

Good to go, I guess.
msg206972 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-12-26 22:57
New changeset 63bc68d7f449 by Serhiy Storchaka in branch 'default':
Issue #20046: Locale alias table no longer contains entities which can be
http://hg.python.org/cpython/rev/63bc68d7f449
msg206973 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-12-26 23:00
Thank you Marc-Andre for your reviews.
History
Date User Action Args
2022-04-11 14:57:55adminsetgithub: 64245
2013-12-26 23:00:11serhiy.storchakasetstatus: open -> closed
messages: + msg206973

assignee: serhiy.storchaka
resolution: fixed
stage: patch review -> resolved
2013-12-26 22:57:05python-devsetnosy: + python-dev
messages: + msg206972
2013-12-26 22:47:30lemburgsetmessages: + msg206971
2013-12-26 22:19:51serhiy.storchakasetfiles: + locale_optimize_aliases_4.patch

messages: + msg206970
2013-12-26 21:47:36lemburgsetmessages: + msg206967
2013-12-26 21:43:12serhiy.storchakasetfiles: + locale_optimize_aliases_3.patch

messages: + msg206966
2013-12-26 21:09:16lemburgsetmessages: + msg206964
2013-12-26 20:19:14serhiy.storchakasetfiles: + locale_optimize_aliases_2.patch

messages: + msg206959
versions: + Python 3.4
2013-12-26 00:36:03lemburgsetmessages: + msg206932
2013-12-21 23:38:34serhiy.storchakasetmessages: + msg206796
2013-12-21 21:38:44lemburgsetmessages: + msg206787
2013-12-21 19:33:26serhiy.storchakasetmessages: + msg206770
2013-12-21 19:22:08serhiy.storchakasetdependencies: + Fixed support for Indian locales, Fix makelocalealias.py for Python 3
2013-12-21 19:20:57serhiy.storchakacreate