Issue 20046: Optimize locale aliases table

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/64245

classification

Title:	Optimize locale aliases table
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.4

process

Status:	closed	Resolution:	fixed
Dependencies:	20027 20033	Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	lemburg, loewis, python-dev, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2013-12-21 19:20 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
locale_optimize_aliases.patch	serhiy.storchaka, 2013-12-21 19:20		review
locale_optimize_aliases_2.patch	serhiy.storchaka, 2013-12-26 20:19		review
locale_optimize_aliases_3.patch	serhiy.storchaka, 2013-12-26 21:43		review
locale_optimize_aliases_4.patch	serhiy.storchaka, 2013-12-26 22:19		review

Messages (13)
msg206769 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-21 19:20
Proposed patch removes over 400 entities from locale alias tables. They are redundant because they can be calculated on fly. Also it enables utf8 aliases. Now this adds not hundreds of redundant aliases, but only 8 new locales: + 'be_bg.utf8': 'bg_BG.UTF-8', + 'c.utf8': 'en_US.UTF-8', + 'en_dl.utf8': 'en_DL.UTF-8', + 'en_zw.utf8': 'en_ZS.UTF-8', + 'pa_pk.utf8': 'pa_PK.UTF-8', + 'sr_yu.utf8': 'sr_RS.UTF-8', + 'te_in.utf8': 'te_IN.UTF-8', + 'zh_sg.utf8': 'zh_SG.UTF-8',
msg206770 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-21 19:33
Example. 'br_fr': 'br_FR.ISO8859-1', - 'br_fr.iso88591': 'br_FR.ISO8859-1', - 'br_fr.iso885914': 'br_FR.ISO8859-14', - 'br_fr.iso885915': 'br_FR.ISO8859-15', - 'br_fr.iso885915@euro': 'br_FR.ISO8859-15', - 'br_fr.utf8@euro': 'br_FR.UTF-8', - 'br_fr@euro': 'br_FR.ISO8859-15', Only one of 7 br_fr entities are left. For br_fr.iso88591, br_fr.iso885914 and br_fr.iso885915 just replaced encoding of base br_fr locale. For br_fr.iso885915@euro and br_fr.utf8@euro the @euro modifier is dropped because ISO8859-15 and UTF-8 already contains the euro character. For br_fr@euro default ISO8859-1 encoding replaced to ISO8859-15 and the @euro modifier is dropped. So now the table contains only base entities which map lang_country to lang_country.encoding and special cases for deprecated and obscure aliases.
msg206787 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-12-21 21:38
On 21.12.2013 20:33, Serhiy Storchaka wrote: > > Serhiy Storchaka added the comment: > > Example. > > 'br_fr': 'br_FR.ISO8859-1', > - 'br_fr.iso88591': 'br_FR.ISO8859-1', > - 'br_fr.iso885914': 'br_FR.ISO8859-14', > - 'br_fr.iso885915': 'br_FR.ISO8859-15', > - 'br_fr.iso885915@euro': 'br_FR.ISO8859-15', > - 'br_fr.utf8@euro': 'br_FR.UTF-8', > - 'br_fr@euro': 'br_FR.ISO8859-15', > > Only one of 7 br_fr entities are left. For br_fr.iso88591, br_fr.iso885914 and br_fr.iso885915 just replaced encoding of base br_fr locale. For br_fr.iso885915@euro and br_fr.utf8@euro the @euro modifier is dropped because ISO8859-15 and UTF-8 already contains the euro character. For br_fr@euro default ISO8859-1 encoding replaced to ISO8859-15 and the @euro modifier is dropped. > > So now the table contains only base entities which map lang_country to lang_country.encoding and special cases for deprecated and obscure aliases. Looks like an interesting approach, but I'd like to think about it a day or two. Some notes: * the patch seems to include some unrelated changes, e.g. the devanagari fixes and a few new mappings * the optimize step is called twice for some reason - is this intended ? if yes, please add a comment why this is done * the patch would need some tests to make sure that the removed aliases indeed still map to the correct C locale strings Thanks, -- Marc-Andre Lemburg eGenix.com
msg206796 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-21 23:38
> * the patch seems to include some unrelated changes, e.g. the > devanagari fixes and a few new mappings May be. In any case I have added issue20027 as dependency. New mappings were added when enable UTF-8 locales in makelocalealias.py (I can split this in separate issue). > * the optimize step is called twice for some reason - is this > intended ? if yes, please add a comment why this is done Actually we should call it in a loop while the size of the table is decreased. > * the patch would need some tests to make sure that the removed > aliases indeed still map to the correct C locale strings Currently the makelocalealias.py is such manual test. test_locale contains multiple tests for different locales. I'll add several new cases.
msg206932 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-12-26 00:36
I thought some more about this approach. I'm +1 on it. The locale lookup is not time critical, so the table optimization makes sense. Nice idea, Serhiy ! On 22.12.2013 00:38, Serhiy Storchaka wrote: > > Serhiy Storchaka added the comment: > >> * the patch seems to include some unrelated changes, e.g. the >> devanagari fixes and a few new mappings > > May be. In any case I have added issue20027 as dependency. New mappings were > added when enable UTF-8 locales in makelocalealias.py (I can split this in > separate issue). I think it's better to apply the patches separately, if that's possible. >> * the optimize step is called twice for some reason - is this >> intended ? if yes, please add a comment why this is done > > Actually we should call it in a loop while the size of the table is decreased. Ok, that makes sense. Still, please add a comment on why this is necessary. >> * the patch would need some tests to make sure that the removed >> aliases indeed still map to the correct C locale strings > > Currently the makelocalealias.py is such manual test. test_locale contains > multiple tests for different locales. I'll add several new cases. Great. Thanks, -- Marc-Andre Lemburg eGenix.com
msg206959 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-26 20:19
Here is updated patch. It doesn't include any additions to locale alias table (including devanagari). Added several tests cases (many other test cases for removed aliases already exist). optimize() is called only once, looks as second run has no effect until UTF-8 aliases are enabled.
msg206964 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-12-26 21:09
On 26.12.2013 21:19, Serhiy Storchaka wrote: > > Here is updated patch. It doesn't include any additions to locale alias table (including devanagari). Added several tests cases (many other test cases for removed aliases already exist). optimize() is called only once, looks as second run has no effect until UTF-8 aliases are enabled. Looks good. Could you add a test for the optimization function ? It should make sure that a complete set of now removed locale names are properly optimized away, e.g. 'nl_nl': 'nl_NL.ISO8859-1', - 'nl_nl.88591': 'nl_NL.ISO8859-1', - 'nl_nl.iso88591': 'nl_NL.ISO8859-1', - 'nl_nl.iso885915': 'nl_NL.ISO8859-15', - 'nl_nl.iso885915@euro': 'nl_NL.ISO8859-15', - 'nl_nl.utf8@euro': 'nl_NL.UTF-8', - 'nl_nl@euro': 'nl_NL.ISO8859-15', and 'ja_jp': 'ja_JP.eucJP', - 'ja_jp.ajec': 'ja_JP.eucJP', 'ja_jp.euc': 'ja_JP.eucJP', - 'ja_jp.eucjp': 'ja_JP.eucJP', - 'ja_jp.iso-2022-jp': 'ja_JP.JIS7', - 'ja_jp.iso2022jp': 'ja_JP.JIS7', - 'ja_jp.jis': 'ja_JP.JIS7', - 'ja_jp.jis7': 'ja_JP.JIS7', 'ja_jp.mscode': 'ja_JP.SJIS', 'ja_jp.pck': 'ja_JP.SJIS', - 'ja_jp.sjis': 'ja_JP.SJIS', - 'ja_jp.ujis': 'ja_JP.eucJP', This should then cover all the Latin-1 encodings and also an Asian one. Thanks, -- Marc-Andre Lemburg eGenix.com
msg206966 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-26 21:43
> Could you add a test for the optimization function ? I have no ideas. The optimization function is a part of the makelocalealias.py which ran manually very rarely (last time 3.5 year ago). It isn't exposed outside the script and I'm not sure that it worths the complication of the testing. > It should make sure that a complete set of now removed locale > names are properly optimized away, e.g. > > 'nl_nl': 'nl_NL.ISO8859-1', > - 'nl_nl.88591': 'nl_NL.ISO8859-1', > - 'nl_nl.iso88591': 'nl_NL.ISO8859-1', > - 'nl_nl.iso885915': 'nl_NL.ISO8859-15', > - 'nl_nl.iso885915@euro': 'nl_NL.ISO8859-15', > - 'nl_nl.utf8@euro': 'nl_NL.UTF-8', > - 'nl_nl@euro': 'nl_NL.ISO8859-15', These cases are covered by test_english and test_euro_modifier. > 'ja_jp': 'ja_JP.eucJP', > - 'ja_jp.ajec': 'ja_JP.eucJP', > 'ja_jp.euc': 'ja_JP.eucJP', > - 'ja_jp.eucjp': 'ja_JP.eucJP', > - 'ja_jp.iso-2022-jp': 'ja_JP.JIS7', > - 'ja_jp.iso2022jp': 'ja_JP.JIS7', > - 'ja_jp.jis': 'ja_JP.JIS7', > - 'ja_jp.jis7': 'ja_JP.JIS7', > 'ja_jp.mscode': 'ja_JP.SJIS', > 'ja_jp.pck': 'ja_JP.SJIS', > - 'ja_jp.sjis': 'ja_JP.SJIS', > - 'ja_jp.ujis': 'ja_JP.eucJP', Here is a test which includes tests for japanese cases end tests for the euc encoding (it maps to different encodings depending on language).
msg206967 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-12-26 21:47
On 26.12.2013 22:43, Serhiy Storchaka wrote: > > Serhiy Storchaka added the comment: > >> Could you add a test for the optimization function ? > > I have no ideas. The optimization function is a part of the makelocalealias.py > which ran manually very rarely (last time 3.5 year ago). It isn't exposed > outside the script and I'm not sure that it worths the complication of the > testing. I probably wasn't clear: I meant some tests that show that the alias definitions (on the left) from the X11 file are actually mapped to the correct alias locales (on the right). >> It should make sure that a complete set of now removed locale >> names are properly optimized away, e.g. >> >> 'nl_nl': 'nl_NL.ISO8859-1', >> - 'nl_nl.88591': 'nl_NL.ISO8859-1', >> - 'nl_nl.iso88591': 'nl_NL.ISO8859-1', >> - 'nl_nl.iso885915': 'nl_NL.ISO8859-15', >> - 'nl_nl.iso885915@euro': 'nl_NL.ISO8859-15', >> - 'nl_nl.utf8@euro': 'nl_NL.UTF-8', >> - 'nl_nl@euro': 'nl_NL.ISO8859-15', > > These cases are covered by test_english and test_euro_modifier. Ok. >> 'ja_jp': 'ja_JP.eucJP', >> - 'ja_jp.ajec': 'ja_JP.eucJP', >> 'ja_jp.euc': 'ja_JP.eucJP', >> - 'ja_jp.eucjp': 'ja_JP.eucJP', >> - 'ja_jp.iso-2022-jp': 'ja_JP.JIS7', >> - 'ja_jp.iso2022jp': 'ja_JP.JIS7', >> - 'ja_jp.jis': 'ja_JP.JIS7', >> - 'ja_jp.jis7': 'ja_JP.JIS7', >> 'ja_jp.mscode': 'ja_JP.SJIS', >> 'ja_jp.pck': 'ja_JP.SJIS', >> - 'ja_jp.sjis': 'ja_JP.SJIS', >> - 'ja_jp.ujis': 'ja_JP.eucJP', > > Here is a test which includes tests for japanese cases end tests for the euc > encoding (it maps to different encodings depending on language). Thanks. I think this is good enough now.
msg206970 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-26 22:19
> I probably wasn't clear: I meant some tests that show that the > alias definitions (on the left) from the X11 file are actually mapped to the > correct alias locales (on the right). This is how the optimization function works. It updates alias table with the X11 file, and then removes the alias entities one by one and checks that the alias definition are mapped to to correct alias locales. Here is a patch which adds a self-check in the makelocalealias.py script.
msg206971 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-12-26 22:47
On 26.12.2013 23:19, Serhiy Storchaka wrote: > > Serhiy Storchaka added the comment: > >> I probably wasn't clear: I meant some tests that show that the >> alias definitions (on the left) from the X11 file are actually mapped to the >> correct alias locales (on the right). > > This is how the optimization function works. It updates alias table with the > X11 file, and then removes the alias entities one by one and checks that the > alias definition are mapped to to correct alias locales. > > Here is a patch which adds a self-check in the makelocalealias.py script. Thanks, this makes that part of the implementation waterproof as well. Good to go, I guess.
msg206972 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-12-26 22:57
New changeset 63bc68d7f449 by Serhiy Storchaka in branch 'default': Issue #20046: Locale alias table no longer contains entities which can be http://hg.python.org/cpython/rev/63bc68d7f449
msg206973 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-26 23:00
Thank you Marc-Andre for your reviews.

History
Date	User	Action	Args
2022-04-11 14:57:55	admin	set	github: 64245
2013-12-26 23:00:11	serhiy.storchaka	set	status: open -> closed messages: + msg206973 assignee: serhiy.storchaka resolution: fixed stage: patch review -> resolved
2013-12-26 22:57:05	python-dev	set	nosy: + python-dev messages: + msg206972
2013-12-26 22:47:30	lemburg	set	messages: + msg206971
2013-12-26 22:19:51	serhiy.storchaka	set	files: + locale_optimize_aliases_4.patch messages: + msg206970
2013-12-26 21:47:36	lemburg	set	messages: + msg206967
2013-12-26 21:43:12	serhiy.storchaka	set	files: + locale_optimize_aliases_3.patch messages: + msg206966
2013-12-26 21:09:16	lemburg	set	messages: + msg206964
2013-12-26 20:19:14	serhiy.storchaka	set	files: + locale_optimize_aliases_2.patch messages: + msg206959 versions: + Python 3.4
2013-12-26 00:36:03	lemburg	set	messages: + msg206932
2013-12-21 23:38:34	serhiy.storchaka	set	messages: + msg206796
2013-12-21 21:38:44	lemburg	set	messages: + msg206787
2013-12-21 19:33:26	serhiy.storchaka	set	messages: + msg206770
2013-12-21 19:22:08	serhiy.storchaka	set	dependencies: + Fixed support for Indian locales, Fix makelocalealias.py for Python 3
2013-12-21 19:20:57	serhiy.storchaka	create