Message 350823 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	eryksun, paul.moore, steve.dower, tim.golden, xtreak, zach.ware
Date	2019-08-29.19:53:49
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1567108430.23.0.317693421427.issue37945@roundup.psfhosted.org>
In-reply-to

Content
If normalize() is implemented for Windows, then the tests should be split out into POSIX and Windows versions. Currently, most of the tests in NormalizeTest are not checking a result that's properly normalized for ucrt. A useful implementation of locale.normalize should allow a script to use ("en_US", "iso8859_1") in Windows without having to know that Latin-1 is Windows codepage 28591, or that ucrt requires a classic locale name if the encoding isn't UTF-8. The required result for setlocale() is "English_United States.28591". As far as aliases are concerned, at a minimum, we need to map "posix" and "c" to "C". We can also support "C.UTF-8" as "en_US.UTF-8". Do we need to support the Unix locale_alias mappings from X.org? If so, I suppose we could use a double mapping. First try the Unix locale_alias mapping. Then try that result in a windows_locale_alias mapping that includes additional mappings from Unix to Windows. For example: sr_CS.UTF-8 -> sr_Cyrl_CS.UTF-8 sr_CS.UTF-8@latin -> sr_Latn_CS.UTF-8 ca_ES.UTF-8@valencia -> ca_ES_valencia.UTF-8 Note that the last one doesn't currently work. "ca-ES-valencia" is a valid Windows locale name for the Valencian variant of Catalan (ca), which lacks an ISO 639 code of its own since it's officially (and somewhat controversially) designated as a dialect of Catalan. This is an unusual case that has a subtag after the region, which ucrt's manual BCP-47 parsing cannot handle. (It tries to parse "ES" as the script and "valencia" as an ISO 3166-1 country code.) After mapping aliases, if the result still has "@" in it, normalize() should fail. We don't know what the "@" modifier means. Otherwise, split the locale name and encoding parts. If the encoding isn't UTF-8, try to map it to a codepage. For this we need a windows_codepage_alias dict that maps IANA official and Python-specific encoding names to Windows codepages. Next, check the locale name via WINAPI IsValidLocaleName. If it's not valid, try replacing underscore with hyphen and check again. Otherwise assume it's a classic ucrt locale name. (It may not be valid, but implementing all of the work ucrt does to parse a classic locale name is too much I think.) If it's a valid Windows locale name, and we have a codepage encoding, then try to translate it as a classic ucrt locale name. This requires two WINAPI GetLocaleInfoEx calls to look up the English versions of the language and country name.

If normalize() is implemented for Windows, then the tests should be split out into POSIX and Windows versions. Currently, most of the tests in NormalizeTest are not checking a result that's properly normalized for ucrt.

A useful implementation of locale.normalize should allow a script to use ("en_US", "iso8859_1") in Windows without having to know that Latin-1 is Windows codepage 28591, or that ucrt requires a classic locale name if the encoding isn't UTF-8. The required result for setlocale() is "English_United States.28591".

As far as aliases are concerned, at a minimum, we need to map "posix" and "c" to "C". We can also support "C.UTF-8" as "en_US.UTF-8". Do we need to support the Unix locale_alias mappings from X.org? If so, I suppose we could use a double mapping. First try the Unix locale_alias mapping. Then try that result in a windows_locale_alias mapping that includes additional mappings from Unix to Windows. For example:

sr_CS.UTF-8 -> sr_Cyrl_CS.UTF-8
sr_CS.UTF-8@latin -> sr_Latn_CS.UTF-8
ca_ES.UTF-8@valencia -> ca_ES_valencia.UTF-8

Note that the last one doesn't currently work. "ca-ES-valencia" is a valid Windows locale name for the Valencian variant of Catalan (ca), which lacks an ISO 639 code of its own since it's officially (and somewhat controversially) designated as a dialect of Catalan. This is an unusual case that has a subtag after the region, which ucrt's manual BCP-47 parsing cannot handle. (It tries to parse "ES" as the script and "valencia" as an ISO 3166-1 country code.)

After mapping aliases, if the result still has "@" in it, normalize() should fail. We don't know what the "@" modifier means.

Otherwise, split the locale name and encoding parts. If the encoding isn't UTF-8, try to map it to a codepage. For this we need a windows_codepage_alias dict that maps IANA official and Python-specific encoding names to Windows codepages. Next, check the locale name via WINAPI IsValidLocaleName. If it's not valid, try replacing underscore with hyphen and check again. Otherwise assume it's a classic ucrt locale name. (It may not be valid, but implementing all of the work ucrt does to parse a classic locale name is too much I think.) If it's a valid Windows locale name, and we have a codepage encoding, then try to translate it as a classic ucrt locale name. This requires two WINAPI GetLocaleInfoEx calls to look up the English versions of the language and country name.

History
Date	User	Action	Args
2019-08-29 19:53:50	eryksun	set	recipients: + eryksun, paul.moore, tim.golden, zach.ware, steve.dower, xtreak
2019-08-29 19:53:50	eryksun	set	messageid: <1567108430.23.0.317693421427.issue37945@roundup.psfhosted.org>
2019-08-29 19:53:50	eryksun	link	issue37945 messages
2019-08-29 19:53:49	eryksun	create