Author eryksun
Recipients eryksun, paul.moore, steve.dower, tim.golden, xtreak, zach.ware
Date 2019-08-29.19:53:49
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1567108430.23.0.317693421427.issue37945@roundup.psfhosted.org>
In-reply-to
Content
If normalize() is implemented for Windows, then the tests should be split out into POSIX and Windows versions. Currently, most of the tests in NormalizeTest are not checking a result that's properly normalized for ucrt.

A useful implementation of locale.normalize should allow a script to use ("en_US", "iso8859_1") in Windows without having to know that Latin-1 is Windows codepage 28591, or that ucrt requires a classic locale name if the encoding isn't UTF-8. The required result for setlocale() is "English_United States.28591". 

As far as aliases are concerned, at a minimum, we need to map "posix" and "c" to "C". We can also support "C.UTF-8" as "en_US.UTF-8". Do we need to support the Unix locale_alias mappings from X.org? If so, I suppose we could use a double mapping. First try the Unix locale_alias mapping. Then try that result in a windows_locale_alias mapping that includes additional mappings from Unix to Windows. For example: 

    sr_CS.UTF-8          -> sr_Cyrl_CS.UTF-8
    sr_CS.UTF-8@latin    -> sr_Latn_CS.UTF-8
    ca_ES.UTF-8@valencia -> ca_ES_valencia.UTF-8

Note that the last one doesn't currently work. "ca-ES-valencia" is a valid Windows locale name for the Valencian variant of Catalan (ca), which lacks an ISO 639 code of its own since it's officially (and somewhat controversially) designated as a dialect of Catalan. This is an unusual case that has a subtag after the region, which ucrt's manual BCP-47 parsing cannot handle. (It tries to parse "ES" as the script and "valencia" as an ISO 3166-1 country code.)

After mapping aliases, if the result still has "@" in it, normalize() should fail. We don't know what the "@" modifier means.

Otherwise, split the locale name and encoding parts. If the encoding isn't UTF-8, try to map it to a codepage. For this we need a  windows_codepage_alias dict that maps IANA official and Python-specific encoding names to Windows codepages. Next, check the locale name via WINAPI IsValidLocaleName. If it's not valid, try replacing underscore with hyphen and check again. Otherwise assume it's a classic ucrt locale name. (It may not be valid, but implementing all of the work ucrt does to parse a classic locale name is too much I think.) If it's a valid Windows locale name, and we have a codepage encoding, then try to translate it as a classic ucrt locale name. This requires two WINAPI GetLocaleInfoEx calls to look up the English versions of the language and country name.
History
Date User Action Args
2019-08-29 19:53:50eryksunsetrecipients: + eryksun, paul.moore, tim.golden, zach.ware, steve.dower, xtreak
2019-08-29 19:53:50eryksunsetmessageid: <1567108430.23.0.317693421427.issue37945@roundup.psfhosted.org>
2019-08-29 19:53:50eryksunlinkissue37945 messages
2019-08-29 19:53:49eryksuncreate