Message 387256 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	AndersMunch, eryksun, lemburg, paul.moore, steve.dower, swt2c, tim.golden, zach.ware
Date	2021-02-18.18:04:53
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1613671493.97.0.572146174769.issue43115@roundup.psfhosted.org>
In-reply-to

Content
> The APIs were written at a time where locale modifiers > simply did mot exist. Technically, locale modifiers did exist circa 2000, but I suppose you mean that they were uncommon to the point of being unheard of at the time. The modifier field was specified in the X/Open Portability Guide Issue 3 (XPG3) in 1989, and again in XPG4 in 1992 as "language[_territory][.codeset][@modifier]". I can't provide links to the specifications (they're not freely available), but here's a link to X/Open "Internationalisation Guide Version 2" (1993), which defines the modifier field in section 5.1.2 (pages 88-89): https://pubs.opengroup.org/onlinepubs/009269599/toc.pdf > Support could be added via a special locale tuple return > object, which looks like 2-tuple, but comes with extra attributes > to store the modifier That's a good idea and worth implementing. But the _strptime and calendar modules have no need to call getlocale(LC_TIME). IMO, it adds fragility for no benefit. All they need to save is the result of setlocale(LC_TIME). Also, the default locale for calendar.LocaleTextCalendar has no need to use getdefaultlocale() instead of using an empty string, i.e. setlocale(LC_TIME, ""). The latter is simpler and more reliable. --- > support for Windows is only partial, due to the > completely different approach Windows' CRT took to locales. Using the same implementation for POSIX and Windows is needlessly complicated, and difficult to reason about how it behaves in all cases. I suggest implementing separate versions of normalize() and _parse_localename() for Windows, making use of direct queries via _winapi.GetLocaleInfoEx() (to be added). The mapping in encodings.aliases also needs comprehensive coverage for Windows code pages (e.g. cp20127 -> ascii, cp28591 -> latin_1, etc). A poor match should not be aliased, such as code page 20932 and euc_JP. (For all 3-byte sequences in standard euc-JP, code page 20932 encodes 2-byte sequences by dropping the lead byte and masking the third byte as ASCII.) If the locale string doesn't include a codeset, then normalize() shouldn't do anything to obtain one. It's not necessary in Windows. If there's a codeset, normalize() should ensure it's "UTF-8", "OCP", "ACP", or a Windows code page in the right form, e.g. "ascii" -> "20127". ucrt supports "ACP" and "OCP" codesets for the locale's ANSI and OEM code pages. These must be in uppercase, e.g. "hindi.acp" -> "hindi.ACP". ucrt will set the latter as "Hindi_India.utf8" (it's a Unicode-only locale), which should parse as ("Hindi_India", "UTF-8"). If the locale without the codeset isn't a valid Windows BCP-47 locale name, as determined by the NLS API, then normalize() should only care about case-insensitive normalization of "C" and "POSIX" as "C", e.g. "c" -> "C". No other normalization is necessary. ucrt supports case-insensitive "language[_country[.codepage]]" and ".codepage" forms, where language and country are either the full English names, LOCALE_SENGLISHLANGUAGENAME and LOCALE_SENGLISHCOUNTRYNAME, or 3-letter abbreviations, LOCALE_SABBREVLANGNAME and LOCALE_SABBREVCTRYNAME, such as "enu_USA". It also supports locale aliases such as "american[.codeset]". If the result isn't "C" or a BCP-47 locale name, ucrt setlocale() always returns the "language_country.codepage" form with full English names. A BCP-47 locale name such as "en" or "en_US" cannot be used with a codeset other than UTF-8. If no codeset is specified, ucrt implicitly uses the locale's ANSI code page. If a BCP-47 locale name is paired with a codeset that's neither the given locale's ANSI codepage nor UTF-8, then normalize it to the "language_country.codepage" form. For example, "fr_FR.latin-1" -> "French_France.28591". Parse the latter as ("French_France", "ISO-8859-1"). If a BCP-47 locale name is paired with the locale's ANSI code page, then normalize it without the code page, e.g. "sr_Latn_RS.cp1250" -> "sr_Latn_RS". Look up the locale's ANSI code page when parsing the latter, e.g. "sr_Latn_RS" -> ("sr_Latn_RS", "cp1250"). If a BCP-47 locale name is paired with UTF-8, then there isn't much to do other than normalize the locale name and encoding name, e.g. "en_us.utf8" -> "en_US.UTF-8".

> The APIs were written at a time where locale modifiers 
> simply did mot exist. 

Technically, locale modifiers did exist circa 2000, but I suppose you mean that they were uncommon to the point of being unheard of at the time.

The modifier field was specified in the X/Open Portability Guide Issue 3 (XPG3) in 1989, and again in XPG4 in 1992 as "language[_territory][.codeset][@modifier]". I can't provide links to the specifications (they're not freely available), but here's a link to X/Open "Internationalisation Guide Version 2" (1993), which defines the modifier field in section 5.1.2 (pages 88-89):

https://pubs.opengroup.org/onlinepubs/009269599/toc.pdf

> Support could be added via a special locale tuple return
> object, which looks like 2-tuple, but comes with extra attributes
> to store the modifier

That's a good idea and worth implementing. But the _strptime and calendar modules have no need to call getlocale(LC_TIME). IMO, it adds fragility for no benefit. All they need to save is the result of setlocale(LC_TIME). 

Also, the default locale for calendar.LocaleTextCalendar has no need to use getdefaultlocale() instead of using an empty string, i.e. setlocale(LC_TIME, ""). The latter is simpler and more reliable.

---

> support for Windows is only partial, due to the
> completely different approach Windows' CRT took to locales.

Using the same implementation for POSIX and Windows is needlessly complicated, and difficult to reason about how it behaves in all cases.

I suggest implementing separate versions of normalize() and _parse_localename() for Windows, making use of direct queries via _winapi.GetLocaleInfoEx() (to be added). 

The mapping in encodings.aliases also needs comprehensive coverage for Windows code pages (e.g. cp20127 -> ascii, cp28591 -> latin_1, etc). A poor match should not be aliased, such as code page 20932 and euc_JP. (For all 3-byte sequences in standard euc-JP, code page 20932 encodes 2-byte sequences by dropping the lead byte and masking the third byte as ASCII.)

If the locale string doesn't include a codeset, then normalize() shouldn't do anything to obtain one. It's not necessary in Windows. If there's a codeset, normalize() should ensure it's "UTF-8", "OCP", "ACP", or a Windows code page in the right form, e.g. "ascii" -> "20127". ucrt supports "ACP" and "OCP" codesets for the locale's ANSI and OEM code pages. These must be in uppercase, e.g. "hindi.acp" -> "hindi.ACP". ucrt will set the latter as "Hindi_India.utf8" (it's a Unicode-only locale), which should parse as ("Hindi_India", "UTF-8").

If the locale without the codeset isn't a valid Windows BCP-47 locale name, as determined by the NLS API, then normalize() should only care about case-insensitive normalization of "C" and "POSIX" as "C", e.g. "c" -> "C". No other normalization is necessary. ucrt supports case-insensitive "language[_country[.codepage]]" and ".codepage" forms, where language and country are either the full English names, LOCALE_SENGLISHLANGUAGENAME and LOCALE_SENGLISHCOUNTRYNAME, or 3-letter abbreviations, LOCALE_SABBREVLANGNAME and LOCALE_SABBREVCTRYNAME, such as "enu_USA". It also supports locale aliases such as "american[.codeset]". If the result isn't "C" or a BCP-47 locale name, ucrt setlocale() always returns the "language_country.codepage" form with full English names.

A BCP-47 locale name such as "en" or "en_US" cannot be used with a codeset other than UTF-8. If no codeset is specified, ucrt implicitly uses the locale's ANSI code page. 

If a BCP-47 locale name is paired with a codeset that's neither the given locale's ANSI codepage nor UTF-8, then normalize it to the "language_country.codepage" form. For example, "fr_FR.latin-1" -> "French_France.28591". Parse the latter as ("French_France", "ISO-8859-1"). 

If a BCP-47 locale name is paired with the locale's ANSI code page, then normalize it without the code page, e.g. "sr_Latn_RS.cp1250" -> "sr_Latn_RS". Look up the locale's ANSI code page when parsing the latter, e.g. "sr_Latn_RS" -> ("sr_Latn_RS", "cp1250"). 

If a BCP-47 locale name is paired with UTF-8, then there isn't much to do other than normalize the locale name and encoding name, e.g. "en_us.utf8" -> "en_US.UTF-8".

History
Date	User	Action	Args
2021-02-18 18:04:53	eryksun	set	recipients: + eryksun, lemburg, paul.moore, tim.golden, zach.ware, steve.dower, swt2c, AndersMunch
2021-02-18 18:04:53	eryksun	set	messageid: <1613671493.97.0.572146174769.issue43115@roundup.psfhosted.org>
2021-02-18 18:04:53	eryksun	link	issue43115 messages
2021-02-18 18:04:53	eryksun	create