Message 289179 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Arfrever, lemburg, loewis, serhiy.storchaka
Date	2017-03-07.18:29:01
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<ed4a6a5d-b5cc-1c13-b35a-a703ab83c406@egenix.com>
In-reply-to	<1488907384.39.0.343309274831.issue20087@psf.upfronthosting.co.za>

Content
On 07.03.2017 18:23, Serhiy Storchaka wrote: > > Serhiy Storchaka added the comment: > >> 'cy_GB.ISO8859-1' to 'cy_GB.ISO8859-14' > > Looks as just fixing an error. The default West-European ISO8859-1 is changed to Celtic cy_GB.ISO8859-14. This looks better option for Welsh. > >> 'tg_TJ.KOI8-C' to 'tg_TJ.KOI8-T' > > KOI8-C is not supported by Python, but KOI8-T is supported. I don't know what KOI8-C means, there are several rarely used incompatible encodings with this name. While all this may make sense, I'm missing some more reasoning behind the differences between X.org and glibc. This change also looks strange: - 'ka_ge': 'ka_GE.GEORGIAN-ACADEMY', + 'ka_ge': 'ka_GE.GEORGIAN_PS', 'ka_ge.georgianacademy': 'ka_GE.GEORGIAN-ACADEMY', 'ka_ge.georgianps': 'ka_GE.GEORGIAN-PS', 'ka_ge.georgianrs': 'ka_GE.GEORGIAN-ACADEMY', Why is GEORGIAN_PS written with an underscore whereas the other mappings use dashes ? Or this one: - 'fi_fi': 'fi_FI.ISO8859-15', + 'fi_fi': 'fi_FI.ISO8859-1', Why would a locale switch away from an encoding having the Euro sign to one without it ? Or why is this latin variant removed: - 'nan_tw@latin': 'nan_TW.UTF-8@latin', Why should Russians switch back to ISO ? - 'ru_ru': 'ru_RU.UTF-8', + 'ru_ru': 'ru_RU.ISO8859-5', or from ISO to KOI ? - 'russian': 'ru_RU.ISO8859-5', + 'russian': 'ru_RU.KOI8-R', The more I look at these changes, the more I believe we should not simply take everything we find in the files for granted. They obviously both have bugs. >> I also don't understand why some "xx.utf-8" locale mappings were removed - I don't think we should remove those, unless they are no longer needed due to some other logic implying these mappings. > > The aliases table is a table of exceptions. Removed entries no longer are exceptional. It's not a table of exceptions, it's a table mapping commonly used locale settings to ones which the lib C understands :-) But regardless, I checked the code and it is already smart enough to convert lib C incompatible spellings such as "utf8" to "UTF-8", so these entries can indeed be removed, but only if the locale is otherwise listed. In some cases, it's probably better to drop the ".utf8" to have more generic mappings, e.g. + 'bhb_in.utf8': 'bhb_IN.UTF-8', or 'de_li.utf8': 'de_LI.UTF-8', though I'd expect that mapping to be: 'de_li': 'de_LI.ISO8859-1', as for all other "de" entries.

On 07.03.2017 18:23, Serhiy Storchaka wrote:
> 
> Serhiy Storchaka added the comment:
> 
>> 'cy_GB.ISO8859-1' to 'cy_GB.ISO8859-14'
> 
> Looks as just fixing an error. The default West-European ISO8859-1 is changed to Celtic cy_GB.ISO8859-14. This looks better option for Welsh.
> 
>> 'tg_TJ.KOI8-C' to 'tg_TJ.KOI8-T'
> 
> KOI8-C is not supported by Python, but KOI8-T is supported. I don't know what KOI8-C means, there are several rarely used incompatible encodings with this name.

While all this may make sense, I'm missing some more reasoning
behind the differences between X.org and glibc.

This change also looks strange:

-    'ka_ge':                                'ka_GE.GEORGIAN-ACADEMY',
+    'ka_ge':                                'ka_GE.GEORGIAN_PS',
     'ka_ge.georgianacademy':                'ka_GE.GEORGIAN-ACADEMY',
     'ka_ge.georgianps':                     'ka_GE.GEORGIAN-PS',
     'ka_ge.georgianrs':                     'ka_GE.GEORGIAN-ACADEMY',

Why is GEORGIAN_PS written with an underscore whereas the other
mappings use dashes ?

Or this one:

-    'fi_fi':                                'fi_FI.ISO8859-15',
+    'fi_fi':                                'fi_FI.ISO8859-1',

Why would a locale switch away from an encoding having
the Euro sign to one without it ?

Or why is this latin variant removed:

-    'nan_tw@latin':                         'nan_TW.UTF-8@latin',

Why should Russians switch back to ISO ?

-    'ru_ru':                                'ru_RU.UTF-8',
+    'ru_ru':                                'ru_RU.ISO8859-5',

or from ISO to KOI ?

-    'russian':                              'ru_RU.ISO8859-5',
+    'russian':                              'ru_RU.KOI8-R',

The more I look at these changes, the more I believe we
should not simply take everything we find in the files
for granted. They obviously both have bugs.

>> I also don't understand why some "xx.utf-8" locale mappings were removed - I don't think we should remove those, unless they are no longer needed due to some other logic implying these mappings.
> 
> The aliases table is a table of exceptions. Removed entries no longer are exceptional.

It's not a table of exceptions, it's a table mapping commonly
used locale settings to ones which the lib C understands :-)

But regardless, I checked the code and it is already
smart enough to convert lib C incompatible spellings such
as "utf8" to "UTF-8", so these entries can indeed be
removed, but only if the locale is otherwise listed.

In some cases, it's probably better to drop the ".utf8"
to have more generic mappings, e.g.

+    'bhb_in.utf8':                          'bhb_IN.UTF-8',

or

     'de_li.utf8':                           'de_LI.UTF-8',

though I'd expect that mapping to be:

     'de_li':                           'de_LI.ISO8859-1',

as for all other "de" entries.

History
Date	User	Action	Args
2017-03-07 18:29:01	lemburg	set	recipients: + lemburg, loewis, Arfrever, serhiy.storchaka
2017-03-07 18:29:01	lemburg	link	issue20087 messages
2017-03-07 18:29:01	lemburg	create