Message 361473 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	Carsten Fuchs, CharlieClark, Dominik Geldmacher, Manjusaka, eryksun, jeremy.kloth, jkloth, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Date	2020-02-06.08:26:08
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1580977568.81.0.984499067128.issue36792@roundup.psfhosted.org>
In-reply-to

Content
> Even some well known locale names still use the utf-8 code page. Most > seem to uncommon, but at least es-BR (Brazil) does and would still > fall victim to these UCRT bugs. es-BR is a custom locale for the Spanish language in Brazil, as opposed to the common Portuguese locale (pt-BR). It's a Unicode-only locale, which means its ANSI codepage is 0. Since 0 is CP_ACP, its effective ANSI codepage is the system or process ANSI codepage. For example: >>> kernel32 = ctypes.WinDLL('kernel32', use_last_error=True) >>> buf = (ctypes.c_wchar * 10)() Portuguese in Brazil uses codepage 1252 as its ANSI codepage: >>> n = kernel32.GetLocaleInfoEx('pt-BR', 0x1004, buf, 10) >>> buf.value '1252' Spanish in Brazil uses CP_ACP: >>> n = kernel32.GetLocaleInfoEx('es-BR', 0x1004, buf, 10) >>> buf.value '0' hi-IN (Hindi, India) is a common Unicode-only locale: >>> n = kernel32.GetLocaleInfoEx('hi-IN', 0x1004, buf, 10) >>> buf.value '0' ucrt has switched to using UTF-8 for Unicode-only locales: >>> locale.setlocale(locale.LC_CTYPE, 'hi_IN') 'hi_IN' >>> ucrt = ctypes.CDLL('ucrtbase', use_errno=True) >>> ucrt.___lc_codepage_func() 65001 Note that ucrt uses UTF-8 for Unicode-only locales only when using an explicitly named locale such as "hi_IN", "Hindi_India" or even just "Hindi". On the other hand, if a Unicode-only locale is used implicitly, ucrt instead uses the system ANSI codepage: >>> locale.setlocale(locale.LC_CTYPE, '') 'Hindi_India.1252' >>> ucrt.___lc_codepage_func() 1252 I suppose this is for backwards compatibility. Windows 10 at least supports setting the system ANSI codepage to UTF-8, or overriding the process ANSI codepage to UTF-8 via the application manifest "actveCodePage" setting. For the latter, I modified the manifest in a "python_utf8.exe" copy of the normal "python.exe" binary, which is simpler than having to reboot to change the system locale: C:\>python_utf8 -q >>> import locale >>> locale.setlocale(locale.LC_CTYPE, '') 'Hindi_India.utf8'

> Even some well known locale names still use the utf-8 code page.  Most 
> seem to uncommon, but at least es-BR (Brazil) does and would still 
> fall victim to these UCRT bugs.

es-BR is a custom locale for the Spanish language in Brazil, as opposed to the common Portuguese locale (pt-BR). It's a Unicode-only locale, which means its ANSI codepage is 0. Since 0 is CP_ACP, its effective ANSI codepage is the system or process ANSI codepage. 

For example:

    >>> kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
    >>> buf = (ctypes.c_wchar * 10)()

Portuguese in Brazil uses codepage 1252 as its ANSI codepage:

    >>> n = kernel32.GetLocaleInfoEx('pt-BR', 0x1004, buf, 10)
    >>> buf.value
    '1252'

Spanish in Brazil uses CP_ACP:

    >>> n = kernel32.GetLocaleInfoEx('es-BR', 0x1004, buf, 10)
    >>> buf.value
    '0'

hi-IN (Hindi, India) is a common Unicode-only locale:

    >>> n = kernel32.GetLocaleInfoEx('hi-IN', 0x1004, buf, 10)
    >>> buf.value
    '0'

ucrt has switched to using UTF-8 for Unicode-only locales:

    >>> locale.setlocale(locale.LC_CTYPE, 'hi_IN')
    'hi_IN'
    >>> ucrt = ctypes.CDLL('ucrtbase', use_errno=True)
    >>> ucrt.___lc_codepage_func()
    65001

Note that ucrt uses UTF-8 for Unicode-only locales only when using an explicitly named locale such as "hi_IN", "Hindi_India" or even just "Hindi". On the other hand, if a Unicode-only locale is used implicitly, ucrt instead uses the system ANSI codepage:

    >>> locale.setlocale(locale.LC_CTYPE, '')
    'Hindi_India.1252'
    >>> ucrt.___lc_codepage_func()
    1252

I suppose this is for backwards compatibility. Windows 10 at least supports setting the system ANSI codepage to UTF-8, or overriding the process ANSI codepage to UTF-8 via the application manifest "actveCodePage" setting. For the latter, I modified the manifest in a "python_utf8.exe" copy of the normal "python.exe" binary, which is simpler than having to reboot to change the system locale:

    C:\>python_utf8 -q

    >>> import locale
    >>> locale.setlocale(locale.LC_CTYPE, '')
    'Hindi_India.utf8'

History
Date	User	Action	Args
2020-02-06 08:26:08	eryksun	set	recipients: + eryksun, paul.moore, vstinner, tim.golden, jkloth, jeremy.kloth, zach.ware, steve.dower, Manjusaka, CharlieClark, Dominik Geldmacher, Carsten Fuchs
2020-02-06 08:26:08	eryksun	set	messageid: <1580977568.81.0.984499067128.issue36792@roundup.psfhosted.org>
2020-02-06 08:26:08	eryksun	link	issue36792 messages
2020-02-06 08:26:08	eryksun	create