Author eryksun
Recipients Carsten Fuchs, CharlieClark, Dominik Geldmacher, Manjusaka, eryksun, jeremy.kloth, jkloth, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Date 2020-02-06.08:26:08
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1580977568.81.0.984499067128.issue36792@roundup.psfhosted.org>
In-reply-to
Content
> Even some well known locale names still use the utf-8 code page.  Most 
> seem to uncommon, but at least es-BR (Brazil) does and would still 
> fall victim to these UCRT bugs.

es-BR is a custom locale for the Spanish language in Brazil, as opposed to the common Portuguese locale (pt-BR). It's a Unicode-only locale, which means its ANSI codepage is 0. Since 0 is CP_ACP, its effective ANSI codepage is the system or process ANSI codepage. 

For example:

    >>> kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
    >>> buf = (ctypes.c_wchar * 10)()

Portuguese in Brazil uses codepage 1252 as its ANSI codepage:

    >>> n = kernel32.GetLocaleInfoEx('pt-BR', 0x1004, buf, 10)
    >>> buf.value
    '1252'

Spanish in Brazil uses CP_ACP:

    >>> n = kernel32.GetLocaleInfoEx('es-BR', 0x1004, buf, 10)
    >>> buf.value
    '0'

hi-IN (Hindi, India) is a common Unicode-only locale:

    >>> n = kernel32.GetLocaleInfoEx('hi-IN', 0x1004, buf, 10)
    >>> buf.value
    '0'

ucrt has switched to using UTF-8 for Unicode-only locales:

    >>> locale.setlocale(locale.LC_CTYPE, 'hi_IN')
    'hi_IN'
    >>> ucrt = ctypes.CDLL('ucrtbase', use_errno=True)
    >>> ucrt.___lc_codepage_func()
    65001

Note that ucrt uses UTF-8 for Unicode-only locales only when using an explicitly named locale such as "hi_IN", "Hindi_India" or even just "Hindi". On the other hand, if a Unicode-only locale is used implicitly, ucrt instead uses the system ANSI codepage:

    >>> locale.setlocale(locale.LC_CTYPE, '')
    'Hindi_India.1252'
    >>> ucrt.___lc_codepage_func()
    1252

I suppose this is for backwards compatibility. Windows 10 at least supports setting the system ANSI codepage to UTF-8, or overriding the process ANSI codepage to UTF-8 via the application manifest "actveCodePage" setting. For the latter, I modified the manifest in a "python_utf8.exe" copy of the normal "python.exe" binary, which is simpler than having to reboot to change the system locale:

    C:\>python_utf8 -q

    >>> import locale
    >>> locale.setlocale(locale.LC_CTYPE, '')
    'Hindi_India.utf8'
History
Date User Action Args
2020-02-06 08:26:08eryksunsetrecipients: + eryksun, paul.moore, vstinner, tim.golden, jkloth, jeremy.kloth, zach.ware, steve.dower, Manjusaka, CharlieClark, Dominik Geldmacher, Carsten Fuchs
2020-02-06 08:26:08eryksunsetmessageid: <1580977568.81.0.984499067128.issue36792@roundup.psfhosted.org>
2020-02-06 08:26:08eryksunlinkissue36792 messages
2020-02-06 08:26:08eryksuncreate