Message 361476 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	Carsten Fuchs, CharlieClark, Dominik Geldmacher, Manjusaka, eryksun, jeremy.kloth, jkloth, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Date	2020-02-06.09:02:01
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1580979722.66.0.188032500066.issue36792@roundup.psfhosted.org>
In-reply-to

Content
That the CRT caches the tzname strings as ANSI multibyte strings is frustrating -- whether or not it's buggy. I would expect there to be a _wtzname cache of the native OS strings that wcsftime uses directly, with no potential for failed encodings (e.g. empty strings or mojibake). It's also strange that it encodes the time-zone name using the system ANSI codepage in the C locale. Normally LC_CTYPE in the C locale uses Latin-1, due to simple casting between WCHAR and CHAR. This leads to mojibake when the ANSI time-zone name gets decoded as Latin-1 by an internal mbstowcs call in wcsftime. I'm not saying one or the other is necessarily right, but more care should haven gone into this. At the very least, if we're stuck with system ANSI tzname strings in the C locale, then a flag should be set that tells wcsftime to decode them as system ANSI strings instead of via mbstowcs. Also, the timezone name is determined by the preferred UI language of the current user, which is not necessarily compatible with the system ANSI codepage. It's not even necessarily compatible with the user-locale ANSI codepage, as used by setlocale(LC_CTYPE, ""). Windows 10 at least provides an option to sync the user locale with the user preferred UI language. IMO, this is a strong argument in favor of using _wtzname wide-character strings. UI Language (MUI) and locale are not tightly coupled in Windows NLS. Here's an example where the user's preferred language is Hindi, and the time zone name is "समन्वित वैश्विक समय" (i.e. Coordinated Universal Time), but the system locale is English with codepage 1252 (for Western European languages). This is a normal configuration if the system locale doesn't have beta UTF-8 support enabled, or if the process ANSI codepage isn't overridden to UTF-8 via the "activeCodePage" manifest setting. The tzname strings normally get set by a one-time _tzset call, and they're only reset if tzset is called manually. tzset uses the system ANSI encoding if LC_CTYPE is the "C" locale (again, normally ucrt uses Latin-1 in the "C" locale). Since the encoding of the Hindi timezone name to codepage 1252 contains the default character ("?"), which is not allowed, tzset sets the tzname strings to empty strings. import ctypes, locale, time ucrt = ctypes.CDLL('ucrtbase', use_errno=True) ucrt.__tzname.restype = ctypes.POINTER(ctypes.c_char_p) tzname = ucrt.__tzname() >>> locale.setlocale(locale.LC_CTYPE, 'C') 'C' >>> ucrt._tzset() 0 >>> tzname[0], tzname[1] (b'', b'') >>> time.strftime('%Z') '' If we update the LC_CTYPE category to use UTF-8, the cached tzname value doesn't get automatically updated, and strftime still returns an empty string: >>> locale.setlocale(locale.LC_CTYPE, '.utf8') 'Hindi_India.utf8' >>> tzname[0], tzname[1] (b'', b'') >>> time.strftime('%Z') '' The tzname values get updated if we manually call tzset: >>> ucrt._tzset() 0 >>> tzname[0].decode('utf-8'), tzname[1].decode('utf-8') ('समन्वित वैश्विक समय', 'समन्वित वैश्विक समय') However, LC_TIME is still in the "C" locale. strftime uses system ANSI (1252) in this case, so the encoded result from the CRT strftime call ends up using the default character (?): >>> time.strftime('%Z') '??????? ??????? ???' If we set LC_TIME to UTF-8, we finally get a valid result: >>> locale.setlocale(locale.LC_TIME, '.utf8') 'Hindi_India.utf8' >>> time.strftime('%Z') 'समन्वित वैश्विक समय' We wouldn't have to worry about LC_TIME here if Python called C wcsftime instead of C strftime. The problem that bpo-10653 was trying to work around is a design flaw in the C runtime library, and calling strftime is not a solution. Here's a variation on my example in msg243660, continuing with the current Hindi example. The setup in this example uses UTF-8 as the system ANSI codepage (via python_utf8.exe) and sets LC_CTYPE to the "C" locale. This yields the following monstrosity: >>> time.strftime('%Z') 'Ã\xa0Â¤Â¸Ã\xa0Â¤Â®Ã\xa0Â¤Â¨Ã\xa0Â¥Â\x8dÃ\xa0Â¤ÂµÃ\xa0Â¤Â¿Ã\xa0Â¤Â¤ Ã\xa0Â¤ÂµÃ\xa0Â¥Â\x88Ã\xa0Â¤Â¶Ã\xa0Â¥Â\x8dÃ\xa0Â¤ÂµÃ\xa0Â¤Â¿Ã\xa0Â¤Â\x95 Ã\xa0Â¤Â¸Ã\xa0Â¤Â®Ã\xa0Â¤Â¯' It's due to the following sequence of encoding and decoding operations: >>> mbs_lcctype_utf8 = 'समन्वित वैश्विक समय'.encode('utf-8') >>> wcs_lcctype_latin1 = mbs_lcctype_utf8.decode('latin-1') >>> mbs_lctime_utf8 = wcs_lcctype_latin1.encode('utf-8') This last one is from PyUnicode_DecodeLocaleAndSize and mbstowcs: >>> py_str_lcctype_latin1 = mbs_lctime_utf8.decode('latin-1') >>> py_str_lcctype_latin1 == time.strftime('%Z') True

That the CRT caches the tzname strings as ANSI multibyte strings is frustrating -- whether or not it's buggy. I would expect there to be a _wtzname cache of the native OS strings that wcsftime uses directly, with no potential for failed encodings (e.g. empty strings or mojibake).

It's also strange that it encodes the time-zone name using the system ANSI codepage in the C locale. Normally LC_CTYPE in the C locale uses Latin-1, due to simple casting between WCHAR and CHAR. This leads to mojibake when the ANSI time-zone name gets decoded as Latin-1 by an internal mbstowcs call in wcsftime. I'm not saying one or the other is necessarily right, but more care should haven gone into this. At the very least, if we're stuck with system ANSI tzname strings in the C locale, then a flag should be set that tells wcsftime to decode them as system ANSI strings instead of via mbstowcs.

Also, the timezone name is determined by the preferred UI language of the current user, which is not necessarily compatible with the system ANSI codepage. It's not even necessarily compatible with the user-locale ANSI codepage, as used by setlocale(LC_CTYPE, ""). Windows 10 at least provides an option to sync the user locale with the user preferred UI language. IMO, this is a strong argument in favor of using _wtzname wide-character strings. UI Language (MUI) and locale are not tightly coupled in Windows NLS.

Here's an example where the user's preferred language is Hindi, and the time zone name is "समन्वित वैश्विक समय" (i.e. Coordinated Universal Time), but the system locale is English with codepage 1252 (for Western European languages). This is a normal configuration if the system locale doesn't have beta UTF-8 support enabled, or if the process ANSI codepage isn't overridden to UTF-8 via the "activeCodePage" manifest setting.

The tzname strings normally get set by a one-time _tzset call, and they're only reset if tzset is called manually. tzset uses the system ANSI encoding if LC_CTYPE is the "C" locale (again, normally ucrt uses Latin-1 in the "C" locale). Since the encoding of the Hindi timezone name to codepage 1252 contains the default character ("?"), which is not allowed, tzset sets the tzname strings to empty strings.

    import ctypes, locale, time
    ucrt = ctypes.CDLL('ucrtbase', use_errno=True)
    ucrt.__tzname.restype = ctypes.POINTER(ctypes.c_char_p)
    tzname = ucrt.__tzname()

    >>> locale.setlocale(locale.LC_CTYPE, 'C')
    'C'
    >>> ucrt._tzset()
    0
    >>> tzname[0], tzname[1]
    (b'', b'')
    >>> time.strftime('%Z')
    ''

If we update the LC_CTYPE category to use UTF-8, the cached tzname value doesn't get automatically updated, and strftime still returns an empty string:

    >>> locale.setlocale(locale.LC_CTYPE, '.utf8')
    'Hindi_India.utf8'

    >>> tzname[0], tzname[1]
    (b'', b'')
    >>> time.strftime('%Z')
    ''

The tzname values get updated if we manually call tzset:

    >>> ucrt._tzset()
    0
    >>> tzname[0].decode('utf-8'), tzname[1].decode('utf-8')
    ('समन्वित वैश्विक समय', 'समन्वित वैश्विक समय')

However, LC_TIME is still in the "C" locale. strftime uses system ANSI (1252) in this case, so the encoded result from the CRT strftime call ends up using the default character (?):

    >>> time.strftime('%Z')
    '??????? ??????? ???'

If we set LC_TIME to UTF-8, we finally get a valid result:

    >>> locale.setlocale(locale.LC_TIME, '.utf8')
    'Hindi_India.utf8'
    >>> time.strftime('%Z')
    'समन्वित वैश्विक समय'

We wouldn't have to worry about LC_TIME here if Python called C wcsftime instead of C strftime. The problem that bpo-10653 was trying to work around is a design flaw in the C runtime library, and calling strftime is not a solution. 

Here's a variation on my example in msg243660, continuing with the current Hindi example. The setup in this example uses UTF-8 as the system ANSI codepage (via python_utf8.exe) and sets LC_CTYPE to the "C" locale. This yields the following monstrosity:

    >>> time.strftime('%Z')
    'Ã\xa0Â¤Â¸Ã\xa0Â¤Â®Ã\xa0Â¤Â¨Ã\xa0Â¥Â\x8dÃ\xa0Â¤ÂµÃ\xa0Â¤Â¿Ã\xa0Â¤Â¤ Ã\xa0Â¤ÂµÃ\xa0Â¥Â\x88Ã\xa0Â¤Â¶Ã\xa0Â¥Â\x8dÃ\xa0Â¤ÂµÃ\xa0Â¤Â¿Ã\xa0Â¤Â\x95 Ã\xa0Â¤Â¸Ã\xa0Â¤Â®Ã\xa0Â¤Â¯'

It's due to the following sequence of encoding and decoding operations:

    >>> mbs_lcctype_utf8 = 'समन्वित वैश्विक समय'.encode('utf-8')
    >>> wcs_lcctype_latin1 = mbs_lcctype_utf8.decode('latin-1')
    >>> mbs_lctime_utf8 = wcs_lcctype_latin1.encode('utf-8')

This last one is from PyUnicode_DecodeLocaleAndSize and mbstowcs:

    >>> py_str_lcctype_latin1 = mbs_lctime_utf8.decode('latin-1')
    >>> py_str_lcctype_latin1 == time.strftime('%Z')
    True

History
Date	User	Action	Args
2021-03-27 05:23:41	eryksun	unlink	issue36792 messages
2020-02-06 09:02:02	eryksun	set	recipients: + eryksun, paul.moore, vstinner, tim.golden, jkloth, jeremy.kloth, zach.ware, steve.dower, Manjusaka, CharlieClark, Dominik Geldmacher, Carsten Fuchs
2020-02-06 09:02:02	eryksun	set	messageid: <1580979722.66.0.188032500066.issue36792@roundup.psfhosted.org>
2020-02-06 09:02:02	eryksun	link	issue36792 messages
2020-02-06 09:02:01	eryksun	create