Message 251259 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	BreamoreBoy, amaury.forgeotdarc, belopolsky, eryksun, jcea, msmhrt, ocean-city, prikryl, vstinner
Date	2015-09-21.20:49:44
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1442868585.23.0.356333355896.issue16322@psf.upfronthosting.co.za>
In-reply-to

Content
> local_encoding = locale.getdefaultlocale()[1] Use locale.getpreferredencoding(). > b = eval('b' + ascii(result)) > result = b.decode(local_encoding) It's simpler and more reliable to use 'latin-1' and 'mbcs' (ANSI). For example: result = result.encode('latin-1').decode('mbcs') If setlocale(LC_CTYPE, "") is called before importing the time module, then tzname is already correct. In this case, the above is either harmless or raises a UnicodeEncodeError that can be handled. OTOH, your approach silently corrupts the value: >>> result = 'Střední Evropa (běžný čas)' >>> b = eval('b' + ascii(result)) >>> b.decode('1251') 'St\\u0159ednн Evropa (b\\u011b\\u017enэ \\u010das)' Back to the issue. In review, on initial import of the time module, if the CRT is using the default "C" locale, we have this inconsistency in which the time functions encode/decode tzname as ANSI and mbstowcs decodes tzname as Latin-1. (Plus strftime in the new CRT calls wcsftime, which adds another transcoding layer to compound the mojibake goodness.) If time.tzset is implemented on Windows, then at startup an application can set the locale (specifically LC_CTYPE for tzname, and LC_TIME for strftime) and then call time.tzset(). Example with Russian system locale: Initially we're in the "C" locale and the CRT's tzname is in ANSI. time.tzname incorrectly decodes this as Latin-1 since that's what mbstowcs uses in the "C" locale: >>> time.tzname[0] '\xc2\xf0\xe5\xec\xff \xe2 \xf4\xee\xf0\xec\xe0\xf2\xe5 UTC' The way the CRT's strftime is implemented compounds the problem: >>> time.strftime('%Z') 'A?aiy a oi?iaoa UTC' It's implemented by calling the wide-character function, wcsftime. Just like Python, this gets a wide-character string by calling mbstowcs on the ANSI tzname. Then the CRT's strftime encodes the wide-character string back as a best-fit ANSI string, and finally time.strftime decodes the result as Latin-1 via mbstowcs. The result is mutated mojibake: >>> time.tzname[0].encode('mbcs', 'replace').decode('latin-1') 'A?aiy a oi?iaoa UTC' Ironically, Python stopped calling wcsftime on Windows because of these problems, but changes to the code since then, plus the new CRT, have brought the problem back, and worse. See my comment in issue 10653, msg243660. Fix this by setting the locale and calling _tzset: >>> import ctypes, locale >>> locale.setlocale(locale.LC_ALL, '') 'Russian_Russia.1251' >>> ctypes.cdll.ucrtbase._tzset() 0 >>> time.strftime('%Z') 'Время в формате UTC' If time.tzset were implemented on Windows, calling it would reload the time.tzname tuple.

> local_encoding = locale.getdefaultlocale()[1]

Use locale.getpreferredencoding().

> b = eval('b' + ascii(result))
> result = b.decode(local_encoding)

It's simpler and more reliable to use 'latin-1' and 'mbcs' (ANSI). For example:

    result = result.encode('latin-1').decode('mbcs')

If setlocale(LC_CTYPE, "") is called before importing the time module, then tzname is already correct. In this case, the above is either harmless or raises a UnicodeEncodeError that can be handled. OTOH, your approach silently corrupts the value:

    >>> result = 'Střední Evropa (běžný čas)'
    >>> b = eval('b' + ascii(result))
    >>> b.decode('1251')
    'St\\u0159ednн Evropa (b\\u011b\\u017enэ \\u010das)'

Back to the issue. In review, on initial import of the time module, if the CRT is using the default "C" locale, we have this inconsistency in which the time functions encode/decode tzname as ANSI and mbstowcs decodes tzname as Latin-1. (Plus strftime in the new CRT calls wcsftime, which adds another transcoding layer to compound the mojibake goodness.)

If time.tzset is implemented on Windows, then at startup an application can set the locale (specifically LC_CTYPE for tzname, and LC_TIME for strftime) and then call time.tzset(). 

Example with Russian system locale:

Initially we're in the "C" locale and the CRT's tzname is in ANSI. time.tzname incorrectly decodes this as Latin-1 since that's what mbstowcs uses in the "C" locale:

    >>> time.tzname[0]
    '\xc2\xf0\xe5\xec\xff \xe2 \xf4\xee\xf0\xec\xe0\xf2\xe5 UTC'

The way the CRT's strftime is implemented compounds the problem:

    >>> time.strftime('%Z')
    'A?aiy a oi?iaoa UTC'

It's implemented by calling the wide-character function, wcsftime. Just like Python, this gets a wide-character string by calling mbstowcs on the ANSI tzname. Then the CRT's strftime encodes the wide-character string back as a best-fit ANSI string, and finally time.strftime decodes the result as Latin-1 via mbstowcs. The result is mutated mojibake:

    >>> time.tzname[0].encode('mbcs', 'replace').decode('latin-1')
    'A?aiy a oi?iaoa UTC'

Ironically, Python stopped calling wcsftime on Windows because of these problems, but changes to the code since then, plus the new CRT, have brought the problem back, and worse. See my comment in issue 10653, msg243660.

Fix this by setting the locale and calling _tzset:

    >>> import ctypes, locale
    >>> locale.setlocale(locale.LC_ALL, '')
    'Russian_Russia.1251'
    >>> ctypes.cdll.ucrtbase._tzset()
    0
    >>> time.strftime('%Z')
    'Время в формате UTC'

If time.tzset were implemented on Windows, calling it would reload the time.tzname tuple.

History
Date	User	Action	Args
2015-09-21 20:49:45	eryksun	set	recipients: + eryksun, jcea, amaury.forgeotdarc, prikryl, belopolsky, vstinner, ocean-city, BreamoreBoy, msmhrt
2015-09-21 20:49:45	eryksun	set	messageid: <1442868585.23.0.356333355896.issue16322@psf.upfronthosting.co.za>
2015-09-21 20:49:45	eryksun	link	issue16322 messages
2015-09-21 20:49:44	eryksun	create