Message 269462 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	abarry, eryksun, ezio.melotti, martin.panter, paul.moore, r.david.murray, steve.dower, tim.golden, vstinner, zach.ware
Date	2016-06-29.04:00:09
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1467172810.39.0.948866181211.issue26226@psf.upfronthosting.co.za>
In-reply-to

Content
time.strftime calls the CRT's strftime function, which the Windows universal CRT implements by calling wcsftime and encoding the result. The timezone name is actually stored as a char string (tzname), so wcsftime has to decode it via mbstowcs. The problem is that in the C locale tzname is an ANSI (1252) string while mbstowcs simply casts to wchar_t, which is the same as decoding as Latin-1. This works fine for "é" (U+00E9). But the right single quote character (U+2019) is "\x92" in 1252, and a simple cast maps it to the non-character U+0092. When the CRT's strftime encodes this back as an ANSI string, it maps U+0092 to the replacement character for 1252, a question mark. Similarly, time.tzname decodes the tzname ANSI strings using mbstowcs, with the same mismatch between LC_CTYPE and LC_TIME, resulting in the string "Est (heure d\x92été)" In summary, the problem is that LC_TIME uses ANSI in the C locale, while LC_CTYPE uses Latin-1. A workaround (in most cases) is to delay importing the time module until after setting LC_CTYPE (also setting LC_TIME should cover all cases). For example: >>> import sys, locale >>> 'time' in sys.modules False >>> locale.setlocale(locale.LC_CTYPE, '') 'French_France.1252' >>> import time >>> time.tzname ('Est', 'Est (heure d’été)') >>> time.strftime('%Z') 'Est (heure d’été)' Note that Unix Python 3 sets LC_CTYPE at startup, so doing the same on Windows would actually improve cross-platform consistency.

time.strftime calls the CRT's strftime function, which the Windows universal CRT implements by calling wcsftime and encoding the result. The timezone name is actually stored as a char string (tzname), so wcsftime has to decode it via mbstowcs. 

The problem is that in the C locale tzname is an ANSI (1252) string while mbstowcs simply casts to wchar_t, which is the same as decoding as Latin-1. This works fine for "é" (U+00E9). But the right single quote character (U+2019) is "\x92" in 1252, and a simple cast maps it to the non-character U+0092. 

When the CRT's strftime encodes this back as an ANSI string, it maps U+0092 to the replacement character for 1252, a question mark. Similarly, time.tzname decodes the tzname ANSI strings using mbstowcs, with the same mismatch between LC_CTYPE and LC_TIME, resulting in the string "Est (heure d\x92été)"

In summary, the problem is that LC_TIME uses ANSI in the C locale, while LC_CTYPE uses Latin-1. A workaround (in most cases) is to delay importing the time module until after setting LC_CTYPE (also setting LC_TIME should cover all cases). For example:

    >>> import sys, locale
    >>> 'time' in sys.modules
    False
    >>> locale.setlocale(locale.LC_CTYPE, '')
    'French_France.1252'
    >>> import time
    >>> time.tzname
    ('Est', 'Est (heure d’été)')
    >>> time.strftime('%Z')
    'Est (heure d’été)'

Note that Unix Python 3 sets LC_CTYPE at startup, so doing the same on Windows would actually improve cross-platform consistency.

History
Date	User	Action	Args
2016-06-29 04:00:10	eryksun	set	recipients: + eryksun, paul.moore, vstinner, tim.golden, ezio.melotti, r.david.murray, martin.panter, zach.ware, steve.dower, abarry
2016-06-29 04:00:10	eryksun	set	messageid: <1467172810.39.0.948866181211.issue26226@psf.upfronthosting.co.za>
2016-06-29 04:00:10	eryksun	link	issue26226 messages
2016-06-29 04:00:09	eryksun	create