classification
Title: time.strftime() and Unicode characters on Windows
Type: behavior Stage:
Components: Extension Modules, Library (Lib), Unicode, Windows Versions: Python 3.10, Python 3.9, Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: AndiDog_old, belopolsky, eric.smith, eryksun, ezio.melotti, paul.moore, shimizukawa, steve.dower, terry.reedy, tim.golden, vstinner, zach.ware
Priority: normal Keywords:

Created on 2010-04-03 15:08 by AndiDog_old, last changed 2021-03-08 19:17 by eryksun.

Messages (16)
msg102269 - (view) Author: Andidog_old (AndiDog_old) Date: 2010-04-03 15:08
There is inconsistent behavior in time.strftime, comparing Python 2.6 and 3.1. In 3.1, non-ASCII Unicode characters seem to get dropped whereas in 2.6 you can keep them using the necessary Unicode-to-UTF8 workaround.

This should be fixed if it isn't intended behavior.

Python 2.6

>>> time.strftime(u"%d\u200F%A".encode("utf-8"), time.gmtime()).decode("utf-8")
u'03\u200fSaturday'
>>> time.strftime(u"%d\u0041%A".encode("utf-8"), time.gmtime()).decode("utf-8")
u'03ASaturday'

Python 3.1

>>> time.strftime("%d\u200F%A", time.gmtime())
''
>>> len(time.strftime("%d\u200F%A", time.gmtime()))
0
>>> time.strftime("%d\u0041%A", time.gmtime())
'03ASaturday'
msg102298 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-04-03 21:49
This seems to be fixed now, on both 3.1 and 3.2.
Can you try with 3.1.2 and see if it works?
What operating system are you using?
msg102310 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-04-04 00:08
Actually the bug seems related to Windows.
msg102332 - (view) Author: Andidog_old (AndiDog_old) Date: 2010-04-04 11:33
Just installed Python 3.1.2, same problem. I'm using Windows XP SP2 with two Python installations (2.6.4 and now 3.1.2).
msg102335 - (view) Author: Andidog_old (AndiDog_old) Date: 2010-04-04 12:07
Definitely a Windows problem. I did this on Visual Studio 2008:

    wchar_t out[1000];
    time_t currentTime;
    time(&currentTime);
    tm *timeStruct = gmtime(&currentTime);

    size_t ret = wcsftime(out, 1000, L"%d%A", timeStruct);
    wprintf(L"ret = %d, out = (%s)\n", ret, out);

    ret = wcsftime(out, 1000, L"%d\u200f%A", timeStruct);
    wprintf(L"ret = %d, out = (%s)\n", ret, out);

and the output was

    ret = 8, out = (04Sunday)
    ret = 0, out = ()

Python really shouldn't use any so-called standard functions on Windows. They never work as expected ^^...
msg159341 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-04-25 22:58
> Actually the bug seems related to Windows.

See also the issue #10653: wcsftime() doesn't format correctly time zones, so Python 3 uses strftime() instead.
msg222667 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-07-10 14:08
Using 3.4.1 and 3.5.0 I get:-

time.strftime("%d\u200F%A", time.gmtime())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'locale' codec can't encode character '\u200f' in position 2: Illegal byte sequence
msg226114 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-08-30 02:14
I verified Marks 3.4.1 result with Idle.

It strikes me as a bug that a function that maps a unicode format string to a unicode string with interpolations added should ever encode the format to bytes, lets alone using using an encoding that fails or loses information.  It is especially weird given that % formatting does not even work (at present) for bytes.

It seems to me that strftime should never encode the non-special parts of the format text.  Instead, it could split the format (re.split) into a list of alternatine '%x' pairs and running text segments, replace the '%x' entries with the proper entries, and return the list joined back into a string. Some replacements would be locale dependent, other not.

(Just wondering, are the locate names of days and months bytes restricted to ascii or unrestricted unicode using native characters?)
msg251554 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2015-09-24 22:37
@Alexander what is you take on this please?  I can confirm that it is still a problem on Windows in 3.5.0.
msg251558 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2015-09-25 00:22
Mark, I am no expert on Windows.  I believe Victor is most knowledgable in this area.
msg251560 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-09-25 01:05
The problem is definitely that:
format = PyUnicode_EncodeLocale(format_arg, "surrogateescape");
fails on Windows.

Windows is using strftime, not wcsftime. It's not using wcsftime because of issue 10653.

If I force Windows to use wcsftime, this particular example works:
>>> time.strftime("%d\u200F%A", time.gmtime())
'25\u200fFriday'

I haven't looked at issue 10653 enough to understand if it's still a problem with the new Visual C++. Maybe it is: I only tested with my default US locale.
msg255043 - (view) Author: Takayuki SHIMIZUKAWA (shimizukawa) Date: 2015-11-21 05:41
I've implemented a workaround for Sphinx:


>>> time.strftime(u'%Y 年'.encode('unicode-escape').decode(), *args).encode().decode('unicode-escape')
2015 年

https://github.com/sphinx-doc/sphinx/blob/8ae43b9fd/sphinx/util/osutil.py#L175
msg255133 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2015-11-23 07:05
The problem from issue 10653 is that internally the CRT encodes the time zone name using the ANSI codepage (i.e. the default system codepage). wcsftime decodes this string using mbstowcs (i.e. multibyte string to wide-character string), which uses Latin-1 in the C locale. In other words, in the C locale on Windows, mbstowcs just casts the byte values to wchar_t. 

With the new Universal CRT, strftime is implemented by calling wcsftime, so the accepted solution for issue 10653 is broken in 3.5+. A simple way around the problem is to switch back to using wcsftime and temporarily (or permanently) set the thread's LC_CTYPE locale to the system default. This makes the internal mbstowcs call use the ANSI codepage. Note that on POSIX platforms 3.x already sets the default via setlocale(LC_CTYPE, "") in Python/pylifecycle.c. Why not set this for all platforms that have setlocale?

> I only tested with my default US locale.

If your system locale uses codepage 1252 (a superset of Latin-1), then you can still test this on a per thread basis if your system has additional language packs. For example:

    import ctypes

    kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

    if kernel32.GetModuleHandleW('ucrtbased'):  # debug build
        crt = ctypes.CDLL('ucrtbased', use_errno=True)
    else:
        crt = ctypes.CDLL('ucrtbase', use_errno=True)

    MUI_LANGUAGE_NAME = 8
    LC_CTYPE = 2

    class tm(ctypes.Structure):
        pass

    crt._gmtime64.restype = ctypes.POINTER(tm)

    # set a Russian locale for the current thread    
    kernel32.SetThreadPreferredUILanguages(MUI_LANGUAGE_NAME,
                                           'ru-RU\0', None)
    crt._wsetlocale(LC_CTYPE, 'ru-RU')
    # update the time zone name based on the thread locale
    crt._tzset() 

    # get a struct tm *
    ltime = ctypes.c_int64()
    crt._time64(ctypes.byref(ltime))
    tmptr = crt._gmtime64(ctypes.byref(ltime))

    # call wcsftime using C and Russian locales 
    buf = (ctypes.c_wchar * 100)()
    crt._wsetlocale(LC_CTYPE, 'C')
    size = crt.wcsftime(buf, 100, '%Z\r\n', tmptr)
    tz1 = buf[:size]
    crt._wsetlocale(LC_CTYPE, 'ru-RU')
    size = crt.wcsftime(buf, 100, '%Z\r\n', tmptr)
    tz2 = buf[:size]

    hcon = kernel32.GetStdHandle(-11)
    pn = ctypes.pointer(ctypes.c_uint())

    >>> _ = kernel32.WriteConsoleW(hcon, tz1, len(tz1), pn, None)
    Âðåìÿ â ôîðìàòå UTC
    >>> _ = kernel32.WriteConsoleW(hcon, tz2, len(tz2), pn, None)
    Время в формате UTC

The first result demonstrates the ANSI => Latin-1 mojibake problem in the C locale. You can encode this result as Latin-1 and then decode it back as codepage 1251:

    >>> tz1.encode('latin-1').decode('1251') == tz2
    True

But transcoding isn't a general workaround since the format string shouldn't be restricted to ANSI, unless you can smuggle the Unicode through like Takayuki showed.
msg388241 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-03-07 16:09
Update since msg255133:

Python 3.8+ now calls setlocale(LC_CTYPE, "") at startup in Windows, as 3.x has always done in POSIX. So decoding the output of C strftime("%Z") with PyUnicode_DecodeLocaleAndSize() 'works' again, since both default to the process code page. The latter is usually the system code page, unless overridden to UTF-8 in the application manifest.

But calling C strftime() as a workaround is still a fragile solution, since it requires that the process code page is able to encode the process or thread UI language. In general, the system code page, the current user locale, and current user preferred language are independent settings in Windows. 

Calling C strftime() also unnecessarily limits the format string to characters in the current LC_CTYPE locale encoding, which requires hacky workarounds.

Starting with Windows 10 v2004 (build 19041), ucrt uses an internal wide-character version of the time-zone name that gets returned by an internal __wide_tzname() call and used for "%Z" in wcsftime(). The wide-character value gets updated by _tzset() and kept in sync with _tzname.

If Python switched to using wcsftime() in Windows 10 2004+, then the current locale encoding would no longer be a problem for any UI language. 

Also, bpo-36779 switched to setting time.tzname by directly calling WinAPI GetTimeZineInformation(). time.tzset() should be implemented in order to keep the value of time.tzname in sync with time.strftime("%Z").
msg388277 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-08 18:17
>  time.tzset() should be implemented

I'm not sure of what you mean. The function is implemented:

static PyObject *
time_tzset(PyObject *self, PyObject *unused)
{
    PyObject* m;

    m = PyImport_ImportModuleNoBlock("time");
    if (m == NULL) {
        return NULL;
    }

    tzset();

    /* Reset timezone, altzone, daylight and tzname */
    if (init_timezone(m) < 0) {
         return NULL;
    }
    Py_DECREF(m);
    if (PyErr_Occurred())
        return NULL;

    Py_RETURN_NONE;
}
msg388286 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-03-08 19:17
> I'm not sure of what you mean. The function is implemented:

My comment was limited to Windows, for which time.tzset() has never been implemented. Since Python has its own implementation of time.tzname in Windows, it should also implement time.tzset() to allow refreshing the value. Actually, ucrt implements C _tzset(), so the implementation of time.tzset() in Windows also has to call C _tzset() to update _tzname (and also ucrt's new private __wide_tzname), in addition to calling GetTimeZoneInformation() to update its own time.tzname value. 

Another difference with Python's time.tzname and C strftime("%Z") is that  ucrt will use the TZ environment variable, but Python's implementation of time.tzname in Windows does not.
History
Date User Action Args
2021-03-08 19:17:29eryksunsetmessages: + msg388286
2021-03-08 18:17:58vstinnersetmessages: + msg388277
2021-03-07 16:09:30eryksunsetversions: + Python 3.8, Python 3.9, Python 3.10, - Python 3.4, Python 3.5, Python 3.6
nosy: + paul.moore, tim.golden, zach.ware, steve.dower

messages: + msg388241

components: + Extension Modules
2019-02-24 22:16:35BreamoreBoysetnosy: - BreamoreBoy
2015-11-23 07:05:54eryksunsetnosy: + eryksun

messages: + msg255133
versions: + Python 3.6
2015-11-21 05:41:43shimizukawasetnosy: + shimizukawa
messages: + msg255043
2015-09-25 01:05:09eric.smithsetmessages: + msg251560
2015-09-25 00:22:24belopolskysetmessages: + msg251558
2015-09-24 22:37:09BreamoreBoysetnosy: + belopolsky
messages: + msg251554
2014-10-01 00:36:52vstinnersettitle: strftime and Unicode characters -> time.strftime() and Unicode characters on Windows
2014-08-30 02:14:42terry.reedysetnosy: + terry.reedy

messages: + msg226114
versions: + Python 3.4, Python 3.5, - Python 3.1, Python 3.2
2014-07-10 14:08:50BreamoreBoysetnosy: + BreamoreBoy
messages: + msg222667
2012-04-25 23:00:30brian.curtinsetnosy: - brian.curtin
2012-04-25 22:58:12vstinnersetnosy: + vstinner
messages: + msg159341
2010-04-04 13:54:05brian.curtinsetversions: - Python 3.3
2010-04-04 12:07:22AndiDog_oldsetmessages: + msg102335
versions: + Python 3.3
2010-04-04 11:33:11AndiDog_oldsetmessages: + msg102332
2010-04-04 00:08:45ezio.melottisetstatus: pending -> open
versions: + Python 3.2
nosy: + brian.curtin

messages: + msg102310

components: + Windows
2010-04-03 21:49:46ezio.melottisetstatus: open -> pending
priority: normal

nosy: + ezio.melotti
versions: - Python 2.6
messages: + msg102298
2010-04-03 16:05:03eric.smithsetnosy: + eric.smith
2010-04-03 15:08:42AndiDog_oldcreate