classification
Title: time.strftime() and Unicode characters on Windows
Type: behavior Stage:
Components: Library (Lib), Unicode, Windows Versions: Python 3.6, Python 3.4, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: AndiDog, belopolsky, eric.smith, eryksun, ezio.melotti, shimizukawa, terry.reedy, vstinner
Priority: normal Keywords:

Created on 2010-04-03 15:08 by AndiDog, last changed 2019-02-24 22:16 by BreamoreBoy.

Messages (13)
msg102269 - (view) Author: (AndiDog) Date: 2010-04-03 15:08
There is inconsistent behavior in time.strftime, comparing Python 2.6 and 3.1. In 3.1, non-ASCII Unicode characters seem to get dropped whereas in 2.6 you can keep them using the necessary Unicode-to-UTF8 workaround.

This should be fixed if it isn't intended behavior.

Python 2.6

>>> time.strftime(u"%d\u200F%A".encode("utf-8"), time.gmtime()).decode("utf-8")
u'03\u200fSaturday'
>>> time.strftime(u"%d\u0041%A".encode("utf-8"), time.gmtime()).decode("utf-8")
u'03ASaturday'

Python 3.1

>>> time.strftime("%d\u200F%A", time.gmtime())
''
>>> len(time.strftime("%d\u200F%A", time.gmtime()))
0
>>> time.strftime("%d\u0041%A", time.gmtime())
'03ASaturday'
msg102298 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-04-03 21:49
This seems to be fixed now, on both 3.1 and 3.2.
Can you try with 3.1.2 and see if it works?
What operating system are you using?
msg102310 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-04-04 00:08
Actually the bug seems related to Windows.
msg102332 - (view) Author: (AndiDog) Date: 2010-04-04 11:33
Just installed Python 3.1.2, same problem. I'm using Windows XP SP2 with two Python installations (2.6.4 and now 3.1.2).
msg102335 - (view) Author: (AndiDog) Date: 2010-04-04 12:07
Definitely a Windows problem. I did this on Visual Studio 2008:

    wchar_t out[1000];
    time_t currentTime;
    time(&currentTime);
    tm *timeStruct = gmtime(&currentTime);

    size_t ret = wcsftime(out, 1000, L"%d%A", timeStruct);
    wprintf(L"ret = %d, out = (%s)\n", ret, out);

    ret = wcsftime(out, 1000, L"%d\u200f%A", timeStruct);
    wprintf(L"ret = %d, out = (%s)\n", ret, out);

and the output was

    ret = 8, out = (04Sunday)
    ret = 0, out = ()

Python really shouldn't use any so-called standard functions on Windows. They never work as expected ^^...
msg159341 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-04-25 22:58
> Actually the bug seems related to Windows.

See also the issue #10653: wcsftime() doesn't format correctly time zones, so Python 3 uses strftime() instead.
msg222667 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-07-10 14:08
Using 3.4.1 and 3.5.0 I get:-

time.strftime("%d\u200F%A", time.gmtime())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'locale' codec can't encode character '\u200f' in position 2: Illegal byte sequence
msg226114 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-08-30 02:14
I verified Marks 3.4.1 result with Idle.

It strikes me as a bug that a function that maps a unicode format string to a unicode string with interpolations added should ever encode the format to bytes, lets alone using using an encoding that fails or loses information.  It is especially weird given that % formatting does not even work (at present) for bytes.

It seems to me that strftime should never encode the non-special parts of the format text.  Instead, it could split the format (re.split) into a list of alternatine '%x' pairs and running text segments, replace the '%x' entries with the proper entries, and return the list joined back into a string. Some replacements would be locale dependent, other not.

(Just wondering, are the locate names of days and months bytes restricted to ascii or unrestricted unicode using native characters?)
msg251554 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2015-09-24 22:37
@Alexander what is you take on this please?  I can confirm that it is still a problem on Windows in 3.5.0.
msg251558 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2015-09-25 00:22
Mark, I am no expert on Windows.  I believe Victor is most knowledgable in this area.
msg251560 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-09-25 01:05
The problem is definitely that:
format = PyUnicode_EncodeLocale(format_arg, "surrogateescape");
fails on Windows.

Windows is using strftime, not wcsftime. It's not using wcsftime because of issue 10653.

If I force Windows to use wcsftime, this particular example works:
>>> time.strftime("%d\u200F%A", time.gmtime())
'25\u200fFriday'

I haven't looked at issue 10653 enough to understand if it's still a problem with the new Visual C++. Maybe it is: I only tested with my default US locale.
msg255043 - (view) Author: Takayuki SHIMIZUKAWA (shimizukawa) Date: 2015-11-21 05:41
I've implemented a workaround for Sphinx:


>>> time.strftime(u'%Y 年'.encode('unicode-escape').decode(), *args).encode().decode('unicode-escape')
2015 年

https://github.com/sphinx-doc/sphinx/blob/8ae43b9fd/sphinx/util/osutil.py#L175
msg255133 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2015-11-23 07:05
The problem from issue 10653 is that internally the CRT encodes the time zone name using the ANSI codepage (i.e. the default system codepage). wcsftime decodes this string using mbstowcs (i.e. multibyte string to wide-character string), which uses Latin-1 in the C locale. In other words, in the C locale on Windows, mbstowcs just casts the byte values to wchar_t. 

With the new Universal CRT, strftime is implemented by calling wcsftime, so the accepted solution for issue 10653 is broken in 3.5+. A simple way around the problem is to switch back to using wcsftime and temporarily (or permanently) set the thread's LC_CTYPE locale to the system default. This makes the internal mbstowcs call use the ANSI codepage. Note that on POSIX platforms 3.x already sets the default via setlocale(LC_CTYPE, "") in Python/pylifecycle.c. Why not set this for all platforms that have setlocale?

> I only tested with my default US locale.

If your system locale uses codepage 1252 (a superset of Latin-1), then you can still test this on a per thread basis if your system has additional language packs. For example:

    import ctypes

    kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

    if kernel32.GetModuleHandleW('ucrtbased'):  # debug build
        crt = ctypes.CDLL('ucrtbased', use_errno=True)
    else:
        crt = ctypes.CDLL('ucrtbase', use_errno=True)

    MUI_LANGUAGE_NAME = 8
    LC_CTYPE = 2

    class tm(ctypes.Structure):
        pass

    crt._gmtime64.restype = ctypes.POINTER(tm)

    # set a Russian locale for the current thread    
    kernel32.SetThreadPreferredUILanguages(MUI_LANGUAGE_NAME,
                                           'ru-RU\0', None)
    crt._wsetlocale(LC_CTYPE, 'ru-RU')
    # update the time zone name based on the thread locale
    crt._tzset() 

    # get a struct tm *
    ltime = ctypes.c_int64()
    crt._time64(ctypes.byref(ltime))
    tmptr = crt._gmtime64(ctypes.byref(ltime))

    # call wcsftime using C and Russian locales 
    buf = (ctypes.c_wchar * 100)()
    crt._wsetlocale(LC_CTYPE, 'C')
    size = crt.wcsftime(buf, 100, '%Z\r\n', tmptr)
    tz1 = buf[:size]
    crt._wsetlocale(LC_CTYPE, 'ru-RU')
    size = crt.wcsftime(buf, 100, '%Z\r\n', tmptr)
    tz2 = buf[:size]

    hcon = kernel32.GetStdHandle(-11)
    pn = ctypes.pointer(ctypes.c_uint())

    >>> _ = kernel32.WriteConsoleW(hcon, tz1, len(tz1), pn, None)
    Âðåìÿ â ôîðìàòå UTC
    >>> _ = kernel32.WriteConsoleW(hcon, tz2, len(tz2), pn, None)
    Время в формате UTC

The first result demonstrates the ANSI => Latin-1 mojibake problem in the C locale. You can encode this result as Latin-1 and then decode it back as codepage 1251:

    >>> tz1.encode('latin-1').decode('1251') == tz2
    True

But transcoding isn't a general workaround since the format string shouldn't be restricted to ANSI, unless you can smuggle the Unicode through like Takayuki showed.
History
Date User Action Args
2019-02-24 22:16:35BreamoreBoysetnosy: - BreamoreBoy
2015-11-23 07:05:54eryksunsetnosy: + eryksun

messages: + msg255133
versions: + Python 3.6
2015-11-21 05:41:43shimizukawasetnosy: + shimizukawa
messages: + msg255043
2015-09-25 01:05:09eric.smithsetmessages: + msg251560
2015-09-25 00:22:24belopolskysetmessages: + msg251558
2015-09-24 22:37:09BreamoreBoysetnosy: + belopolsky
messages: + msg251554
2014-10-01 00:36:52vstinnersettitle: strftime and Unicode characters -> time.strftime() and Unicode characters on Windows
2014-08-30 02:14:42terry.reedysetnosy: + terry.reedy

messages: + msg226114
versions: + Python 3.4, Python 3.5, - Python 3.1, Python 3.2
2014-07-10 14:08:50BreamoreBoysetnosy: + BreamoreBoy
messages: + msg222667
2012-04-25 23:00:30brian.curtinsetnosy: - brian.curtin
2012-04-25 22:58:12vstinnersetnosy: + vstinner
messages: + msg159341
2010-04-04 13:54:05brian.curtinsetversions: - Python 3.3
2010-04-04 12:07:22AndiDogsetmessages: + msg102335
versions: + Python 3.3
2010-04-04 11:33:11AndiDogsetmessages: + msg102332
2010-04-04 00:08:45ezio.melottisetstatus: pending -> open
versions: + Python 3.2
nosy: + brian.curtin

messages: + msg102310

components: + Windows
2010-04-03 21:49:46ezio.melottisetstatus: open -> pending
priority: normal

nosy: + ezio.melotti
versions: - Python 2.6
messages: + msg102298
2010-04-03 16:05:03eric.smithsetnosy: + eric.smith
2010-04-03 15:08:42AndiDogcreate