classification
Title: Locale-based date formatting crashes on non-ASCII data
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.0, Python 3.1
process
Status: closed Resolution: duplicate
Dependencies: Superseder: time.strftime() always decodes result with UTF-8
View: 3061
Assigned To: Nosy List: loewis, pitrou
Priority: high Keywords:

Created on 2009-03-25 19:46 by pitrou, last changed 2009-05-29 16:37 by loewis. This issue is now closed.

Messages (4)
msg84163 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-25 19:46
Locale-based date formatting in py3k (using strftime) crashes when asked
to format a month name (or day, I assume) containing non-ASCII characters:

>>> import time
>>> import locale
>>> time.strftime("%B", (2009,2,1,0,0,0,0,0,0))
'February'
>>> locale.setlocale(locale.LC_TIME, "fr_FR")
'fr_FR'
>>> time.strftime("%B", (2009,2,1,0,0,0,0,0,0))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3:
invalid data

It works if I specify the encoding explicitly in the locale name so as
to coincide with the encoding specified in the error message above (but
that's assuming the given encoding-specific locale *is* installed):

>>> locale.setlocale(locale.LC_TIME, "fr_FR.UTF-8")
'fr_FR.UTF-8'
>>> time.strftime("%B", (2009,2,1,0,0,0,0,0,0))
'février'
msg84164 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-25 19:48
(if I explicitly set another encoding, it doesn't work however:

>>> locale.setlocale(locale.LC_TIME, "fr_FR.ISO-8859-1")
'fr_FR.ISO-8859-1'
>>> time.strftime("%B", (2009,2,1,0,0,0,0,0,0))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3:
invalid data

)
msg84170 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-03-26 00:43
I think the problem is that creation of the Unicode string defaults to 
UTF-8. It should instead use the locale's encoding.

You are right that it could be an issue that there is no Python codec 
for the locale's encoding. To be robust against this case, I think the 
locale's mbcs->wcs routines should be used (i.e. mbstowcs). Better yet, 
use wcsftime in the first place. AFAICT, wcsftime is C99, so not all 
systems might support it. However, it appears that MSVC has it, so we 
could assume it exists and wait until someone complains. One issue 
apparently is that some implementations of wcsftime expect the format as 
char* (and again, I would defer dealing with that until somebody 
complains).

In either case, you end up with a wchar_t. In principle, the locale 
might use a non-Unicode wide charset for wchar_t, but these got out of 
use some time ago, and Python had always assumed that wchar_t is 
Unicode.
msg88516 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-05-29 16:37
This is a duplicate of issue 3061.
History
Date User Action Args
2009-05-29 16:37:51loewissetstatus: open -> closed
resolution: duplicate
superseder: time.strftime() always decodes result with UTF-8
messages: + msg88516
2009-03-26 00:43:22loewissetnosy: + loewis
messages: + msg84170
2009-03-25 19:48:17pitrousetmessages: + msg84164
2009-03-25 19:46:17pitroucreate