Issue 5562: Locale-based date formatting crashes on non-ASCII data

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49812

classification

Title:	Locale-based date formatting crashes on non-ASCII data
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.0, Python 3.1

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	time.strftime() always decodes result with UTF-8 View: 3061
Assigned To:		Nosy List:	loewis, pitrou
Priority:	high	Keywords:

Created on 2009-03-25 19:46 by pitrou, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg84163 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-03-25 19:46
Locale-based date formatting in py3k (using strftime) crashes when asked to format a month name (or day, I assume) containing non-ASCII characters: >>> import time >>> import locale >>> time.strftime("%B", (2009,2,1,0,0,0,0,0,0)) 'February' >>> locale.setlocale(locale.LC_TIME, "fr_FR") 'fr_FR' >>> time.strftime("%B", (2009,2,1,0,0,0,0,0,0)) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3: invalid data It works if I specify the encoding explicitly in the locale name so as to coincide with the encoding specified in the error message above (but that's assuming the given encoding-specific locale is installed): >>> locale.setlocale(locale.LC_TIME, "fr_FR.UTF-8") 'fr_FR.UTF-8' >>> time.strftime("%B", (2009,2,1,0,0,0,0,0,0)) 'février'
msg84164 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-03-25 19:48
(if I explicitly set another encoding, it doesn't work however: >>> locale.setlocale(locale.LC_TIME, "fr_FR.ISO-8859-1") 'fr_FR.ISO-8859-1' >>> time.strftime("%B", (2009,2,1,0,0,0,0,0,0)) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3: invalid data )
msg84170 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-03-26 00:43
I think the problem is that creation of the Unicode string defaults to UTF-8. It should instead use the locale's encoding. You are right that it could be an issue that there is no Python codec for the locale's encoding. To be robust against this case, I think the locale's mbcs->wcs routines should be used (i.e. mbstowcs). Better yet, use wcsftime in the first place. AFAICT, wcsftime is C99, so not all systems might support it. However, it appears that MSVC has it, so we could assume it exists and wait until someone complains. One issue apparently is that some implementations of wcsftime expect the format as char* (and again, I would defer dealing with that until somebody complains). In either case, you end up with a wchar_t. In principle, the locale might use a non-Unicode wide charset for wchar_t, but these got out of use some time ago, and Python had always assumed that wchar_t is Unicode.
msg88516 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2009-05-29 16:37
This is a duplicate of issue 3061.

History
Date	User	Action	Args
2022-04-11 14:56:46	admin	set	github: 49812
2009-05-29 16:37:51	loewis	set	status: open -> closed resolution: duplicate superseder: time.strftime() always decodes result with UTF-8 messages: + msg88516
2009-03-26 00:43:22	loewis	set	nosy: + loewis messages: + msg84170
2009-03-25 19:48:17	pitrou	set	messages: + msg84164
2009-03-25 19:46:17	pitrou	create