Issue 5240: time.strptime fails to match data and format with Unicode whitespaces (Py3)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49490

classification

Title:	time.strptime fails to match data and format with Unicode whitespaces (Py3)
Type:	behavior	Stage:
Components:	Library (Lib), Unicode	Versions:	Python 3.0, Python 3.1

process

Status:	closed	Resolution:	duplicate
Dependencies:	2834	Superseder:	Change time.strptime() to make it work with Unicode chars View: 5239
Assigned To:		Nosy List:	brett.cannon, ezio.melotti, ocean-city
Priority:	normal	Keywords:

Created on 2009-02-13 03:16 by ezio.melotti, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg81859 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2009-02-13 03:16
On Python3, strptime raises a ValueError with some "Unicode whitespaces" even if they are present both in the 'string' and 'format' args in the same position: >>> strptime("Thu\x20Feb", "%a\x20%b") # normal space, works fine time.struct_time(tm_year=1900, tm_mon=2, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=32, tm_isdst=-1) >>> strptime("Thu\xa0Feb", "%a\xa0%b") # no-break space, fails ValueError: time data 'Thu\xa0Feb' does not match format '%a\xa0%b' I wrote a small script to find out other chars where it fails (it needs ~5 minutes to run): >>> l = [] >>> for char in map(chr, range(0xFFFF)): ... try: x = strptime('Thu{0}Feb'.format(char), '%a{0}%b'.format(char)) ... except ValueError: l.append(char) ... >>> l ['\x1c', '\x1d', '\x1e', '\x1f', '%', '\x85', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a', '\u200b', '\u2028', '\u2029', '\u202f', '\u205f', '\u3000'] >>> [char.strip() for char in l] ['', '', '', '', '%', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''] >>> [unicodedata.category(char) for char in l] ['Cc', 'Cc', 'Cc', 'Cc', 'Po', 'Cc', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Cf', 'Zl', 'Zp', 'Zs', 'Zs', 'Zs'] >>> [unicodedata.name(char, '???') for char in l] ['???', '???', '???', '???', 'PERCENT SIGN', '???', 'NO-BREAK SPACE', 'OGHAM SPACE MARK', 'EN QUAD', 'EM QUAD', 'EN SPACE', 'EM SPACE', 'THREE-PER-EM SPACE', 'FOUR-PER-EM SPACE', 'SIX-PER-EM SPACE', 'FIGURE SPACE', 'PUNCTUATION SPACE', 'THIN SPACE', 'HAIR SPACE', 'ZERO WIDTH SPACE', 'LINE SEPARATOR', 'PARAGRAPH SEPARATOR', 'NARROW NO-BREAK SPACE', 'MEDIUM MATHEMATICAL SPACE', 'IDEOGRAPHIC SPACE'] All these chars (except % and some control chars) are whitespace and they are removed by the .strip() method, so I guess that something similar happens in strptime too. The Unicode categories are: "Cc" = "Other, Control" "Zs" = "Separator, Space" "Cf" = "Other, Format" "Zl" = "Separator, Line" "Zp" = "Separator, Paragraph" Everything seems to work fine on Py2.x (tested on 2.4 and 2.6)
msg81916 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2009-02-13 12:34
By quick observation, I found this problem goes away if following change is applied. Index: Lib/_strptime.py =================================================================== --- Lib/_strptime.py (revision 69496) +++ Lib/_strptime.py (working copy) @@ -262,7 +262,7 @@ def compile(self, format): """Return a compiled re object for the format string.""" - return re_compile(self.pattern(format), IGNORECASE \| ASCII) + return re_compile(self.pattern(format), IGNORECASE) _cache_lock = _thread_allocate_lock() # DO NOT modify _TimeRE_cache or _regex_cache without acquiring the cache lock But this is just an observation. I don't call this fix because I'm not familier with unicode, this change might cause another problem if applied.
msg81919 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2009-02-13 13:16
I think you have found the problem, strptime probably uses \s with the re.ASCII flag and fails to match all the Unicode whitespaces: >>> l ['\x1c', '\x1d', '\x1e', '\x1f', '%', '\x85', '\xa0', '\u1680', '\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006', '\u2007', '\u2008', '\u2009', '\u200a', '\u200b', '\u2028', '\u2029', '\u202f', '\u205f', '\u3000'] >>> [bool(re.match('^\s$', char, re.ASCII)) for char in l] [False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False] >>> [bool(re.match('^\s$', char)) for char in l] [True, True, True, True, False, True, True, True, True, True, True, True, True,True, True, True, True, True, True, True, True, True, True, True, True] This bug is then related #5239 and the proposed fix should work for both. We can close this as duplicate and include this problem in #5239. Good work!
msg81924 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2009-02-13 13:43
OK, I'll close this entry as duplicated.

History
Date	User	Action	Args
2022-04-11 14:56:45	admin	set	github: 49490
2009-02-13 19:06:47	brett.cannon	set	nosy: + brett.cannon
2009-02-13 13:43:17	ocean-city	set	status: open -> closed resolution: duplicate superseder: Change time.strptime() to make it work with Unicode chars messages: + msg81924
2009-02-13 13:16:17	ezio.melotti	set	messages: + msg81919
2009-02-13 12:34:49	ocean-city	set	nosy: + ocean-city dependencies: + re.IGNORECASE not Unicode-ready messages: + msg81916
2009-02-13 03:16:37	ezio.melotti	create