classification
Title: time.strptime fails to match data and format with Unicode whitespaces (Py3)
Type: behavior Stage:
Components: Library (Lib), Unicode Versions: Python 3.0, Python 3.1
process
Status: closed Resolution: duplicate
Dependencies: 2834 Superseder: Change time.strptime() to make it work with Unicode chars
View: 5239
Assigned To: Nosy List: brett.cannon, ezio.melotti, ocean-city
Priority: normal Keywords:

Created on 2009-02-13 03:16 by ezio.melotti, last changed 2009-02-13 19:06 by brett.cannon. This issue is now closed.

Messages (4)
msg81859 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-02-13 03:16
On Python3, strptime raises a ValueError with some "Unicode whitespaces"
even if they are present both in the 'string' and 'format' args in the
same position:
>>> strptime("Thu\x20Feb", "%a\x20%b") # normal space, works fine
time.struct_time(tm_year=1900, tm_mon=2, tm_mday=1, tm_hour=0, tm_min=0,
tm_sec=0, tm_wday=3, tm_yday=32, tm_isdst=-1)
>>> strptime("Thu\xa0Feb", "%a\xa0%b") # no-break space, fails
ValueError: time data 'Thu\xa0Feb' does not match format '%a\xa0%b'

I wrote a small script to find out other chars where it fails (it needs
~5 minutes to run):
>>> l = []
>>> for char in map(chr, range(0xFFFF)):
...   try: x = strptime('Thu{0}Feb'.format(char), '%a{0}%b'.format(char))
...   except ValueError: l.append(char)
...
>>> l
['\x1c', '\x1d', '\x1e', '\x1f', '%', '\x85', '\xa0', '\u1680',
'\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006',
'\u2007', '\u2008', '\u2009', '\u200a', '\u200b', '\u2028', '\u2029',
'\u202f', '\u205f', '\u3000']
>>> [char.strip() for char in l]
['', '', '', '', '%', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '']
>>> [unicodedata.category(char) for char in l]
['Cc', 'Cc', 'Cc', 'Cc', 'Po', 'Cc', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs',
'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Cf', 'Zl', 'Zp', 'Zs', 'Zs',
'Zs']
>>> [unicodedata.name(char, '???') for char in l]
['???', '???', '???', '???', 'PERCENT SIGN', '???', 'NO-BREAK SPACE',
'OGHAM SPACE MARK', 'EN QUAD', 'EM QUAD', 'EN SPACE', 'EM SPACE',
'THREE-PER-EM SPACE', 'FOUR-PER-EM SPACE', 'SIX-PER-EM SPACE', 'FIGURE
SPACE', 'PUNCTUATION SPACE', 'THIN SPACE', 'HAIR SPACE', 'ZERO WIDTH
SPACE', 'LINE SEPARATOR', 'PARAGRAPH SEPARATOR', 'NARROW NO-BREAK
SPACE', 'MEDIUM MATHEMATICAL SPACE', 'IDEOGRAPHIC SPACE']

All these chars (except % and some control chars) are whitespace and
they are removed by the .strip() method, so I guess that something
similar happens in strptime too.

The Unicode categories are:
"Cc" = "Other, Control"
"Zs" = "Separator, Space"
"Cf" = "Other, Format"
"Zl" = "Separator, Line"
"Zp" = "Separator, Paragraph"

Everything seems to work fine on Py2.x (tested on 2.4 and 2.6)
msg81916 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2009-02-13 12:34
By quick observation, I found this problem goes away if following change
is applied.

Index: Lib/_strptime.py
===================================================================
--- Lib/_strptime.py	(revision 69496)
+++ Lib/_strptime.py	(working copy)
@@ -262,7 +262,7 @@
 
     def compile(self, format):
         """Return a compiled re object for the format string."""
-        return re_compile(self.pattern(format), IGNORECASE | ASCII)
+        return re_compile(self.pattern(format), IGNORECASE)
 
 _cache_lock = _thread_allocate_lock()
 # DO NOT modify _TimeRE_cache or _regex_cache without acquiring the
cache lock

But this is just an observation. I don't call this *fix* because I'm not
familier with unicode, this change might cause another problem if applied.
msg81919 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-02-13 13:16
I think you have found the problem, strptime probably uses \s with the
re.ASCII flag and fails to match all the Unicode whitespaces:
>>> l
['\x1c', '\x1d', '\x1e', '\x1f', '%', '\x85', '\xa0', '\u1680',
'\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006',
'\u2007', '\u2008', '\u2009', '\u200a', '\u200b', '\u2028', '\u2029',
'\u202f', '\u205f', '\u3000']
>>> [bool(re.match('^\s$', char, re.ASCII)) for char in l]
[False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False]
>>> [bool(re.match('^\s$', char)) for char in l]
[True, True, True, True, False, True, True, True, True, True, True,
True, True,True, True, True, True, True, True, True, True, True, True,
True, True]

This bug is then related #5239 and the proposed fix should work for both.
We can close this as duplicate and include this problem in #5239.

Good work!
msg81924 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2009-02-13 13:43
OK, I'll close this entry as duplicated.
History
Date User Action Args
2009-02-13 19:06:47brett.cannonsetnosy: + brett.cannon
2009-02-13 13:43:17ocean-citysetstatus: open -> closed
resolution: duplicate
superseder: Change time.strptime() to make it work with Unicode chars
messages: + msg81924
2009-02-13 13:16:17ezio.melottisetmessages: + msg81919
2009-02-13 12:34:49ocean-citysetnosy: + ocean-city
dependencies: + re.IGNORECASE not Unicode-ready
messages: + msg81916
2009-02-13 03:16:37ezio.melotticreate