New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
time.strptime fails to match data and format with Unicode whitespaces (Py3) #49490
Comments
On Python3, strptime raises a ValueError with some "Unicode whitespaces"
even if they are present both in the 'string' and 'format' args in the
same position:
>>> strptime("Thu\x20Feb", "%a\x20%b") # normal space, works fine
time.struct_time(tm_year=1900, tm_mon=2, tm_mday=1, tm_hour=0, tm_min=0,
tm_sec=0, tm_wday=3, tm_yday=32, tm_isdst=-1)
>>> strptime("Thu\xa0Feb", "%a\xa0%b") # no-break space, fails
ValueError: time data 'Thu\xa0Feb' does not match format '%a\xa0%b'
I wrote a small script to find out other chars where it fails (it needs
~5 minutes to run):
>>> l = []
>>> for char in map(chr, range(0xFFFF)):
... try: x = strptime('Thu{0}Feb'.format(char), '%a{0}%b'.format(char))
... except ValueError: l.append(char)
...
>>> l
['\x1c', '\x1d', '\x1e', '\x1f', '%', '\x85', '\xa0', '\u1680',
'\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006',
'\u2007', '\u2008', '\u2009', '\u200a', '\u200b', '\u2028', '\u2029',
'\u202f', '\u205f', '\u3000']
>>> [char.strip() for char in l]
['', '', '', '', '%', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '']
>>> [unicodedata.category(char) for char in l]
['Cc', 'Cc', 'Cc', 'Cc', 'Po', 'Cc', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs',
'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Cf', 'Zl', 'Zp', 'Zs', 'Zs',
'Zs']
>>> [unicodedata.name(char, '???') for char in l]
['???', '???', '???', '???', 'PERCENT SIGN', '???', 'NO-BREAK SPACE',
'OGHAM SPACE MARK', 'EN QUAD', 'EM QUAD', 'EN SPACE', 'EM SPACE',
'THREE-PER-EM SPACE', 'FOUR-PER-EM SPACE', 'SIX-PER-EM SPACE', 'FIGURE
SPACE', 'PUNCTUATION SPACE', 'THIN SPACE', 'HAIR SPACE', 'ZERO WIDTH
SPACE', 'LINE SEPARATOR', 'PARAGRAPH SEPARATOR', 'NARROW NO-BREAK
SPACE', 'MEDIUM MATHEMATICAL SPACE', 'IDEOGRAPHIC SPACE'] All these chars (except % and some control chars) are whitespace and The Unicode categories are: Everything seems to work fine on Py2.x (tested on 2.4 and 2.6) |
By quick observation, I found this problem goes away if following change Index: Lib/_strptime.py --- Lib/_strptime.py (revision 69496)
+++ Lib/_strptime.py (working copy)
@@ -262,7 +262,7 @@
def compile(self, format):
"""Return a compiled re object for the format string."""
- return re_compile(self.pattern(format), IGNORECASE | ASCII)
+ return re_compile(self.pattern(format), IGNORECASE)
_cache_lock = _thread_allocate_lock()
# DO NOT modify _TimeRE_cache or _regex_cache without acquiring the
cache lock But this is just an observation. I don't call this *fix* because I'm not |
I think you have found the problem, strptime probably uses \s with the
re.ASCII flag and fails to match all the Unicode whitespaces:
>>> l
['\x1c', '\x1d', '\x1e', '\x1f', '%', '\x85', '\xa0', '\u1680',
'\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006',
'\u2007', '\u2008', '\u2009', '\u200a', '\u200b', '\u2028', '\u2029',
'\u202f', '\u205f', '\u3000']
>>> [bool(re.match('^\s$', char, re.ASCII)) for char in l]
[False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False]
>>> [bool(re.match('^\s$', char)) for char in l]
[True, True, True, True, False, True, True, True, True, True, True,
True, True,True, True, True, True, True, True, True, True, True, True,
True, True] This bug is then related bpo-5239 and the proposed fix should work for both. Good work! |
OK, I'll close this entry as duplicated. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: