time.strptime fails to match data and format with Unicode whitespaces (Py3) #49490

ezio-melotti · 2009-02-13T03:16:37Z

BPO	5240
Nosy	@brettcannon, @ezio-melotti
Dependencies	bpo-2834: re.IGNORECASE not Unicode-ready
Superseder	bpo-5239: Change time.strptime() to make it work with Unicode chars

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2009-02-13.13:43:17.942>
created_at = <Date 2009-02-13.03:16:37.134>
labels = ['type-bug', 'library', 'expert-unicode']
title = 'time.strptime fails to match data and format with Unicode whitespaces (Py3)'
updated_at = <Date 2009-02-13.19:06:47.973>
user = 'https://github.com/ezio-melotti'

bugs.python.org fields:

activity = <Date 2009-02-13.19:06:47.973>
actor = 'brett.cannon'
assignee = 'none'
closed = True
closed_date = <Date 2009-02-13.13:43:17.942>
closer = 'ocean-city'
components = ['Library (Lib)', 'Unicode']
creation = <Date 2009-02-13.03:16:37.134>
creator = 'ezio.melotti'
dependencies = ['2834']
files = []
hgrepos = []
issue_num = 5240
keywords = []
message_count = 4.0
messages = ['81859', '81916', '81919', '81924']
nosy_count = 3.0
nosy_names = ['brett.cannon', 'ocean-city', 'ezio.melotti']
pr_nums = []
priority = 'normal'
resolution = 'duplicate'
stage = None
status = 'closed'
superseder = '5239'
type = 'behavior'
url = 'https://bugs.python.org/issue5240'
versions = ['Python 3.0', 'Python 3.1']

ezio-melotti · 2009-02-13T03:16:36Z

On Python3, strptime raises a ValueError with some "Unicode whitespaces"
even if they are present both in the 'string' and 'format' args in the
same position:
>>> strptime("Thu\x20Feb", "%a\x20%b") # normal space, works fine
time.struct_time(tm_year=1900, tm_mon=2, tm_mday=1, tm_hour=0, tm_min=0,
tm_sec=0, tm_wday=3, tm_yday=32, tm_isdst=-1)
>>> strptime("Thu\xa0Feb", "%a\xa0%b") # no-break space, fails
ValueError: time data 'Thu\xa0Feb' does not match format '%a\xa0%b'

I wrote a small script to find out other chars where it fails (it needs
~5 minutes to run):
>>> l = []
>>> for char in map(chr, range(0xFFFF)):
...   try: x = strptime('Thu{0}Feb'.format(char), '%a{0}%b'.format(char))
...   except ValueError: l.append(char)
...
>>> l
['\x1c', '\x1d', '\x1e', '\x1f', '%', '\x85', '\xa0', '\u1680',
'\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006',
'\u2007', '\u2008', '\u2009', '\u200a', '\u200b', '\u2028', '\u2029',
'\u202f', '\u205f', '\u3000']
>>> [char.strip() for char in l]
['', '', '', '', '%', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '']
>>> [unicodedata.category(char) for char in l]
['Cc', 'Cc', 'Cc', 'Cc', 'Po', 'Cc', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs',
'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Cf', 'Zl', 'Zp', 'Zs', 'Zs',
'Zs']
>>> [unicodedata.name(char, '???') for char in l]
['???', '???', '???', '???', 'PERCENT SIGN', '???', 'NO-BREAK SPACE',
'OGHAM SPACE MARK', 'EN QUAD', 'EM QUAD', 'EN SPACE', 'EM SPACE',
'THREE-PER-EM SPACE', 'FOUR-PER-EM SPACE', 'SIX-PER-EM SPACE', 'FIGURE
SPACE', 'PUNCTUATION SPACE', 'THIN SPACE', 'HAIR SPACE', 'ZERO WIDTH
SPACE', 'LINE SEPARATOR', 'PARAGRAPH SEPARATOR', 'NARROW NO-BREAK
SPACE', 'MEDIUM MATHEMATICAL SPACE', 'IDEOGRAPHIC SPACE']

All these chars (except % and some control chars) are whitespace and
they are removed by the .strip() method, so I guess that something
similar happens in strptime too.

The Unicode categories are:
"Cc" = "Other, Control"
"Zs" = "Separator, Space"
"Cf" = "Other, Format"
"Zl" = "Separator, Line"
"Zp" = "Separator, Paragraph"

Everything seems to work fine on Py2.x (tested on 2.4 and 2.6)

ocean-city · 2009-02-13T12:34:48Z

By quick observation, I found this problem goes away if following change
is applied.

Index: Lib/_strptime.py
===================================================================

--- Lib/_strptime.py	(revision 69496)
+++ Lib/_strptime.py	(working copy)
@@ -262,7 +262,7 @@
 
     def compile(self, format):
         """Return a compiled re object for the format string."""
-        return re_compile(self.pattern(format), IGNORECASE | ASCII)
+        return re_compile(self.pattern(format), IGNORECASE)
 
 _cache_lock = _thread_allocate_lock()
 # DO NOT modify _TimeRE_cache or _regex_cache without acquiring the
cache lock

But this is just an observation. I don't call this *fix* because I'm not
familier with unicode, this change might cause another problem if applied.

ezio-melotti · 2009-02-13T13:16:17Z

I think you have found the problem, strptime probably uses \s with the
re.ASCII flag and fails to match all the Unicode whitespaces:
>>> l
['\x1c', '\x1d', '\x1e', '\x1f', '%', '\x85', '\xa0', '\u1680',
'\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006',
'\u2007', '\u2008', '\u2009', '\u200a', '\u200b', '\u2028', '\u2029',
'\u202f', '\u205f', '\u3000']
>>> [bool(re.match('^\s$', char, re.ASCII)) for char in l]
[False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False, False,
False, False, False, False, False]
>>> [bool(re.match('^\s$', char)) for char in l]
[True, True, True, True, False, True, True, True, True, True, True,
True, True,True, True, True, True, True, True, True, True, True, True,
True, True]

This bug is then related bpo-5239 and the proposed fix should work for both.
We can close this as duplicate and include this problem in bpo-5239.

Good work!

ocean-city · 2009-02-13T13:43:18Z

OK, I'll close this entry as duplicated.

ezio-melotti added stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error labels Feb 13, 2009

ocean-city mannequin closed this as completed Feb 13, 2009

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

time.strptime fails to match data and format with Unicode whitespaces (Py3) #49490

time.strptime fails to match data and format with Unicode whitespaces (Py3) #49490

ezio-melotti commented Feb 13, 2009

ezio-melotti commented Feb 13, 2009

ocean-city mannequin commented Feb 13, 2009

ezio-melotti commented Feb 13, 2009

ocean-city mannequin commented Feb 13, 2009

time.strptime fails to match data and format with Unicode whitespaces (Py3) #49490

time.strptime fails to match data and format with Unicode whitespaces (Py3) #49490

Comments

ezio-melotti commented Feb 13, 2009

ezio-melotti commented Feb 13, 2009

ocean-city mannequin commented Feb 13, 2009

ezio-melotti commented Feb 13, 2009

ocean-city mannequin commented Feb 13, 2009