This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author ezio.melotti
Recipients ezio.melotti
Date 2009-02-13.03:16:35
SpamBayes Score 1.2275498e-08
Marked as misclassified No
Message-id <1234494999.4.0.610886804768.issue5240@psf.upfronthosting.co.za>
In-reply-to
Content
On Python3, strptime raises a ValueError with some "Unicode whitespaces"
even if they are present both in the 'string' and 'format' args in the
same position:
>>> strptime("Thu\x20Feb", "%a\x20%b") # normal space, works fine
time.struct_time(tm_year=1900, tm_mon=2, tm_mday=1, tm_hour=0, tm_min=0,
tm_sec=0, tm_wday=3, tm_yday=32, tm_isdst=-1)
>>> strptime("Thu\xa0Feb", "%a\xa0%b") # no-break space, fails
ValueError: time data 'Thu\xa0Feb' does not match format '%a\xa0%b'

I wrote a small script to find out other chars where it fails (it needs
~5 minutes to run):
>>> l = []
>>> for char in map(chr, range(0xFFFF)):
...   try: x = strptime('Thu{0}Feb'.format(char), '%a{0}%b'.format(char))
...   except ValueError: l.append(char)
...
>>> l
['\x1c', '\x1d', '\x1e', '\x1f', '%', '\x85', '\xa0', '\u1680',
'\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006',
'\u2007', '\u2008', '\u2009', '\u200a', '\u200b', '\u2028', '\u2029',
'\u202f', '\u205f', '\u3000']
>>> [char.strip() for char in l]
['', '', '', '', '%', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '']
>>> [unicodedata.category(char) for char in l]
['Cc', 'Cc', 'Cc', 'Cc', 'Po', 'Cc', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs',
'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Cf', 'Zl', 'Zp', 'Zs', 'Zs',
'Zs']
>>> [unicodedata.name(char, '???') for char in l]
['???', '???', '???', '???', 'PERCENT SIGN', '???', 'NO-BREAK SPACE',
'OGHAM SPACE MARK', 'EN QUAD', 'EM QUAD', 'EN SPACE', 'EM SPACE',
'THREE-PER-EM SPACE', 'FOUR-PER-EM SPACE', 'SIX-PER-EM SPACE', 'FIGURE
SPACE', 'PUNCTUATION SPACE', 'THIN SPACE', 'HAIR SPACE', 'ZERO WIDTH
SPACE', 'LINE SEPARATOR', 'PARAGRAPH SEPARATOR', 'NARROW NO-BREAK
SPACE', 'MEDIUM MATHEMATICAL SPACE', 'IDEOGRAPHIC SPACE']

All these chars (except % and some control chars) are whitespace and
they are removed by the .strip() method, so I guess that something
similar happens in strptime too.

The Unicode categories are:
"Cc" = "Other, Control"
"Zs" = "Separator, Space"
"Cf" = "Other, Format"
"Zl" = "Separator, Line"
"Zp" = "Separator, Paragraph"

Everything seems to work fine on Py2.x (tested on 2.4 and 2.6)
History
Date User Action Args
2009-02-13 03:16:39ezio.melottisetrecipients: + ezio.melotti
2009-02-13 03:16:39ezio.melottisetmessageid: <1234494999.4.0.610886804768.issue5240@psf.upfronthosting.co.za>
2009-02-13 03:16:37ezio.melottilinkissue5240 messages
2009-02-13 03:16:35ezio.melotticreate