Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time.strptime fails to match data and format with Unicode whitespaces (Py3) #49490

Closed
ezio-melotti opened this issue Feb 13, 2009 · 4 comments
Closed
Labels
stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@ezio-melotti
Copy link
Member

BPO 5240
Nosy @brettcannon, @ezio-melotti
Dependencies
  • bpo-2834: re.IGNORECASE not Unicode-ready
  • Superseder
  • bpo-5239: Change time.strptime() to make it work with Unicode chars
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2009-02-13.13:43:17.942>
    created_at = <Date 2009-02-13.03:16:37.134>
    labels = ['type-bug', 'library', 'expert-unicode']
    title = 'time.strptime fails to match data and format with Unicode whitespaces (Py3)'
    updated_at = <Date 2009-02-13.19:06:47.973>
    user = 'https://github.com/ezio-melotti'

    bugs.python.org fields:

    activity = <Date 2009-02-13.19:06:47.973>
    actor = 'brett.cannon'
    assignee = 'none'
    closed = True
    closed_date = <Date 2009-02-13.13:43:17.942>
    closer = 'ocean-city'
    components = ['Library (Lib)', 'Unicode']
    creation = <Date 2009-02-13.03:16:37.134>
    creator = 'ezio.melotti'
    dependencies = ['2834']
    files = []
    hgrepos = []
    issue_num = 5240
    keywords = []
    message_count = 4.0
    messages = ['81859', '81916', '81919', '81924']
    nosy_count = 3.0
    nosy_names = ['brett.cannon', 'ocean-city', 'ezio.melotti']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = None
    status = 'closed'
    superseder = '5239'
    type = 'behavior'
    url = 'https://bugs.python.org/issue5240'
    versions = ['Python 3.0', 'Python 3.1']

    @ezio-melotti
    Copy link
    Member Author

    On Python3, strptime raises a ValueError with some "Unicode whitespaces"
    even if they are present both in the 'string' and 'format' args in the
    same position:
    >>> strptime("Thu\x20Feb", "%a\x20%b") # normal space, works fine
    time.struct_time(tm_year=1900, tm_mon=2, tm_mday=1, tm_hour=0, tm_min=0,
    tm_sec=0, tm_wday=3, tm_yday=32, tm_isdst=-1)
    >>> strptime("Thu\xa0Feb", "%a\xa0%b") # no-break space, fails
    ValueError: time data 'Thu\xa0Feb' does not match format '%a\xa0%b'
    
    I wrote a small script to find out other chars where it fails (it needs
    ~5 minutes to run):
    >>> l = []
    >>> for char in map(chr, range(0xFFFF)):
    ...   try: x = strptime('Thu{0}Feb'.format(char), '%a{0}%b'.format(char))
    ...   except ValueError: l.append(char)
    ...
    >>> l
    ['\x1c', '\x1d', '\x1e', '\x1f', '%', '\x85', '\xa0', '\u1680',
    '\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006',
    '\u2007', '\u2008', '\u2009', '\u200a', '\u200b', '\u2028', '\u2029',
    '\u202f', '\u205f', '\u3000']
    >>> [char.strip() for char in l]
    ['', '', '', '', '%', '', '', '', '', '', '', '', '', '', '', '', '',
    '', '', '', '', '', '', '', '']
    >>> [unicodedata.category(char) for char in l]
    ['Cc', 'Cc', 'Cc', 'Cc', 'Po', 'Cc', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs',
    'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Zs', 'Cf', 'Zl', 'Zp', 'Zs', 'Zs',
    'Zs']
    >>> [unicodedata.name(char, '???') for char in l]
    ['???', '???', '???', '???', 'PERCENT SIGN', '???', 'NO-BREAK SPACE',
    'OGHAM SPACE MARK', 'EN QUAD', 'EM QUAD', 'EN SPACE', 'EM SPACE',
    'THREE-PER-EM SPACE', 'FOUR-PER-EM SPACE', 'SIX-PER-EM SPACE', 'FIGURE
    SPACE', 'PUNCTUATION SPACE', 'THIN SPACE', 'HAIR SPACE', 'ZERO WIDTH
    SPACE', 'LINE SEPARATOR', 'PARAGRAPH SEPARATOR', 'NARROW NO-BREAK
    SPACE', 'MEDIUM MATHEMATICAL SPACE', 'IDEOGRAPHIC SPACE']

    All these chars (except % and some control chars) are whitespace and
    they are removed by the .strip() method, so I guess that something
    similar happens in strptime too.

    The Unicode categories are:
    "Cc" = "Other, Control"
    "Zs" = "Separator, Space"
    "Cf" = "Other, Format"
    "Zl" = "Separator, Line"
    "Zp" = "Separator, Paragraph"

    Everything seems to work fine on Py2.x (tested on 2.4 and 2.6)

    @ezio-melotti ezio-melotti added stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error labels Feb 13, 2009
    @ocean-city
    Copy link
    Mannequin

    ocean-city mannequin commented Feb 13, 2009

    By quick observation, I found this problem goes away if following change
    is applied.

    Index: Lib/_strptime.py
    ===================================================================

    --- Lib/_strptime.py	(revision 69496)
    +++ Lib/_strptime.py	(working copy)
    @@ -262,7 +262,7 @@
     
         def compile(self, format):
             """Return a compiled re object for the format string."""
    -        return re_compile(self.pattern(format), IGNORECASE | ASCII)
    +        return re_compile(self.pattern(format), IGNORECASE)
     
     _cache_lock = _thread_allocate_lock()
     # DO NOT modify _TimeRE_cache or _regex_cache without acquiring the
    cache lock

    But this is just an observation. I don't call this *fix* because I'm not
    familier with unicode, this change might cause another problem if applied.

    @ezio-melotti
    Copy link
    Member Author

    I think you have found the problem, strptime probably uses \s with the
    re.ASCII flag and fails to match all the Unicode whitespaces:
    >>> l
    ['\x1c', '\x1d', '\x1e', '\x1f', '%', '\x85', '\xa0', '\u1680',
    '\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006',
    '\u2007', '\u2008', '\u2009', '\u200a', '\u200b', '\u2028', '\u2029',
    '\u202f', '\u205f', '\u3000']
    >>> [bool(re.match('^\s$', char, re.ASCII)) for char in l]
    [False, False, False, False, False, False, False, False, False, False,
    False, False, False, False, False, False, False, False, False, False,
    False, False, False, False, False]
    >>> [bool(re.match('^\s$', char)) for char in l]
    [True, True, True, True, False, True, True, True, True, True, True,
    True, True,True, True, True, True, True, True, True, True, True, True,
    True, True]

    This bug is then related bpo-5239 and the proposed fix should work for both.
    We can close this as duplicate and include this problem in bpo-5239.

    Good work!

    @ocean-city
    Copy link
    Mannequin

    ocean-city mannequin commented Feb 13, 2009

    OK, I'll close this entry as duplicated.

    @ocean-city ocean-city mannequin closed this as completed Feb 13, 2009
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant