Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change time.strptime() to make it work with Unicode chars #49489

Closed
ezio-melotti opened this issue Feb 13, 2009 · 13 comments
Closed

Change time.strptime() to make it work with Unicode chars #49489

ezio-melotti opened this issue Feb 13, 2009 · 13 comments
Labels
stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@ezio-melotti
Copy link
Member

BPO 5239
Nosy @brettcannon, @pitrou, @ezio-melotti
Dependencies
  • bpo-2834: re.IGNORECASE not Unicode-ready
  • bpo-5249: Fix strftime on windows.
  • Files
  • remove_ascii_flag.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2009-03-30.21:54:24.324>
    created_at = <Date 2009-02-13.01:46:34.730>
    labels = ['type-bug', 'library', 'expert-unicode']
    title = 'Change time.strptime() to make it work with Unicode chars'
    updated_at = <Date 2009-03-30.21:54:24.322>
    user = 'https://github.com/ezio-melotti'

    bugs.python.org fields:

    activity = <Date 2009-03-30.21:54:24.322>
    actor = 'brett.cannon'
    assignee = 'none'
    closed = True
    closed_date = <Date 2009-03-30.21:54:24.324>
    closer = 'brett.cannon'
    components = ['Library (Lib)', 'Unicode']
    creation = <Date 2009-02-13.01:46:34.730>
    creator = 'ezio.melotti'
    dependencies = ['2834', '5249']
    files = ['13074']
    hgrepos = []
    issue_num = 5239
    keywords = ['patch']
    message_count = 13.0
    messages = ['81847', '81928', '81932', '81934', '81938', '81939', '81940', '81941', '81948', '81949', '81952', '84665', '84669']
    nosy_count = 4.0
    nosy_names = ['brett.cannon', 'pitrou', 'ocean-city', 'ezio.melotti']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue5239'
    versions = ['Python 2.6', 'Python 2.5', 'Python 2.4', 'Python 3.0', 'Python 3.1', 'Python 2.7']

    @ezio-melotti
    Copy link
    Member Author

    On Py3 strptime("2009", "%Y") fails:
    >>> strptime("2009", "%Y")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python3.0/_strptime.py", line 454, in _strptime_time
        return _strptime(data_string, format)[0]
      File "/usr/local/lib/python3.0/_strptime.py", line 325, in _strptime
        (data_string, format))
    ValueError: time data '2009' does not match format '%Y'
    
    but non-ascii numbers are supported elsewhere:
    >>> int("2009")
    2009
    >>> re.match("^\d{4}$", "2009").group()
    '2009'
    
    The problem seems to be at the line 265 of _strptime.py:
            return re_compile(self.pattern(format), IGNORECASE | ASCII)
    The ASCII flag prevent the regex to work properly with '2009':
    >>> re.match("^\d{4}$", "2009", re.ASCII)
    >>>

    I tried to remove the ASCII flag and it worked fine.

    On Py2.x the problem is the same:
    >>> strptime(u"2009", "%Y")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.5/_strptime.py", line 330, in strptime
        (data_string, format))
    ValueError>>>
    >>> int(u"2009")
    2009
    >>> re.match("^\d{4}$", u"2009")
    
    Here there's probably to add the re.UNICODE flag at the line 265 (untested):
            return re_compile(self.pattern(format), IGNORECASE | UNICODE)
    in order to make it work:
    >>> re.match("^\d{4}$", u"2009", re.U).group()
    u'\uff12\uff10\uff10\uff19'

    @ezio-melotti ezio-melotti added stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error labels Feb 13, 2009
    @ocean-city
    Copy link
    Mannequin

    ocean-city mannequin commented Feb 13, 2009

    This patch comes from bpo-5240. I think testcase is needed. I'll try if
    I can.

    @ezio-melotti ezio-melotti changed the title time.strptime("2009", "%Y") raises a value error Change time.strptime() to make it work with Unicode chars Feb 13, 2009
    @ocean-city
    Copy link
    Mannequin

    ocean-city mannequin commented Feb 13, 2009

    Hmm, this fails on python2 too. Maybe re.ASCII is added for backward
    compatibility? Again, I'm not familiar with unicode, so I won't call
    remove_ascii_flag.patch as *fix*.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 13, 2009

    Hmm, this fails on python2 too. Maybe re.ASCII is added for backward
    compatibility? Again, I'm not familiar with unicode, so I won't call
    remove_ascii_flag.patch as *fix*.

    re.ASCII was added to many stdlib modules because I wanted to minimize
    the potential for breakage when I converted the re library to use
    unicode matching by default.

    If it is desireable for strptime() and friends to match unicode digits
    as well as pure-ASCII digits (which sounds like a reasonable request to
    me), then re.ASCII can probably be dropped without any regret.

    (py3k doesn't have to be 100% compatible with python2 :-))

    @ezio-melotti
    Copy link
    Member Author

    I think Py3 with re.ASCII is the same as Py2 without re.UNICODE (and Py3
    without re.ASCII is the same as Py2 with re.UNICODE).

    It's probably a good idea to have a coherent behavior between Py2 and
    Py3, so if we remove re.ASCII from Py3 we should add re.UNICODE to Py2.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 13, 2009

    Le vendredi 13 février 2009 à 14:44 +0000, Ezio Melotti a écrit :

    It's probably a good idea to have a coherent behavior between Py2 and
    Py3, so if we remove re.ASCII from Py3 we should add re.UNICODE to Py2.

    Removing re.ASCII in py3k is a no-brainer, because unicode is how
    strings work by default.
    On the other hand, strings in 2.x are 8-bit, so it would probably be
    better to keep strptime as is.
    As I said, py3k doesn't have to be compatible with 2.x, that's even the
    whole point of it.

    @ezio-melotti
    Copy link
    Member Author

    Removing re.ASCII in py3k is a no-brainer, because unicode is how
    strings work by default.

    I meant from the line 265 of _strptime.py, not from Python :P

    @pitrou
    Copy link
    Member

    pitrou commented Feb 13, 2009

    > Removing re.ASCII in py3k is a no-brainer, because unicode is how
    > strings work by default.

    I meant from the line 265 of _strptime.py, not from Python :P

    That's what I understood.

    @ezio-melotti
    Copy link
    Member Author

    Sorry, I misunderstood the meaning of "no-brainer".

    If we add re.UNICODE on Py2, strptime should work fine with unicode
    strings, but it could fail somehow with normal strings. Is it more
    important to provide a way to use Unicode chars that works only with
    unicode strings or to have a coherent behavior between str and unicode?

    I don't think that adding re.UNICODE will break any existing code, but
    it may cause problems if someone tries to use encoded str instead of
    unicode (but shouldn't work already).

    Also note that encoded strings should be a problem only if they have to
    match a strptime directive (e.g. %Y), the other chars should be compared
    as they are, so it should work with str and unicode as long as they are
    not mixed (I think that whitespaces are treated differently though).

    I'll try to add re.UNICODE and see what happens.

    @ocean-city
    Copy link
    Mannequin

    ocean-city mannequin commented Feb 13, 2009

    I added test. But this requires bpo-5249 fix to be passed on windows.

    (I used "\u3000" instead of "\xa0" because "\xa0" cannot be decoded on
    windows mbcs)

    @pitrou
    Copy link
    Member

    pitrou commented Feb 13, 2009

    If we add re.UNICODE on Py2, strptime should work fine with unicode
    strings, but it could fail somehow with normal strings. Is it more
    important to provide a way to use Unicode chars that works only with
    unicode strings or to have a coherent behavior between str and unicode?

    I'd say the latter, since str and unicode are often interchangeable in
    2.x.

    @ocean-city
    Copy link
    Mannequin

    ocean-city mannequin commented Mar 30, 2009

    This issue seems to be fixed on py3k by r70755. (bpo-5236)

    @brettcannon
    Copy link
    Member

    As Hirokazu pointed out, this was fixed.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants