classification
Title: Change time.strptime() to make it work with Unicode chars
Type: behavior Stage:
Components: Library (Lib), Unicode Versions: Python 3.0, Python 2.4, Python 3.1, Python 2.7, Python 2.6, Python 2.5
process
Status: closed Resolution: fixed
Dependencies: 2834 5249 Superseder:
Assigned To: Nosy List: brett.cannon, ezio.melotti, ocean-city, pitrou
Priority: normal Keywords: patch

Created on 2009-02-13 01:46 by ezio.melotti, last changed 2009-03-30 21:54 by brett.cannon. This issue is now closed.

Files
File name Uploaded Description Edit
remove_ascii_flag.patch ocean-city, 2009-02-13 16:30
Messages (13)
msg81847 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-02-13 01:46
On Py3 strptime("2009", "%Y") fails:
>>> strptime("2009", "%Y")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.0/_strptime.py", line 454, in _strptime_time
    return _strptime(data_string, format)[0]
  File "/usr/local/lib/python3.0/_strptime.py", line 325, in _strptime
    (data_string, format))
ValueError: time data '2009' does not match format '%Y'

but non-ascii numbers are supported elsewhere:
>>> int("2009")
2009
>>> re.match("^\d{4}$", "2009").group()
'2009'

The problem seems to be at the line 265 of _strptime.py:
        return re_compile(self.pattern(format), IGNORECASE | ASCII)
The ASCII flag prevent the regex to work properly with '2009':
>>> re.match("^\d{4}$", "2009", re.ASCII)
>>>

I tried to remove the ASCII flag and it worked fine.

On Py2.x the problem is the same:
>>> strptime(u"2009", "%Y")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/_strptime.py", line 330, in strptime
    (data_string, format))
ValueError>>>
>>> int(u"2009")
2009
>>> re.match("^\d{4}$", u"2009")

Here there's probably to add the re.UNICODE flag at the line 265 (untested):
        return re_compile(self.pattern(format), IGNORECASE | UNICODE)
in order to make it work:
>>> re.match("^\d{4}$", u"2009", re.U).group()
u'\uff12\uff10\uff10\uff19'
msg81928 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2009-02-13 13:52
This patch comes from issue5240. I think testcase is needed. I'll try if
I can.
msg81932 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2009-02-13 14:13
Hmm, this fails on python2 too. Maybe re.ASCII is added for backward
compatibility? Again, I'm not familiar with unicode, so I won't call
remove_ascii_flag.patch as *fix*.
msg81934 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-13 14:24
> Hmm, this fails on python2 too. Maybe re.ASCII is added for backward
> compatibility? Again, I'm not familiar with unicode, so I won't call
> remove_ascii_flag.patch as *fix*.

re.ASCII was added to many stdlib modules because I wanted to minimize
the potential for breakage when I converted the re library to use
unicode matching by default.

If it is desireable for strptime() and friends to match unicode digits
as well as pure-ASCII digits (which sounds like a reasonable request to
me), then re.ASCII can probably be dropped without any regret.

(py3k doesn't have to be 100% compatible with python2 :-))
msg81938 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-02-13 14:44
I think Py3 with re.ASCII is the same as Py2 without re.UNICODE (and Py3
without re.ASCII is the same as Py2 with re.UNICODE).

It's probably a good idea to have a coherent behavior between Py2 and
Py3, so if we remove re.ASCII from Py3 we should add re.UNICODE to Py2.
msg81939 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-13 14:50
Le vendredi 13 février 2009 à 14:44 +0000, Ezio Melotti a écrit :
> It's probably a good idea to have a coherent behavior between Py2 and
> Py3, so if we remove re.ASCII from Py3 we should add re.UNICODE to Py2.

Removing re.ASCII in py3k is a no-brainer, because unicode is how
strings work by default.
On the other hand, strings in 2.x are 8-bit, so it would probably be
better to keep strptime as is.
As I said, py3k doesn't have to be compatible with 2.x, that's even the
whole point of it.
msg81940 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-02-13 15:27
> Removing re.ASCII in py3k is a no-brainer, because unicode is how
> strings work by default.

I meant from the line 265 of _strptime.py, not from Python :P
msg81941 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-13 15:30
> > Removing re.ASCII in py3k is a no-brainer, because unicode is how
> > strings work by default.
> 
> I meant from the line 265 of _strptime.py, not from Python :P

That's what I understood.
msg81948 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-02-13 16:26
Sorry, I misunderstood the meaning of "no-brainer".

If we add re.UNICODE on Py2, strptime should work fine with unicode
strings, but it could fail somehow with normal strings. Is it more
important to provide a way to use Unicode chars that works only with
unicode strings or to have a coherent behavior between str and unicode?

I don't think that adding re.UNICODE will break any existing code, but
it may cause problems if someone tries to use encoded str instead of
unicode (but shouldn't work already).

Also note that encoded strings should be a problem only if they have to
match a strptime directive (e.g. %Y), the other chars should be compared
as they are, so it should work with str and unicode as long as they are
not mixed (I think that whitespaces are treated differently though).

I'll try to add re.UNICODE and see what happens.
msg81949 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2009-02-13 16:30
I added test. But this requires issue5249 fix to be passed on windows.

(I used "\u3000" instead of "\xa0" because "\xa0" cannot be decoded on
windows mbcs)
msg81952 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-13 16:57
> If we add re.UNICODE on Py2, strptime should work fine with unicode
> strings, but it could fail somehow with normal strings. Is it more
> important to provide a way to use Unicode chars that works only with
> unicode strings or to have a coherent behavior between str and unicode?

I'd say the latter, since str and unicode are often interchangeable in
2.x.
msg84665 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2009-03-30 21:52
This issue seems to be fixed on py3k by r70755. (issue5236)
msg84669 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2009-03-30 21:54
As Hirokazu pointed out, this was fixed.
History
Date User Action Args
2009-03-30 21:54:24brett.cannonsetstatus: open -> closed
resolution: fixed
messages: + msg84669
2009-03-30 21:52:24ocean-citysetmessages: + msg84665
2009-02-13 19:06:30brett.cannonsetnosy: + brett.cannon
2009-02-13 16:57:15pitrousetmessages: + msg81952
2009-02-13 16:30:50ocean-citysetfiles: - remove_ascii_flag.patch
2009-02-13 16:30:39ocean-citysetfiles: + remove_ascii_flag.patch
dependencies: + Fix strftime on windows.
messages: + msg81949
2009-02-13 16:26:07ezio.melottisetmessages: + msg81948
2009-02-13 15:30:08pitrousetmessages: + msg81941
2009-02-13 15:27:52ezio.melottisetmessages: + msg81940
2009-02-13 14:50:33pitrousetmessages: + msg81939
2009-02-13 14:44:18ezio.melottisetmessages: + msg81938
2009-02-13 14:24:37pitrousetmessages: + msg81934
2009-02-13 14:13:06ocean-citysetnosy: + pitrou
messages: + msg81932
2009-02-13 14:01:30ezio.melottisettitle: time.strptime("2009", "%Y") raises a value error -> Change time.strptime() to make it work with Unicode chars
2009-02-13 13:52:31ocean-citysetfiles: + remove_ascii_flag.patch
keywords: + patch
dependencies: + re.IGNORECASE not Unicode-ready
messages: + msg81928
nosy: + ocean-city
2009-02-13 13:43:17ocean-citylinkissue5240 superseder
2009-02-13 01:47:01ezio.melottisettype: behavior
components: + Library (Lib), Unicode
versions: + Python 2.6, Python 2.5, Python 2.4, Python 3.0, Python 3.1, Python 2.7
2009-02-13 01:46:34ezio.melotticreate