classification
Title: re.IGNORECASE not Unicode-ready
Type: behavior Stage:
Components: Regular Expressions Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: pitrou Nosy List: amaury.forgeotdarc, barry, ezio.melotti, gvanrossum, humitos, mark, pitrou, sven.siegmund
Priority: critical Keywords: patch

Created on 2008-05-12 08:44 by sven.siegmund, last changed 2009-02-13 14:02 by ezio.melotti. This issue is now closed.

Files
File name Uploaded Description Edit
reunicode.patch pitrou, 2008-06-29 01:19
reunicode2.patch pitrou, 2008-06-29 20:21
reunicode3.patch pitrou, 2008-06-29 20:36
reunicode4.patch pitrou, 2008-07-05 21:09
reunicode5.patch pitrou, 2008-07-28 16:39
Messages (24)
msg66715 - (view) Author: Sven Siegmund (sven.siegmund) Date: 2008-05-12 08:43
re cannot ignore case of special latin characters:

Python 3.0a5 (py3k:62932M, May  9 2008, 16:23:11) [MSC v.1500 32 bit 
(Intel)] on win32
>>> 'Á'.lower() == 'á' and 'á'.upper() == 'Á'
True
>>> import re
>>> rx = re.compile('Á', re.IGNORECASE)
>>> rx.match('á') # should match but won't
>>> rx.match('Á') # will match
<_sre.SRE_Match object at 0x014B08A8>
>>> rx = re.compile('á', re.IGNORECASE)
>>> rx.match('Á') # should match but won't
>>> rx.match('á') # will match
<_sre.SRE_Match object at 0x014B08A8>
msg66727 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-05-12 14:44
Try adding re.LOCALE to the flags.  I'm not sure why that is needed but
it seems to fix this issue.

I still think this is a legitimate bug though.
msg67622 - (view) Author: Manuel Kaufmann (humitos) Date: 2008-06-02 00:23
I have the same error with the re.LOCALE flag...

[humitos] [~]$ python3.0
Python 3.0a5+ (py3k:63855, Jun  1 2008, 13:05:09)
[GCC 4.1.3 20080114 (prerelease) (Debian 4.1.2-19)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> rx = re.compile('á', re.LOCALE | re.IGNORECASE)
>>> rx.match('Á')
>>> rx.match('á')
<_sre.SRE_Match object at 0x2b955e204d30>
>>> rx = re.compile('Á', re.IGNORECASE | re.LOCALE)
>>> rx.match('Á')
<_sre.SRE_Match object at 0x2b955e204e00>
>>> rx.match('á')
>>> 'Á'.lower() == 'á' and 'á'.upper() == 'Á'
True
>>>
msg68901 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-06-28 19:40
Same here, re.LOCALE doesn't circumvent the problem.
msg68905 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-06-28 20:27
Uh, actually, it works if you specify re.UNICODE. If you don't, the
getlower() function in _sre.c falls back to the plain ASCII algorithm.

>>> pat = re.compile('Á', re.IGNORECASE | re.UNICODE)
>>> pat.match('á')
<_sre.SRE_Match object at 0xb7c66c28>
>>> pat.match('Á')
<_sre.SRE_Match object at 0xb7c66cd0>

I wonder if re.UNICODE shouldn't be the default in Py3k, at least when
the pattern is a string and not a bytes object. There may also be a
re.ASCII flag for those cases where people want to fallback to the old
behaviour.
msg68920 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-06-28 22:19
Sounds like re.UNICODE should be on by default when the pattern is a str
instance.

Also (per mailing list discussion) we should probably only allow
matching bytes when the pattern is bytes, and matching str when the
pattern is str.

Finally, is there a use case of re.LOCALE any more? I'm thinking not.
msg68922 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-06-28 22:35
Le samedi 28 juin 2008 à 22:20 +0000, Guido van Rossum a écrit :
> Finally, is there a use case of re.LOCALE any more? I'm thinking not.

It's used for locale-specific case matching in the non-unicode case. But
it looks to me like a bad practice and we could probably remove it.

'C'
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE)
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE |re.LOCALE)
>>> locale.setlocale(locale.LC_CTYPE, 'fr_FR.ISO-8859-1')
'fr_FR.ISO-8859-1'
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE)
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE | re.LOCALE)
<_sre.SRE_Match object at 0xb7b9ac28>
msg68932 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-06-29 01:15
Here is a preliminary patch which doesn't remove re.LOCALE, but adds
TypeError's for mistyped matchings, a ValueError when specifying
re.UNICODE with a bytes pattern, and implies re.UNICODE for unicode
patterns. The test suite runs fine after a few fixes.

It also includes the patch for #3231 ("re.compile fails with some bytes
patterns").
msg68966 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-06-29 20:21
This new patch also introduces re.ASCII as discussed on the mailing-list.
msg68967 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-06-29 20:36
Improved patch which also detects incompatibilities for "(?u)".
msg69298 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-07-05 21:09
This new patch adds re.ASCII in all sensitive places I could find in the
stdlib (except lib2to3 which as far as I understand is maintained in a
separate branch, and even has its own copy of tokenize.py...).

Also, I didn't get an answer to the following question on the ML: should
an inline flag "(?a)" be introduced to mirror the existing "(?u)" - so
as to set the ASCII flag from inside a pattern string.
msg69301 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-07-05 21:30
http://codereview.appspot.com/2439
msg70354 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-07-28 16:39
Final patch adding the (?a) inline flag (equivalent to re.ASCII). Please
review:
http://codereview.appspot.com/2439
msg70370 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-07-28 20:41
Are all those re.ASCII flags mandatory, or are they here just for
theoretical correctness?
For example, the output of "gcc -dumpversion" is certainly plain ASCII.
I don't mind that \d also matches some exotic digit - it just won't happen.
msg70371 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-07-28 20:49
Le lundi 28 juillet 2008 à 20:41 +0000, Amaury Forgeot d'Arc a écrit :
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
> 
> Are all those re.ASCII flags mandatory, or are they here just for
> theoretical correctness?

For theoretical correctness. I just don't want to analyze each case
individually and I'm probably not competent for many of them.
msg70780 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-06 10:29
If nobody (except Amaury :-)) has anything to say about the current
patch, should it be committed?
msg70787 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-08-06 16:34
Let's make sure the release manager is OK with this.
msg71186 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-15 21:31
Barry?
msg71413 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2008-08-19 12:57
I haven't looked at the specific patch, but based on the description of
the behavior, I'm +1 on committing this before beta 3.  I'm fine with
leaving the re.ASCII flags in there -- it will be a marker to indicate
perhaps the code needs a closer examination (eventually).
msg71414 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2008-08-19 12:58
Make sure of course that the documentation is updated and a NEWS file
entry is added.
msg71455 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-19 17:59
Fixed in r65860. Someone should check the docs though (at least try to
generate them, and review my changes a bit since English isn't my mother
tongue).
msg71516 - (view) Author: Mark Summerfield (mark) Date: 2008-08-20 07:36
On 2008-08-19, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> Fixed in r65860. Someone should check the docs though (at least try to
> generate them, and review my changes a bit since English isn't my mother
> tongue).

I've revised the ASCII and LOCALE-related texts in re.rst in r65903.
msg71517 - (view) Author: Mark Summerfield (mark) Date: 2008-08-20 07:40
On 2008-08-19, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> Fixed in r65860. Someone should check the docs though (at least try to
> generate them, and review my changes a bit since English isn't my mother
> tongue).

And two more (tiny) fixes in r65904; that's my lot:-)
msg71519 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-20 08:49
Thanks a lot Mark!
History
Date User Action Args
2009-02-13 14:02:50ezio.melottisetnosy: + ezio.melotti
2009-02-13 13:52:31ocean-citylinkissue5239 dependencies
2009-02-13 12:34:49ocean-citylinkissue5240 dependencies
2008-08-20 08:49:38pitrousetmessages: + msg71519
2008-08-20 07:40:55marksetmessages: + msg71517
2008-08-20 07:36:30marksetmessages: + msg71516
2008-08-19 17:59:29pitrousetstatus: open -> closed
resolution: accepted -> fixed
messages: + msg71455
2008-08-19 12:58:07barrysetmessages: + msg71414
2008-08-19 12:57:41barrysetresolution: accepted
messages: + msg71413
2008-08-15 21:31:11pitrousetmessages: + msg71186
2008-08-06 16:34:33gvanrossumsetnosy: + barry
messages: + msg70787
2008-08-06 10:29:16pitrousetmessages: + msg70780
2008-07-28 20:49:16pitrousetmessages: + msg70371
2008-07-28 20:41:56amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg70370
2008-07-28 16:39:31pitrousetfiles: + reunicode5.patch
messages: + msg70354
2008-07-24 15:07:53pitrousetpriority: critical
assignee: pitrou
2008-07-24 12:39:00marksetnosy: + mark
2008-07-05 21:30:04pitrousetmessages: + msg69301
2008-07-05 21:10:11pitrousetfiles: + reunicode4.patch
messages: + msg69298
2008-06-29 20:36:38pitrousetfiles: + reunicode3.patch
messages: + msg68967
2008-06-29 20:21:07pitrousetfiles: + reunicode2.patch
messages: + msg68966
2008-06-29 01:19:44pitrousetfiles: + reunicode.patch
2008-06-29 01:19:17pitrousetfiles: - reunicode.patch
2008-06-29 01:15:28pitrousetfiles: + reunicode.patch
keywords: + patch
messages: + msg68932
2008-06-28 22:35:39pitrousetmessages: + msg68922
2008-06-28 22:19:03gvanrossumsetmessages: + msg68920
2008-06-28 20:27:24pitrousetmessages: + msg68905
2008-06-28 19:40:35pitrousetnosy: + pitrou
messages: + msg68901
2008-06-02 00:23:02humitossetnosy: + humitos
messages: + msg67622
2008-05-12 14:44:03gvanrossumsetnosy: + gvanrossum
messages: + msg66727
2008-05-12 08:44:03sven.siegmundcreate