re.IGNORECASE not Unicode-ready #47083

svensiegmund · 2008-05-12T08:44:03Z

BPO	2834
Nosy	@gvanrossum, @warsaw, @amauryfa, @pitrou, @mark-summerfield, @humitos, @ezio-melotti
Files	reunicode.patch reunicode2.patch reunicode3.patch reunicode4.patch reunicode5.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/pitrou'
closed_at = <Date 2008-08-19.17:59:29.265>
created_at = <Date 2008-05-12.08:44:02.715>
labels = ['expert-regex', 'type-bug']
title = 're.IGNORECASE not Unicode-ready'
updated_at = <Date 2009-02-13.14:02:50.756>
user = 'https://bugs.python.org/svensiegmund'

bugs.python.org fields:

activity = <Date 2009-02-13.14:02:50.756>
actor = 'ezio.melotti'
assignee = 'pitrou'
closed = True
closed_date = <Date 2008-08-19.17:59:29.265>
closer = 'pitrou'
components = ['Regular Expressions']
creation = <Date 2008-05-12.08:44:02.715>
creator = 'sven.siegmund'
dependencies = []
files = ['10768', '10777', '10778', '10819', '10998']
hgrepos = []
issue_num = 2834
keywords = ['patch']
message_count = 24.0
messages = ['66715', '66727', '67622', '68901', '68905', '68920', '68922', '68932', '68966', '68967', '69298', '69301', '70354', '70370', '70371', '70780', '70787', '71186', '71413', '71414', '71455', '71516', '71517', '71519']
nosy_count = 8.0
nosy_names = ['gvanrossum', 'barry', 'amaury.forgeotdarc', 'pitrou', 'mark', 'humitos', 'ezio.melotti', 'sven.siegmund']
pr_nums = []
priority = 'critical'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue2834'
versions = ['Python 3.0']

svensiegmund · 2008-05-12T08:43:25Z

re cannot ignore case of special latin characters:

Python 3.0a5 (py3k:62932M, May  9 2008, 16:23:11) [MSC v.1500 32 bit 
(Intel)] on win32
>>> 'Á'.lower() == 'á' and 'á'.upper() == 'Á'
True
>>> import re
>>> rx = re.compile('Á', re.IGNORECASE)
>>> rx.match('á') # should match but won't
>>> rx.match('Á') # will match
<_sre.SRE_Match object at 0x014B08A8>
>>> rx = re.compile('á', re.IGNORECASE)
>>> rx.match('Á') # should match but won't
>>> rx.match('á') # will match
<_sre.SRE_Match object at 0x014B08A8>

gvanrossum · 2008-05-12T14:44:02Z

Try adding re.LOCALE to the flags. I'm not sure why that is needed but
it seems to fix this issue.

I still think this is a legitimate bug though.

humitos · 2008-06-02T00:23:01Z

I have the same error with the re.LOCALE flag...

[humitos] [~]$ python3.0
Python 3.0a5+ (py3k:63855, Jun  1 2008, 13:05:09)
[GCC 4.1.3 20080114 (prerelease) (Debian 4.1.2-19)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> rx = re.compile('á', re.LOCALE | re.IGNORECASE)
>>> rx.match('Á')
>>> rx.match('á')
<_sre.SRE_Match object at 0x2b955e204d30>
>>> rx = re.compile('Á', re.IGNORECASE | re.LOCALE)
>>> rx.match('Á')
<_sre.SRE_Match object at 0x2b955e204e00>
>>> rx.match('á')
>>> 'Á'.lower() == 'á' and 'á'.upper() == 'Á'
True
>>>

pitrou · 2008-06-28T19:40:35Z

Same here, re.LOCALE doesn't circumvent the problem.

pitrou · 2008-06-28T20:27:23Z

Uh, actually, it works if you specify re.UNICODE. If you don't, the
getlower() function in _sre.c falls back to the plain ASCII algorithm.

>>> pat = re.compile('Á', re.IGNORECASE | re.UNICODE)
>>> pat.match('á')
<_sre.SRE_Match object at 0xb7c66c28>
>>> pat.match('Á')
<_sre.SRE_Match object at 0xb7c66cd0>

I wonder if re.UNICODE shouldn't be the default in Py3k, at least when
the pattern is a string and not a bytes object. There may also be a
re.ASCII flag for those cases where people want to fallback to the old
behaviour.

gvanrossum · 2008-06-28T22:19:03Z

Sounds like re.UNICODE should be on by default when the pattern is a str
instance.

Also (per mailing list discussion) we should probably only allow
matching bytes when the pattern is bytes, and matching str when the
pattern is str.

Finally, is there a use case of re.LOCALE any more? I'm thinking not.

pitrou · 2008-06-28T22:35:39Z

Le samedi 28 juin 2008 à 22:20 +0000, Guido van Rossum a écrit :

Finally, is there a use case of re.LOCALE any more? I'm thinking not.

It's used for locale-specific case matching in the non-unicode case. But
it looks to me like a bad practice and we could probably remove it.

'C'
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE)
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE |re.LOCALE)
>>> locale.setlocale(locale.LC_CTYPE, 'fr_FR.ISO-8859-1')
'fr_FR.ISO-8859-1'
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE)
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE | re.LOCALE)
<_sre.SRE_Match object at 0xb7b9ac28>

pitrou · 2008-06-29T01:15:25Z

Here is a preliminary patch which doesn't remove re.LOCALE, but adds
TypeError's for mistyped matchings, a ValueError when specifying
re.UNICODE with a bytes pattern, and implies re.UNICODE for unicode
patterns. The test suite runs fine after a few fixes.

It also includes the patch for bpo-3231 ("re.compile fails with some bytes
patterns").

pitrou · 2008-06-29T20:21:02Z

This new patch also introduces re.ASCII as discussed on the mailing-list.

pitrou · 2008-06-29T20:36:32Z

Improved patch which also detects incompatibilities for "(?u)".

pitrou · 2008-07-05T21:09:59Z

This new patch adds re.ASCII in all sensitive places I could find in the
stdlib (except lib2to3 which as far as I understand is maintained in a
separate branch, and even has its own copy of tokenize.py...).

Also, I didn't get an answer to the following question on the ML: should
an inline flag "(?a)" be introduced to mirror the existing "(?u)" - so
as to set the ASCII flag from inside a pattern string.

pitrou · 2008-07-05T21:30:05Z

http://codereview.appspot.com/2439

pitrou · 2008-07-28T16:39:19Z

Final patch adding the (?a) inline flag (equivalent to re.ASCII). Please
review:
http://codereview.appspot.com/2439

amauryfa · 2008-07-28T20:41:55Z

Are all those re.ASCII flags mandatory, or are they here just for
theoretical correctness?
For example, the output of "gcc -dumpversion" is certainly plain ASCII.
I don't mind that \d also matches some exotic digit - it just won't happen.

pitrou · 2008-07-28T20:49:16Z

Le lundi 28 juillet 2008 à 20:41 +0000, Amaury Forgeot d'Arc a écrit :

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

Are all those re.ASCII flags mandatory, or are they here just for
theoretical correctness?

For theoretical correctness. I just don't want to analyze each case
individually and I'm probably not competent for many of them.

pitrou · 2008-08-06T10:29:16Z

If nobody (except Amaury :-)) has anything to say about the current
patch, should it be committed?

gvanrossum · 2008-08-06T16:34:32Z

Let's make sure the release manager is OK with this.

pitrou · 2008-08-15T21:31:11Z

Barry?

warsaw · 2008-08-19T12:57:41Z

I haven't looked at the specific patch, but based on the description of
the behavior, I'm +1 on committing this before beta 3. I'm fine with
leaving the re.ASCII flags in there -- it will be a marker to indicate
perhaps the code needs a closer examination (eventually).

warsaw · 2008-08-19T12:58:08Z

Make sure of course that the documentation is updated and a NEWS file
entry is added.

pitrou · 2008-08-19T17:59:28Z

Fixed in r65860. Someone should check the docs though (at least try to
generate them, and review my changes a bit since English isn't my mother
tongue).

mark-summerfield · 2008-08-20T07:36:29Z

On 2008-08-19, Antoine Pitrou wrote:

Antoine Pitrou <pitrou@free.fr> added the comment:

Fixed in r65860. Someone should check the docs though (at least try to
generate them, and review my changes a bit since English isn't my mother
tongue).

I've revised the ASCII and LOCALE-related texts in re.rst in r65903.

mark-summerfield · 2008-08-20T07:40:55Z

On 2008-08-19, Antoine Pitrou wrote:

Antoine Pitrou <pitrou@free.fr> added the comment:

Fixed in r65860. Someone should check the docs though (at least try to
generate them, and review my changes a bit since English isn't my mother
tongue).

And two more (tiny) fixes in r65904; that's my lot:-)

pitrou · 2008-08-20T08:49:39Z

Thanks a lot Mark!

svensiegmund mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels May 12, 2008

pitrou self-assigned this Jul 24, 2008

pitrou closed this as completed Aug 19, 2008

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re.IGNORECASE not Unicode-ready #47083

re.IGNORECASE not Unicode-ready #47083

svensiegmund mannequin commented May 12, 2008

svensiegmund mannequin commented May 12, 2008

gvanrossum commented May 12, 2008

humitos mannequin commented Jun 2, 2008

pitrou commented Jun 28, 2008

pitrou commented Jun 28, 2008

gvanrossum commented Jun 28, 2008

pitrou commented Jun 28, 2008

pitrou commented Jun 29, 2008

pitrou commented Jun 29, 2008

pitrou commented Jun 29, 2008

pitrou commented Jul 5, 2008

pitrou commented Jul 5, 2008

pitrou commented Jul 28, 2008

amauryfa commented Jul 28, 2008

pitrou commented Jul 28, 2008

pitrou commented Aug 6, 2008

gvanrossum commented Aug 6, 2008

pitrou commented Aug 15, 2008

warsaw commented Aug 19, 2008

warsaw commented Aug 19, 2008

pitrou commented Aug 19, 2008

mark-summerfield mannequin commented Aug 20, 2008

mark-summerfield mannequin commented Aug 20, 2008

pitrou commented Aug 20, 2008

Navigation Menu

re.IGNORECASE not Unicode-ready #47083

re.IGNORECASE not Unicode-ready #47083

Comments

svensiegmund mannequin commented May 12, 2008

svensiegmund mannequin commented May 12, 2008

gvanrossum commented May 12, 2008

humitos mannequin commented Jun 2, 2008

pitrou commented Jun 28, 2008

pitrou commented Jun 28, 2008

gvanrossum commented Jun 28, 2008

pitrou commented Jun 28, 2008

pitrou commented Jun 29, 2008

pitrou commented Jun 29, 2008

pitrou commented Jun 29, 2008

pitrou commented Jul 5, 2008

pitrou commented Jul 5, 2008

pitrou commented Jul 28, 2008

amauryfa commented Jul 28, 2008

pitrou commented Jul 28, 2008

pitrou commented Aug 6, 2008

gvanrossum commented Aug 6, 2008

pitrou commented Aug 15, 2008

warsaw commented Aug 19, 2008

warsaw commented Aug 19, 2008

pitrou commented Aug 19, 2008

mark-summerfield mannequin commented Aug 20, 2008

mark-summerfield mannequin commented Aug 20, 2008

pitrou commented Aug 20, 2008