IGNORECASE breaks unicode literal range matching #61583

acdha · 2013-03-07T20:52:32Z

BPO	17381
Nosy	@ezio-melotti, @serhiy-storchaka
Dependencies	bpo-22584: Get rid of SRE character tables
Files	re_ignore_case_range.patch re_ignore_case_range-3.5.patch re_ignore_case_range-3.4_2.patch re_ignore_case_range-3.5_2.patch re_ignore_case_range-3.5_3.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2014-10-31.16:12:16.319>
created_at = <Date 2013-03-07.20:52:32.442>
labels = ['expert-regex', 'type-bug']
title = 'IGNORECASE breaks unicode literal range matching'
updated_at = <Date 2014-10-31.16:12:16.318>
user = 'https://bugs.python.org/acdha'

bugs.python.org fields:

activity = <Date 2014-10-31.16:12:16.318>
actor = 'serhiy.storchaka'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2014-10-31.16:12:16.319>
closer = 'serhiy.storchaka'
components = ['Regular Expressions']
creation = <Date 2013-03-07.20:52:32.442>
creator = 'acdha'
dependencies = ['22584']
files = ['36575', '36636', '36712', '36839', '36842']
hgrepos = []
issue_num = 17381
keywords = ['patch', 'needs review']
message_count = 17.0
messages = ['183705', '183712', '183753', '183988', '183989', '183992', '183993', '184016', '226608', '226989', '227485', '228814', '228837', '229919', '230332', '230336', '230350']
nosy_count = 5.0
nosy_names = ['ezio.melotti', 'mrabarnett', 'python-dev', 'serhiy.storchaka', 'acdha']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue17381'
versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

acdha · 2013-03-07T20:52:32Z

I noticed an interesting failure while using re.match / re.sub to look for non-Cyrillic characters in allegedly Russian text:

>>> re.sub(r'[\s\u0400-\u0527]+', ' ', 'Архангельская губерния', flags=re.IGNORECASE)
'Архангельская губерния'
>>> re.sub(r'[\s\u0400-\u0527]+', '', 'Архангельская губерния', flags=0)
''

The same is true in Python 2.7, although you need to use ur'' patterns for the literals to be expanded:

>>> re.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=re.IGNORECASE|regex.UNICODE)
u'\u0410\u0440\u0445\u0430\u043d\u0433\u0435\u043b\u044c\u0441\u043a\u0430\u044f\u0433\u0443\u0431\u0435\u0440\u043d\u0438\u044f'

In contrast, the regex module behaves as expected:

>>> regex.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=regex.IGNORECASE|regex.UNICODE)
u''

(Transcript maintained at https://gist.github.com/acdha/5111687)

mrabarnett · 2013-03-07T23:19:56Z

The way the re handles ranges is to convert the two endpoints to lowercase and then check whether the lowercase form of the character in the text is in that range.

For example, [A-Z] is converted to the range [\x41-\x5A], and the lowercase form of 'Q' ('\x51') is 'q' ('\x7A'), which is in the range.

In your example, [\u0400-\u0527] is converted to the range [\u0450-\u0527], but the lowercase form of 'А' ('\u0410') is 'а' ('\u0430'), which isn't in the range.

This is the same as issue bpo-3511, but a worse failure.

acdha · 2013-03-08T18:22:51Z

Ah, that explains it - I'd been hoping based on the re.DEBUG output that the explicit unicode ranges were preserved.

I found bpo-3511 before opening this one but don't believe the decision should be the same since this isn't a mixed numeric/alphabetic range.

ezio-melotti · 2013-03-11T19:50:22Z

Matthew, should this be closed then?

acdha · 2013-03-11T19:59:43Z

Ezio: given the non-obvious failure, what do you think of at least documenting this and issuing a warning any time both re.UNICODE and re.IGNORECASE are set?

mrabarnett · 2013-03-11T21:00:25Z

In issue bpo-3511 the range was slightly unusual, so closing it seemed a reasonable approach, but the range in this issue is less clearly a problem. My preference would be to fix it, if possible.

serhiy-storchaka · 2013-03-11T21:24:54Z

I'm working on the patch.

ezio-melotti · 2013-03-12T08:11:32Z

Is this the same issue described in bpo-12728?

serhiy-storchaka · 2014-09-08T19:58:13Z

No, bpo-12728 is more complicate case.

Here is a patch which fixes this issue and bpo-3511.

serhiy-storchaka · 2014-09-17T08:57:11Z

This patch has a disadvantage - it slows down case-insensitive compiling of some very wide ranges, e.g. compile(r"[\x00-\U0010ffff]+", re.I) (this is worst case). In most cases this is not important, because such wide ranges are rare enough and compiled patterns are cached.

To get rid of this regression, we need new opcode. Due to preserving binary compatibility, this approach can't be applied to old releases. Here is a patch for 3.5.

Please make a review. This patches are needed to continue fixing of other re bugs.

serhiy-storchaka · 2014-09-24T19:17:11Z

Here is other patch for 3.4. It is more than 10 times faster than initial patch in worst case.

serhiy-storchaka · 2014-10-08T19:41:55Z

Actually 3.5 patch can be simpler.

serhiy-storchaka · 2014-10-09T07:50:41Z

Updated patch for 3.5 addresses Antoine's comments.

Note that 3.4 and 3.5 use different solutions of this issue.

serhiy-storchaka · 2014-10-24T12:28:34Z

Does the patch look good now for you Antoine? If there are no objections I'm going to commit it soon.

In order to apply 3.4 patch to 2.7 we need either significant modify the patch, or first backport bpo-19329 changes to 2.7 (it would be easier).

python-dev · 2014-10-31T10:42:54Z

New changeset 6f52a3d0f548 by Serhiy Storchaka in branch 'default':
Issue bpo-17381: Fixed handling of case-insensitive ranges in regular expressions.
https://hg.python.org/cpython/rev/6f52a3d0f548

New changeset 7981cb1556cf by Serhiy Storchaka in branch '3.4':
Issue bpo-17381: Fixed handling of case-insensitive ranges in regular expressions.
https://hg.python.org/cpython/rev/7981cb1556cf

python-dev · 2014-10-31T11:55:20Z

New changeset ebd48b4f650d by Serhiy Storchaka in branch '2.7':
Backported the optimization of compiling charsets in regular expressions
https://hg.python.org/cpython/rev/ebd48b4f650d

New changeset 6cd4b9827755 by Serhiy Storchaka in branch '2.7':
Issue bpo-17381: Fixed ranges handling in case-insensitive regular expressions.
https://hg.python.org/cpython/rev/6cd4b9827755

serhiy-storchaka · 2014-10-31T16:12:16Z

Thank you Antoine for your review.

acdha mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Mar 7, 2013

serhiy-storchaka self-assigned this Sep 8, 2014

serhiy-storchaka closed this as completed Oct 31, 2014

ezio-melotti transferred this issue from another repository Apr 10, 2022

serhiy-storchaka mentioned this issue Apr 15, 2022

Outdated Unicode data in the re module #91575

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IGNORECASE breaks unicode literal range matching #61583

IGNORECASE breaks unicode literal range matching #61583

acdha mannequin commented Mar 7, 2013

acdha mannequin commented Mar 7, 2013

mrabarnett mannequin commented Mar 7, 2013

acdha mannequin commented Mar 8, 2013

ezio-melotti commented Mar 11, 2013

acdha mannequin commented Mar 11, 2013

mrabarnett mannequin commented Mar 11, 2013

serhiy-storchaka commented Mar 11, 2013

ezio-melotti commented Mar 12, 2013

serhiy-storchaka commented Sep 8, 2014

serhiy-storchaka commented Sep 17, 2014

serhiy-storchaka commented Sep 24, 2014

serhiy-storchaka commented Oct 8, 2014

serhiy-storchaka commented Oct 9, 2014

serhiy-storchaka commented Oct 24, 2014

python-dev mannequin commented Oct 31, 2014

python-dev mannequin commented Oct 31, 2014

serhiy-storchaka commented Oct 31, 2014

IGNORECASE breaks unicode literal range matching #61583

IGNORECASE breaks unicode literal range matching #61583

Comments

acdha mannequin commented Mar 7, 2013

acdha mannequin commented Mar 7, 2013

mrabarnett mannequin commented Mar 7, 2013

acdha mannequin commented Mar 8, 2013

ezio-melotti commented Mar 11, 2013

acdha mannequin commented Mar 11, 2013

mrabarnett mannequin commented Mar 11, 2013

serhiy-storchaka commented Mar 11, 2013

ezio-melotti commented Mar 12, 2013

serhiy-storchaka commented Sep 8, 2014

serhiy-storchaka commented Sep 17, 2014

serhiy-storchaka commented Sep 24, 2014

serhiy-storchaka commented Oct 8, 2014

serhiy-storchaka commented Oct 9, 2014

serhiy-storchaka commented Oct 24, 2014

python-dev mannequin commented Oct 31, 2014

python-dev mannequin commented Oct 31, 2014

serhiy-storchaka commented Oct 31, 2014