classification
Title: IGNORECASE breaks unicode literal range matching
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.5, Python 3.4, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: 22584 Superseder:
Assigned To: serhiy.storchaka Nosy List: acdha, ezio.melotti, mrabarnett, python-dev, serhiy.storchaka
Priority: normal Keywords: needs review, patch

Created on 2013-03-07 20:52 by acdha, last changed 2014-10-31 16:12 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
re_ignore_case_range.patch serhiy.storchaka, 2014-09-08 19:58 review
re_ignore_case_range-3.5.patch serhiy.storchaka, 2014-09-17 08:57 review
re_ignore_case_range-3.4_2.patch serhiy.storchaka, 2014-09-24 19:17 review
re_ignore_case_range-3.5_2.patch serhiy.storchaka, 2014-10-08 19:41 review
re_ignore_case_range-3.5_3.patch serhiy.storchaka, 2014-10-09 07:50 review
Messages (17)
msg183705 - (view) Author: Chris Adams (acdha) Date: 2013-03-07 20:52
I noticed an interesting failure while using re.match / re.sub to look for non-Cyrillic characters in allegedly Russian text:

>>> re.sub(r'[\s\u0400-\u0527]+', ' ', 'Архангельская губерния', flags=re.IGNORECASE)
'Архангельская губерния'
>>> re.sub(r'[\s\u0400-\u0527]+', '', 'Архангельская губерния', flags=0)
''

The same is true in Python 2.7, although you need to use ur'' patterns for the literals to be expanded:

>>> re.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=re.IGNORECASE|regex.UNICODE)
u'\u0410\u0440\u0445\u0430\u043d\u0433\u0435\u043b\u044c\u0441\u043a\u0430\u044f\u0433\u0443\u0431\u0435\u0440\u043d\u0438\u044f'


In contrast, the regex module behaves as expected:

>>> regex.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=regex.IGNORECASE|regex.UNICODE)
u''

(Transcript maintained at https://gist.github.com/acdha/5111687)
msg183712 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2013-03-07 23:19
The way the re handles ranges is to convert the two endpoints to lowercase and then check whether the lowercase form of the character in the text is in that range.

For example, [A-Z] is converted to the range [\x41-\x5A], and the lowercase form of 'Q' ('\x51') is 'q' ('\x7A'), which is in the range.

In your example, [\u0400-\u0527] is converted to the range [\u0450-\u0527], but the lowercase form of 'А' ('\u0410') is 'а' ('\u0430'), which isn't in the range.

This is the same as issue #3511, but a worse failure.
msg183753 - (view) Author: Chris Adams (acdha) Date: 2013-03-08 18:22
Ah, that explains it - I'd been hoping based on the re.DEBUG output that the explicit unicode ranges were preserved.

I found #3511 before opening this one but don't believe the decision should be the same since this isn't a mixed numeric/alphabetic range.
msg183988 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-03-11 19:50
Matthew, should this be closed then?
msg183989 - (view) Author: Chris Adams (acdha) Date: 2013-03-11 19:59
Ezio: given the non-obvious failure, what do you think of at least documenting this and issuing a warning any time both re.UNICODE and re.IGNORECASE are set?
msg183992 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2013-03-11 21:00
In issue #3511 the range was slightly unusual, so closing it seemed a reasonable approach, but the range in this issue is less clearly a problem. My preference would be to fix it, if possible.
msg183993 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-03-11 21:24
I'm working on the patch.
msg184016 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-03-12 08:11
Is this the same issue described in #12728?
msg226608 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-08 19:58
No, issue12728 is more complicate case.

Here is a patch which fixes this issue and issue3511.
msg226989 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-17 08:57
This patch has a disadvantage - it slows down case-insensitive compiling of some very wide ranges, e.g. compile(r"[\x00-\U0010ffff]+", re.I) (this is worst case). In most cases this is not important, because such wide ranges are rare enough and compiled patterns are cached.

To get rid of this regression, we need new opcode. Due to preserving binary compatibility, this approach can't be applied to old releases. Here is a patch for 3.5.

Please make a review. This patches are needed to continue fixing of other re bugs.
msg227485 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-09-24 19:17
Here is other patch for 3.4. It is more than 10 times faster than initial patch in worst case.
msg228814 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-10-08 19:41
Actually 3.5 patch can be simpler.
msg228837 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-10-09 07:50
Updated patch for 3.5 addresses Antoine's comments.

Note that 3.4 and 3.5 use different solutions of this issue.
msg229919 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-10-24 12:28
Does the patch look good now for you Antoine? If there are no objections I'm going to commit it soon.

In order to apply 3.4 patch to 2.7 we need either significant modify the patch, or first backport issue19329 changes to 2.7 (it would be easier).
msg230332 - (view) Author: Roundup Robot (python-dev) Date: 2014-10-31 10:42
New changeset 6f52a3d0f548 by Serhiy Storchaka in branch 'default':
Issue #17381: Fixed handling of case-insensitive ranges in regular expressions.
https://hg.python.org/cpython/rev/6f52a3d0f548

New changeset 7981cb1556cf by Serhiy Storchaka in branch '3.4':
Issue #17381: Fixed handling of case-insensitive ranges in regular expressions.
https://hg.python.org/cpython/rev/7981cb1556cf
msg230336 - (view) Author: Roundup Robot (python-dev) Date: 2014-10-31 11:55
New changeset ebd48b4f650d by Serhiy Storchaka in branch '2.7':
Backported the optimization of compiling charsets in regular expressions
https://hg.python.org/cpython/rev/ebd48b4f650d

New changeset 6cd4b9827755 by Serhiy Storchaka in branch '2.7':
Issue #17381: Fixed ranges handling in case-insensitive regular expressions.
https://hg.python.org/cpython/rev/6cd4b9827755
msg230350 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-10-31 16:12
Thank you Antoine for your review.
History
Date User Action Args
2014-11-08 12:14:26serhiy.storchakalinkissue3511 superseder
2014-10-31 16:12:16serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg230350

stage: patch review -> resolved
2014-10-31 11:55:20python-devsetmessages: + msg230336
2014-10-31 10:42:54python-devsetnosy: + python-dev
messages: + msg230332
2014-10-24 12:28:34serhiy.storchakasetmessages: + msg229919
2014-10-09 07:50:41serhiy.storchakasetfiles: + re_ignore_case_range-3.5_3.patch

dependencies: + Get rid of SRE character tables
messages: + msg228837
2014-10-08 19:41:57serhiy.storchakasetfiles: + re_ignore_case_range-3.5_2.patch

messages: + msg228814
2014-09-24 19:17:11serhiy.storchakasetfiles: + re_ignore_case_range-3.4_2.patch

messages: + msg227485
2014-09-21 20:45:06serhiy.storchakalinkissue12728 dependencies
2014-09-17 08:57:13serhiy.storchakasetkeywords: + needs review
files: + re_ignore_case_range-3.5.patch
messages: + msg226989
2014-09-08 19:58:14serhiy.storchakasetfiles: + re_ignore_case_range.patch
versions: + Python 3.4, Python 3.5, - Python 3.3
messages: + msg226608

assignee: serhiy.storchaka
keywords: + patch
stage: patch review
2013-03-12 08:11:31ezio.melottisetmessages: + msg184016
2013-03-11 21:24:54serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg183993
2013-03-11 21:00:24mrabarnettsetmessages: + msg183992
2013-03-11 19:59:42acdhasetmessages: + msg183989
2013-03-11 19:50:21ezio.melottisetmessages: + msg183988
2013-03-08 18:22:51acdhasetmessages: + msg183753
2013-03-07 23:19:56mrabarnettsetmessages: + msg183712
2013-03-07 20:52:32acdhacreate