New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IGNORECASE breaks unicode literal range matching #61583
Comments
I noticed an interesting failure while using re.match / re.sub to look for non-Cyrillic characters in allegedly Russian text: >>> re.sub(r'[\s\u0400-\u0527]+', ' ', 'Архангельская губерния', flags=re.IGNORECASE)
'Архангельская губерния'
>>> re.sub(r'[\s\u0400-\u0527]+', '', 'Архангельская губерния', flags=0)
'' The same is true in Python 2.7, although you need to use ur'' patterns for the literals to be expanded: >>> re.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=re.IGNORECASE|regex.UNICODE)
u'\u0410\u0440\u0445\u0430\u043d\u0433\u0435\u043b\u044c\u0441\u043a\u0430\u044f\u0433\u0443\u0431\u0435\u0440\u043d\u0438\u044f' In contrast, the regex module behaves as expected: >>> regex.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=regex.IGNORECASE|regex.UNICODE)
u'' (Transcript maintained at https://gist.github.com/acdha/5111687) |
The way the re handles ranges is to convert the two endpoints to lowercase and then check whether the lowercase form of the character in the text is in that range. For example, [A-Z] is converted to the range [\x41-\x5A], and the lowercase form of 'Q' ('\x51') is 'q' ('\x7A'), which is in the range. In your example, [\u0400-\u0527] is converted to the range [\u0450-\u0527], but the lowercase form of 'А' ('\u0410') is 'а' ('\u0430'), which isn't in the range. This is the same as issue bpo-3511, but a worse failure. |
Ah, that explains it - I'd been hoping based on the re.DEBUG output that the explicit unicode ranges were preserved. I found bpo-3511 before opening this one but don't believe the decision should be the same since this isn't a mixed numeric/alphabetic range. |
Matthew, should this be closed then? |
Ezio: given the non-obvious failure, what do you think of at least documenting this and issuing a warning any time both re.UNICODE and re.IGNORECASE are set? |
In issue bpo-3511 the range was slightly unusual, so closing it seemed a reasonable approach, but the range in this issue is less clearly a problem. My preference would be to fix it, if possible. |
I'm working on the patch. |
Is this the same issue described in bpo-12728? |
This patch has a disadvantage - it slows down case-insensitive compiling of some very wide ranges, e.g. compile(r"[\x00-\U0010ffff]+", re.I) (this is worst case). In most cases this is not important, because such wide ranges are rare enough and compiled patterns are cached. To get rid of this regression, we need new opcode. Due to preserving binary compatibility, this approach can't be applied to old releases. Here is a patch for 3.5. Please make a review. This patches are needed to continue fixing of other re bugs. |
Here is other patch for 3.4. It is more than 10 times faster than initial patch in worst case. |
Actually 3.5 patch can be simpler. |
Updated patch for 3.5 addresses Antoine's comments. Note that 3.4 and 3.5 use different solutions of this issue. |
Does the patch look good now for you Antoine? If there are no objections I'm going to commit it soon. In order to apply 3.4 patch to 2.7 we need either significant modify the patch, or first backport bpo-19329 changes to 2.7 (it would be easier). |
New changeset 6f52a3d0f548 by Serhiy Storchaka in branch 'default': New changeset 7981cb1556cf by Serhiy Storchaka in branch '3.4': |
New changeset ebd48b4f650d by Serhiy Storchaka in branch '2.7': New changeset 6cd4b9827755 by Serhiy Storchaka in branch '2.7': |
Thank you Antoine for your review. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: