Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IGNORECASE breaks unicode literal range matching #61583

Closed
acdha mannequin opened this issue Mar 7, 2013 · 17 comments
Closed

IGNORECASE breaks unicode literal range matching #61583

acdha mannequin opened this issue Mar 7, 2013 · 17 comments
Assignees
Labels
topic-regex type-bug An unexpected behavior, bug, or error

Comments

@acdha
Copy link
Mannequin

acdha mannequin commented Mar 7, 2013

BPO 17381
Nosy @ezio-melotti, @serhiy-storchaka
Dependencies
  • bpo-22584: Get rid of SRE character tables
  • Files
  • re_ignore_case_range.patch
  • re_ignore_case_range-3.5.patch
  • re_ignore_case_range-3.4_2.patch
  • re_ignore_case_range-3.5_2.patch
  • re_ignore_case_range-3.5_3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2014-10-31.16:12:16.319>
    created_at = <Date 2013-03-07.20:52:32.442>
    labels = ['expert-regex', 'type-bug']
    title = 'IGNORECASE breaks unicode literal range matching'
    updated_at = <Date 2014-10-31.16:12:16.318>
    user = 'https://bugs.python.org/acdha'

    bugs.python.org fields:

    activity = <Date 2014-10-31.16:12:16.318>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2014-10-31.16:12:16.319>
    closer = 'serhiy.storchaka'
    components = ['Regular Expressions']
    creation = <Date 2013-03-07.20:52:32.442>
    creator = 'acdha'
    dependencies = ['22584']
    files = ['36575', '36636', '36712', '36839', '36842']
    hgrepos = []
    issue_num = 17381
    keywords = ['patch', 'needs review']
    message_count = 17.0
    messages = ['183705', '183712', '183753', '183988', '183989', '183992', '183993', '184016', '226608', '226989', '227485', '228814', '228837', '229919', '230332', '230336', '230350']
    nosy_count = 5.0
    nosy_names = ['ezio.melotti', 'mrabarnett', 'python-dev', 'serhiy.storchaka', 'acdha']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue17381'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

    @acdha
    Copy link
    Mannequin Author

    acdha mannequin commented Mar 7, 2013

    I noticed an interesting failure while using re.match / re.sub to look for non-Cyrillic characters in allegedly Russian text:

    >>> re.sub(r'[\s\u0400-\u0527]+', ' ', 'Архангельская губерния', flags=re.IGNORECASE)
    'Архангельская губерния'
    >>> re.sub(r'[\s\u0400-\u0527]+', '', 'Архангельская губерния', flags=0)
    ''

    The same is true in Python 2.7, although you need to use ur'' patterns for the literals to be expanded:

    >>> re.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=re.IGNORECASE|regex.UNICODE)
    u'\u0410\u0440\u0445\u0430\u043d\u0433\u0435\u043b\u044c\u0441\u043a\u0430\u044f\u0433\u0443\u0431\u0435\u0440\u043d\u0438\u044f'

    In contrast, the regex module behaves as expected:

    >>> regex.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=regex.IGNORECASE|regex.UNICODE)
    u''

    (Transcript maintained at https://gist.github.com/acdha/5111687)

    @acdha acdha mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Mar 7, 2013
    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Mar 7, 2013

    The way the re handles ranges is to convert the two endpoints to lowercase and then check whether the lowercase form of the character in the text is in that range.

    For example, [A-Z] is converted to the range [\x41-\x5A], and the lowercase form of 'Q' ('\x51') is 'q' ('\x7A'), which is in the range.

    In your example, [\u0400-\u0527] is converted to the range [\u0450-\u0527], but the lowercase form of 'А' ('\u0410') is 'а' ('\u0430'), which isn't in the range.

    This is the same as issue bpo-3511, but a worse failure.

    @acdha
    Copy link
    Mannequin Author

    acdha mannequin commented Mar 8, 2013

    Ah, that explains it - I'd been hoping based on the re.DEBUG output that the explicit unicode ranges were preserved.

    I found bpo-3511 before opening this one but don't believe the decision should be the same since this isn't a mixed numeric/alphabetic range.

    @ezio-melotti
    Copy link
    Member

    Matthew, should this be closed then?

    @acdha
    Copy link
    Mannequin Author

    acdha mannequin commented Mar 11, 2013

    Ezio: given the non-obvious failure, what do you think of at least documenting this and issuing a warning any time both re.UNICODE and re.IGNORECASE are set?

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Mar 11, 2013

    In issue bpo-3511 the range was slightly unusual, so closing it seemed a reasonable approach, but the range in this issue is less clearly a problem. My preference would be to fix it, if possible.

    @serhiy-storchaka
    Copy link
    Member

    I'm working on the patch.

    @ezio-melotti
    Copy link
    Member

    Is this the same issue described in bpo-12728?

    @serhiy-storchaka
    Copy link
    Member

    No, bpo-12728 is more complicate case.

    Here is a patch which fixes this issue and bpo-3511.

    @serhiy-storchaka serhiy-storchaka self-assigned this Sep 8, 2014
    @serhiy-storchaka
    Copy link
    Member

    This patch has a disadvantage - it slows down case-insensitive compiling of some very wide ranges, e.g. compile(r"[\x00-\U0010ffff]+", re.I) (this is worst case). In most cases this is not important, because such wide ranges are rare enough and compiled patterns are cached.

    To get rid of this regression, we need new opcode. Due to preserving binary compatibility, this approach can't be applied to old releases. Here is a patch for 3.5.

    Please make a review. This patches are needed to continue fixing of other re bugs.

    @serhiy-storchaka
    Copy link
    Member

    Here is other patch for 3.4. It is more than 10 times faster than initial patch in worst case.

    @serhiy-storchaka
    Copy link
    Member

    Actually 3.5 patch can be simpler.

    @serhiy-storchaka
    Copy link
    Member

    Updated patch for 3.5 addresses Antoine's comments.

    Note that 3.4 and 3.5 use different solutions of this issue.

    @serhiy-storchaka
    Copy link
    Member

    Does the patch look good now for you Antoine? If there are no objections I'm going to commit it soon.

    In order to apply 3.4 patch to 2.7 we need either significant modify the patch, or first backport bpo-19329 changes to 2.7 (it would be easier).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 31, 2014

    New changeset 6f52a3d0f548 by Serhiy Storchaka in branch 'default':
    Issue bpo-17381: Fixed handling of case-insensitive ranges in regular expressions.
    https://hg.python.org/cpython/rev/6f52a3d0f548

    New changeset 7981cb1556cf by Serhiy Storchaka in branch '3.4':
    Issue bpo-17381: Fixed handling of case-insensitive ranges in regular expressions.
    https://hg.python.org/cpython/rev/7981cb1556cf

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 31, 2014

    New changeset ebd48b4f650d by Serhiy Storchaka in branch '2.7':
    Backported the optimization of compiling charsets in regular expressions
    https://hg.python.org/cpython/rev/ebd48b4f650d

    New changeset 6cd4b9827755 by Serhiy Storchaka in branch '2.7':
    Issue bpo-17381: Fixed ranges handling in case-insensitive regular expressions.
    https://hg.python.org/cpython/rev/6cd4b9827755

    @serhiy-storchaka
    Copy link
    Member

    Thank you Antoine for your review.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants