Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python re lib fails case insensitive matches on Unicode data #56937

Closed
tchrist mannequin opened this issue Aug 11, 2011 · 9 comments
Closed

Python re lib fails case insensitive matches on Unicode data #56937

tchrist mannequin opened this issue Aug 11, 2011 · 9 comments
Assignees
Labels
topic-regex type-bug An unexpected behavior, bug, or error

Comments

@tchrist
Copy link
Mannequin

tchrist mannequin commented Aug 11, 2011

BPO 12728
Nosy @malemburg, @gvanrossum, @loewis, @terryjreedy, @pitrou, @ezio-melotti, @serhiy-storchaka
Dependencies
  • bpo-17381: IGNORECASE breaks unicode literal range matching
  • Files
  • sigmata.python: Test case proving Python lib re is erroneously using casemapping when it is supposed to use casefolding
  • re_ignore_case_2.patch
  • re_cases.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2014-11-10.10:52:04.273>
    created_at = <Date 2011-08-11.18:48:20.814>
    labels = ['expert-regex', 'type-bug']
    title = 'Python re lib fails case insensitive matches on Unicode data'
    updated_at = <Date 2014-11-10.10:52:04.272>
    user = 'https://bugs.python.org/tchrist'

    bugs.python.org fields:

    activity = <Date 2014-11-10.10:52:04.272>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2014-11-10.10:52:04.273>
    closer = 'serhiy.storchaka'
    components = ['Regular Expressions']
    creation = <Date 2011-08-11.18:48:20.814>
    creator = 'tchrist'
    dependencies = ['17381']
    files = ['22879', '37086', '37146']
    hgrepos = []
    issue_num = 12728
    keywords = ['patch']
    message_count = 9.0
    messages = ['141916', '141987', '141988', '143034', '227236', '230349', '230830', '230951', '230952']
    nosy_count = 11.0
    nosy_names = ['lemburg', 'gvanrossum', 'loewis', 'terry.reedy', 'pitrou', 'ezio.melotti', 'mrabarnett', 'Arfrever', 'python-dev', 'tchrist', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue12728'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

    @tchrist
    Copy link
    Mannequin Author

    tchrist mannequin commented Aug 11, 2011

    The Python re library is broken in its approach to case-insensitive matches. It erroneously attempts to compare lowercase mappings. This is wrong. You must compare the Unicode casefolds, not the Unicode casemaps. Otherwise you get wrong answers. I include a small test case that illustrates this bug. The bug exists on both 2.7 and 3.2, and on both wide builds and narrow builds. For comparison, I also show results using Matthew Barnett's regex library, which gets all 5 tests correct where re gets all 5 tests wrong.

    A sample run is:

    FAIL: re pattern Ι is not the same as string ͅ
    PASS: regex pattern Ι is indeed the same as string ͅ
    FAIL: re pattern Μ is not the same as string µ
    PASS: regex pattern Μ is indeed the same as string µ
    FAIL: re pattern ſ is not the same as string s
    PASS: regex pattern ſ is indeed the same as string s
    FAIL: re pattern ΣΤΙΓΜΑΣ is not the same as string στιγμας
    PASS: regex pattern ΣΤΙΓΜΑΣ is indeed the same as string στιγμας
    FAIL: re pattern POST is not the same as string poſt
    PASS: regex pattern POST is indeed the same as string poſt

    re lib passed 0 of 5 tests
    regex lib passed 5 of 5 tests

    @tchrist tchrist mannequin added stdlib Python modules in the Lib dir topic-regex type-bug An unexpected behavior, bug, or error and removed stdlib Python modules in the Lib dir labels Aug 11, 2011
    @terryjreedy
    Copy link
    Member

    I am not sure that everyone will agree that this is a bug, rather than a feature request, or that if a bug, that it should be changed in existing releases and possibly break running code. The doc just says, somewhat vaguely, that IGNORECASE "works for Unicode characters as expected". I have added others as nosy for their opinions.

    The test file should have omitted the gratuitous and distracting warnings, especially the one that effectively scolds Windows users for running Windows. With those omitted, the test cases given would form the basis for an added TestCase.

    @tchrist
    Copy link
    Mannequin Author

    tchrist mannequin commented Aug 12, 2011

    Terry J. Reedy <tjreedy@udel.edu> added the comment:

    I am not sure that everyone will agree that this is a bug, rather than a fe=
    ature request, or that if a bug, that it should be changed in existing rele=
    ases and possibly break running code. The doc just says, somewhat vaguely, =
    that IGNORECASE "works for Unicode characters as expected". I have added ot=
    hers as nosy for their opinions.

    Working as expected for Unicode characters means it must the Unicode's
    rules for casefolding. Otherwise you don't have Unicode at all; you just
    have ISO 10646. Unicode is not merely a larger character repertoire; again,
    that is merely ISO 10646. Unicode is all about the rules for processing this
    larger repertoire. This is a very common mistake, so common that it is in the
    Unicode FAQ:

    Q: What is the relation between ISO/IEC 10646 and Unicode?
    
    A: In 1991, the ISO Working Group responsible for ISO/IEC 10646 (JTC
       1/SC 2/WG 2) and the Unicode Consortium decided to create one
       universal standard for coding multilingual text. Since then, the
       ISO 10646 Working Group (SC 2/WG 2) and the Unicode Consortium
       have worked together very closely to extend the standard and to
       keep their respective versions synchronized. [EH]
    
    Q: So are they the same thing?
    
    A: No. Although the character codes and encoding forms are
       synchronized between Unicode and ISO/IEC 10646, the Unicode
       Standard imposes additional constraints on implementations to
       ensure that they treat characters uniformly across platforms and
       applications. To this end, it supplies an extensive set of
       functional character specifications, character data, algorithms
       and substantial background material that is *not* in ISO/IEC 10646.
    
    http://unicode.org/faq/unicode_iso.html
    

    Part of those functional character specifications can be found in the three
    casefolding fields of the file UnicodeData.txt and also in two auxiliary
    files of the Unicode distribution, CaseFolding.txt and SpecialCasing.txt.
    The Unicode Character Database is not optional. If you do not use it, you
    do not have Unicode; instead you merely have ISO 10646, which is of zero
    practical use to anyone compared with Unicode. I'm sure that Python would
    not want to be stuck having something of no use to anyone when everyone
    else actually supports Unicode.

    One is not allowed to make up one's own rules that run counter to Unicode's
    and still make the claim that one is working on Unicode, since that is in
    fact not what one is doing. Based on all that, Python does not do case
    insensitive matching on Unicode, a condition contrary to its documented
    claims. That clearly makes it a bug that needs fixing rather than a
    feature request to be summarily ignored.

    The test file should have omitted the gratuitous and distracting warnings, =
    especially the one that effectively scolds Windows users for running Window=
    s. With those omitted, the test cases given would form the basis for an add=
    ed TestCase.

    I have absolutely no idea what on earth you could possibly be referring to.
    Honestly. I ran my tests on both releases (2.7 and 3.2), on both builds
    (wide and narrow), and on both platforms (Unix and Mac). The warnings are
    in there so I can make sure I have everything set up correctly to run the
    tests, and will understand why I get more failures than expected in the event
    that things are not set up appropriately.

    Let me make perfectly clear that I have never in my life come anywhere near a
    Microsoft system, let alone touched one, and that I furthermore never shall.
    I have not the foggiest notion what in the world you are complaining about.
    If the problem is that you are for some reason unable to create a Python with
    full Unicode support under Microsoft, that is hardly my fault. Render unto
    Caesar that which is Caesar's: complain to Microsoft about Microsoft's bugs,
    not to me, as I am wholly blameless of their problems.

    If you don't like my test cases, you know where to find vi.

    I supposed I could always send you the program that writes these programs
    for me, but as I knew you won't like it, I withheld it. You already have
    all that you need to see exactly where the bugs are and how to fix them.

    --tom

    @gvanrossum
    Copy link
    Member

    This bug could do with a little less attitude. That said, I think it is a bug and should be fixed, at the very least for Python 3.3. As always, it is a matter of much debate to what extent bugs can be fixed in previous Python versions (specifically, 2.7 and 3.2) without breaking more code than it fixes, and I don't want to jump the gun on that issue. Let's first see what it takes to fix this for 3.3.

    @serhiy-storchaka
    Copy link
    Member

    Here is preliminary patch which fixes case-insensitive regular expression matching of unicode strings. It is incomplete, it needs applying patches from bpo-17381, which fixes other aspects of case-insensitive matching.

    One bug is left for Turkish letters. This matching is not transitive. Three pairs of letters should match: ı ~ I ~ i ~ İ. All other combinations should not match (ı !~ i, I !~ İ, ı !~ İ). This patch doesn't fixes this bug.

    @serhiy-storchaka serhiy-storchaka self-assigned this Sep 21, 2014
    @serhiy-storchaka
    Copy link
    Member

    Here are complete patch and script used to generate equivalence table.

    @serhiy-storchaka
    Copy link
    Member

    Could anyone please make a review?

    The script is updated so that it now is compatible with 2.7. There are some differences in equivalence table between 2.7 and 3.4 (e.g. 'ΐ' (U+0390) is not equivalent to 'ΐ' (U+1FD3) in 2.7).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 10, 2014

    New changeset 4caa695af94c by Serhiy Storchaka in branch '2.7':
    Issue bpo-12728: Different Unicode characters having the same uppercase but
    https://hg.python.org/cpython/rev/4caa695af94c

    New changeset 47b3084dd6aa by Serhiy Storchaka in branch '3.4':
    Issue bpo-12728: Different Unicode characters having the same uppercase but
    https://hg.python.org/cpython/rev/47b3084dd6aa

    New changeset 09ec09cfe539 by Serhiy Storchaka in branch 'default':
    Issue bpo-12728: Different Unicode characters having the same uppercase but
    https://hg.python.org/cpython/rev/09ec09cfe539

    @serhiy-storchaka
    Copy link
    Member

    This solution (with hardcoded table of equivalent lowercases) is temporary. In future re engine will be changed to support correct caseless matching of different lowercase forms internally.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants