Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re.IGNORECASE not Unicode-ready #47083

Closed
svensiegmund mannequin opened this issue May 12, 2008 · 24 comments
Closed

re.IGNORECASE not Unicode-ready #47083

svensiegmund mannequin opened this issue May 12, 2008 · 24 comments
Assignees
Labels
topic-regex type-bug An unexpected behavior, bug, or error

Comments

@svensiegmund
Copy link
Mannequin

svensiegmund mannequin commented May 12, 2008

BPO 2834
Nosy @gvanrossum, @warsaw, @amauryfa, @pitrou, @mark-summerfield, @humitos, @ezio-melotti
Files
  • reunicode.patch
  • reunicode2.patch
  • reunicode3.patch
  • reunicode4.patch
  • reunicode5.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/pitrou'
    closed_at = <Date 2008-08-19.17:59:29.265>
    created_at = <Date 2008-05-12.08:44:02.715>
    labels = ['expert-regex', 'type-bug']
    title = 're.IGNORECASE not Unicode-ready'
    updated_at = <Date 2009-02-13.14:02:50.756>
    user = 'https://bugs.python.org/svensiegmund'

    bugs.python.org fields:

    activity = <Date 2009-02-13.14:02:50.756>
    actor = 'ezio.melotti'
    assignee = 'pitrou'
    closed = True
    closed_date = <Date 2008-08-19.17:59:29.265>
    closer = 'pitrou'
    components = ['Regular Expressions']
    creation = <Date 2008-05-12.08:44:02.715>
    creator = 'sven.siegmund'
    dependencies = []
    files = ['10768', '10777', '10778', '10819', '10998']
    hgrepos = []
    issue_num = 2834
    keywords = ['patch']
    message_count = 24.0
    messages = ['66715', '66727', '67622', '68901', '68905', '68920', '68922', '68932', '68966', '68967', '69298', '69301', '70354', '70370', '70371', '70780', '70787', '71186', '71413', '71414', '71455', '71516', '71517', '71519']
    nosy_count = 8.0
    nosy_names = ['gvanrossum', 'barry', 'amaury.forgeotdarc', 'pitrou', 'mark', 'humitos', 'ezio.melotti', 'sven.siegmund']
    pr_nums = []
    priority = 'critical'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue2834'
    versions = ['Python 3.0']

    @svensiegmund
    Copy link
    Mannequin Author

    svensiegmund mannequin commented May 12, 2008

    re cannot ignore case of special latin characters:

    Python 3.0a5 (py3k:62932M, May  9 2008, 16:23:11) [MSC v.1500 32 bit 
    (Intel)] on win32
    >>> 'Á'.lower() == 'á' and 'á'.upper() == 'Á'
    True
    >>> import re
    >>> rx = re.compile('Á', re.IGNORECASE)
    >>> rx.match('á') # should match but won't
    >>> rx.match('Á') # will match
    <_sre.SRE_Match object at 0x014B08A8>
    >>> rx = re.compile('á', re.IGNORECASE)
    >>> rx.match('Á') # should match but won't
    >>> rx.match('á') # will match
    <_sre.SRE_Match object at 0x014B08A8>

    @svensiegmund svensiegmund mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels May 12, 2008
    @gvanrossum
    Copy link
    Member

    Try adding re.LOCALE to the flags. I'm not sure why that is needed but
    it seems to fix this issue.

    I still think this is a legitimate bug though.

    @humitos
    Copy link
    Mannequin

    humitos mannequin commented Jun 2, 2008

    I have the same error with the re.LOCALE flag...

    [humitos] [~]$ python3.0
    Python 3.0a5+ (py3k:63855, Jun  1 2008, 13:05:09)
    [GCC 4.1.3 20080114 (prerelease) (Debian 4.1.2-19)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import re
    >>> rx = re.compile('á', re.LOCALE | re.IGNORECASE)
    >>> rx.match('Á')
    >>> rx.match('á')
    <_sre.SRE_Match object at 0x2b955e204d30>
    >>> rx = re.compile('Á', re.IGNORECASE | re.LOCALE)
    >>> rx.match('Á')
    <_sre.SRE_Match object at 0x2b955e204e00>
    >>> rx.match('á')
    >>> 'Á'.lower() == 'á' and 'á'.upper() == 'Á'
    True
    >>>

    @pitrou
    Copy link
    Member

    pitrou commented Jun 28, 2008

    Same here, re.LOCALE doesn't circumvent the problem.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 28, 2008

    Uh, actually, it works if you specify re.UNICODE. If you don't, the
    getlower() function in _sre.c falls back to the plain ASCII algorithm.

    >>> pat = re.compile('Á', re.IGNORECASE | re.UNICODE)
    >>> pat.match('á')
    <_sre.SRE_Match object at 0xb7c66c28>
    >>> pat.match('Á')
    <_sre.SRE_Match object at 0xb7c66cd0>

    I wonder if re.UNICODE shouldn't be the default in Py3k, at least when
    the pattern is a string and not a bytes object. There may also be a
    re.ASCII flag for those cases where people want to fallback to the old
    behaviour.

    @gvanrossum
    Copy link
    Member

    Sounds like re.UNICODE should be on by default when the pattern is a str
    instance.

    Also (per mailing list discussion) we should probably only allow
    matching bytes when the pattern is bytes, and matching str when the
    pattern is str.

    Finally, is there a use case of re.LOCALE any more? I'm thinking not.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 28, 2008

    Le samedi 28 juin 2008 à 22:20 +0000, Guido van Rossum a écrit :

    Finally, is there a use case of re.LOCALE any more? I'm thinking not.

    It's used for locale-specific case matching in the non-unicode case. But
    it looks to me like a bad practice and we could probably remove it.

    'C'
    >>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE)
    >>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE |re.LOCALE)
    >>> locale.setlocale(locale.LC_CTYPE, 'fr_FR.ISO-8859-1')
    'fr_FR.ISO-8859-1'
    >>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE)
    >>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE | re.LOCALE)
    <_sre.SRE_Match object at 0xb7b9ac28>

    @pitrou
    Copy link
    Member

    pitrou commented Jun 29, 2008

    Here is a preliminary patch which doesn't remove re.LOCALE, but adds
    TypeError's for mistyped matchings, a ValueError when specifying
    re.UNICODE with a bytes pattern, and implies re.UNICODE for unicode
    patterns. The test suite runs fine after a few fixes.

    It also includes the patch for bpo-3231 ("re.compile fails with some bytes
    patterns").

    @pitrou
    Copy link
    Member

    pitrou commented Jun 29, 2008

    This new patch also introduces re.ASCII as discussed on the mailing-list.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 29, 2008

    Improved patch which also detects incompatibilities for "(?u)".

    @pitrou
    Copy link
    Member

    pitrou commented Jul 5, 2008

    This new patch adds re.ASCII in all sensitive places I could find in the
    stdlib (except lib2to3 which as far as I understand is maintained in a
    separate branch, and even has its own copy of tokenize.py...).

    Also, I didn't get an answer to the following question on the ML: should
    an inline flag "(?a)" be introduced to mirror the existing "(?u)" - so
    as to set the ASCII flag from inside a pattern string.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 5, 2008

    @pitrou pitrou self-assigned this Jul 24, 2008
    @pitrou
    Copy link
    Member

    pitrou commented Jul 28, 2008

    Final patch adding the (?a) inline flag (equivalent to re.ASCII). Please
    review:
    http://codereview.appspot.com/2439

    @amauryfa
    Copy link
    Member

    Are all those re.ASCII flags mandatory, or are they here just for
    theoretical correctness?
    For example, the output of "gcc -dumpversion" is certainly plain ASCII.
    I don't mind that \d also matches some exotic digit - it just won't happen.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 28, 2008

    Le lundi 28 juillet 2008 à 20:41 +0000, Amaury Forgeot d'Arc a écrit :

    Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

    Are all those re.ASCII flags mandatory, or are they here just for
    theoretical correctness?

    For theoretical correctness. I just don't want to analyze each case
    individually and I'm probably not competent for many of them.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 6, 2008

    If nobody (except Amaury :-)) has anything to say about the current
    patch, should it be committed?

    @gvanrossum
    Copy link
    Member

    Let's make sure the release manager is OK with this.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 15, 2008

    Barry?

    @warsaw
    Copy link
    Member

    warsaw commented Aug 19, 2008

    I haven't looked at the specific patch, but based on the description of
    the behavior, I'm +1 on committing this before beta 3. I'm fine with
    leaving the re.ASCII flags in there -- it will be a marker to indicate
    perhaps the code needs a closer examination (eventually).

    @warsaw
    Copy link
    Member

    warsaw commented Aug 19, 2008

    Make sure of course that the documentation is updated and a NEWS file
    entry is added.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 19, 2008

    Fixed in r65860. Someone should check the docs though (at least try to
    generate them, and review my changes a bit since English isn't my mother
    tongue).

    @pitrou pitrou closed this as completed Aug 19, 2008
    @mark-summerfield
    Copy link
    Mannequin

    mark-summerfield mannequin commented Aug 20, 2008

    On 2008-08-19, Antoine Pitrou wrote:

    Antoine Pitrou <pitrou@free.fr> added the comment:

    Fixed in r65860. Someone should check the docs though (at least try to
    generate them, and review my changes a bit since English isn't my mother
    tongue).

    I've revised the ASCII and LOCALE-related texts in re.rst in r65903.

    @mark-summerfield
    Copy link
    Mannequin

    mark-summerfield mannequin commented Aug 20, 2008

    On 2008-08-19, Antoine Pitrou wrote:

    Antoine Pitrou <pitrou@free.fr> added the comment:

    Fixed in r65860. Someone should check the docs though (at least try to
    generate them, and review my changes a bit since English isn't my mother
    tongue).

    And two more (tiny) fixes in r65904; that's my lot:-)

    @pitrou
    Copy link
    Member

    pitrou commented Aug 20, 2008

    Thanks a lot Mark!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants