This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Leif Arne Storset
Recipients Leif Arne Storset, docs@python
Date 2015-08-19.12:38:10
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1439987891.31.0.547960058443.issue24896@psf.upfronthosting.co.za>
In-reply-to
Content
A non-ASCII string does not match a regular expression case-insensitively
unless the UNICODE flag is set. This seems reasonable, but the documentation
seems to imply that this is not the case.

The example:

    import re
    # Does not match
    re.compile(u"неоднозначность", re.IGNORECASE) \
            .findall(u"Неоднозначность") 
    # Matches
    re.compile(u"неоднозначность", re.IGNORECASE | re.UNICODE) \
            .findall(u"Неоднозначность")

(In Python 3, it does not match if re.ASCII is given.)

The documentation (2.7) says:

    re.UNICODE
    
    Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character
    properties database.

(https://docs.python.org/2/library/re.html#re.UNICODE)

My regex does not use any of those escapes, yet the regex changes behavior with
the UNICODE flag. This leads to confusion when the regex doesn't match. The documentation is very specific about the behavior that changes with the flag,
implying that behavior not mentioned is unaffected.

Of course, it's easy to guess the correct (hopefully) solution.

Still, I suggest changing the documentation to mention that re.IGNORECASE is
affected. Looking at the source code, there seems to be further consequences
(it mentions "Unicode locale") which may also warrant a mention. If you do want
to avoid specifics, however, even a hand-wavy reference to something like "match
according to Unicode" would help, because it implies that not only the escapes
change behavior.



In Python 3, there is a counterpart to the 2.7 problem: re.ASCII makes our
Cyrillic string not match. Again, this behavior makes intuitive sense, but the
documentation seems to indicate something different:

    re.ASCII
    Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead
    of full Unicode matching. This is only meaningful for Unicode patterns, and
    is ignored for byte patterns.

    …

    re.IGNORECASE
    Perform case-insensitive matching; expressions like [A-Z] will match
    lowercase letters, too. This is not affected by the current locale and
    works for Unicode characters as expected.

re.ASCII does appear to affect re.IGNORECASE. Since this is the non-default
case, however, I'm not sure it's worth calling it out. I'd be happy even if
only the 2.7 docs change.
History
Date User Action Args
2015-08-19 12:38:11Leif Arne Storsetsetrecipients: + Leif Arne Storset, docs@python
2015-08-19 12:38:11Leif Arne Storsetsetmessageid: <1439987891.31.0.547960058443.issue24896@psf.upfronthosting.co.za>
2015-08-19 12:38:11Leif Arne Storsetlinkissue24896 messages
2015-08-19 12:38:10Leif Arne Storsetcreate