classification
Title: It is undocumented that re.UNICODE and re.LOCALE affect re.IGNORECASE
Type: enhancement Stage: needs patch
Components: Documentation, Regular Expressions Versions: Python 3.7, Python 3.6, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Brian Ward, Leif Arne Storset, docs@python, ezio.melotti, jwilk, mrabarnett, r.david.murray, serhiy.storchaka
Priority: normal Keywords: easy

Created on 2015-08-19 12:38 by Leif Arne Storset, last changed 2017-10-06 15:06 by serhiy.storchaka.

Pull Requests
URL Status Linked Edit
PR 1781 merged python-dev, 2017-05-24 00:32
PR 1782 merged Brian Ward, 2017-05-24 00:44
PR 3313 merged gregory.p.smith, 2017-09-04 21:30
Messages (7)
msg248829 - (view) Author: Leif Arne Storset (Leif Arne Storset) Date: 2015-08-19 12:38
A non-ASCII string does not match a regular expression case-insensitively
unless the UNICODE flag is set. This seems reasonable, but the documentation
seems to imply that this is not the case.

The example:

    import re
    # Does not match
    re.compile(u"неоднозначность", re.IGNORECASE) \
            .findall(u"Неоднозначность") 
    # Matches
    re.compile(u"неоднозначность", re.IGNORECASE | re.UNICODE) \
            .findall(u"Неоднозначность")

(In Python 3, it does not match if re.ASCII is given.)

The documentation (2.7) says:

    re.UNICODE
    
    Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character
    properties database.

(https://docs.python.org/2/library/re.html#re.UNICODE)

My regex does not use any of those escapes, yet the regex changes behavior with
the UNICODE flag. This leads to confusion when the regex doesn't match. The documentation is very specific about the behavior that changes with the flag,
implying that behavior not mentioned is unaffected.

Of course, it's easy to guess the correct (hopefully) solution.

Still, I suggest changing the documentation to mention that re.IGNORECASE is
affected. Looking at the source code, there seems to be further consequences
(it mentions "Unicode locale") which may also warrant a mention. If you do want
to avoid specifics, however, even a hand-wavy reference to something like "match
according to Unicode" would help, because it implies that not only the escapes
change behavior.



In Python 3, there is a counterpart to the 2.7 problem: re.ASCII makes our
Cyrillic string not match. Again, this behavior makes intuitive sense, but the
documentation seems to indicate something different:

    re.ASCII
    Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead
    of full Unicode matching. This is only meaningful for Unicode patterns, and
    is ignored for byte patterns.

    …

    re.IGNORECASE
    Perform case-insensitive matching; expressions like [A-Z] will match
    lowercase letters, too. This is not affected by the current locale and
    works for Unicode characters as expected.

re.ASCII does appear to affect re.IGNORECASE. Since this is the non-default
case, however, I'm not sure it's worth calling it out. I'd be happy even if
only the 2.7 docs change.
msg248837 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-08-19 13:12
I think it would be reasonable to add re.IGNORECASE to the list of things affected, since it obviously does switch between using the unicode database and not doing so.
msg293730 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-15 18:48
As was reported in issue30373 there is the same issue with re.LOCALE.

This documentation issue should be easy for everyone who is fluent in English and familiar with the re module.
msg294324 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-05-24 05:51
Actually the locale affects case-insensitively matching if use the re.LOCAL flag. The set of characters matched by b'[A-Z]' is locale-depending. For example in Turkish locale it can include the letters 'İ' and 'ı'. Only 8-bit locales are supported, not UTF-8 locales.

In Unicode case-insensitive mode the expression '[A-Z]' matches not only Latin uppercase and lowercacase letters A-Z and a-z, but also characters 'İ', 'ı', 'ſ', and 'K'.
msg294410 - (view) Author: Brian Ward (Brian Ward) * Date: 2017-05-24 23:04
OK, I'll look at this soon and come up with the next iteration.
msg301305 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-05 11:03
This wording is not correct as I noted in msg294324.
msg303823 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-06 15:06
I tried to correct the documentation in issue31714.
History
Date User Action Args
2017-10-06 15:06:26serhiy.storchakasetmessages: + msg303823
2017-09-05 11:03:43serhiy.storchakasetmessages: + msg301305
versions: - Python 3.5
2017-09-04 21:30:34gregory.p.smithsetpull_requests: + pull_request3340
2017-05-24 23:04:20Brian Wardsetmessages: + msg294410
2017-05-24 05:51:07serhiy.storchakasetmessages: + msg294324
2017-05-24 00:44:21Brian Wardsetpull_requests: + pull_request1863
2017-05-24 00:32:26python-devsetpull_requests: + pull_request1862
2017-05-23 21:31:25Brian Wardsetnosy: + Brian Ward
2017-05-15 18:51:23jwilksetnosy: + jwilk
2017-05-15 18:48:04serhiy.storchakasettitle: It is undocumented that re.UNICODE affects re.IGNORECASE -> It is undocumented that re.UNICODE and re.LOCALE affect re.IGNORECASE
nosy: + serhiy.storchaka

messages: + msg293730

keywords: + easy
2017-05-15 18:43:03serhiy.storchakalinkissue30373 superseder
2016-10-16 08:12:20serhiy.storchakasetversions: + Python 3.7
2016-01-04 00:13:33ezio.melottisetversions: + Python 3.5, Python 3.6, - Python 3.4
nosy: + mrabarnett, ezio.melotti

components: + Regular Expressions
type: enhancement
stage: needs patch
2015-08-19 13:12:34r.david.murraysetnosy: + r.david.murray
messages: + msg248837
2015-08-19 12:38:11Leif Arne Storsetcreate