Issue 24896: It is undocumented that re.UNICODE and re.LOCALE affect re.IGNORECASE

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/69084

classification

Title:	It is undocumented that re.UNICODE and re.LOCALE affect re.IGNORECASE
Type:	enhancement	Stage:	resolved
Components:	Documentation, Regular Expressions	Versions:	Python 3.7, Python 3.6, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	Brian Ward, Leif Arne Storset, docs@python, ezio.melotti, jwilk, mrabarnett, r.david.murray, serhiy.storchaka
Priority:	normal	Keywords:	easy

Created on 2015-08-19 12:38 by Leif Arne Storset, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 1781	merged	python-dev, 2017-05-24 00:32
PR 1782	merged	Brian Ward, 2017-05-24 00:44
PR 3313	merged	gregory.p.smith, 2017-09-04 21:30

Messages (7)
msg248829 - (view)	Author: Leif Arne Storset (Leif Arne Storset)	Date: 2015-08-19 12:38
A non-ASCII string does not match a regular expression case-insensitively unless the UNICODE flag is set. This seems reasonable, but the documentation seems to imply that this is not the case. The example: import re # Does not match re.compile(u"неоднозначность", re.IGNORECASE) \ .findall(u"Неоднозначность") # Matches re.compile(u"неоднозначность", re.IGNORECASE \| re.UNICODE) \ .findall(u"Неоднозначность") (In Python 3, it does not match if re.ASCII is given.) The documentation (2.7) says: re.UNICODE Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database. (https://docs.python.org/2/library/re.html#re.UNICODE) My regex does not use any of those escapes, yet the regex changes behavior with the UNICODE flag. This leads to confusion when the regex doesn't match. The documentation is very specific about the behavior that changes with the flag, implying that behavior not mentioned is unaffected. Of course, it's easy to guess the correct (hopefully) solution. Still, I suggest changing the documentation to mention that re.IGNORECASE is affected. Looking at the source code, there seems to be further consequences (it mentions "Unicode locale") which may also warrant a mention. If you do want to avoid specifics, however, even a hand-wavy reference to something like "match according to Unicode" would help, because it implies that not only the escapes change behavior. In Python 3, there is a counterpart to the 2.7 problem: re.ASCII makes our Cyrillic string not match. Again, this behavior makes intuitive sense, but the documentation seems to indicate something different: re.ASCII Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. … re.IGNORECASE Perform case-insensitive matching; expressions like [A-Z] will match lowercase letters, too. This is not affected by the current locale and works for Unicode characters as expected. re.ASCII does appear to affect re.IGNORECASE. Since this is the non-default case, however, I'm not sure it's worth calling it out. I'd be happy even if only the 2.7 docs change.
msg248837 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2015-08-19 13:12
I think it would be reasonable to add re.IGNORECASE to the list of things affected, since it obviously does switch between using the unicode database and not doing so.
msg293730 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-05-15 18:48
As was reported in issue30373 there is the same issue with re.LOCALE. This documentation issue should be easy for everyone who is fluent in English and familiar with the re module.
msg294324 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-05-24 05:51
Actually the locale affects case-insensitively matching if use the re.LOCAL flag. The set of characters matched by b'[A-Z]' is locale-depending. For example in Turkish locale it can include the letters 'İ' and 'ı'. Only 8-bit locales are supported, not UTF-8 locales. In Unicode case-insensitive mode the expression '[A-Z]' matches not only Latin uppercase and lowercacase letters A-Z and a-z, but also characters 'İ', 'ı', 'ſ', and 'K'.
msg294410 - (view)	Author: Brian Ward (Brian Ward) *	Date: 2017-05-24 23:04
OK, I'll look at this soon and come up with the next iteration.
msg301305 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-09-05 11:03
This wording is not correct as I noted in msg294324.
msg303823 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-10-06 15:06
I tried to correct the documentation in issue31714.

History
Date	User	Action	Args
2022-04-11 14:58:19	admin	set	github: 69084
2017-11-16 10:41:48	serhiy.storchaka	set	status: open -> closed resolution: fixed stage: needs patch -> resolved
2017-10-06 15:06:26	serhiy.storchaka	set	messages: + msg303823
2017-09-05 11:03:43	serhiy.storchaka	set	messages: + msg301305 versions: - Python 3.5
2017-09-04 21:30:34	gregory.p.smith	set	pull_requests: + pull_request3340
2017-05-24 23:04:20	Brian Ward	set	messages: + msg294410
2017-05-24 05:51:07	serhiy.storchaka	set	messages: + msg294324
2017-05-24 00:44:21	Brian Ward	set	pull_requests: + pull_request1863
2017-05-24 00:32:26	python-dev	set	pull_requests: + pull_request1862
2017-05-23 21:31:25	Brian Ward	set	nosy: + Brian Ward
2017-05-15 18:51:23	jwilk	set	nosy: + jwilk
2017-05-15 18:48:04	serhiy.storchaka	set	title: It is undocumented that re.UNICODE affects re.IGNORECASE -> It is undocumented that re.UNICODE and re.LOCALE affect re.IGNORECASE nosy: + serhiy.storchaka messages: + msg293730 keywords: + easy
2017-05-15 18:43:03	serhiy.storchaka	link	issue30373 superseder
2016-10-16 08:12:20	serhiy.storchaka	set	versions: + Python 3.7
2016-01-04 00:13:33	ezio.melotti	set	versions: + Python 3.5, Python 3.6, - Python 3.4 nosy: + mrabarnett, ezio.melotti components: + Regular Expressions type: enhancement stage: needs patch
2015-08-19 13:12:34	r.david.murray	set	nosy: + r.david.murray messages: + msg248837
2015-08-19 12:38:11	Leif Arne Storset	create