Issue 17381: IGNORECASE breaks unicode literal range matching

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/61583

classification

Title:	IGNORECASE breaks unicode literal range matching
Type:	behavior	Stage:	resolved
Components:	Regular Expressions	Versions:	Python 3.4, Python 3.5, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:	22584	Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	acdha, ezio.melotti, mrabarnett, python-dev, serhiy.storchaka
Priority:	normal	Keywords:	needs review, patch

Created on 2013-03-07 20:52 by acdha, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
re_ignore_case_range.patch	serhiy.storchaka, 2014-09-08 19:58		review
re_ignore_case_range-3.5.patch	serhiy.storchaka, 2014-09-17 08:57		review
re_ignore_case_range-3.4_2.patch	serhiy.storchaka, 2014-09-24 19:17		review
re_ignore_case_range-3.5_2.patch	serhiy.storchaka, 2014-10-08 19:41		review
re_ignore_case_range-3.5_3.patch	serhiy.storchaka, 2014-10-09 07:50		review

Messages (17)
msg183705 - (view)	Author: Chris Adams (acdha)	Date: 2013-03-07 20:52
I noticed an interesting failure while using re.match / re.sub to look for non-Cyrillic characters in allegedly Russian text: >>> re.sub(r'[\s\u0400-\u0527]+', ' ', 'Архангельская губерния', flags=re.IGNORECASE) 'Архангельская губерния' >>> re.sub(r'[\s\u0400-\u0527]+', '', 'Архангельская губерния', flags=0) '' The same is true in Python 2.7, although you need to use ur'' patterns for the literals to be expanded: >>> re.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=re.IGNORECASE\|regex.UNICODE) u'\u0410\u0440\u0445\u0430\u043d\u0433\u0435\u043b\u044c\u0441\u043a\u0430\u044f\u0433\u0443\u0431\u0435\u0440\u043d\u0438\u044f' In contrast, the regex module behaves as expected: >>> regex.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=regex.IGNORECASE\|regex.UNICODE) u'' (Transcript maintained at https://gist.github.com/acdha/5111687)
msg183712 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2013-03-07 23:19
The way the re handles ranges is to convert the two endpoints to lowercase and then check whether the lowercase form of the character in the text is in that range. For example, [A-Z] is converted to the range [\x41-\x5A], and the lowercase form of 'Q' ('\x51') is 'q' ('\x7A'), which is in the range. In your example, [\u0400-\u0527] is converted to the range [\u0450-\u0527], but the lowercase form of 'А' ('\u0410') is 'а' ('\u0430'), which isn't in the range. This is the same as issue #3511, but a worse failure.
msg183753 - (view)	Author: Chris Adams (acdha)	Date: 2013-03-08 18:22
Ah, that explains it - I'd been hoping based on the re.DEBUG output that the explicit unicode ranges were preserved. I found #3511 before opening this one but don't believe the decision should be the same since this isn't a mixed numeric/alphabetic range.
msg183988 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-03-11 19:50
Matthew, should this be closed then?
msg183989 - (view)	Author: Chris Adams (acdha)	Date: 2013-03-11 19:59
Ezio: given the non-obvious failure, what do you think of at least documenting this and issuing a warning any time both re.UNICODE and re.IGNORECASE are set?
msg183992 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2013-03-11 21:00
In issue #3511 the range was slightly unusual, so closing it seemed a reasonable approach, but the range in this issue is less clearly a problem. My preference would be to fix it, if possible.
msg183993 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-03-11 21:24
I'm working on the patch.
msg184016 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-03-12 08:11
Is this the same issue described in #12728?
msg226608 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-09-08 19:58
No, issue12728 is more complicate case. Here is a patch which fixes this issue and issue3511.
msg226989 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-09-17 08:57
This patch has a disadvantage - it slows down case-insensitive compiling of some very wide ranges, e.g. compile(r"[\x00-\U0010ffff]+", re.I) (this is worst case). In most cases this is not important, because such wide ranges are rare enough and compiled patterns are cached. To get rid of this regression, we need new opcode. Due to preserving binary compatibility, this approach can't be applied to old releases. Here is a patch for 3.5. Please make a review. This patches are needed to continue fixing of other re bugs.
msg227485 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-09-24 19:17
Here is other patch for 3.4. It is more than 10 times faster than initial patch in worst case.
msg228814 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-10-08 19:41
Actually 3.5 patch can be simpler.
msg228837 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-10-09 07:50
Updated patch for 3.5 addresses Antoine's comments. Note that 3.4 and 3.5 use different solutions of this issue.
msg229919 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-10-24 12:28
Does the patch look good now for you Antoine? If there are no objections I'm going to commit it soon. In order to apply 3.4 patch to 2.7 we need either significant modify the patch, or first backport issue19329 changes to 2.7 (it would be easier).
msg230332 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-10-31 10:42
New changeset 6f52a3d0f548 by Serhiy Storchaka in branch 'default': Issue #17381: Fixed handling of case-insensitive ranges in regular expressions. https://hg.python.org/cpython/rev/6f52a3d0f548 New changeset 7981cb1556cf by Serhiy Storchaka in branch '3.4': Issue #17381: Fixed handling of case-insensitive ranges in regular expressions. https://hg.python.org/cpython/rev/7981cb1556cf
msg230336 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-10-31 11:55
New changeset ebd48b4f650d by Serhiy Storchaka in branch '2.7': Backported the optimization of compiling charsets in regular expressions https://hg.python.org/cpython/rev/ebd48b4f650d New changeset 6cd4b9827755 by Serhiy Storchaka in branch '2.7': Issue #17381: Fixed ranges handling in case-insensitive regular expressions. https://hg.python.org/cpython/rev/6cd4b9827755
msg230350 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-10-31 16:12
Thank you Antoine for your review.

History
Date	User	Action	Args
2022-04-11 14:57:42	admin	set	github: 61583
2014-11-08 12:14:26	serhiy.storchaka	link	issue3511 superseder
2014-10-31 16:12:16	serhiy.storchaka	set	status: open -> closed resolution: fixed messages: + msg230350 stage: patch review -> resolved
2014-10-31 11:55:20	python-dev	set	messages: + msg230336
2014-10-31 10:42:54	python-dev	set	nosy: + python-dev messages: + msg230332
2014-10-24 12:28:34	serhiy.storchaka	set	messages: + msg229919
2014-10-09 07:50:41	serhiy.storchaka	set	files: + re_ignore_case_range-3.5_3.patch dependencies: + Get rid of SRE character tables messages: + msg228837
2014-10-08 19:41:57	serhiy.storchaka	set	files: + re_ignore_case_range-3.5_2.patch messages: + msg228814
2014-09-24 19:17:11	serhiy.storchaka	set	files: + re_ignore_case_range-3.4_2.patch messages: + msg227485
2014-09-21 20:45:06	serhiy.storchaka	link	issue12728 dependencies
2014-09-17 08:57:13	serhiy.storchaka	set	keywords: + needs review files: + re_ignore_case_range-3.5.patch messages: + msg226989
2014-09-08 19:58:14	serhiy.storchaka	set	files: + re_ignore_case_range.patch versions: + Python 3.4, Python 3.5, - Python 3.3 messages: + msg226608 assignee: serhiy.storchaka keywords: + patch stage: patch review
2013-03-12 08:11:31	ezio.melotti	set	messages: + msg184016
2013-03-11 21:24:54	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg183993
2013-03-11 21:00:24	mrabarnett	set	messages: + msg183992
2013-03-11 19:59:42	acdha	set	messages: + msg183989
2013-03-11 19:50:21	ezio.melotti	set	messages: + msg183988
2013-03-08 18:22:51	acdha	set	messages: + msg183753
2013-03-07 23:19:56	mrabarnett	set	messages: + msg183712
2013-03-07 20:52:32	acdha	create