Message 70799 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mrabarnett
Recipients	mrabarnett
Date	2008-08-06.19:41:10
SpamBayes Score	0.0001359083
Marked as misclassified	No
Message-id	<1218051735.51.0.548930988583.issue3511@psf.upfronthosting.co.za>
In-reply-to

Content
While working on the regex code in sre_compile.py I came across the following code in the handling of charset ranges in _optimize_charset: for i in range(fixup(av[0]), fixup(av[1])+1): charmap[i] = 1 The function fixup converts the ends of the range to lower case if the ignore-case flag is present. The problem with this approach is illustrated below: >>> import re >>> print re.match(r'[9-A]', 'A') <_sre.SRE_Match object at 0x00A78058> >>> print re.match(r'[9-A]', 'a') None >>> print re.match(r'[9-A]', '_') None >>> print re.match(r'[9-A]', 'A', re.IGNORECASE) <_sre.SRE_Match object at 0x00D0BFA8> >>> print re.match(r'[9-A]', 'a', re.IGNORECASE) <_sre.SRE_Match object at 0x00A78058> >>> print re.match(r'[9-A]', '_', re.IGNORECASE) <_sre.SRE_Match object at 0x00D0BFA8> >>> '_' doesn't lie between '9' and 'A', but it does lie between '9' and 'a'. Surely the ignore-case flag should not affect whether non-letters are matched or not?

While working on the regex code in sre_compile.py I came across the
following code in the handling of charset ranges in _optimize_charset:

    for i in range(fixup(av[0]), fixup(av[1])+1):
        charmap[i] = 1

The function fixup converts the ends of the range to lower case if the
ignore-case flag is present. The problem with this approach is
illustrated below:

>>> import re
>>> print re.match(r'[9-A]', 'A')
<_sre.SRE_Match object at 0x00A78058>
>>> print re.match(r'[9-A]', 'a')
None
>>> print re.match(r'[9-A]', '_')
None
>>> print re.match(r'[9-A]', 'A', re.IGNORECASE)
<_sre.SRE_Match object at 0x00D0BFA8>
>>> print re.match(r'[9-A]', 'a', re.IGNORECASE)
<_sre.SRE_Match object at 0x00A78058>
>>> print re.match(r'[9-A]', '_', re.IGNORECASE)
<_sre.SRE_Match object at 0x00D0BFA8>
>>> 

'_' doesn't lie between '9' and 'A', but it does lie between '9' and 'a'.

Surely the ignore-case flag should not affect whether non-letters are
matched or not?

History
Date	User	Action	Args
2008-08-06 19:42:15	mrabarnett	set	recipients: + mrabarnett
2008-08-06 19:42:15	mrabarnett	set	messageid: <1218051735.51.0.548930988583.issue3511@psf.upfronthosting.co.za>
2008-08-06 19:41:11	mrabarnett	link	issue3511 messages
2008-08-06 19:41:10	mrabarnett	create