Message 134700 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	RobM
Recipients	RobM, effbot, ezio.melotti, pitrou
Date	2011-04-28.17:02:55
SpamBayes Score	2.4361069e-12
Marked as misclassified	No
Message-id	<1304010176.69.0.487843608928.issue11947@psf.upfronthosting.co.za>
In-reply-to

Content
Regular expressions which are written match literal underscores ("_", ASCII ordinal 95) and specify `re.IGNORECASE` during compilation do not consistently match underscores: it seems some occurrences are matched, but others are not. The following session log shows the problem: Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> subject = "[Conclave-Mendoi]_ef_-_a_tale_of_memories_00-12_H264" >>> print subject.encode("base64") # Incase my environment encoding is to blame W0NvbmNsYXZlLU1lbmRvaV1fZWZfLV9hX3RhbGVfb2ZfbWVtb3JpZXNfMDAtMTJfSDI2NA== >>> re.sub("_", "X", subject) # No flags, does what I expect '[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264' >>> >>> re.sub("_", "X", subject, re.IGNORECASE) # Misses some matches '[Conclave-Mendoi]XefX-_a_tale_of_memories_00-12_H264' >>> >>> re.sub("_", "X", subject, re.IGNORECASE \| re.LOCALE) # Misses fewer matches '[Conclave-Mendoi]XefX-XaXtaleXofXmemories_00-12_H264' >>> >>> re.sub("_", "X", subject, re.IGNORECASE \| re.LOCALE \| re.UNICODE) # Works OK '[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264' >>> >>> re.sub("_", "X", subject, re.IGNORECASE \| re.UNICODE) # Works OK '[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264' >>> >>> type(subject) # Don't think this is a unicode string <type 'str'> >>> Since my `subject` variable is of type `str` and only contains ASCII characters I do not believe that the `re.UNICODE` flag should be required.

Regular expressions which are written match literal underscores ("_", ASCII 
ordinal 95) and specify `re.IGNORECASE` during compilation do not consistently 
match underscores: it seems some occurrences are matched, but others are not.

The following session log shows the problem:

    Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) 
    [GCC 4.4.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import re
    >>> subject = "[Conclave-Mendoi]_ef_-_a_tale_of_memories_00-12_H264"
    >>> print subject.encode("base64")  # Incase my environment encoding is to blame
    W0NvbmNsYXZlLU1lbmRvaV1fZWZfLV9hX3RhbGVfb2ZfbWVtb3JpZXNfMDAtMTJfSDI2NA==

    >>> re.sub("_", "X", subject)  # No flags, does what I expect
    '[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264'
    >>> 
    >>> re.sub("_", "X", subject, re.IGNORECASE)  # Misses some matches
    '[Conclave-Mendoi]XefX-_a_tale_of_memories_00-12_H264'
    >>> 
    >>> re.sub("_", "X", subject, re.IGNORECASE | re.LOCALE)  # Misses fewer matches
    '[Conclave-Mendoi]XefX-XaXtaleXofXmemories_00-12_H264'
    >>> 
    >>> re.sub("_", "X", subject, re.IGNORECASE | re.LOCALE | re.UNICODE)  # Works OK
    '[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264'
    >>> 
    >>> re.sub("_", "X", subject, re.IGNORECASE | re.UNICODE) # Works OK
    '[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264'
    >>> 
    >>> type(subject)  # Don't think this is a unicode string
    <type 'str'>
    >>> 

Since my `subject` variable is of type `str` and only contains ASCII characters
I do not believe that the `re.UNICODE` flag should be required.

History
Date	User	Action	Args
2011-04-28 17:02:56	RobM	set	recipients: + RobM, effbot, pitrou, ezio.melotti
2011-04-28 17:02:56	RobM	set	messageid: <1304010176.69.0.487843608928.issue11947@psf.upfronthosting.co.za>
2011-04-28 17:02:55	RobM	link	issue11947 messages
2011-04-28 17:02:55	RobM	create