classification
Title: re.IGNORECASE does not match literal "_" (underscore)
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 2.6
process
Status: closed Resolution: duplicate
Dependencies: Superseder: re.sub confusion between count and flags args
View: 11957
Assigned To: Nosy List: RobM, effbot, ezio.melotti, mrabarnett, pitrou
Priority: normal Keywords:

Created on 2011-04-28 17:02 by RobM, last changed 2014-10-29 16:15 by vstinner. This issue is now closed.

Messages (6)
msg134700 - (view) Author: Robert Meerman (RobM) Date: 2011-04-28 17:02
Regular expressions which are written match literal underscores ("_", ASCII 
ordinal 95) and specify `re.IGNORECASE` during compilation do not consistently 
match underscores: it seems some occurrences are matched, but others are not.

The following session log shows the problem:

    Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) 
    [GCC 4.4.3] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import re
    >>> subject = "[Conclave-Mendoi]_ef_-_a_tale_of_memories_00-12_H264"
    >>> print subject.encode("base64")  # Incase my environment encoding is to blame
    W0NvbmNsYXZlLU1lbmRvaV1fZWZfLV9hX3RhbGVfb2ZfbWVtb3JpZXNfMDAtMTJfSDI2NA==

    >>> re.sub("_", "X", subject)  # No flags, does what I expect
    '[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264'
    >>> 
    >>> re.sub("_", "X", subject, re.IGNORECASE)  # Misses some matches
    '[Conclave-Mendoi]XefX-_a_tale_of_memories_00-12_H264'
    >>> 
    >>> re.sub("_", "X", subject, re.IGNORECASE | re.LOCALE)  # Misses fewer matches
    '[Conclave-Mendoi]XefX-XaXtaleXofXmemories_00-12_H264'
    >>> 
    >>> re.sub("_", "X", subject, re.IGNORECASE | re.LOCALE | re.UNICODE)  # Works OK
    '[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264'
    >>> 
    >>> re.sub("_", "X", subject, re.IGNORECASE | re.UNICODE) # Works OK
    '[Conclave-Mendoi]XefX-XaXtaleXofXmemoriesX00-12XH264'
    >>> 
    >>> type(subject)  # Don't think this is a unicode string
    <type 'str'>
    >>> 

Since my `subject` variable is of type `str` and only contains ASCII characters
I do not believe that the `re.UNICODE` flag should be required.
msg134716 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2011-04-28 19:54
help(re.sub) says:

    sub(pattern, repl, string, count=0)

and re.IGNORECASE has a value of 2.

Therefore this:

    re.sub("_", "X", subject, re.IGNORECASE)

is telling it to replace at most 2 occurrences of "_".
msg134717 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-04-28 20:49
Closing as invalid.
I wonder if it would be better to have count as a keyword-only argument though, since this problem seems to come up pretty often and it's not easy to debug.
msg134723 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2011-04-28 22:21
I don't know how much code that might break. It might not be that much; I can't remember when I last used re.sub without the default count.
msg134752 - (view) Author: Robert Meerman (RobM) Date: 2011-04-29 11:53
Oh, that's embarrassing. :-)

Could a type-check be used to alert the user to their mistake? I suppose that would require re.IGNORECASE (et al) to be of some new type (presumably sub-classed from Integer).

(Thanks for the quick response, and sorry to waste your time)
msg134831 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-04-30 02:24
See also #11957.
History
Date User Action Args
2014-10-29 16:15:11vstinnersetsuperseder: re.sub confusion between count and flags args
resolution: not a bug -> duplicate
2011-04-30 02:24:10ezio.melottisetmessages: + msg134831
2011-04-29 11:53:32RobMsetmessages: + msg134752
2011-04-28 22:21:58mrabarnettsetmessages: + msg134723
2011-04-28 20:49:19ezio.melottisetstatus: open -> closed
resolution: not a bug
messages: + msg134717

stage: resolved
2011-04-28 19:54:48mrabarnettsetnosy: + mrabarnett
messages: + msg134716
2011-04-28 17:02:55RobMcreate