This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: regex matches incorrectly on literal dot (99.9% confirmed)
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Cal.Leeming, lehmannro
Priority: normal Keywords:

Created on 2011-06-13 12:24 by Cal.Leeming, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (4)
msg138234 - (view) Author: Cal Leeming (Cal.Leeming) Date: 2011-06-13 12:24
I believe I might have found a bug in the Python re libraries. Here is a complete debug of what is happening (my apologies for the nature of the actual text). I have ran this regex through RegexBuddy (and a few other tools), and all of them do the correct action (which is to not do any replacement), apart from Python. I haven't yet tried this in another language.

------------ ORIGINAL TEXT ------------
>>313229176 
me and a buddy and his girlfriend were watching tv once and this blabbering idiot starts talking about this scientific study she heard about where they built a fake city and only one guy didn't know that it was a fake. we all paused for a second and i said "the truman show?" and she says "yeah! that was the name of it!" me my buddy and his girlfriend all catch eyes and are baffled at how stupid she was
----------------------------------------

------------ TEXT AFTER REGEX SUB ------------

me and a buddy and his girlfriend were http://watching.tv once and this blabbering idiot starts talking about this scientific study she heard about where they built a fake city and only one guy didn't know that it was a fake.we all paused for a second and i said "the truman show?" and she says "yeah! that was the name of it!" me my buddy and his girlfriend all catch eyes and are baffled at how stupid she was
-----------------------------------------------

----------- REPLACED TEXT -----------
 watching tv 
 http://watching.tv 
-----------------------------------------------


---- REGEX ----
_t = re.compile(r"(^| )((?:[\w\-]{2,}?\.|)(?:[\w\-]{2,}?)(?:\.com|\.net|\.org|\.co\.uk|\.tv|\.ly))", flags = re.IGNORECASE | re.MULTILINE | re.DEBUG)

---- COMMAND ----
_t.sub("\\1http://\\2", original_message_here)


---- REGEX DEBUG ----

subpattern 1
  branch
    at at_beginning
  or
    literal 32
subpattern 2
  subpattern None
    branch
      min_repeat 2 65535
        in
          category category_word
          literal 45
      literal 46
    or
  subpattern None
    min_repeat 2 65535
      in
        category category_word
        literal 45
  subpattern None
    literal 46
    branch
      literal 99
      literal 111
      literal 109
    or
      literal 110
      literal 101
      literal 116
    or
      literal 111
      literal 114
      literal 103
    or
      literal 99
      literal 111
      literal 46
      literal 117
      literal 107
    or
      literal 116
      literal 118
    or
      literal 108
      literal 121
msg138236 - (view) Author: Cal Leeming (Cal.Leeming) Date: 2011-06-13 12:30
Take particular notice to the following:

\.co\.uk

    or
      literal 99
      literal 111
      literal 46
      literal 117
      literal 107


>>> map(lambda x: chr(x), [99,111,46,117,107])
['c', 'o', '.', 'u', 'k']

It would appear it is ignoring the first \. 

But why??
msg138239 - (view) Author: Robert Lehmann (lehmannro) * Date: 2011-06-13 12:42
I can not reproduce either of your findings.  Could you provide us with your version information?  re version 2.2.1, _sre 2.2.2, Python 2.6.6, Debian sid here.  Also tested with Python 2.7.2rc1 (same RE).

>>> import re
>>> re.compile(r"\.co\.uk", re.DEBUG)
literal 46
literal 99
literal 111
literal 46
literal 117
literal 107
<_sre.SRE_Pattern object at 0xb73b0860>
>>> re.compile(r"(^| )((?:[\w\-]{2,}?\.|)(?:[\w\-]{2,}?)(?:\.com|\.net|\.org|\.co\.uk|\.tv|\.ly))", flags = re.IGNORECASE | re.MULTILINE | re.DEBUG).sub("\\1http://\\2", """me and a buddy and his girlfriend were watching tv once and this blabbering idiot starts talking about this scientific study she heard about where they built a fake city and only one guy didn't know that it was a fake. we all paused for a second and i said "the truman show?" and she says "yeah! that was the name of it!" me my buddy and his girlfriend all catch eyes and are baffled at how stupid she was""")
subpattern 1
...
'me and a buddy and his girlfriend were watching tv once...'
msg138240 - (view) Author: Cal Leeming (Cal.Leeming) Date: 2011-06-13 12:53
Oh jeez, you're going to think I'm such an idiot. I just ran a completely fresh test in the cli (away from the original source), and the issue disappeared (it was caused by caching - apparently).

I'm really sorry to have bothered you guys, I should have thought and tested this outside the original code first. I'll make sure to do this before posting any bugs in the future.

Thank you for your extremely fast response though!

Cal
History
Date User Action Args
2022-04-11 14:57:18adminsetgithub: 56534
2011-06-13 14:32:57r.david.murraysetstatus: open -> closed
resolution: not a bug
stage: resolved
2011-06-13 12:53:44Cal.Leemingsetmessages: + msg138240
2011-06-13 12:42:59lehmannrosetnosy: + lehmannro
messages: + msg138239
2011-06-13 12:30:52Cal.Leemingsetmessages: + msg138236
2011-06-13 12:25:10Cal.Leemingsettype: behavior
2011-06-13 12:24:51Cal.Leemingcreate