classification
Title: re fails to match ^ when start index is specified ?
Type: Stage: resolved
Components: Regular Expressions Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: bsdphk, ezio.melotti, mrabarnett, ned.deily
Priority: normal Keywords:

Created on 2013-01-05 09:21 by bsdphk, last changed 2013-01-05 18:02 by mrabarnett. This issue is now closed.

Messages (5)
msg179116 - (view) Author: Poul-Henning Kamp (bsdphk) Date: 2013-01-05 09:21
I'm surprised that this does not find any matches:

    import re
    r = re.compile("^abc")
    s = "0123abcxyz"
    for i in range(0,len(s)):
        print(i, r.search(s, i))

I would have expected the i==4 case to match ?

(This is on:
Python 2.7.3 (default, Dec 14 2012, 02:46:02) 
[GCC 4.2.1 Compatible FreeBSD Clang 3.2 (branches/release_32 168974)] on freebsd10
)
msg179117 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2013-01-05 10:01
Note the warning about '^' in the documentation for the re search method:

"The optional second parameter pos gives an index in the string where the search is to start; it defaults to 0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start."

http://docs.python.org/2/library/re.html#re.RegexObject.search
msg179118 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2013-01-05 10:22
To expand a bit, rather than multiple calls to search, you can use the start and end methods of the match object to determine where the string (without the '^' anchor) matches.  For example:

r = re.compile("abc")
s = "0123abcxyz"
match = r.search(s)
if match:
    print(match.start(), match.end())
msg179121 - (view) Author: Poul-Henning Kamp (bsdphk) Date: 2013-01-05 12:58
I have tried hard, but have utterly failed to figure out why you have chosen the semantics for ^ you mention, tried to come up with a plausible use case, and I have utterly failed.

I find it distinctly counter intuitive.

I think the Principle of Least Astonishment compliant definition of ^ and $ would be that they match the start and end of the string offered for matching, ie: taking start+end into account.

The real use-case behind this is searching through a mmap'ed database file, for a particular regexp in a particular field of the records, with the minimum amount of copying.

The semantics you mention, makes ^ and $ useless in this, and as far as I can tell, any other scenario involving start+end arguments.
msg179132 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2013-01-05 18:02
The semantics of '^' are common to many different regex implementations, including those of Perl and C#.

The 'pos' argument merely gives the starting position the search (C# also lets you provide a starting position, and behaves in exactly the same way).

Perhaps you should be using 'match' instead.
History
Date User Action Args
2013-01-05 18:02:24mrabarnettsetmessages: + msg179132
2013-01-05 12:58:44bsdphksetmessages: + msg179121
2013-01-05 10:22:02ned.deilysetmessages: + msg179118
2013-01-05 10:01:41ned.deilysetstatus: open -> closed

nosy: + ned.deily
messages: + msg179117

resolution: not a bug
stage: resolved
2013-01-05 09:21:55bsdphkcreate