classification
Title: Proposal: add re.fullmatch() method
Type: enhancement Stage: committed/rejected
Components: Regular Expressions Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: ezberch, ezio.melotti, georg.brandl, gvanrossum, mrabarnett, pitrou, python-dev, serhiy.storchaka, tim.peters
Priority: normal Keywords: patch

Created on 2012-10-11 21:30 by gvanrossum, last changed 2013-11-23 22:46 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
issue16203.patch ezberch, 2012-11-27 22:19 patch, with doc and tests review
issue16203_mrab.patch mrabarnett, 2013-02-04 22:39 review
issue16203_mrab_2.patch georg.brandl, 2013-10-13 07:20 review
issue16203_mrab_3.patch serhiy.storchaka, 2013-10-26 12:33 review
Messages (36)
msg172695 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-10-11 21:30
I've noticed a subtle bug in some of our internal code.  Someone wants to ensure that a certain string (e.g. a URL path) matches a certain pattern in its entirety.  They use re.match() with a regex ending in $.  Fine.  Now someone else comes along and modifies the pattern.  Somehow the $ gets lost, or the pattern develops a set of toplevel choices that don't all end in $.  And now things that have a valid *prefix* suddenly (and unintentionally) start matching.

I think this is a common enough issue and propose a new API: a fullmatch() function and method that work just like the existing match() function and method but also check that the whole input string matches.  This can be implemented slightly awkwardly as follows in user code:

def fullmatch(regex, input, flags=0):
  m = re.match(regex, input, flags)
  if m is not None and m.end() == len(input):
    return m
  return None

(The corresponding method will have to be somewhat more complex because the underlying match() method takes optional pos and endpos arguments.)
msg172696 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2012-10-11 21:45
+1.  Note that this really can't be done in user-level code.  For example, consider matching the pattern

a|ab

against the string

ab

Without being _forced_ to consider the "ab" branch, the regexp will match just the "a" branch.  So, e.g., the example code you posted will say "nope, it didn't match (the whole thing)".
msg172697 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-11 21:46
What will be with non-greedy qualifiers? Should '.*?' full match any string?

>>> re.match('.*?$', 'abc').group()
'abc'
>>> re.match('.*?', 'abc').group()
''
msg172698 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-10-11 21:58
> Note that this really can't be done in user-level code.

Well, how about:

def fullmatch(regex, input, flags=0):
    return re.match("(:?" + regex + ")$", input, flags)
msg172704 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2012-10-11 22:42
Antoine, that's certainly the conceptual intent here.  Can't say whether your attempt works in all cases.  The docs don't guarantee it.  For example, if the original regexp started with (?x), the docs explicitly say the effect of (?x) is undefined "if there are non-whitespace characters before the [inline (?x)] flag".

Sure, you could parse the regexp is user code too, and move an initial (?...x...) before your non-capturing group.  For that matter, you could write your own regexp engine in user code too ;-)

The point is that it should be easy for the regexp engine to implement the desired functionality - and user attempts to "fake it" have pitfalls (even Guido didn't get it right - LOL ;-) ).
msg172775 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-10-12 19:45
'$' will match at the end of the string or just before the final '\n':

>>> re.match(r'abc$', 'abc\n')
<_sre.SRE_Match object at 0x00F15448>

So shouldn't you be using r'\Z' instead?

>>> re.match(r'abc\Z', 'abc')
<_sre.SRE_Match object at 0x00F15410>
>>> re.match(r'abc\Z', 'abc\n')
>>> 

And what happens if the MULTILINE flag is turned on?

>>> re.match(r'abc$', 'abc\ndef', flags=re.MULTILINE)
<_sre.SRE_Match object at 0x00F15448>
>>> re.match(r'abc\Z', 'abc\ndef', flags=re.MULTILINE)
>>>
msg172796 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2012-10-13 05:28
Matthew, Guido wrote "check that the whole input string matches" (or slice if pos and (possibly also) endpos is/are given).  So, yes, \Z is more to the point than $ if people want to continue wasting time trying to implement this as a Python-level function ;-)

I don't understand what you're asking about MULTILINE.  What's the issue there?  Focus on Guido's "whole input string matches", not on his motivational talk about "a regex ending in $".  $ and/or \Z aren't the point here; "whole input string matches" is the point.
msg172816 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-10-13 17:25
Tim, my point is that if the MULTILINE flag happens to be turned on, '$' won't just match at the end of the string (or slice), it'll also match at a newline, so wrapping the pattern in (?:...)$ in that case could give the wrong answer, but wrapping it in (?:...)\Z would give the right answer.
msg172819 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-10-13 17:45
> Tim, my point is that if the MULTILINE flag happens to be turned on,
> '$' won't just match at the end of the string (or slice), it'll also
> match at a newline, so wrapping the pattern in (?:...)$ in that case
> could give the wrong answer, but wrapping it in (?:...)\Z would give
> the right answer.

This means Tim and Guido are right that a dedicated fullmatch() method
is desireable.
msg172822 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-13 18:07
Definitely this is not easy issue.
msg172844 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2012-10-14 02:53
Serhiy, I expect this is easy to implement _inside_ the regexp engine.  The complications come from trying to do it outside the engine.  But even there, wrapping the original regexp <re> in

(?:<re>)\Z

is at worst very close.  The only insecurity with that I've thought of concerns the doc's warnings about what can appear before an inline re.VERBOSE flag.  It probably works fine even if <re> does begin with  (?...x....).
msg172845 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-10-14 03:11
It certainly appears to ignore the whitespace, even if the "(?x)" is at the end of the pattern or in the middle of a group.

Another point we need to consider is that the user might want to use a pre-compiled pattern.
msg173110 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-10-16 23:20
I'm about to add this to my regex implementation and, naturally, I want it to have the same name for compatibility.

However, I'm not that keen on "fullmatch" and would prefer "matchall" instead.

What do you think?
msg173113 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2012-10-16 23:45
I like "matchall" fine, but I can't channel Guido on names - he sometimes gets those wrong ;-)
msg173114 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-10-16 23:49
re.matchall() would appear to be related to re.findall(), which it isn't.

The re2 package has a FullMatch method:
http://code.google.com/p/re2/wiki/CplusplusAPI
msg173116 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-10-17 00:23
re2's FullMatch method contrasts with its PartialMatch method, which re doesn't have!
msg173121 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2012-10-17 02:02
But my other argument stands.
msg173122 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2012-10-17 02:17
OK, in order to avoid bikeshedding, "fullmatch" it is.
msg173206 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-10-17 21:16
FWIW, I prefer "fullmatch" as well :)
msg176498 - (view) Author: Ezra Berch (ezberch) Date: 2012-11-27 22:19
Patch attached. I've taken a slightly different approach than what has been discussed here: rather than define a new fullmatch() function and method, I've defined a new re.FULLMATCH flag for match(). So an example would be

re.match('abc','abc',re.FULLMATCH)

The implementation is basically what has been discussed here, except done when the regular expression is compiled rather than at the user level.
msg177553 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-12-15 19:38
Thanks for the patch. While an internal flag may be a reasonable implementation strategy, IMHO a dedicated method still makes sense: it's simply more readable than passing a flag.

As for the tests, they should probably exercise the interaction with re.MULTILINE - see MRAB's comment in msg172775.
msg181395 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2013-02-04 22:39
I've attached a patch.
msg181451 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-02-05 16:18
I did not analyze the patch deeply, only left on Rietveld comments on first sight. Need to update the documentation.
msg181468 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-02-05 19:15
The patch doesn't seem to include failure cases for fullmatch (i.e. cases where fullmatch returns None where match or search would return a match).
msg181496 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2013-02-06 02:12
3 of the tests expect None when using 'fullmatch'; they won't return None when using 'match'.
msg181552 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-02-06 18:27
> 3 of the tests expect None when using 'fullmatch'; they won't return 
> None when using 'match'.

Sorry, my bad. Like Serhiy, I can't comment on the changes to re internals, but we can trust you on this. The patch needs documentation, though.
msg181565 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-02-06 19:25
I can't comment right now, but I am going inspect thoroughly re internals. 
This is a new feature and we have enough time before 3.4 freeze.
msg199620 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-12 20:35
Serhiy, sorry to ping you, but do you think you're gonna look at this?
msg199661 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-10-13 07:20
I updated the patch to current tip, fixed three issues from the review, and added documentation updates.
msg199663 - (view) Author: Roundup Robot (python-dev) Date: 2013-10-13 07:31
New changeset b51218966201 by Georg Brandl in branch 'default':
Add re.fullmatch() function and regex.fullmatch() method, which anchor the
http://hg.python.org/cpython/rev/b51218966201
msg199664 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-10-13 07:32
Sorry, accidental push, already reverted.
msg201340 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-26 12:32
Patch updated to current tip. I have added some changes from the review and have added some tests.

Matthew, why change for SRE_OP_REPEAT_ONE is needed? Tests are passed without it.
msg203059 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-11-16 16:30
Matthew, could you please answer my question?
msg203074 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2013-11-16 18:44
I don't know that it's not needed.
msg204106 - (view) Author: Roundup Robot (python-dev) Date: 2013-11-23 21:20
New changeset 89dfa2671c83 by Serhiy Storchaka in branch 'default':
Issue #16203: Add re.fullmatch() function and regex.fullmatch() method,
http://hg.python.org/cpython/rev/89dfa2671c83
msg204107 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-11-23 21:23
Committed with additional test (re.fullmatch('a+', 'ab')) which proves that change for SRE_OP_REPEAT_ONE are needed.

Thank you Matthew for your contribution.
History
Date User Action Args
2013-11-23 22:46:48serhiy.storchakasetstatus: open -> closed
2013-11-23 21:23:53serhiy.storchakasetresolution: fixed
stage: patch review -> committed/rejected
2013-11-23 21:23:21serhiy.storchakasetmessages: + msg204107
2013-11-23 21:20:53python-devsetmessages: + msg204106
2013-11-23 21:03:16serhiy.storchakasetassignee: serhiy.storchaka
2013-11-16 18:44:05mrabarnettsetmessages: + msg203074
2013-11-16 16:30:15serhiy.storchakasetmessages: + msg203059
2013-10-26 12:33:09serhiy.storchakasetkeywords: - easy
2013-10-26 12:33:02serhiy.storchakasetfiles: + issue16203_mrab_3.patch
2013-10-26 12:32:39serhiy.storchakasetmessages: + msg201340
2013-10-13 13:01:02ncoghlansetstage: committed/rejected -> patch review
2013-10-13 07:32:54georg.brandlsetstatus: closed -> open
resolution: fixed -> (no value)
messages: + msg199664
2013-10-13 07:31:12python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg199663

resolution: fixed
stage: patch review -> committed/rejected
2013-10-13 07:20:02georg.brandlsetfiles: + issue16203_mrab_2.patch
nosy: + georg.brandl
messages: + msg199661

2013-10-12 20:35:37pitrousetmessages: + msg199620
2013-02-06 19:25:58serhiy.storchakasetmessages: + msg181565
2013-02-06 18:27:15pitrousetmessages: + msg181552
2013-02-06 02:12:23mrabarnettsetmessages: + msg181496
2013-02-05 19:15:32pitrousetmessages: + msg181468
2013-02-05 16:18:33serhiy.storchakasetmessages: + msg181451
2013-02-04 22:39:42mrabarnettsetfiles: + issue16203_mrab.patch

messages: + msg181395
2012-12-15 19:38:52pitrousetmessages: + msg177553
stage: needs patch -> patch review
2012-11-27 22:19:10ezberchsetfiles: + issue16203.patch

nosy: + ezberch
messages: + msg176498

keywords: + patch
2012-11-12 08:55:41serhiy.storchakasetstage: needs patch
2012-10-18 08:33:48ezio.melottisetnosy: + ezio.melotti
components: + Regular Expressions
2012-10-17 21:16:26pitrousetmessages: + msg173206
2012-10-17 02:17:00mrabarnettsetmessages: + msg173122
2012-10-17 02:02:07gvanrossumsetmessages: + msg173121
2012-10-17 00:23:53mrabarnettsetmessages: + msg173116
2012-10-16 23:49:27gvanrossumsetmessages: + msg173114
2012-10-16 23:45:39tim.peterssetmessages: + msg173113
2012-10-16 23:20:01mrabarnettsetmessages: + msg173110
2012-10-14 03:11:51mrabarnettsetmessages: + msg172845
2012-10-14 02:53:14tim.peterssetmessages: + msg172844
2012-10-13 18:07:00serhiy.storchakasetmessages: + msg172822
2012-10-13 17:45:43pitrousetmessages: + msg172819
2012-10-13 17:25:51mrabarnettsetmessages: + msg172816
2012-10-13 05:28:21tim.peterssetmessages: + msg172796
2012-10-12 19:45:41mrabarnettsetnosy: + mrabarnett
messages: + msg172775
2012-10-11 22:42:52tim.peterssetmessages: + msg172704
2012-10-11 21:58:57pitrousetnosy: + pitrou
messages: + msg172698
2012-10-11 21:46:54serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg172697
2012-10-11 21:45:56tim.peterssetnosy: + tim.peters
messages: + msg172696
2012-10-11 21:30:41gvanrossumcreate