Proposal: add re.fullmatch() method #60407

gvanrossum · 2012-10-11T21:30:42Z

BPO	16203
Nosy	@gvanrossum, @tim-one, @birkenfeld, @pitrou, @ezio-melotti, @serhiy-storchaka
Files	issue16203.patch: patch, with doc and tests issue16203_mrab.patch issue16203_mrab_2.patch issue16203_mrab_3.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2013-11-23.22:46:48.282>
created_at = <Date 2012-10-11.21:30:41.951>
labels = ['expert-regex', 'type-feature']
title = 'Proposal: add re.fullmatch() method'
updated_at = <Date 2013-11-23.22:46:48.282>
user = 'https://github.com/gvanrossum'

bugs.python.org fields:

activity = <Date 2013-11-23.22:46:48.282>
actor = 'serhiy.storchaka'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2013-11-23.22:46:48.282>
closer = 'serhiy.storchaka'
components = ['Regular Expressions']
creation = <Date 2012-10-11.21:30:41.951>
creator = 'gvanrossum'
dependencies = []
files = ['28140', '28955', '32076', '32370']
hgrepos = []
issue_num = 16203
keywords = ['patch']
message_count = 36.0
messages = ['172695', '172696', '172697', '172698', '172704', '172775', '172796', '172816', '172819', '172822', '172844', '172845', '173110', '173113', '173114', '173116', '173121', '173122', '173206', '176498', '177553', '181395', '181451', '181468', '181496', '181552', '181565', '199620', '199661', '199663', '199664', '201340', '203059', '203074', '204106', '204107']
nosy_count = 9.0
nosy_names = ['gvanrossum', 'tim.peters', 'georg.brandl', 'pitrou', 'ezio.melotti', 'mrabarnett', 'python-dev', 'ezberch', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue16203'
versions = ['Python 3.4']

gvanrossum · 2012-10-11T21:30:42Z

I've noticed a subtle bug in some of our internal code. Someone wants to ensure that a certain string (e.g. a URL path) matches a certain pattern in its entirety. They use re.match() with a regex ending in $. Fine. Now someone else comes along and modifies the pattern. Somehow the $ gets lost, or the pattern develops a set of toplevel choices that don't all end in $. And now things that have a valid *prefix* suddenly (and unintentionally) start matching.

I think this is a common enough issue and propose a new API: a fullmatch() function and method that work just like the existing match() function and method but also check that the whole input string matches. This can be implemented slightly awkwardly as follows in user code:

def fullmatch(regex, input, flags=0):
  m = re.match(regex, input, flags)
  if m is not None and m.end() == len(input):
    return m
  return None

(The corresponding method will have to be somewhat more complex because the underlying match() method takes optional pos and endpos arguments.)

tim-one · 2012-10-11T21:45:56Z

+1. Note that this really can't be done in user-level code. For example, consider matching the pattern

a|ab

against the string

ab

Without being _forced_ to consider the "ab" branch, the regexp will match just the "a" branch. So, e.g., the example code you posted will say "nope, it didn't match (the whole thing)".

serhiy-storchaka · 2012-10-11T21:46:54Z

What will be with non-greedy qualifiers? Should '.*?' full match any string?

>>> re.match('.*?$', 'abc').group()
'abc'
>>> re.match('.*?', 'abc').group()
''

pitrou · 2012-10-11T21:58:57Z

Note that this really can't be done in user-level code.

Well, how about:

def fullmatch(regex, input, flags=0):
    return re.match("(:?" + regex + ")$", input, flags)

tim-one · 2012-10-11T22:42:52Z

Antoine, that's certainly the conceptual intent here. Can't say whether your attempt works in all cases. The docs don't guarantee it. For example, if the original regexp started with (?x), the docs explicitly say the effect of (?x) is undefined "if there are non-whitespace characters before the [inline (?x)] flag".

Sure, you could parse the regexp is user code too, and move an initial (?...x...) before your non-capturing group. For that matter, you could write your own regexp engine in user code too ;-)

The point is that it should be easy for the regexp engine to implement the desired functionality - and user attempts to "fake it" have pitfalls (even Guido didn't get it right - LOL ;-) ).

mrabarnett · 2012-10-12T19:45:42Z

'$' will match at the end of the string or just before the final '\n':

>>> re.match(r'abc$', 'abc\n')
<_sre.SRE_Match object at 0x00F15448>

So shouldn't you be using r'\Z' instead?

>>> re.match(r'abc\Z', 'abc')
<_sre.SRE_Match object at 0x00F15410>
>>> re.match(r'abc\Z', 'abc\n')
>>>

And what happens if the MULTILINE flag is turned on?

>>> re.match(r'abc$', 'abc\ndef', flags=re.MULTILINE)
<_sre.SRE_Match object at 0x00F15448>
>>> re.match(r'abc\Z', 'abc\ndef', flags=re.MULTILINE)
>>>

tim-one · 2012-10-13T05:28:21Z

Matthew, Guido wrote "check that the whole input string matches" (or slice if pos and (possibly also) endpos is/are given). So, yes, \Z is more to the point than $ if people want to continue wasting time trying to implement this as a Python-level function ;-)

I don't understand what you're asking about MULTILINE. What's the issue there? Focus on Guido's "whole input string matches", not on his motivational talk about "a regex ending in $". $ and/or \Z aren't the point here; "whole input string matches" is the point.

mrabarnett · 2012-10-13T17:25:52Z

Tim, my point is that if the MULTILINE flag happens to be turned on, '$' won't just match at the end of the string (or slice), it'll also match at a newline, so wrapping the pattern in (?:...)$ in that case could give the wrong answer, but wrapping it in (?:...)\Z would give the right answer.

pitrou · 2012-10-13T17:45:44Z

Tim, my point is that if the MULTILINE flag happens to be turned on,
'$' won't just match at the end of the string (or slice), it'll also
match at a newline, so wrapping the pattern in (?:...)$ in that case
could give the wrong answer, but wrapping it in (?:...)\Z would give
the right answer.

This means Tim and Guido are right that a dedicated fullmatch() method
is desireable.

serhiy-storchaka · 2012-10-13T18:07:00Z

Definitely this is not easy issue.

tim-one · 2012-10-14T02:53:14Z

Serhiy, I expect this is easy to implement _inside_ the regexp engine. The complications come from trying to do it outside the engine. But even there, wrapping the original regexp <re> in

(?:<re>)\Z

is at worst very close. The only insecurity with that I've thought of concerns the doc's warnings about what can appear before an inline re.VERBOSE flag. It probably works fine even if <re> does begin with (?...x....).

mrabarnett · 2012-10-14T03:11:51Z

It certainly appears to ignore the whitespace, even if the "(?x)" is at the end of the pattern or in the middle of a group.

Another point we need to consider is that the user might want to use a pre-compiled pattern.

mrabarnett · 2012-10-16T23:20:01Z

I'm about to add this to my regex implementation and, naturally, I want it to have the same name for compatibility.

However, I'm not that keen on "fullmatch" and would prefer "matchall" instead.

What do you think?

tim-one · 2012-10-16T23:45:39Z

I like "matchall" fine, but I can't channel Guido on names - he sometimes gets those wrong ;-)

gvanrossum · 2012-10-16T23:49:28Z

re.matchall() would appear to be related to re.findall(), which it isn't.

The re2 package has a FullMatch method:
http://code.google.com/p/re2/wiki/CplusplusAPI

mrabarnett · 2012-10-17T00:23:53Z

re2's FullMatch method contrasts with its PartialMatch method, which re doesn't have!

gvanrossum · 2012-10-17T02:02:07Z

But my other argument stands.

mrabarnett · 2012-10-17T02:17:00Z

OK, in order to avoid bikeshedding, "fullmatch" it is.

pitrou · 2012-10-17T21:16:27Z

FWIW, I prefer "fullmatch" as well :)

ezberch · 2012-11-27T22:19:07Z

Patch attached. I've taken a slightly different approach than what has been discussed here: rather than define a new fullmatch() function and method, I've defined a new re.FULLMATCH flag for match(). So an example would be

re.match('abc','abc',re.FULLMATCH)

The implementation is basically what has been discussed here, except done when the regular expression is compiled rather than at the user level.

pitrou · 2012-12-15T19:38:52Z

Thanks for the patch. While an internal flag may be a reasonable implementation strategy, IMHO a dedicated method still makes sense: it's simply more readable than passing a flag.

As for the tests, they should probably exercise the interaction with re.MULTILINE - see MRAB's comment in msg172775.

mrabarnett · 2013-02-04T22:39:42Z

I've attached a patch.

serhiy-storchaka · 2013-02-05T16:18:34Z

I did not analyze the patch deeply, only left on Rietveld comments on first sight. Need to update the documentation.

pitrou · 2013-02-05T19:15:32Z

The patch doesn't seem to include failure cases for fullmatch (i.e. cases where fullmatch returns None where match or search would return a match).

mrabarnett · 2013-02-06T02:12:23Z

3 of the tests expect None when using 'fullmatch'; they won't return None when using 'match'.

pitrou · 2013-02-06T18:27:16Z

3 of the tests expect None when using 'fullmatch'; they won't return
None when using 'match'.

Sorry, my bad. Like Serhiy, I can't comment on the changes to re internals, but we can trust you on this. The patch needs documentation, though.

serhiy-storchaka · 2013-02-06T19:25:59Z

I can't comment right now, but I am going inspect thoroughly re internals.
This is a new feature and we have enough time before 3.4 freeze.

pitrou · 2013-10-12T20:35:38Z

Serhiy, sorry to ping you, but do you think you're gonna look at this?

birkenfeld · 2013-10-13T07:20:01Z

I updated the patch to current tip, fixed three issues from the review, and added documentation updates.

python-dev · 2013-10-13T07:31:13Z

New changeset b51218966201 by Georg Brandl in branch 'default':
Add re.fullmatch() function and regex.fullmatch() method, which anchor the
http://hg.python.org/cpython/rev/b51218966201

birkenfeld · 2013-10-13T07:32:55Z

Sorry, accidental push, already reverted.

serhiy-storchaka · 2013-10-26T12:32:40Z

Patch updated to current tip. I have added some changes from the review and have added some tests.

Matthew, why change for SRE_OP_REPEAT_ONE is needed? Tests are passed without it.

serhiy-storchaka · 2013-11-16T16:30:16Z

Matthew, could you please answer my question?

mrabarnett · 2013-11-16T18:44:06Z

I don't know that it's not needed.

python-dev · 2013-11-23T21:20:54Z

New changeset 89dfa2671c83 by Serhiy Storchaka in branch 'default':
Issue bpo-16203: Add re.fullmatch() function and regex.fullmatch() method,
http://hg.python.org/cpython/rev/89dfa2671c83

serhiy-storchaka · 2013-11-23T21:23:21Z

Committed with additional test (re.fullmatch('a+', 'ab')) which proves that change for SRE_OP_REPEAT_ONE are needed.

Thank you Matthew for your contribution.

gvanrossum added easy type-feature A feature request or enhancement labels Oct 11, 2012

ezio-melotti added the topic-regex label Oct 18, 2012

python-dev mannequin closed this as completed Oct 13, 2013

birkenfeld reopened this Oct 13, 2013

serhiy-storchaka removed the easy label Oct 26, 2013

serhiy-storchaka self-assigned this Nov 23, 2013

serhiy-storchaka closed this as completed Nov 23, 2013

ezio-melotti transferred this issue from another repository Apr 10, 2022

Proposal: add re.fullmatch() method #60407

Proposal: add re.fullmatch() method #60407

Comments

gvanrossum commented Oct 11, 2012

gvanrossum commented Oct 11, 2012

tim-one commented Oct 11, 2012

serhiy-storchaka commented Oct 11, 2012

pitrou commented Oct 11, 2012

tim-one commented Oct 11, 2012

mrabarnett mannequin commented Oct 12, 2012

tim-one commented Oct 13, 2012

mrabarnett mannequin commented Oct 13, 2012

pitrou commented Oct 13, 2012

serhiy-storchaka commented Oct 13, 2012

tim-one commented Oct 14, 2012

mrabarnett mannequin commented Oct 14, 2012

mrabarnett mannequin commented Oct 16, 2012

tim-one commented Oct 16, 2012

gvanrossum commented Oct 16, 2012

mrabarnett mannequin commented Oct 17, 2012

gvanrossum commented Oct 17, 2012

mrabarnett mannequin commented Oct 17, 2012

pitrou commented Oct 17, 2012

ezberch mannequin commented Nov 27, 2012

pitrou commented Dec 15, 2012

mrabarnett mannequin commented Feb 4, 2013

serhiy-storchaka commented Feb 5, 2013

pitrou commented Feb 5, 2013

mrabarnett mannequin commented Feb 6, 2013

pitrou commented Feb 6, 2013

serhiy-storchaka commented Feb 6, 2013

pitrou commented Oct 12, 2013

birkenfeld commented Oct 13, 2013

python-dev mannequin commented Oct 13, 2013

birkenfeld commented Oct 13, 2013

serhiy-storchaka commented Oct 26, 2013

serhiy-storchaka commented Nov 16, 2013

mrabarnett mannequin commented Nov 16, 2013

python-dev mannequin commented Nov 23, 2013

serhiy-storchaka commented Nov 23, 2013