classification
Title: MatchObject should offer __getitem__()
Type: enhancement Stage: resolved
Components: Regular Expressions Versions: Python 3.4
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Improve the usability of the match object named group API
View: 24454
Assigned To: Nosy List: berker.peksag, brandon-rhodes, christian.heimes, ezio.melotti, gward, mrabarnett, serhiy.storchaka
Priority: normal Keywords:

Created on 2013-11-09 15:15 by brandon-rhodes, last changed 2016-06-27 06:41 by berker.peksag. This issue is now closed.

Messages (6)
msg202480 - (view) Author: Brandon Rhodes (brandon-rhodes) * Date: 2013-11-09 15:15
Regular expression re.MatchObject objects are sequences.
They contain at least one “group” string, possibly more,
which are integer-indexed starting at zero.
Today, groups can be accessed in one of two ways.

(1) You can call the method match.group(N).

(2) You can call glist = match.groups()
    and then access each group as glist[N-1].
    Note the obvious off-by-one error:
    .groups() does not include “group zero”,
    which contains the entire match,
    and therefore its indexes are off-by-one
    from the values you would pass to .group().

I propose that MatchObject gain a __getitem__(N) method
whose return value for every N is the same as .group(N)
as I think that match[N] is a quite obvious syntax for
asking for one particular group of an RE match.

The only objection I can see to this proposal
is the obvious asymmetry between Group Zero and all
subsequent groups of a regular expression pattern:
zero means “the whole thing” whereas each of the others
holds the content of a particular explicit set of parens.
Looping over the elements match[0], match[1], ... of a
pattern like this:

    r'(\d\d\d\d)/(\d\d)/(\d\d)'

will give you *first* the *entire* match, and only then
turn its attention to the three parenthesized substrings.

My retort is that concentric groups can happen anyway:
that Group Zero, holding the entire match, is not really
as special as the newcomer might suspect, because you can
always wind up with groups inside of other groups; it is
simply part of the semantics of regular expressions that
groups might overlap or might contain one another, as in:

    r'((\d\d)/(\d\d)) Description: (.*)'

Here, we see that concentricity is not a special property
of Group Zero, but in fact something that can happen quite
naturally with other groups.

The caller simply needs to imagine every regular expression
being surrounded by an “automatic set of parentheses” to
understand where Group Zero comes from, and how it will be
ordered in the resulting sequence of groups relative to
the subordinate groups within the string.

If one or two people voice agreement here in this issue,
I will be very happy to offer a patch.
msg202481 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-11-09 15:29
This is something that the regex module already has, and since it is/was supposed to replace the re module in stdlib, I've been holding off to add to re for a long time.  We also discussed this recently on #python-dev, and I think it's OK to add it, as long as it behaves the same way as it does in the regex module.
If others agree it would be great to do it before the 3.4 feature freeze (there aren't many days left).
msg202488 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-11-09 17:02
We discussed this recently on #python-dev, and I don't think that it's worth to add indexing to match object. It will be confused that len(match) != len(match.groups()). I don't know any use case for indexing, it doesn't add anything new except yet one way to access a group. This feature not only increases maintaining complexity, but it also increases a number of things which should learn and remember Python programmer.
msg202588 - (view) Author: Greg Ward (gward) (Python committer) Date: 2013-11-11 00:00
>>> import this
[...]
There should be one-- and preferably only one --obvious way to do it.
msg202693 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-11-12 14:50
I think the idea is to eventually deprecate the .group() API.
msg269353 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2016-06-27 06:41
Thanks for the detailed report! Issue 24454 is actually a duplicate of this but it has a patch and the idea was discussed by several core developers there. I'm going to close this one.
History
Date User Action Args
2016-06-27 06:41:43berker.peksagsetstatus: open -> closed

superseder: Improve the usability of the match object named group API

nosy: + berker.peksag
messages: + msg269353
resolution: duplicate
stage: needs patch -> resolved
2013-11-12 14:50:47ezio.melottisetmessages: + msg202693
2013-11-11 00:00:07gwardsetnosy: + gward
messages: + msg202588
2013-11-09 17:02:41serhiy.storchakasetmessages: + msg202488
2013-11-09 15:29:23ezio.melottisetnosy: + christian.heimes, serhiy.storchaka

messages: + msg202481
stage: needs patch
2013-11-09 15:20:29brandon-rhodessetversions: + Python 3.4, - Python 3.5
2013-11-09 15:15:27brandon-rhodescreate