classification
Title: Improve the usability of the match object named group API
Type: enhancement Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: eric.smith Nosy List: barry, eric.smith, ezio.melotti, irdb, mrabarnett, python-dev, rhettinger, serhiy.storchaka
Priority: low Keywords: easy, needs review, patch

Created on 2015-06-15 03:47 by rhettinger, last changed 2016-10-06 08:43 by irdb. This issue is now closed.

Files
File name Uploaded Description Edit
issue-24454-0.diff eric.smith, 2015-12-07 13:31 review
issue-24454-1.diff eric.smith, 2016-09-10 03:30 review
Messages (19)
msg245361 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-06-15 03:47
The usability, learnability, and readability of match object code would be improved by giving it a more Pythonic API (inspired by ElementTree).

Given a search like:

   data = 'Answer found on row 8 column 12.'
   mo = re.search(r'row (?P<row>\d+) column (?P<col>\d+)', data)

We currently access results with:

  print(mo.group('col'))
  print(len(mo.groups())

A better way would look like this:

  print(mo['col'])
  print(len(mo))

This would work nicely with string formatting:

  print('Located coordinate at (%(row)s, %(col)s)' % mo)
  print('Located coordinate at ({row}, {col})'.format_map(mo))
msg245362 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-06-15 04:29
You can use mo.groupdict().

  print('Located coordinate at (%(row)s, %(col)s)' % mo.groupdict())
  print('Located coordinate at ({row}, {col})'.format_map(mo.groupdict()))

As for len(mo), this is ambiguous, as well as indexing with integer indices. You suggest len(mo) be equal len(mo.groups()) and this looks reasonable to me, but in regex len(mo) equals to len(mo.groups())+1 (because mo[1] equals to mo.group(1) or mo.groups()[0]). If indexing will work only with named groups, it would be expected that len(mo) will be equal to len(mo.groupdict()).
msg245363 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2015-06-15 04:39
> print(mo['col'])
> print(len(mo))

This has already been discussed in another issue.  As Serhiy mentioned, len(mo) and mo[num] would be ambiguous because of the group 0, but mo[name] might be ok.
msg245375 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-06-15 13:50
I'd definitely be for mo['col']. I can't say I've ever used len(mo.groups()).

I do have lots of code like:
return mo.group('col'), mo.group('row'), mo.group('foo')

Using groupdict there is doable but not great. But:
return mo['col'], mo['row'], mo['foo']
would be a definite improvement.
msg245376 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2015-06-15 14:14
I agree that it would be nice if len(mo) == len(mo.groups()), but Serhiy has explained why that's not the case in the regex module.

The regex module does support mo[name], so:

  print('Located coordinate at (%(row)s, %(col)s)' % mo)
  print('Located coordinate at ({row}, {col})'.format_map(mo))

already work.
msg245379 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-06-15 14:47
The disadvantage of supporting len() is its ambiguousness. Supporting indexing with group name also has disadvantages (benefits already was mentioned above).

1. Indexing with string keys is a part of mapping protocol, and it would be expected that other parts of the protocol if not all are supported (at least len() and iteration), but they are not.

2. If indexing with group names would be supported, it would be expected the support of integer indexes. But this is ambiguous too.

This feature would improve the access to named groups (6 characters less to type for every case and better readability), but may be implementing access via attributes would be even better? mo.groupnamespace().col or mo.ns.col?
msg245381 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-06-15 15:36
> but may be implementing access via attributes would 
> be even better? mo.groupnamespace().col or mo.ns.col?

The whole point is to eliminate the unnecessary extra level.
Contrast using DOM with using ElementTree.  The difference
in usability and readability is huge.  If you do much in the
way of regex work, this will be a win:

   mo['name'], mo['rank'], mo['serialnumber']

There are several problems with trying to turn this into attribute access.  One of the usual ones are the conflict between the user fieldnames and the actual methods and attributes of the objects (that is why named tuples have the irritating leading underscore for its own attributes and methods).  The other problem is that it interferes with usability when the fieldname is stored in a variable.   Contrast, "fieldname='rank'; print(mo[fieldname])" with "fieldname='rank'; print(getattr(mo, fieldname))".

I'm happy to abandon the "len(mo)" suggestion, but mo[groupname] would be really nice.
msg256030 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-12-06 20:54
I've been playing around with this. My implementation is basically the naive:

def __getitem__(self, value):
    return self.group(value)

I have the following tests passing:

    def test_match_getitem(self):
        pat = re.compile('(?:(?P<a1>a)|(?P<b2>b))(?P<c3>c)?')

        m = pat.match('a')
        self.assertEqual(m['a1'], 'a')
        self.assertEqual(m['b2'], None)
        self.assertEqual(m['c3'], None)
        self.assertEqual(m[0], 'a')
        self.assertEqual(m[1], 'a')
        self.assertEqual(m[2], None)
        self.assertEqual(m[3], None)
        with self.assertRaises(IndexError):
            m['X']

        m = pat.match('ac')
        self.assertEqual(m['a1'], 'a')
        self.assertEqual(m['b2'], None)
        self.assertEqual(m['c3'], 'c')
        self.assertEqual(m[0], 'ac')
        self.assertEqual(m[1], 'a')
        self.assertEqual(m[2], None)
        self.assertEqual(m[3], 'c')
        with self.assertRaises(IndexError):
            m['X']

        # Cannot assign.
        with self.assertRaises(TypeError):
             m[0] = 1

        # No len().
        self.assertRaises(TypeError, len, m)

But because I'm just calling group(), you'll notice a few oddities. Namely:

- indexing with integers works, as would group(i). See the discussion of omitting len().
- for defined but missing group names, you get None instead of KeyError
- for invalid group names, you get IndexError instead of KeyError

I can't decide if these are good (because they're consistent with group()), or bad (because they're surprising. I'm interested in your opinions.

I'll attach the patch when I'm at a better computer.
msg256061 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2015-12-07 13:31
Here's the patch.

I added some more tests, including tests for ''.format_map(). I think the format_map() tests convince me that keeping None for non-matched groups makes sense.
msg275557 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2016-09-10 03:30
Updated patch, with docs. I'd like to get this in to 3.6. Can anyone take a look?
msg275785 - (view) Author: Roundup Robot (python-dev) Date: 2016-09-11 12:55
New changeset ac0643314d12 by Eric V. Smith in branch 'default':
Issue 24454: Improve the usability of the re match object named group API
https://hg.python.org/cpython/rev/ac0643314d12
msg275788 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-11 13:19
Please document this feature in What's News Eric.

There is no need to add the __getitem__ method explicitly the the list of methods match_methods. It would be added automatically.
msg275789 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2016-09-11 13:27
I added a note in Misc/NEWS. Is that not what you mean?

I'll look at match_methods.
msg275790 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-11 13:40
I meant Doc/whatsnew/3.6.rst.
msg275792 - (view) Author: Roundup Robot (python-dev) Date: 2016-09-11 13:50
New changeset 3265247e08f0 by Eric V. Smith in branch 'default':
Issue 24454: Added whatsnew entry, removed __getitem__ from match_methods. Thanks Serhiy Storchaka.
https://hg.python.org/cpython/rev/3265247e08f0
msg275793 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2016-09-11 13:51
Fixed. Thanks, Serhiy.
msg275794 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-11 14:05
./Modules/_sre.c:2425:14: warning: ‘match_getitem_doc’ defined but not used [-
Wunused-variable]
msg275797 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2016-09-11 14:18
On 9/11/2016 10:05 AM, Serhiy Storchaka wrote:
>
> Serhiy Storchaka added the comment:
>
> ./Modules/_sre.c:2425:14: warning: ‘match_getitem_doc’ defined but not used [-
> Wunused-variable]

Not in Visual Studio!

Standby.
msg275798 - (view) Author: Roundup Robot (python-dev) Date: 2016-09-11 14:20
New changeset 9eb38e0f1cad by Eric V. Smith in branch 'default':
Issue 24454: Removed unused match_getitem_doc.
https://hg.python.org/cpython/rev/9eb38e0f1cad
History
Date User Action Args
2016-10-06 08:43:48irdbsetnosy: + irdb
2016-09-11 14:20:39python-devsetmessages: + msg275798
2016-09-11 14:18:16eric.smithsetmessages: + msg275797
2016-09-11 14:05:13serhiy.storchakasetmessages: + msg275794
2016-09-11 13:51:17eric.smithsetmessages: + msg275793
2016-09-11 13:50:56python-devsetmessages: + msg275792
2016-09-11 13:40:33serhiy.storchakasetmessages: + msg275790
2016-09-11 13:27:10eric.smithsetmessages: + msg275789
2016-09-11 13:19:28serhiy.storchakasetmessages: + msg275788
2016-09-11 12:59:03eric.smithsetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2016-09-11 12:55:49python-devsetnosy: + python-dev
messages: + msg275785
2016-09-10 03:30:12eric.smithsetfiles: + issue-24454-1.diff

messages: + msg275557
2016-06-27 06:41:43berker.peksaglinkissue19536 superseder
2015-12-07 13:31:21eric.smithsetfiles: + issue-24454-0.diff
messages: + msg256061

assignee: eric.smith
keywords: + patch, easy, needs review
stage: patch review
2015-12-06 20:54:29eric.smithsetmessages: + msg256030
2015-06-15 15:36:54rhettingersetmessages: + msg245381
2015-06-15 14:47:43serhiy.storchakasetmessages: + msg245379
2015-06-15 14:14:00mrabarnettsetmessages: + msg245376
2015-06-15 13:50:13eric.smithsetnosy: + eric.smith
messages: + msg245375
2015-06-15 11:20:35barrysetnosy: + barry
2015-06-15 04:39:52ezio.melottisetmessages: + msg245363
2015-06-15 04:29:28serhiy.storchakasetnosy: + ezio.melotti, serhiy.storchaka, mrabarnett
messages: + msg245362
components: + Regular Expressions
2015-06-15 03:47:52rhettingercreate