Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the usability of the match object named group API #68642

Closed
rhettinger opened this issue Jun 15, 2015 · 19 comments
Closed

Improve the usability of the match object named group API #68642

rhettinger opened this issue Jun 15, 2015 · 19 comments
Assignees
Labels
easy stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement

Comments

@rhettinger
Copy link
Contributor

BPO 24454
Nosy @warsaw, @rhettinger, @ericvsmith, @ezio-melotti, @serhiy-storchaka
Files
  • issue-24454-0.diff
  • issue-24454-1.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ericvsmith'
    closed_at = <Date 2016-09-11.12:59:03.975>
    created_at = <Date 2015-06-15.03:47:52.503>
    labels = ['expert-regex', 'easy', 'type-feature', 'library']
    title = 'Improve the usability of the match object named group API'
    updated_at = <Date 2016-10-06.08:43:48.049>
    user = 'https://github.com/rhettinger'

    bugs.python.org fields:

    activity = <Date 2016-10-06.08:43:48.049>
    actor = 'THRlWiTi'
    assignee = 'eric.smith'
    closed = True
    closed_date = <Date 2016-09-11.12:59:03.975>
    closer = 'eric.smith'
    components = ['Library (Lib)', 'Regular Expressions']
    creation = <Date 2015-06-15.03:47:52.503>
    creator = 'rhettinger'
    dependencies = []
    files = ['41262', '44522']
    hgrepos = []
    issue_num = 24454
    keywords = ['patch', 'easy', 'needs review']
    message_count = 19.0
    messages = ['245361', '245362', '245363', '245375', '245376', '245379', '245381', '256030', '256061', '275557', '275785', '275788', '275789', '275790', '275792', '275793', '275794', '275797', '275798']
    nosy_count = 8.0
    nosy_names = ['barry', 'rhettinger', 'eric.smith', 'ezio.melotti', 'mrabarnett', 'THRlWiTi', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'low'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue24454'
    versions = ['Python 3.6']

    @rhettinger
    Copy link
    Contributor Author

    The usability, learnability, and readability of match object code would be improved by giving it a more Pythonic API (inspired by ElementTree).

    Given a search like:

       data = 'Answer found on row 8 column 12.'
       mo = re.search(r'row (?P<row>\d+) column (?P<col>\d+)', data)

    We currently access results with:

      print(mo.group('col'))
      print(len(mo.groups())

    A better way would look like this:

      print(mo['col'])
      print(len(mo))

    This would work nicely with string formatting:

    print('Located coordinate at (%(row)s, %(col)s)' % mo)
    print('Located coordinate at ({row}, {col})'.format_map(mo))

    @rhettinger rhettinger added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Jun 15, 2015
    @serhiy-storchaka
    Copy link
    Member

    You can use mo.groupdict().

    print('Located coordinate at (%(row)s, %(col)s)' % mo.groupdict())
    print('Located coordinate at ({row}, {col})'.format_map(mo.groupdict()))

    As for len(mo), this is ambiguous, as well as indexing with integer indices. You suggest len(mo) be equal len(mo.groups()) and this looks reasonable to me, but in regex len(mo) equals to len(mo.groups())+1 (because mo[1] equals to mo.group(1) or mo.groups()[0]). If indexing will work only with named groups, it would be expected that len(mo) will be equal to len(mo.groupdict()).

    @ezio-melotti
    Copy link
    Member

    print(mo['col'])
    print(len(mo))

    This has already been discussed in another issue. As Serhiy mentioned, len(mo) and mo[num] would be ambiguous because of the group 0, but mo[name] might be ok.

    @ericvsmith
    Copy link
    Member

    I'd definitely be for mo['col']. I can't say I've ever used len(mo.groups()).

    I do have lots of code like:
    return mo.group('col'), mo.group('row'), mo.group('foo')

    Using groupdict there is doable but not great. But:
    return mo['col'], mo['row'], mo['foo']
    would be a definite improvement.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Jun 15, 2015

    I agree that it would be nice if len(mo) == len(mo.groups()), but Serhiy has explained why that's not the case in the regex module.

    The regex module does support mo[name], so:

    print('Located coordinate at (%(row)s, %(col)s)' % mo)
    print('Located coordinate at ({row}, {col})'.format_map(mo))

    already work.

    @serhiy-storchaka
    Copy link
    Member

    The disadvantage of supporting len() is its ambiguousness. Supporting indexing with group name also has disadvantages (benefits already was mentioned above).

    1. Indexing with string keys is a part of mapping protocol, and it would be expected that other parts of the protocol if not all are supported (at least len() and iteration), but they are not.

    2. If indexing with group names would be supported, it would be expected the support of integer indexes. But this is ambiguous too.

    This feature would improve the access to named groups (6 characters less to type for every case and better readability), but may be implementing access via attributes would be even better? mo.groupnamespace().col or mo.ns.col?

    @rhettinger
    Copy link
    Contributor Author

    but may be implementing access via attributes would
    be even better? mo.groupnamespace().col or mo.ns.col?

    The whole point is to eliminate the unnecessary extra level.
    Contrast using DOM with using ElementTree. The difference
    in usability and readability is huge. If you do much in the
    way of regex work, this will be a win:

    mo['name'], mo['rank'], mo['serialnumber']

    There are several problems with trying to turn this into attribute access. One of the usual ones are the conflict between the user fieldnames and the actual methods and attributes of the objects (that is why named tuples have the irritating leading underscore for its own attributes and methods). The other problem is that it interferes with usability when the fieldname is stored in a variable. Contrast, "fieldname='rank'; print(mo[fieldname])" with "fieldname='rank'; print(getattr(mo, fieldname))".

    I'm happy to abandon the "len(mo)" suggestion, but mo[groupname] would be really nice.

    @ericvsmith
    Copy link
    Member

    I've been playing around with this. My implementation is basically the naive:

    def __getitem__(self, value):
        return self.group(value)

    I have the following tests passing:

        def test_match_getitem(self):
            pat = re.compile('(?:(?P<a1>a)|(?P<b2>b))(?P<c3>c)?')
    
            m = pat.match('a')
            self.assertEqual(m['a1'], 'a')
            self.assertEqual(m['b2'], None)
            self.assertEqual(m['c3'], None)
            self.assertEqual(m[0], 'a')
            self.assertEqual(m[1], 'a')
            self.assertEqual(m[2], None)
            self.assertEqual(m[3], None)
            with self.assertRaises(IndexError):
                m['X']
    
            m = pat.match('ac')
            self.assertEqual(m['a1'], 'a')
            self.assertEqual(m['b2'], None)
            self.assertEqual(m['c3'], 'c')
            self.assertEqual(m[0], 'ac')
            self.assertEqual(m[1], 'a')
            self.assertEqual(m[2], None)
            self.assertEqual(m[3], 'c')
            with self.assertRaises(IndexError):
                m['X']
    
            # Cannot assign.
            with self.assertRaises(TypeError):
                 m[0] = 1
    
            # No len().
            self.assertRaises(TypeError, len, m)

    But because I'm just calling group(), you'll notice a few oddities. Namely:

    • indexing with integers works, as would group(i). See the discussion of omitting len().
    • for defined but missing group names, you get None instead of KeyError
    • for invalid group names, you get IndexError instead of KeyError

    I can't decide if these are good (because they're consistent with group()), or bad (because they're surprising. I'm interested in your opinions.

    I'll attach the patch when I'm at a better computer.

    @ericvsmith
    Copy link
    Member

    Here's the patch.

    I added some more tests, including tests for ''.format_map(). I think the format_map() tests convince me that keeping None for non-matched groups makes sense.

    @ericvsmith ericvsmith added the easy label Dec 7, 2015
    @ericvsmith ericvsmith self-assigned this Dec 7, 2015
    @ericvsmith
    Copy link
    Member

    Updated patch, with docs. I'd like to get this in to 3.6. Can anyone take a look?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 11, 2016

    New changeset ac0643314d12 by Eric V. Smith in branch 'default':
    bpo-24454: Improve the usability of the re match object named group API
    https://hg.python.org/cpython/rev/ac0643314d12

    @serhiy-storchaka
    Copy link
    Member

    Please document this feature in What's News Eric.

    There is no need to add the __getitem__ method explicitly the the list of methods match_methods. It would be added automatically.

    @ericvsmith
    Copy link
    Member

    I added a note in Misc/NEWS. Is that not what you mean?

    I'll look at match_methods.

    @serhiy-storchaka
    Copy link
    Member

    I meant Doc/whatsnew/3.6.rst.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 11, 2016

    New changeset 3265247e08f0 by Eric V. Smith in branch 'default':
    bpo-24454: Added whatsnew entry, removed __getitem__ from match_methods. Thanks Serhiy Storchaka.
    https://hg.python.org/cpython/rev/3265247e08f0

    @ericvsmith
    Copy link
    Member

    Fixed. Thanks, Serhiy.

    @serhiy-storchaka
    Copy link
    Member

    ./Modules/_sre.c:2425:14: warning: ‘match_getitem_doc’ defined but not used [-
    Wunused-variable]

    @ericvsmith
    Copy link
    Member

    On 9/11/2016 10:05 AM, Serhiy Storchaka wrote:

    Serhiy Storchaka added the comment:

    ./Modules/_sre.c:2425:14: warning: ‘match_getitem_doc’ defined but not used [-
    Wunused-variable]

    Not in Visual Studio!

    Standby.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 11, 2016

    New changeset 9eb38e0f1cad by Eric V. Smith in branch 'default':
    bpo-24454: Removed unused match_getitem_doc.
    https://hg.python.org/cpython/rev/9eb38e0f1cad

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    easy stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants