Author akoumjian
Recipients akitada, akoumjian, alex, amaury.forgeotdarc, belopolsky, brian.curtin, collinwinter, davide.rizzo, ezio.melotti, georg.brandl, giampaolo.rodola, gregory.p.smith, jacques, jaylogan, jhalcrow, jimjjewett, loewis, mark, moreati, mrabarnett, nneonneo, pitrou, r.david.murray, ronnix, rsc, sjmachin, stiv, timehorse, vbr, zdwiel
Date 2011-07-11.05:19:47
SpamBayes Score 8.76412e-11
Marked as misclassified No
Message-id <1310361589.38.0.000599511922664.issue2636@psf.upfronthosting.co.za>
In-reply-to
Content
I apologize if this is the wrong place for this message. I did not see the link to a separate list.

First let me explain what I am trying to accomplish. I would like to be able to take an unknown regular expression that contains both named and unnamed groups and tag their location in the original string where a match was found. Take the following redundantly simple example:

>>> a_string = r"This is a demo sentence."
>>> pattern = r"(?<a_thing>\w+) (\w+) (?<another_thing>\w+)"
>>> m = regex.search(pattern, a_string)

What I want is a way to insert named/numbered tags into the original string, so that it looks something like this:

r"<a_thing>This</a_thing> <2>is</2> <another_thing>a</another_thing> demo sentence."

The syntax doesn't have to be exactly like that, but you get the place. I have inserted the names and/or indices of the groups into the original string, around the span that the groups occupy. 

This task is exceedingly difficult with the current implementation, unless I am missing something obvious. We could call the groups by index, the groups as a tuple, or the groupdict:

>>> m.group(1)
'This'
>>> m.groups()
('This', 'is', 'a')
>>> m.groupdict()
{'another_thing': 'a', 'a_thing': 'This'}

If all I wanted was to tag the groups by index, it would be a simple function. I would be able to call m.spans() for each index in the length of m.groups() and insert the <> and </> tags around the right indices.

The hard part is finding out how to find the spans of the named groups. Do any of you have a suggestion?

It would make more sense from my perspective, if each group was an object that had its own .span property. It would work like this with the above example:

>>> first = m.group(1)
>>> first.name()
'a_thing'
>>> second = m.group(2)
>>> second.name()
None
>>>

You could still call .spans() on the Match object itself, but it would query its children group objects for the data. Overall I think this would be a much more Pythonic approach, especially given that you have added subscripting and key lookup.

So instead of this:
>>> m['a_thing']
'This'
>>> type(m['a_thing'])
<type 'str'>

You could have:
>>> m['a_thing']
'This'
>>> type(m['a_thing'])
<'regex.Match.Group object'>

With the noted benefit of this:
>>> m['a_thing'].span()
(0, 4)
>>> m['a_thing'].index()
1
>>>

Maybe I'm missing a major point or functionality here, but I've been pouring over the docs and don't currently think what I'm trying to achieve is possible.

Thank you for taking the time to read all this.

-Alec
History
Date User Action Args
2011-07-11 05:19:49akoumjiansetrecipients: + akoumjian, loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, alex, r.david.murray, jacques, brian.curtin, zdwiel, jhalcrow, stiv, davide.rizzo, ronnix
2011-07-11 05:19:49akoumjiansetmessageid: <1310361589.38.0.000599511922664.issue2636@psf.upfronthosting.co.za>
2011-07-11 05:19:48akoumjianlinkissue2636 messages
2011-07-11 05:19:47akoumjiancreate