This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author verdy_p
Recipients verdy_p
Date 2009-10-14.20:08:13
SpamBayes Score 9.278474e-09
Marked as misclassified No
Message-id <1255550896.44.0.457470031322.issue7132@psf.upfronthosting.co.za>
In-reply-to
Content
For now, when capturing groups are used within repetitions, it is impossible to capure what they match 
individually within the list of matched repetitions.

E.g. the following regular expression:

(0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?)?)(?:\.(0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?)?)){3}

is a regexp that contains two capturing groups (\1 and \2), but whose the second one is repeated (3 times) to 
match an IPv4 address in dotted decimal format. We'd like to be able to get the individual multiple matchs 
for the second group.

For now, capturing groups don't record the full list of matches, but just override the last occurence of the 
capturing group (or just the first if the repetition is not greedy, which is not the case here because the 
repetition "{3}" is not followed by a "?"). So \1 will effectively return the first decimal component of the 
IPv4 address, but \2 will just return the last (fourth) decimal component.


I'd like to have the possibility to have a compilation flag "R" that would indicate that capturing groups 
will not just return a single occurence, but all occurences of the same group. If this "R" flag is enabled, 
then:

- the Match.group(index) will not just return a single string but a list of strings, with as many occurences 
as the number of effective repetitions of the same capturing group. The last element in that list will be the 
one equal to the current behavior

- the Match.start(index) and Match.end(index) will also both return a list of positions, those lists having 
the same length as the list returned by Match.group(index).

- for consistency, the returned values as lists of strings (instead of just single strings) will apply to all 
capturing groups, even if they are not repeated.


Effectively, with the same regexp above, we will be able to retreive (and possibily substitute):

- the first decimal component of the IPv4 address with "{\1:1}" (or "{\1:}" or "{\1}" or "\1" as before), 
i.e. the 1st (and last) occurence of capturing group 1, or in Match.group(1)[1], or between string positions Match.start(1)[1] and Match.end(1)[1] ;

- the second decimal component of the IPv4 address with "{\2:1}", i.e. the 1st occurence of capturing group 
2, or in Match.group(2)[1], or between string positions Match.start(2)[1] and Match.end(2)[1] ;

- the third decimal component of the IPv4 address with "{\2:2}", i.e. the 2nd occurence of capturing group 2, 
or in Match.group(2)[2], or between string positions Match.start(2)[2] and Match.end(2)[2] ;

- the fourth decimal component of the IPv4 address with "{\2:3}" (or "{\2:}" or "{\2}" or "\2"), i.e. the 3rd 
(and last) occurence of capturing group 2, or in Match.group(2)[2], or between string positions 
Match.start(2)[3] and Match.end(2)[3] ;


This should work with all repetition patterns (both greedy and not greedy, atomic or not, or possessive), in 
which the repeated pattern contains any capturing group.


This idea should also be submitted to the developers of the PCRE library (and Perl from which they originate, 
and PHP where PCRE is also used), so that they adopt a similar behavior in their regular expressions.

If there's already a candidate syntax or compilation flag in those libraries, this syntax should be used for 
repeated capturing groups.
History
Date User Action Args
2009-10-14 20:08:16verdy_psetrecipients: + verdy_p
2009-10-14 20:08:16verdy_psetmessageid: <1255550896.44.0.457470031322.issue7132@psf.upfronthosting.co.za>
2009-10-14 20:08:14verdy_plinkissue7132 messages
2009-10-14 20:08:14verdy_pcreate