Message 94050 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	verdy_p
Recipients	ezio.melotti, r.david.murray, verdy_p
Date	2009-10-14.23:26:57
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1255562819.18.0.953967906207.issue7132@psf.upfronthosting.co.za>
In-reply-to

Content
I had read carefully ALL what ezio said, this is clear in the fact that I have summarized my responses to ALL the 4 points given by ezio. Capturing groups is a VERY useful feature of regular expressions, but they currently DON'T work as expected (in a useful way) when they are used within repetitions (unless you don't need any captures at all, for example when just using find(), and not performing substitutions on the groups. My proposal woul have absolutely NO performance impact when capturing groups are not used (find only, without replacement, so there the R flag can be safely ignored). It would also not affect the case where capturing groups are used in the regexp, but these groups are not referenced in the substitution or in the code using MatchObject.group(index) : these indexes are already not used (or should not, because this is most of the time a bug when it just returns the last occurence). Using multiple parsing operations with multiple regexps is really tricky, when all could be done directly from the original regexp, without modifying it. In addition, using split() or similar will not work as expected, when the splitting operations will not correctly parse the context in which the multiple occurences are safely separated (this context is only correctly specified in the original regexp where the groups, capturing or not, are specified). This extension will also NOT affect the non-capturing groups like: (?:X){m,n} (?:X)* (?:X)+ It will ONLY affect the CAPTURING groups like: (X){m,n} (X)* (X)+ and only if the R flag is set (in which case this will NOT affect the backtracking behavior, or which strings that will be effectively matched, but only the values of the returned "\n" indexed group. If my suggestion to keep the existing MatchObject.function(index) API looks too dangerous for you, because it would change the type of the returned values when the R flag is set, you can as well rename them to get a specific occurence of a group. Such as: MatchObject.groupOccurences(index) MatchObject.startOccurences(index) MatchObject.endOccurences(index) MatchObject.spanOccurences(index) MatchObject.groupsOccurences(index) But I don't think this is necessary; it will be already expected that they will return lists of values (or lists of pairs), instead of just single values (or single pairs) for each group: Python (as well as PHP or Perl) can already manage return values with varying datatypes. May be only PCRE (written for C/C++) would need a new API name to return lists of values instead of single values for each group, due to existing datatype restrictions. My proposal is not inconsistant: it returns consistant datatypes when the R flag is set, for ALL capturing groups (not just those that are repeated. Anyway I'll submit my idea to other groups, if I can find where to post them. Note that I've already implemented it in my own local implementation of PCRE, and this works perfectly with effectively very few changes (currently I have had to change the datatypes for matching objects so that they can return varying types), and I have used it to create a modified version of 'sed' to perform massive filtering of data: It really reduces the number of transformation steps needed to process such data correctly, because a single regexp (exactly the same that is already used in the first step used to match the substrings we are interested in, when using existing 'sed' implementations) can be used to perform the substitutions using indexes within captured groups. And I would like to have it incoporated in Python (and also Perl or PHP) as well.

I had read carefully ALL what ezio said, this is clear in the fact that 
I have summarized my responses to ALL the 4 points given by ezio.

Capturing groups is a VERY useful feature of regular expressions, but 
they currently DON'T work as expected (in a useful way) when they are 
used within repetitions (unless you don't need any captures at all, for 
example when just using find(), and not performing substitutions on the 
groups.

My proposal woul have absolutely NO performance impact when capturing 
groups are not used (find only, without replacement, so there the R flag 
can be safely ignored).

It would also not affect the case where capturing groups are used in the 
regexp, but these groups are not referenced in the substitution or in 
the code using MatchObject.group(index) : these indexes are already not 
used (or should not, because this is most of the time a bug when it just 
returns the last occurence).

Using multiple parsing operations with multiple regexps is really 
tricky, when all could be done directly from the original regexp, 
without modifying it. In addition, using split() or similar will not 
work as expected, when the splitting operations will not correctly parse 
the context in which the multiple occurences are safely separated (this 
context is only correctly specified in the original regexp where the 
groups, capturing or not, are specified).

This extension will also NOT affect the non-capturing groups like:
 (?:X){m,n}
 (?:X)*
 (?:X)+
It will ONLY affect the CAPTURING groups like:
 (X){m,n}
 (X)*
 (X)+
and only if the R flag is set (in which case this will NOT affect the 
backtracking behavior, or which strings that will be effectively 
matched, but only the values of the returned "\n" indexed group.

If my suggestion to keep the existing MatchObject.function(index) API 
looks too dangerous for you, because it would change the type of the 
returned values when the R flag is set, you can as well rename them to 
get a specific occurence of a group. Such as:

 MatchObject.groupOccurences(index)
 MatchObject.startOccurences(index)
 MatchObject.endOccurences(index)
 MatchObject.spanOccurences(index)
 MatchObject.groupsOccurences(index)

But I don't think this is necessary; it will be already expected that 
they will return lists of values (or lists of pairs), instead of just 
single values (or single pairs) for each group: Python (as well as PHP 
or Perl) can already manage return values with varying datatypes.

May be only PCRE (written for C/C++) would need a new API name to return 
lists of values instead of single values for each group, due to existing 
datatype restrictions.

My proposal is not inconsistant: it returns consistant datatypes when 
the R flag is set, for ALL capturing groups (not just those that are 
repeated.

Anyway I'll submit my idea to other groups, if I can find where to post 
them. Note that I've already implemented it in my own local 
implementation of PCRE, and this works perfectly with effectively very 
few changes (currently I have had to change the datatypes for matching 
objects so that they can return varying types), and I have used it to 
create a modified version of 'sed' to perform massive filtering of data:

It really reduces the number of transformation steps needed to process 
such data correctly, because a single regexp (exactly the same that is 
already used in the first step used to match the substrings we are 
interested in, when using existing 'sed' implementations) can be used to 
perform the substitutions using indexes within captured groups. And I 
would like to have it incoporated in Python (and also Perl or PHP) as 
well.

History
Date	User	Action	Args
2009-10-14 23:26:59	verdy_p	set	recipients: + verdy_p, ezio.melotti, r.david.murray
2009-10-14 23:26:59	verdy_p	set	messageid: <1255562819.18.0.953967906207.issue7132@psf.upfronthosting.co.za>
2009-10-14 23:26:57	verdy_p	link	issue7132 messages
2009-10-14 23:26:57	verdy_p	create