classification
Title: Regexp: capturing groups in repetitions
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.2
process
Status: languishing Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, davidchambers, ezio.melotti, mrabarnett, verdy_p
Priority: low Keywords:

Created on 2009-10-14 20:08 by verdy_p, last changed 2014-03-22 23:46 by BreamoreBoy.

Messages (27)
msg94022 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-14 20:08
For now, when capturing groups are used within repetitions, it is impossible to capure what they match 
individually within the list of matched repetitions.

E.g. the following regular expression:

(0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?)?)(?:\.(0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?)?)){3}

is a regexp that contains two capturing groups (\1 and \2), but whose the second one is repeated (3 times) to 
match an IPv4 address in dotted decimal format. We'd like to be able to get the individual multiple matchs 
for the second group.

For now, capturing groups don't record the full list of matches, but just override the last occurence of the 
capturing group (or just the first if the repetition is not greedy, which is not the case here because the 
repetition "{3}" is not followed by a "?"). So \1 will effectively return the first decimal component of the 
IPv4 address, but \2 will just return the last (fourth) decimal component.


I'd like to have the possibility to have a compilation flag "R" that would indicate that capturing groups 
will not just return a single occurence, but all occurences of the same group. If this "R" flag is enabled, 
then:

- the Match.group(index) will not just return a single string but a list of strings, with as many occurences 
as the number of effective repetitions of the same capturing group. The last element in that list will be the 
one equal to the current behavior

- the Match.start(index) and Match.end(index) will also both return a list of positions, those lists having 
the same length as the list returned by Match.group(index).

- for consistency, the returned values as lists of strings (instead of just single strings) will apply to all 
capturing groups, even if they are not repeated.


Effectively, with the same regexp above, we will be able to retreive (and possibily substitute):

- the first decimal component of the IPv4 address with "{\1:1}" (or "{\1:}" or "{\1}" or "\1" as before), 
i.e. the 1st (and last) occurence of capturing group 1, or in Match.group(1)[1], or between string positions Match.start(1)[1] and Match.end(1)[1] ;

- the second decimal component of the IPv4 address with "{\2:1}", i.e. the 1st occurence of capturing group 
2, or in Match.group(2)[1], or between string positions Match.start(2)[1] and Match.end(2)[1] ;

- the third decimal component of the IPv4 address with "{\2:2}", i.e. the 2nd occurence of capturing group 2, 
or in Match.group(2)[2], or between string positions Match.start(2)[2] and Match.end(2)[2] ;

- the fourth decimal component of the IPv4 address with "{\2:3}" (or "{\2:}" or "{\2}" or "\2"), i.e. the 3rd 
(and last) occurence of capturing group 2, or in Match.group(2)[2], or between string positions 
Match.start(2)[3] and Match.end(2)[3] ;


This should work with all repetition patterns (both greedy and not greedy, atomic or not, or possessive), in 
which the repeated pattern contains any capturing group.


This idea should also be submitted to the developers of the PCRE library (and Perl from which they originate, 
and PHP where PCRE is also used), so that they adopt a similar behavior in their regular expressions.

If there's already a candidate syntax or compilation flag in those libraries, this syntax should be used for 
repeated capturing groups.
msg94025 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-14 20:36
I'd like to add that the same behavior should also affect the span(index) 
method of MatchObject, that should also not just return a single (start, 
end) pair, but that should in this case return a list of pairs, one for 
each occurence, when the "R" compilation flag is specified.

This also means that the regular expression compilation flag R should be 
supported as these constants:
Regexp.R, or Regexp.REPETITIONS
msg94029 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-14 20:49
Rationale for the compilation flag:

You could think that the compilation flag should not be needed. However, 
not using it would mean that a LOT of existing regular expressions that 
already contain capturing groups in repetitions, and for which the 
caputiring group is effectively not used and should have been better 
encoded as a non-capuring group like (?:X) instead of (X), will suffer a 
negative performance impact and a higher memory usage.

The reason is that the MatchObject will have to store lists of 
(start,end) pairs instead of just a single pair: using a list will not 
be the default, so MatchObject.group(groupIndex), 
MatchObject.start(groupIndex), MatchObject.end(groupIndex), and 
MatchObject.span(groupIndex) will continue to return a single value or 
single pair, when the R compilation flag is not set (these values will 
continue to return only the last occurence, that will be overriden after 
each matched occurence of the capturing group.

The MatchObject.groups() will also continue to return a list of single 
strings without that flag set (i.e. a list of the last occurence of each 
capturing group). But when the R flag will be specified, it will return, 
instead, a list of lists, each element being for the group with the same 
index and being itself a list of strings, one for each occurence of the 
capturing group.
msg94030 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-14 20:55
Implementation details:

Currently, the capturing groups behave quite randomly in the values returned by MachedObject, when backtracking occurs in a repetition. This 
proposal will help fix the behavior, because it will also be much easier 
to backtrack cleanly, occurence by occurence, by just dropping the last 
element in the list of (start,end) pairs stored in the MatchedObject for 
all capturing groups specified WITHIN the repeated sub-expression.
msg94034 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-10-14 21:36
I'm skeptical about what you are proposing for the following reasons:
1) it doesn't exist in any other implementation that I know;
2) if implemented as default behavior:
   * it won't be backward-compatible;
   * it will increase the complexity;
3) it will be a proprietary extension and it will reduce the
compatibility with other implementations;
4) I can't think to any real word situation where this would be really
useful.

Using a flag like re.R to change the behavior might solve the issue 2),
but I'll explain why I don't think this is useful.

Let's take a simpler ipv4 address as example: you may want to use
'^(\d{1,3})(?:\.(\d{1,3})){3}$' to capture the digits (without checking
if they are in range(256)).
This currently only returns:
>>> re.match('^(\d{1,3})(?:\.(\d{1,3})){3}$', '192.168.0.1').groups()
('192', '1')

If I understood correctly what you are proposing, you would like it to
return (['192'], ['168', '0', '1']) instead. This will also require an
additional step to join the two lists to get the list with the 4 values.

In these situations where some part is repeating, it's usually easier to
use re.findall() or re.split() (or just a plain str.split for simple
cases like this):
>>> addr = '192.168.0.1'
>>> re.findall('(?:^|\.)(\d{1,3})', addr)
['192', '168', '0', '1']
>>> re.split('\.', addr) # no need to use re.split here
['192', '168', '0', '1']

In both the examples a single step is enough to get what you want
without changing the existing behavior.

'^(\d{1,3})(?:\.(\d{1,3})){3}$' can still be used to check if the string
has the right "format", before using the other methods to extract the data.

So I'm -1 about the whole idea and -0.8 about an additional flag.
Maybe you should discuss about this on the python-ideas ML.
msg94038 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-14 22:26
You're wrong, it WILL be compatible, because it is only conditioned by a 
FLAG. The flag is there specifically for instructing the parser to 
generate lists of values rather than single values.

Without the regular compilation flag set, as I said, there will be NO 
change.

Reopening the proposal, which is perfectly valid !
msg94040 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-14 22:29
Note that I used the IPv4 address format only as an example. There are 
plenty of other more complex cases for which we really need to capture the 
multiple occurences of a capturing group within a repetition.

I'm NOT asking you how to parse it using MULTIPLE regexps and functions. 
Of course you can, but this is a distinct problem, but certinaly NOT a 
general solution (your solution using split() will NOT work with really A 
LOT of other regular expressions).
msg94043 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-14 22:33
In addition, your suggested regexp for IPv4:

'^(\d{1,3})(?:\.(\d{1,3})){3}$'

is completely WRONG ! It will match INVALID IPv4 address formats like 
"000.000.000.000". Reread the RFCs... because "000.000.000.000" is 
CERTAINLY NOT an IPv4 address (if it is found in an URL) but a domain name 
that must be resolved into an IP address using domain name resolution 
requests.
msg94045 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-14 22:47
Summary of your points with my responses :


> 1) it doesn't exist in any other implementation that I know;

That's exactly why I proposed to discuss it with the developers of other 
implementations (I cited PCRE, Perl and PHP developers, there are 
others).


> 2) if implemented as default behavior:
>   * it won't be backward-compatible;

Wrong. This does not even change the syntax of regualr expressions 
themselves.

>   * it will increase the complexity;

Wrong. All the mechanic is already implemented: when the parser will 
store string index positions for a matched group it will just have to 
append a pair in the list stored in MatchObject.group(index) (it will 
create the list if it is not ealready there, but it should be already 
initialized to an empty list by the compiler) when the flag.R is set, 
instead of overwriting the existing pair without wondering if there was 
already another occurence of the same capturing group.


> 3) it will be a proprietary extension and it will reduce the
compatibility with other implementations;

Already suggested above. This will hovever NOT affect the compatibility 
of existing implementation that don't have the R flag.


> 4) I can't think to any real word situation where this would be really
useful.

There are really a lot ! Using multiple split operations and multiple 
parsing on partly parsed regular expressions will not be a solution in 
many situations (think about how you would perform matches and using 
them that in 'vi' or 'ed' with a single "s/regexp/replacement/flag" 
instruction, if there's no extension with a flag and a syntax for 
accesing the individual elements the replacement string).
msg94046 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-14 22:52
And anyway, my suggestion is certainly much more useful than atomic groups 
and possessive groups that have much lower use, and which are already 
being tested in Perl but that Python (or PCRE, PHP, and most 
implementations of 'vi'/'ed', or 'sed') still does not support.
msg94047 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-10-14 22:53
If you read what Ezio wrote carefully you will see that he addressed
both of your points: he acknowledged that a flag would solve (2) (but
disagreed that it was worth it), and he said you could use the first
expression to validate the string before using the split to obtain the
data.  Doing it all in one regex might seem more efficient, but at least
in the single use case you have presented it would lead to more
complicated code.

Simple. obvious feature requests can be opened and acted upon directly
in the tracker, but more complicated requests should be made as
proposals on the python-ideas list, and if they are well received there
then an issue can be opened (or in this case you could reopen this one)
in the tracker with a pointer to the python-ideas thread.  In most
cases, such an issue would need to include a proposed patch.

Note that we don't really have a resolution that says 'sent to
python-ideas', thus the use of the 'rejected' status.
msg94049 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-10-14 23:00
Just to clarify, when I said "in most cases such an issue would need to
include a proposed patch", I mean that even if everyone agrees it is a
good idea it isn't likely to happen unless there is a proposed patch :)
msg94050 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-14 23:26
I had read carefully ALL what ezio said, this is clear in the fact that 
I have summarized my responses to ALL the 4 points given by ezio.

Capturing groups is a VERY useful feature of regular expressions, but 
they currently DON'T work as expected (in a useful way) when they are 
used within repetitions (unless you don't need any captures at all, for 
example when just using find(), and not performing substitutions on the 
groups.

My proposal woul have absolutely NO performance impact when capturing 
groups are not used (find only, without replacement, so there the R flag 
can be safely ignored).

It would also not affect the case where capturing groups are used in the 
regexp, but these groups are not referenced in the substitution or in 
the code using MatchObject.group(index) : these indexes are already not 
used (or should not, because this is most of the time a bug when it just 
returns the last occurence).

Using multiple parsing operations with multiple regexps is really 
tricky, when all could be done directly from the original regexp, 
without modifying it. In addition, using split() or similar will not 
work as expected, when the splitting operations will not correctly parse 
the context in which the multiple occurences are safely separated (this 
context is only correctly specified in the original regexp where the 
groups, capturing or not, are specified).

This extension will also NOT affect the non-capturing groups like:
 (?:X){m,n}
 (?:X)*
 (?:X)+
It will ONLY affect the CAPTURING groups like:
 (X){m,n}
 (X)*
 (X)+
and only if the R flag is set (in which case this will NOT affect the 
backtracking behavior, or which strings that will be effectively 
matched, but only the values of the returned "\n" indexed group.

If my suggestion to keep the existing MatchObject.function(index) API 
looks too dangerous for you, because it would change the type of the 
returned values when the R flag is set, you can as well rename them to 
get a specific occurence of a group. Such as:

 MatchObject.groupOccurences(index)
 MatchObject.startOccurences(index)
 MatchObject.endOccurences(index)
 MatchObject.spanOccurences(index)
 MatchObject.groupsOccurences(index)

But I don't think this is necessary; it will be already expected that 
they will return lists of values (or lists of pairs), instead of just 
single values (or single pairs) for each group: Python (as well as PHP 
or Perl) can already manage return values with varying datatypes.

May be only PCRE (written for C/C++) would need a new API name to return 
lists of values instead of single values for each group, due to existing 
datatype restrictions.

My proposal is not inconsistant: it returns consistant datatypes when 
the R flag is set, for ALL capturing groups (not just those that are 
repeated.

Anyway I'll submit my idea to other groups, if I can find where to post 
them. Note that I've already implemented it in my own local 
implementation of PCRE, and this works perfectly with effectively very 
few changes (currently I have had to change the datatypes for matching 
objects so that they can return varying types), and I have used it to 
create a modified version of 'sed' to perform massive filtering of data:

It really reduces the number of transformation steps needed to process 
such data correctly, because a single regexp (exactly the same that is 
already used in the first step used to match the substrings we are 
interested in, when using existing 'sed' implementations) can be used to 
perform the substitutions using indexes within captured groups. And I 
would like to have it incoporated in Python (and also Perl or PHP) as 
well.
msg94051 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-10-14 23:28
> You're wrong, it WILL be compatible, because it is only conditioned
> by a FLAG.

Sorry, I missed that you mentioned the flag already in the first
message, but what I said in 1), 3) and 4) is still valid.

> There are plenty of other more complex cases for which we really
> need to capture the multiple occurences of a capturing group within
> a repetition.

Can you provide some example where your solution is better than the
other available way of doing it? During the years lot of extensions have
been added to the regex engines, if no one added this is probably
because these problems can be already solved in other ways.

> I'm NOT asking you how to parse it using MULTIPLE regexps and
> functions. 
> Of course you can, but this is a distinct problem, but certinaly NOT
> a general solution (your solution using split() will NOT work with
> really A LOT of other regular expressions).

Even with your solution, in most of the cases you will need additional
steps to assemble the results (at least in the cases with some kind of
separator, where you have to join the first element with the followings).

I can see a very limited set of hypothetical corner cases where your
proposal may save a few line of codes but I don't think it's worth
implementing all this just for them.
An example could be:
>>> re.match('^([0-9A-F]{2}){4} ([a-z]\d){5}$', '3FB52A0C a2c4g3k9d3',
re.R).groups()
(['3F', 'B5', '2A', '0C'], ['a2', 'c4', 'g3', 'k9', 'd3'])
but it's not really a real-world case, if you have some real-world
example I'd like to see it.

> In addition, your suggested regexp for IPv4:
> '^(\d{1,3})(?:\.(\d{1,3})){3}$'
> is completely WRONG !

That's why I wrote 'without checking if they are in range(256)'; the
fact that this regex matches invalid digits was not relevant in my
example (and it's usually easier to convert the digits to int and check
if 0 <= digits <= 255). :)


>> 1) it doesn't exist in any other implementation that I know;
>
> That's exactly why I proposed to discuss it with the developers of
> other implementations (I cited PCRE, Perl and PHP developers, there
> are others).

So maybe this is not the right place to ask.

>> 3) it will be a proprietary extension and it will reduce the
>> compatibility with other implementations;
>
> Already suggested above. This will hovever NOT affect the
> compatibility of existing implementation that don't have the R flag.

What I meant is that a regex that uses the re.R flag in Python won't
work in other languages/implementations because they don't have it, and
a "general" regex (e.g. for an ipv6 address) will have to be
adapted/rewritten in order to take advantage of re.R.

>> 4) I can't think to any real word situation where this would be 
>> really useful.
>
> There are really a lot ! Using multiple split operations and multiple 
> parsing on partly parsed regular expressions will not be a solution 
> in many situations (think about how you would perform matches and 
> using  them that in 'vi' or 'ed' with a single
> "s/regexp/replacement/flag" instruction, if there's no extension
> with a flag and a syntax for accesing the individual elements the 
> replacement string).

Usually when the text to be parsed starts to be too complex is better to
use another approach, e.g. using a real parser or dividing the text in
smaller units and work on them independently. Even if re.R could make
this easier I would still prefer to have a few more line of code that do
different things that a single big regex that does everything.

> And anyway, my suggestion is certainly much more useful than atomic 
> groups and possessive groups that have much lower use [...]

Then why no one implemented it yet? :)
msg94053 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-14 23:44
ezio said:
>>> re.match('^(\d{1,3})(?:\.(\d{1,3})){3}$', '192.168.0.1').groups()
('192', '1')
> If I understood correctly what you are proposing, you would like it to
return (['192'], ['168', '0', '1']) instead.

Yes, exactly ! That's the correct answer that should be returned, when 
the R flag is set.

> This will also require an additional step to join the two lists to get 
the list with the 4 values.

Yes, but this is necessary for full consistency of the group indexes. 
The current return value is clearly inconsistant (generally it returns 
the last occurence of the capturing group, but I've discovered that this 
is not always the case, because of matches that are returned after 
backtracking...)

It is then assumed that when the R flag is set, ALL occurences of 
repeated groups will be returned individually, instead of just a 
'random' one. Note that for full generalization of the concept, there 
should even be lists of lists if a capturing group contains itself 
another inner capturing group (with its own index), in order to 
associate them correctly with each occurence of the outer capturing 
group (however I've still not experimented this in my local 
experimentation, so all occurences are grouped in the same returned 
list, independantly of the occurence of the outer capturing group in 
which they were found).
msg94056 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-14 23:52
> That's why I wrote 'without checking if they are in range(256)'; the
fact that this regex matches invalid digits was not relevant in my
example (and it's usually easier to convert the digits to int and check
if 0 <= digits <= 255). :)

NO ! You have to check also the number of digits for values below 100 (2 
digits only) or below 10 (1 digit only)

And when processing web log files for example, or when parsing Wiki 
pages or emails in which you want to autodetect the presence of ONLY 
valid IP addresses within some contexts, where you want to transform 
them to another form (for example when converting them to links or to 
differentiate 'anonymous' users in wiki pages from registered named 
users, you need to correctly match these IP addresses. In addition, 
these files will often contain many other occurences that you don't want 
to transform, but just some of them in specific contexts given by the 
regexp. for this reason, your suggestion will often not work as 
expected.

The real need is to match things exactly, within their context, and 
capturing all occurences of capturing groups.

I gave the IPv4 regexp only as a simple example to show the need, but 
there are of course much more complex cases, and that's exactly for 
those cases that I would like the extension: using alternate code with 
partial matches and extra split() operations give a code that becomes 
tricky, and most often bogous. Only the original regexp is precise 
enough to parse the content correctly, find only the matches we want, 
and capturing all the groups that we really want, in a single operation, 
and with a near-zero cost (and without complication in the rest of the 
code using it).
msg94057 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-15 00:00
> Even with your solution, in most of the cases you will need additional
steps to assemble the results (at least in the cases with some kind of
separator, where you have to join the first element with the 
followings).

Yes, but this step is trivial and fully predictable. Much more viable 
than the other solutions proposed which gives tricky and often complex 
and bogous code.

How many bugs have been found in code using split() for example to parse 
URLs ? There are countlesss in many softwares (and it is not terminated 
!)

And in fine, the only solution is to simply rewrite the parser 
completely, without regexps at all, or to reduce the generality of the 
problems that the program was supposed to solve (i.e. asserting in the 
code some implementation limits, to reject some forms that were 
previously considered valid). Think about it, the capturing groups are 
the perfect solution for solving the problem cleanly, provided that they 
work as intended and return all their occurences.
msg94058 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-15 00:13
> a "general" regex (e.g. for an ipv6 address)

I know this problem, and I have already written about this. It is not 
possible to parse it in a single regexp if it is written without using 
repetitions. But in that case, the regexp becomes really HUGE, and the 
number of groups in the returned match object is prohibitive. That's why 
CPAN has had to write a specific module for IPv6 addresses in Perl.

Such module can be reduced to just a couple of lines with a single 
regexp, if its capturing groups correctly return ALL their occurences in 
the regexp engine: it requires no further processing and analysis, and 
the data can effectively be reassembled cleanly, just from the returned 
groups (as lists):
- \1 and \2 (for hex components of IPv6 in hex format only, where \1 can 
occur 0 or 1 time, and \2 can occur 0 to 7 times)
- or from \1 to \2 and \3 to \4 (for hex components in \1..\2, where \1 
occurs 0 or 1 time and \2 occurs 0 to 5 times, and for decimal 
components in \3..\4, where \3 occurs 1 time and \4 occurs exactly 3 
times).
msg94060 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-15 00:26
>> And anyway, my suggestion is certainly much more useful than atomic 
>> groups and possessive groups that have much lower use [...]
>Then why no one implemented it yet? :)

That's because they had to use something else than regexps to do their 
parsing. All those that had to do that *pested* that the regexps were 
not capturing all occurences.

And then later they regretted it, because they had to fix their 
alternate code (such as those using the bogous split() alternatives...) 
and finally rewrote their own parsers (sometimes with a combination of 
(f)lex+yacc/bison, even if the full expression was given in a single 
regexp which was expressive enough to match only the exact match they 
wanted, but without using the returned captured groups): this means an 
extra parsing for the found substring (in their correct context) in 
order to process it.
msg94065 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-15 00:45
>>> re.match('^(\d{1,3})(?:\.(\d{1,3})){3}$', '192.168.0.1').groups()
('192', '1')
> If I understood correctly what you are proposing, you would like it to
return (['192'], ['168', '0', '1']) instead.

In fact it can be assembled in a single array directly in the regexp, by 
naming the destination capturing group (with the same name, it would get 
the same group index, instead of of allocating a new one). E.g., with 
someting like:

>>> re.match('^(?P<parts>=\d{1,3})(?:\.(?P<parts>=\d{1,3})){3}$', 
'192.168.0.1').groups()

would return ("parts": ['192', '168', '0', '1']) in the same first 
group.

This could be used as well in PHP (which supports associative arrays for 
named groups which are also indexed positionnaly).
msg94066 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-10-15 00:50
Instead of a new flag, a '*' could be put after the quantifier, eg:

    (\d+)(?:\.(\d+)){3}*

MatchObject.group(1) would be a string and MatchObject.group(2) would be
a list of strings.

The group references could be \g<1>, \g<2:0>, \g<2:1>, \g<2:2>.

However, I think that it's extending regexes too far; something else
should be used, eg pyparsing or some type of context-free grammar with
optional constraints.

-1 from me
msg94067 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-15 02:04
You said that this extension was not implemented anywhere, and you were 
wrong.

I've found that it IS implemented in Perl 6! Look at this discussion:

http://www.perlmonks.org/?node_id=602361

Look at how the matches in quantified capture groups are returned as 
arrayref's (i.e. references to a numbered array).

So my idea is not stupid. Given that Perl rules the world of the Regexp 
language, it will be implemented elsewhere sonner or later, notably in 
PCRE, awk, vi, sed, PHP, .NET library...

Already, this is used in CPAN libraries for Perl v6... (when the X flag 
is set for extensions)
msg94068 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-15 02:41
Anyway, there are ways to speedup regexps, even without instructing the 
regexps with anti-backtracking syntaxes.

See http://swtch.com/~rsc/regexp/regexp1.html
(article dated January 2007)
Which discusses how Perl, PCRE (and PHP), Python, Java, Ruby, .NET 
library... are slow because they are using backtracking a single state 
in the NFA, instead of using simultaneously active states (which 
correctly emulates the DFA, without having to actually build the DFA 
transition states table, which could grow combinatorially, as seen in 
yacc and Bison).

Java uses now the Thomson approach in its latest releases, but I wonder 
how Python works: does it use the DFA simulation? Do you use PCRE?

Note that I've been using the DFA simulation since more than 20 years in 
1987, when I built my first regexp matcher (because the existing 
implementation at that time were really damn slow), after I read the 
Aho/Sethi/Ullman book which already demonstrated the algorithm.

This algorithm has been implemented in some tools replacing the old 
yacc/Bison tools, because they generate huge DFA transition tables (and 
this was the reason why Lex and Yacc were maintained as separate tools, 
Lex using the single-state NFA approach with excessive backtracking, and 
Yacc/Bison trying to generate the full DFA transition tables.) : the 
first language to use this approach was the Purdue Univerity Compiler 
Construction Tools Set (PCCTS) which was initially written in C and is 
now fully written and supported in Java.

The Perl 6 extension for quantified capturing groups will have a slow 
adoption, as long as Perl will continue the slow single-state NFA 
approach with excessive backtracking, instead of the Aho/Sethi/Ullman 
(ASU) approach (that some are attributing to Thomson due to the article 
in 2007, but this is false) using simultaneously active states. But 
anyway, it already exists (and Perl developers are already working on 
rewriting the engine using the ASU approach).

But my suggstion is much more general, as it should not just apply to 
quantified capturing groups, but to any capturing group that is part of 
a subexpression which is quantified.

And the way I specified it, it does not depend on the way the engine is 
written (whever it uses a single-state NFA or multi-state NFA or 
generates and uses a DFA transition state with single-state like in 
Yacc/Bison), because capturing groups are just used to store position 
pairs, and regexp engines already have to count them for effectively 
matching the greedy or non-greedy quantifiers, so this immediately 
provides a usable index for storing at successive positions in a 
numbered array for captured groups.

The simple test case is effectively to try to match /(aa?)*a+/ with 
strings longer than a few dozens of 'a'.
msg94071 - (view) Author: Philippe Verdy (verdy_p) Date: 2009-10-15 03:37
Umm.... I saif that the attribution to Thompson was wrong, in fact it 
was correct. Thompson designed and documented the algorithm in 1968, 
long before the Aho/Seti/Ullman green book... so the algorithm is more 
than 40 years old, and still not in Python, Perl and PCRE (but it is 
present in GNU awk...)

The paper published in swtch.com is effectively written in 2007, but its 
conclusions are perfectly valid. The interesting aspect of this paper is 
that it demonstrates that the Thompson's multi-state NFA approach is 
still the best one, and better than what Perl, PCRE (and PHP), Python, 
Ruby and others are using, but that it can be also optimized further by 
caching the DFA construction "on the fly" (see the blue curve on the 
displayed diagram) while parsing the the already precompiled multi-state 
NFA.

The cache for DFA states will fill up while matching the regexp against 
actual strings, so it will be much faster (and much less memory-and-
time-greedy than generating the full DFA transition table at compilation 
time like in Bison/Yacc).

However the paper still does not discusses how to make the DFA states 
cache limited in size. Notably because the longer the input text will 
be, the more the DFA cache will contain DFA states. One simple rule is 
to limit the number of cached DFA states, and then to allow all stored 
transitions to go all multiple-states in the NFA, and optionally to a 
single DFA state in the cache.

Then the DFA cache can be used in a LIFO manner, to purge it 
automatically from unused states, in order to reuse them (for caching 
another new DFA state which is still not present in the cache, when the 
cache has reached its maximum size): if this occurs, the other existing 
DFA states that point to it must be cleaned (their DFA state pointer or 
reference, stored in their NFA or DFA transitions, must be cleared/set 
to null, so that they will only contain the list of pointers to outgoing 
NFA states). The problem is how to look for a multistate in the DFA 
cache: this requires some fast lookup, but this can be implemented in a 
fast way using hash tables (by hashing the list of NFA states 
represented in the DFA state).

Apparently, GNU awk does not use the cached DFA approach: it just uses 
the NFA directly when the input text is lower than two dozens of 
characters, then it builds the full DFA as soon as the input text 
becomes larger (this explains the sudden, but moderate increase of 
time). But I've seen that GNU awk has the defect of this unlimited DFA 
generation approach: its excessive use of memory when the input text 
increases, because the number of DFA states added to the cache is 
contantly growing with the input, and the time to match each characer 
from the input slowly increases too. At some point, it will crash and 
exit due to memory limits exhaustion, when no more DFA states can be 
stored. That's why the DFA cache MUST be maintained under some level.

I'll try to implement this newer approach first in Java (just because I 
better know this language than Python, and beacause I think it is more 
solid in terms of type-safety, so it can reduce the number of bugs to 
correct before getting something stable).

In Java, there's a clean way to automatically cleanup objects from 
collections, when they are no longer needed: you just need to use weak 
references (the garbage collector will automatically cleanup the older 
objects, when needed). But this approach is not easy to port, and in 
fact, if it can effectively solve some problems, it will still not 
forbif the cache to use up to the maximum VM size. for performance 
reasons, I see little interest in storing more than about 1 million DFA 
states in the cache (also because the hash table that would be used 
would be less efficient when looking up for the key of a NFA multi-state 
where the DFA state is stored). So I will probably not use weak 
references, but will first use a maximum size (even if weak references 
could help maintain the cache at even lower bounds than the allowed 
maximum, if VM memory size is more constrained: it is a good idea in all 
Java programs to allow caches introduced only for performance reasons to 
also collaborate with the garbage collector, in order to avoid the 
explosion of all caches used in various programs or libraries). I don't 
know if Python supports the concept of weak references for handling 
performance caches.

May be I'll port it later in Python, but don't expect that I'll port it 
to C/C++ (as a replacement of PCRE), because I now hate these unsafe 
languages despite I have practiced them for many years: others would do 
that for me, when I'll have published my Java implementation.
msg102080 - (view) Author: David Chambers (davidchambers) Date: 2010-04-01 10:55
I would find this functionality very useful. While I agree that it's often simpler to extract the relevant information in several steps, there are situations in which I'd prefer to do it all in one go.

The application I'm writing at the moment needs to extract metadata from text files. This metadata actually appears as text at the top of each file. For example:

title: Example title
tags: Django, Python, regular expressions

Example title
=============

Here is the first paragraph.

I had expected something like this to get the job done:

meta = re.match(r'(?ms)(?:^(\S+):\s*(.*?)$\n)+^\s*$', contents_of_file)

Ideally in this case, meta.groups() would return:

('title', 'Example title', 'tags', 'Django, Python, regular expressions')
msg121484 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-11-18 18:37
Earlier this week I discovered that .Net supports repeated capture and its API suggested a much cleaner approach than what Perl offered, so I'll be adding it to the regex module at:

    http://pypi.python.org/pypi/regex

The new methods will follow the example of .group() & co.

Given a match object m, m.group(i) returns the last match of group i (or None if there's no match), so I'll be adding m.captures(i) to return a tuple of the captures (an empty tuple if there's no match). I'll also be adding m.starts(i), m.ends(i) and m.spans(i).

The issue for this work is #2636.

Units tests are welcome.
msg214523 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-03-22 23:46
Can this be closed as has happened with numerous other issues as a result of work done on the new regex module via #2636?
History
Date User Action Args
2014-03-22 23:46:07BreamoreBoysetnosy: + BreamoreBoy
messages: + msg214523
2010-11-18 18:37:31mrabarnettsetmessages: + msg121484
2010-04-09 04:07:40ezio.melottisetstatus: open -> languishing
2010-04-01 10:55:13davidchamberssetnosy: + davidchambers
messages: + msg102080
2009-10-15 03:37:14verdy_psetmessages: + msg94071
2009-10-15 02:41:18verdy_psetmessages: + msg94068
2009-10-15 02:04:30verdy_psetmessages: + msg94067
2009-10-15 00:50:32mrabarnettsetnosy: + mrabarnett
messages: + msg94066
2009-10-15 00:45:29verdy_psetmessages: + msg94065
2009-10-15 00:26:55verdy_psetmessages: + msg94060
2009-10-15 00:22:40r.david.murraysetnosy: - r.david.murray
2009-10-15 00:13:58verdy_psetmessages: + msg94058
2009-10-15 00:00:18verdy_psetmessages: + msg94057
2009-10-14 23:52:10verdy_psetmessages: + msg94056
2009-10-14 23:44:01verdy_psetmessages: + msg94053
2009-10-14 23:28:30ezio.melottisetmessages: + msg94051
2009-10-14 23:26:57verdy_psetmessages: + msg94050
2009-10-14 23:00:22r.david.murraysetmessages: + msg94049
2009-10-14 22:53:36r.david.murraysetnosy: + r.david.murray

messages: + msg94047
stage: resolved
2009-10-14 22:52:59verdy_psetmessages: + msg94046
2009-10-14 22:47:29verdy_psetmessages: + msg94045
2009-10-14 22:33:56verdy_psetmessages: + msg94043
2009-10-14 22:29:25verdy_psetmessages: + msg94040
2009-10-14 22:26:46verdy_psetstatus: pending -> open

messages: + msg94038
2009-10-14 21:36:24ezio.melottisetstatus: open -> pending
resolution: rejected
messages: + msg94034
2009-10-14 21:01:23ezio.melottisetpriority: low
nosy: + ezio.melotti
2009-10-14 20:55:53verdy_psetmessages: + msg94030
2009-10-14 20:49:18verdy_psetmessages: + msg94029
2009-10-14 20:36:50verdy_psetmessages: + msg94025
2009-10-14 20:08:14verdy_pcreate