Message 94051 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	ezio.melotti, r.david.murray, verdy_p
Date	2009-10-14.23:28:30
SpamBayes Score	4.9960036e-16
Marked as misclassified	No
Message-id	<1255562912.36.0.0208341925607.issue7132@psf.upfronthosting.co.za>
In-reply-to

Content
> You're wrong, it WILL be compatible, because it is only conditioned > by a FLAG. Sorry, I missed that you mentioned the flag already in the first message, but what I said in 1), 3) and 4) is still valid. > There are plenty of other more complex cases for which we really > need to capture the multiple occurences of a capturing group within > a repetition. Can you provide some example where your solution is better than the other available way of doing it? During the years lot of extensions have been added to the regex engines, if no one added this is probably because these problems can be already solved in other ways. > I'm NOT asking you how to parse it using MULTIPLE regexps and > functions. > Of course you can, but this is a distinct problem, but certinaly NOT > a general solution (your solution using split() will NOT work with > really A LOT of other regular expressions). Even with your solution, in most of the cases you will need additional steps to assemble the results (at least in the cases with some kind of separator, where you have to join the first element with the followings). I can see a very limited set of hypothetical corner cases where your proposal may save a few line of codes but I don't think it's worth implementing all this just for them. An example could be: >>> re.match('^([0-9A-F]{2}){4} ([a-z]\d){5}$', '3FB52A0C a2c4g3k9d3', re.R).groups() (['3F', 'B5', '2A', '0C'], ['a2', 'c4', 'g3', 'k9', 'd3']) but it's not really a real-world case, if you have some real-world example I'd like to see it. > In addition, your suggested regexp for IPv4: > '^(\d{1,3})(?:\.(\d{1,3})){3}$' > is completely WRONG ! That's why I wrote 'without checking if they are in range(256)'; the fact that this regex matches invalid digits was not relevant in my example (and it's usually easier to convert the digits to int and check if 0 <= digits <= 255). :) >> 1) it doesn't exist in any other implementation that I know; > > That's exactly why I proposed to discuss it with the developers of > other implementations (I cited PCRE, Perl and PHP developers, there > are others). So maybe this is not the right place to ask. >> 3) it will be a proprietary extension and it will reduce the >> compatibility with other implementations; > > Already suggested above. This will hovever NOT affect the > compatibility of existing implementation that don't have the R flag. What I meant is that a regex that uses the re.R flag in Python won't work in other languages/implementations because they don't have it, and a "general" regex (e.g. for an ipv6 address) will have to be adapted/rewritten in order to take advantage of re.R. >> 4) I can't think to any real word situation where this would be >> really useful. > > There are really a lot ! Using multiple split operations and multiple > parsing on partly parsed regular expressions will not be a solution > in many situations (think about how you would perform matches and > using them that in 'vi' or 'ed' with a single > "s/regexp/replacement/flag" instruction, if there's no extension > with a flag and a syntax for accesing the individual elements the > replacement string). Usually when the text to be parsed starts to be too complex is better to use another approach, e.g. using a real parser or dividing the text in smaller units and work on them independently. Even if re.R could make this easier I would still prefer to have a few more line of code that do different things that a single big regex that does everything. > And anyway, my suggestion is certainly much more useful than atomic > groups and possessive groups that have much lower use [...] Then why no one implemented it yet? :)

> You're wrong, it WILL be compatible, because it is only conditioned
> by a FLAG.

Sorry, I missed that you mentioned the flag already in the first
message, but what I said in 1), 3) and 4) is still valid.

> There are plenty of other more complex cases for which we really
> need to capture the multiple occurences of a capturing group within
> a repetition.

Can you provide some example where your solution is better than the
other available way of doing it? During the years lot of extensions have
been added to the regex engines, if no one added this is probably
because these problems can be already solved in other ways.

> I'm NOT asking you how to parse it using MULTIPLE regexps and
> functions. 
> Of course you can, but this is a distinct problem, but certinaly NOT
> a general solution (your solution using split() will NOT work with
> really A LOT of other regular expressions).

Even with your solution, in most of the cases you will need additional
steps to assemble the results (at least in the cases with some kind of
separator, where you have to join the first element with the followings).

I can see a very limited set of hypothetical corner cases where your
proposal may save a few line of codes but I don't think it's worth
implementing all this just for them.
An example could be:
>>> re.match('^([0-9A-F]{2}){4} ([a-z]\d){5}$', '3FB52A0C a2c4g3k9d3',
re.R).groups()
(['3F', 'B5', '2A', '0C'], ['a2', 'c4', 'g3', 'k9', 'd3'])
but it's not really a real-world case, if you have some real-world
example I'd like to see it.

> In addition, your suggested regexp for IPv4:
> '^(\d{1,3})(?:\.(\d{1,3})){3}$'
> is completely WRONG !

That's why I wrote 'without checking if they are in range(256)'; the
fact that this regex matches invalid digits was not relevant in my
example (and it's usually easier to convert the digits to int and check
if 0 <= digits <= 255). :)


>> 1) it doesn't exist in any other implementation that I know;
>
> That's exactly why I proposed to discuss it with the developers of
> other implementations (I cited PCRE, Perl and PHP developers, there
> are others).

So maybe this is not the right place to ask.

>> 3) it will be a proprietary extension and it will reduce the
>> compatibility with other implementations;
>
> Already suggested above. This will hovever NOT affect the
> compatibility of existing implementation that don't have the R flag.

What I meant is that a regex that uses the re.R flag in Python won't
work in other languages/implementations because they don't have it, and
a "general" regex (e.g. for an ipv6 address) will have to be
adapted/rewritten in order to take advantage of re.R.

>> 4) I can't think to any real word situation where this would be 
>> really useful.
>
> There are really a lot ! Using multiple split operations and multiple 
> parsing on partly parsed regular expressions will not be a solution 
> in many situations (think about how you would perform matches and 
> using  them that in 'vi' or 'ed' with a single
> "s/regexp/replacement/flag" instruction, if there's no extension
> with a flag and a syntax for accesing the individual elements the 
> replacement string).

Usually when the text to be parsed starts to be too complex is better to
use another approach, e.g. using a real parser or dividing the text in
smaller units and work on them independently. Even if re.R could make
this easier I would still prefer to have a few more line of code that do
different things that a single big regex that does everything.

> And anyway, my suggestion is certainly much more useful than atomic 
> groups and possessive groups that have much lower use [...]

Then why no one implemented it yet? :)

History
Date	User	Action	Args
2009-10-14 23:28:32	ezio.melotti	set	recipients: + ezio.melotti, r.david.murray, verdy_p
2009-10-14 23:28:32	ezio.melotti	set	messageid: <1255562912.36.0.0208341925607.issue7132@psf.upfronthosting.co.za>
2009-10-14 23:28:30	ezio.melotti	link	issue7132 messages
2009-10-14 23:28:30	ezio.melotti	create