Issue10139
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010-10-18 22:26 by tzot, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Messages (6) | |||
---|---|---|---|
msg119088 - (view) | Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) * | Date: 2010-10-18 22:25 | |
This is based on that StackOverflow answer: http://stackoverflow.com/questions/3957164/3963443#3963443. It also applies to Python 2.6 . Searching for a regular expression that satisfies the mentioned SO question (a regular expression that matches strings with an initial A and/or final Z and returns everything except said initial A and final Z), I discovered something that I consider a bug. I've tried to thoroughly verify that this is not a PEBCAK before reporting the issue here. Given: >>> import re >>> text= 'A***Z' then: >>> re.compile('(?<=^A).*(?=Z$)').search(text).group(0) # regex_1 '***' >>> re.compile('(?<=^A).*').search(text).group(0) # regex_2 '***Z' >>> re.compile('.*(?=Z$)').search(text).group(0) # regex_3 'A***' >>> re.compile('(?<=^A).*(?=Z$)|(?<=^A).*').search(text).group(0) # regex_1|regex_2 '***' >>> re.compile('(?<=^A).*(?=Z$)|.*(?=Z$)').search(text).group(0) # regex_1|regex_3 'A***' >>> re.compile('(?<=^A).*|.*(?=Z$)').search(text).group(0) # regex_2|regex_3 'A***' >>> re.compile('(?<=^A).*(?=Z$)|(?<=^A).*|.*(?=Z$)').search(text).group(0) # regex_1|regex_2|regex_3 'A***' regex_1 returns '***'. Based on the documentation (http://docs.python.org/py3k/library/re.html#regular-expression-syntax), I assert that, likewise, '***' should be returned by: regex_1|regex_2 regex_1|regex_3 regex_1|regex_2|regex_3 And yet, regex_3 ( ".*(?=Z$)" ) seems to take precedence over both regex_1 and regex_2, even though it's the last alternative. This works even if I substitute "(?:regex_n)" for every "regex_n", so it's not a matter of precedence. I really hope that this is a PEBCAK; if that is true, I apologize for any time lost on the issue by anyone; but really don't think it is. |
|||
msg119090 - (view) | Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) * | Date: 2010-10-18 22:38 | |
For completeness' sake, I also provide the "(?:regex_n)" results: >>> text= 'A***Z' >>> re.compile('(?:(?<=^A).*(?=Z$))').search(text).group(0) # regex_1 '***' >>> re.compile('(?:(?<=^A).*)').search(text).group(0) # regex_2 '***Z' >>> re.compile('(?:.*(?=Z$))').search(text).group(0) # regex_3 'A***' >>> re.compile('(?:(?<=^A).*(?=Z$))|(?:(?<=^A).*)').search(text).group(0) # regex_1|regex_2 '***' >>> re.compile('(?:(?<=^A).*(?=Z$))|(?:.*(?=Z$))').search(text).group(0) # regex_1|regex_3 'A***' >>> re.compile('(?:(?<=^A).*)|(?:.*(?=Z$))').search(text).group(0) # regex_2|regex_3 'A***' >>> re.compile('(?:(?<=^A).*(?=Z$))|(?:(?<=^A).*)|(?:.*(?=Z$))').search(text).group(0) # regex_1|regex_2|regex_3 'A***' |
|||
msg119125 - (view) | Author: Georg Brandl (georg.brandl) * | Date: 2010-10-19 07:33 | |
I'm not sure this is valid. First, I think I have a much easier example: >>> import re >>> re.search('bc|abc', 'abc').group() 'abc' I assume you'd expect this to give 'bc' as well. However, for a string s, "search" looks for matches looking at s, then looking at s[1:], then s[2:], and so on. For s, it looks at both branches, and the second branch matches. This can be inferred from the docs of "search": """Scan through string looking for a location where the regular expression pattern produces a match;""", for the first location a match is produced for the second branch. |
|||
msg119126 - (view) | Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) * | Date: 2010-10-19 07:50 | |
As I see it, it's more like: >>> re.search('a.*c|a.*|.*c', 'abc').group() producing 'bc' instead of 'abc'. Substitute "(?<=^A)" for "a" and "(?=Z$)" for "c" in the pattern above. In your example, the first part ('bc') does not match the whole string ('abc'). In my example, the first part ('(?<=^A).*(?=Z$)') matches the whole string ('A***Z'). |
|||
msg119127 - (view) | Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) * | Date: 2010-10-19 07:54 | |
Georg, please re-open it. Focus on the difference between example regex_1|regex_2 (both matching; regex_1 is used as it should be), and regex_1|regex_3 (both matching; regex_3 is used incorrectly). |
|||
msg119128 - (view) | Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) * | Date: 2010-10-19 08:11 | |
No, my mistake, you did well for closing it. The more explicit version of the explanation: both regex_1 and regex_2 start actually matching at index 1, while regex_3 starts matching at index 0. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:07 | admin | set | github: 54348 |
2010-10-19 08:11:32 | tzot | set | messages: + msg119128 |
2010-10-19 07:54:25 | tzot | set | messages: + msg119127 |
2010-10-19 07:50:13 | tzot | set | messages: + msg119126 |
2010-10-19 07:33:25 | georg.brandl | set | status: open -> closed nosy: + georg.brandl messages: + msg119125 resolution: wont fix |
2010-10-18 22:38:15 | tzot | set | messages: + msg119090 |
2010-10-18 22:26:00 | tzot | create |