Python 3.4.1, Windows.
help(re.findall) shows me:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
It seems like there is missing information regarding greedy groups, i.e. (regular_expression)*
Please take a look at my example:
-------------EXAMPLE-------------
import re
text = 'To configure your editing environment, use the Editor settings page and its child pages. There is also a ' \
'Quick Switch Scheme command that lets you change color schemes, themes, keymaps, etc. with a couple of ' \
'keystrokes.'
print('Text to be searched: \n' + text)
print('\nSarching method: re.findall()')
regexp_result = re.findall(r'\w+(\s+\w+)', text)
print('\nRegexp rule: r\'\w+(\s+\w+)\' \nFound: ' + str(regexp_result))
print('This works as expected: findall() returns a list of groups (\s+\w+), and the groups are from non-overlapping matches.')
regexp_result = re.findall(r'\w+(\s+\w+)*', text)
print('\nHow about making the group greedy? Here we go: \nRegexp rule: r\'\w+(\s+\w+)*\' \nFound: ' + str(regexp_result))
print('This is a little bit unexpected for me: findall() returns THE LAST MATCHING group only, parsing from-left-to-righ.')
regexp_result_list = re.findall(r'(\w+(\s+\w+)*)', text)
first_group = list(i for i, j in regexp_result_list)
print('\nThe solution is to put an extra group aroung the whole RE: \nRegexp rule: r\'(\w+(\s+\w+)*)\' \nFound: ' + str(first_group))
print('So finally I can get all strings I am looking for, just like expected from the FINDALL method, by accessing first elements in tuples.')
----------END OF EXAMPLE-------------
I found the solution when practicing on this page:
http://regex101.com/#python
Entering:
REGULAR EXPRESSION: \w+(\s+\w+)*
TEST STRING: To configure your editing environment, use the Editor settings page and its child pages. There is also a Quick Switch Scheme command that lets you change color schemes, themes, keymaps, etc. with a couple of keystrokes.
it showed me on the right side with nice color-coding:
1st Capturing group (\s+\w+)*
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
I think some information regarding repeated groups should be included as well in Python documentation.
BTW: I have one extra question.
Searching for 'findall' in this tracker I found this issue:
http://bugs.python.org/issue3384
It looks like information about ordering information is no longer in 3.4.1 documentation. Shouldn't this be there?
Kind Regards
|