This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: re.findall() documentation lacks information about finding THE LAST iteration of repeated capturing group (greedy)
Type: Stage:
Components: Documentation Versions: Python 3.4
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Mateusz.Dobrowolny, docs@python, gvanrossum
Priority: normal Keywords:

Created on 2014-09-07 12:35 by Mateusz.Dobrowolny, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (4)
msg226534 - (view) Author: Mateusz Dobrowolny (Mateusz.Dobrowolny) Date: 2014-09-07 12:35
Python 3.4.1, Windows.
help(re.findall) shows me:
findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result.

It seems like there is missing information regarding greedy groups, i.e. (regular_expression)*
Please take a look at my example:

-------------EXAMPLE-------------
import re

text = 'To configure your editing environment, use the Editor settings page and its child pages. There is also a ' \
       'Quick Switch Scheme command that lets you change color schemes, themes, keymaps, etc. with a couple of ' \
       'keystrokes.'
print('Text to be searched: \n' + text)
print('\nSarching method: re.findall()')

regexp_result = re.findall(r'\w+(\s+\w+)', text)
print('\nRegexp rule: r\'\w+(\s+\w+)\' \nFound: ' + str(regexp_result))
print('This works as expected: findall() returns a list of groups (\s+\w+), and the groups are from non-overlapping matches.')

regexp_result = re.findall(r'\w+(\s+\w+)*', text)
print('\nHow about making the group greedy? Here we go: \nRegexp rule: r\'\w+(\s+\w+)*\' \nFound: ' + str(regexp_result))
print('This is a little bit unexpected for me: findall() returns THE LAST MATCHING group only, parsing from-left-to-righ.')

regexp_result_list = re.findall(r'(\w+(\s+\w+)*)', text)
first_group = list(i for i, j in regexp_result_list)
print('\nThe solution is to put an extra group aroung the whole RE: \nRegexp rule: r\'(\w+(\s+\w+)*)\' \nFound: ' + str(first_group))
print('So finally I can get all strings I am looking for, just like expected from the FINDALL method, by accessing first elements in tuples.')
----------END OF EXAMPLE-------------


I found the solution when practicing on this page:
http://regex101.com/#python
Entering:
REGULAR EXPRESSION: \w+(\s+\w+)*
TEST STRING: To configure your editing environment, use the Editor settings page and its child pages. There is also a Quick Switch Scheme command that lets you change color schemes, themes, keymaps, etc. with a couple of keystrokes.

it showed me on the right side with nice color-coding:
1st Capturing group (\s+\w+)*
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data




I think some information regarding repeated groups should be included as well in Python documentation.

BTW: I have one extra question.
Searching for 'findall' in this tracker I found this issue:
http://bugs.python.org/issue3384

It looks like information about ordering information is no longer in 3.4.1 documentation. Shouldn't this be there?

Kind Regards
msg226543 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2014-09-07 20:30
Do you have a specific sentence or paragraph in mind that could be added?

Be aware help() just shows what's in the docstring, which is typically abbreviated.  The full docs are on docs.python.org.  Can you find what you need there?
msg226567 - (view) Author: Mateusz Dobrowolny (Mateusz.Dobrowolny) Date: 2014-09-08 10:15
The official help
https://docs.python.org/3/library/re.html?highlight=findall#re.findall
in fact contains more information, especially the one mentioned in http://bugs.python.org/issue3384.

Regarding my issue - I am afraid it was my misunderstanding, because it looks like Regular Expressions return always LAST match and Python re.findall reutrns what it is said to be: the list of groups.
And since I repeat a captured group, I get only the last match.

More here for example here:
http://www.regular-expressions.info/captureall.html

I was learning regexp yesterday, and first I reported this without knowing everytnig about capturing groups.

If returning the last match for repeting a capturing group is defined within RegEx itself, than there is no need to mention it in Python documentation...
msg226592 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2014-09-08 17:01
Then let's close this issue.
History
Date User Action Args
2022-04-11 14:58:07adminsetgithub: 66549
2014-09-08 17:01:52gvanrossumsetstatus: open -> closed
resolution: not a bug
messages: + msg226592
2014-09-08 10:15:45Mateusz.Dobrowolnysetmessages: + msg226567
2014-09-07 20:30:31gvanrossumsetnosy: + gvanrossum
messages: + msg226543
2014-09-07 12:49:20Mateusz.Dobrowolnysettitle: re.findall() documentation lacks information about finding THE LAST iteration of reoeated capturing group (greedy) -> re.findall() documentation lacks information about finding THE LAST iteration of repeated capturing group (greedy)
2014-09-07 12:35:32Mateusz.Dobrowolnycreate