Title: Clarify the documentation of re.findall()
Components: Documentation Versions: Python 3.11, Python 3.10, Python 3.9
PR 27849 merged serhiy.storchaka, 2021-08-20 08:22
PR 27879 merged miss-islington, 2021-08-22 07:24
PR 27880 merged miss-islington, 2021-08-22 07:24
Messages (13)
msg399799 - (view) Author: Rondevous (rondevous) Date: 2021-08-17 22:24
Can it please be hinted in the docs of re.findall to use (?:...) for non-capturing groups?

>>> re.findall('(foo)?bar|cool', 'cool')
### I expected the result: ['cool']

After hours of frustration, I learnt that I should use a non-capturing group (?:foo) in the pattern. This was not obvious.

P.S. Making the groups non-capturing in such a pattern is not needed in javascript (as tested on; could this be an issue with the | operator in re.findall?
msg399907 - (view) Author: Vedran Čačić (veky) * Date: 2021-08-19 11:22
It currently says:

...matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups...

I'm not quite sure how it could be clearer. Maybe "Alternatively" at the start of the second sentence?

regexr does the same thing, as far as I can see. Match is 'cool', group 1 is empty. Matches are not the same as groups.
msg399908 - (view) Author: Vedran Čačić (veky) * Date: 2021-08-19 11:28
Ah, now I see. When is called, the whole match is returned. So match can be considered kinda group (quasigroup?:). I see how it can be confusing: python usually starts indexing at 0, and someone might think that a .group(0) would be included in "a list of groups" returned.

I'm not sure how best to fix it. Maybe: Alternatively, if grouping parentheses are present in the pattern, return a list of groups captured by them...
msg399976 - (view) Author: Rondevous (rondevous) Date: 2021-08-20 16:01
To clarify in short: the pattern I mentioned doesn't give the result I expected in re.findall() unlike

Given pattern:  (foo)?bar|cool

Maybe my approach in testing the regex first using and then using re.findall() to return all matches was wrong.

Initially, after going through help(re) I had associated re.findall with the 'global' flag used in javascript regex which would return all the matches. Without the global flag (in javascript) only the first match is returned, like in python.
msg399977 - (view) Author: Rondevous (rondevous) Date: 2021-08-20 16:02
From my understanding, "|" should match either the RegEx on the left or the RegEx on the right of the pipe
>>> help(re):
        "|"      A|B, creates an RE that will match either A or B.

With, the pattern below matches 'cool' as well as 'foo'
>>>'(foo)|cool?', 'foo bar cool foobar coolbar')
<re.Match object; span=(0, 3), match='foo'>
>>>'(foo)|cool?', 'cool')
<re.Match object; span=(0, 4), match='cool'>

But, the same pattern and strings won't match 'cool' if used with re.findall() or re.finditer() because of how they work when capture-groups are present in the pattern.
msg399978 - (view) Author: Rondevous (rondevous) Date: 2021-08-20 16:02
To produce the same results that you'd get by using the global flag in javascript regex, and make re.findall to not capture the groups exclusively, all the groups in the pattern need to be of the non-capturing (?:) type. 

If the distinction about capturing and non-capturing groups is mentioned in the docs of re.findall, it would help those who have learnt regex from another language (like javascript), where the global flag in regex is allowed.

I want the docs of re.findall and re.finditer to somehow suggest the use (?:group) to return the original matches and not the captured groups.
msg399979 - (view) Author: Rondevous (rondevous) Date: 2021-08-20 16:03
Maybe the functionality of re.findall and re.finditer is limited because, e.g. I can't do something like this:

The workaround for doing that might need me to eventually write a parser O_O
msg399980 - (view) Author: Vedran Čačić (veky) * Date: 2021-08-20 16:59
Have you seen the patch? In the patched docs, non-capturing grouping is explicitly mentioned. (Though I myself wouldn't include even that, as it's superfluous with what's said before, obviously it's needed.:)
msg399981 - (view) Author: Vedran Čačić (veky) * Date: 2021-08-20 17:01
Also, maybe you should read the following sentence (also in the docs):

> If one wants more information about all matches of a pattern than the matched text, finditer() is useful as it provides match objects instead of strings.

It seems that's what you wanted in the first place.
msg400050 - (view) Author: Rondevous (rondevous) Date: 2021-08-22 04:42
Oops, I was wrong about re.finditer :D
Sorry, I think didn't check that properly.

Just saw the changes. The patch looks good :)

Thanks a lot!
msg400052 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-08-22 07:24
New changeset 64f9e7b19dc1603fcbd07c17c9860085b9d21465 by Serhiy Storchaka in branch 'main':
bpo-44940: Clarify the documentation of re.findall() (GH-27849)
msg400054 - (view) Author: miss-islington (miss-islington) Date: 2021-08-22 07:45
New changeset 519bcc698c436e12bd6c1ff6f2517060719c60d5 by Miss Islington (bot) in branch '3.10':
bpo-44940: Clarify the documentation of re.findall() (GH-27849)
msg400085 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-08-22 18:15
New changeset d006392245c904547e5727144235c2f9d7948e96 by Miss Islington (bot) in branch '3.9':
bpo-44940: Clarify the documentation of re.findall() (GH-27849) (GH-27880)
