classification
Title: Completeness and symmetry in RE, avoid `findall(...)[0]`
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: apalala, ezio.melotti, mrabarnett, rhettinger, serhiy.storchaka, terry.reedy
Priority: normal Keywords:

Created on 2019-12-30 13:19 by apalala, last changed 2020-01-25 21:14 by rhettinger. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 17735 closed apalala, 2019-12-30 13:19
Messages (11)
msg359039 - (view) Author: Juancarlo Añez (apalala) * Date: 2019-12-30 13:19
The problematic `findall(...)[0]` is a common anti-pattern in Python programs. The reason is lack of symmetry and completeness in the `re` module.

The original proposal in `python-ideas` was to add `re.findfirst(pattern, string, flags=0, default=_mark)` with more or less the semantics of `next(findall(pattern, string, flags=flags), default=default)`. 

The referenced PR adds `findalliter(pattern, string, flags=0)` with the value semantics of `findall()` over a generator, implements `findall()` as `return list(findalliter(...))`, and implements `findfirst()`. 

Consistency and correctness are likely because all tests pass with the redefined `findall()`.
msg359153 - (view) Author: Juancarlo Añez (apalala) * Date: 2020-01-01 12:43
The discussion on python-ideas favored the inclusion of `findfirst()`. At any rate, not having a generator version of `findall()` is an important omission.

Another user's search of Github public repositories found that `findall(...)[0]` is prevalent. python-ideas agreed that the cause was the incompleteness/asymmetry in `re`.
msg359160 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-01-01 14:47
This was discussed recently on the Python-ideas mailing list.

https://mail.python.org/archives/list/python-ideas@python.org/thread/M2OZCN5C26YUJJ4EXLIIXHQBGF6IM5GW/#H3GURL35C7AZ3ZBK6CQZGGCISUZ42WDV

I agree with Raymond that this is an education problem. There are occurrences of `findall(...)[0]` in real code, but most of this code has not very high quality. In all cases re.search() can be used. In some cases the code can be rewritten in much more efficient way, using a single regular expression, for example:

-   if re.search('^(\w+) (\w+)$', parcel.owner):
-     last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0]
-   elif re.search('^(\w+) (\w+) (\w+)$', parcel.owner):
-     last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner )[0]
-   elif re.search('^(\w+) (\w+) & (\w+)$', parcel.owner):
-     last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0]
-   elif re.search('^(\w+) (\w+) (\w+) &amp: (\w+)$', parcel.owner):
-     last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner )[0]
-   elif re.search('^(\w+) (\w+) & (\w+) (\w+)$', parcel.owner):
-     last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0]
-   elif re.search('^(\w+) (\w+) (\w+) &amp: (\w+) (\w+)$', parcel.owner):
-     last, first, middle = re.findall( '(\w+) (\w+) (\w+)',parcel.owner )[0]
-   elif re.search('^(\w+) (\w+) & (\w+) (\w+) (\w+)$', parcel.owner):
-     last, first = re.findall( '(\w+) (\w+)',parcel.owner )[0]
-   elif re.search('^(\w+) (\w+) (\w+) &amp: (\w+) (\w+) (\w+)$', parcel.owner):
-     last, first, middle = re.findall( '(\w+) (\w+) (\w+)', parcel.owner     )[0]
+   m = re.fullmatch('(\w+) (\w+)(?: (\w+))?(?: &(?: \w+){1,3})?', parcel.owner)
+   if m:
+     last, first, middle = m.groups()

But even using `findall(...)[0]` is not always so bad, because in many cases findall() returns a list containing a single string.

Adding re.findfirst() will not automatically fix all existing code which uses `findall(...)[0]`. This is an education problem, you need to teach people to use the more appropriate function, and if you can teach about re.findfirst(), why can't you teach about re.search()?
msg359278 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020-01-04 06:19
I currently agree with Serhiy and am not persuaded that either new function is needed.

1. findalliter: For 3.0, I might have been more aggressive than we were in turning list-returning functions into iterators, as done with map, etc.  But the collective decision, led by Guido, was that for some, such as str.split, the efficiency gain was not worth the disruption.  The thought was that returned lists were typically 'small';  they are definitely finite (whereas map(func, infinite_iterator) is itself an infinite iterator).  It is too late to reverse that decision, and I would be wary of adding iterator versions.  There is no particular justification for this one.

2. findfirst: This seems to be a near duplicate of search.  

I don't know why people use findall(...)[0] instead of search (which I learned first), but I don't see 'lack of symmetry and completeness' as a reason. 'python-ideas' is a mailing list, not a person or even a defined group of people.  The respondents on any thread tend to be a biased selection of the community.  Such discussions suggest but do not determine.
msg359283 - (view) Author: Juancarlo Añez (apalala) * Date: 2020-01-04 12:46
There's no way to assert that `findall(...)[0]` is efficient enough in most cases. It is easy to see that that it is risky in every case, as runtime may be exponential, and memory O(len(input)). A mistake in the regular expression may easily result in an out-of-memory, which can only be debugged with a series of tests using `search()`.

A problem with `re.search(...)` is that id doesn't have the return value semantics of `findall(...)[0]`, and those semantics seem to be what appeal to Python programmers. It takes several lines of code (the ones in `findalliter()`) to have the same result as `findal(...)[0]` when using `search()`. `findall()` is the sole, lonely function in `re` with its return-value semantics.

Also this proposal embeds `first()` within the body of `findfirst(...)`, but by the implementation one should consider if `first()` shouldn't be part of `itertools`, perhaps with a different name, like `take_one()`.

One should also consider that although third-party extensions to `itertools` already provide the equivalent of `first()`, `findalliter()` and `findfirst()` do not belong there, and there are no mainstream third-party extensions to `re` where they would fit.
msg359285 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-01-04 13:14
Your implementation takes several lines of code only because of complex semantic of findall() which you want to preserve in findfirst(). Actually findall() is an equivalent of one of three expressions, depending on the number of capturing groups in the regular expression:

    [m[0] for m in finditer(...)]
    [m[1] for m in finditer(...)]
    [m.groups() for m in finditer(...)]

so findfirst() is an equivalent of one of three expressions:

    search(...)[0]
    search(...)[1]
    search(...).groups()

In real code the number of capturing groups in the regular expression you use is known statically, so you should not dispatch one of above variants at run time. You just use the appropriate one.

findall() is older interface. More modern, generator and object interface is finditer(). findall() was not deprecated and was not converted to returning an iterator because many old code uses it, and we do not want to break it.
msg359304 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2020-01-04 19:11
After reading Serhiy's clarification of the old and wonky findall return api, I agree that reusing  it in an unneeded new function would be a retrograde move.  We should instead upgrade the documentation of the more flexible current functions.

For search, give these examples of common uses:
  search(...)[0]  # Get entire matched string.
  search(...)[1]  # Get first captured group string.
  search(...).groups()  # Get tuple of captured group strings.

Note that the first two (unlike findfirst) may be used even when there are 1/many groups.  If one has a pattern with multiple groups, one might want different return values for different applications of the pattern. Or one may add a group to a pattern to use as a backreference in the matching process, not to mark the substring one want to see.

For find: list and explain finditer ***first***.  Then list and explain findall as the old less-flexible list interface, kept for back compatibility, that is equivalent to one of the following, depending on whether the pattern has 0, 1, or multiple groups.

    [m[0] for m in finditer(...)]  # List of full-match strings.
    [m[1] for m in finditer(...)]  # List of captured strings.
    [m.groups() for m in finditer(...)]  # List of tuples of strings.

To me, the above is clearer than "If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group."  The first clause implies tuples even with one group; the second denies that.  One must infer that for one group, the match is extracted from the tuple.

Again, using finditer is more flexible than using findall in that the return, which limited by the number of groups, is not determined by it.

Given that each function is documented twice, as method and combined module function, where should the expanded explanation go?  Currently, expanded explanations of search, match, and fullmatch are on the methods, which the same for split, sub, and escape are on the functions.  Instead, all should go on whichever set of listings is first.

I think https://docs.python.org/3/library/re.html#module-contents should be split in two sections: re.compile and compile flags; combined compile + method functions.  Then consider moving the functions below the method listings (and move explanations of all functions that are methods to the methods.  Or maybe interleave the methods and functions.
msg360186 - (view) Author: Juancarlo Añez (apalala) * Date: 2020-01-17 13:41
The analysis done by Terry bypasses the fact that `search(...)` returns `None` when there is no match, so indexing or calling methods in its result is not safe code. 

`findall()` returns an empty list when there is no match.

`findalliter()` returns an empty iterator when there is no match.

`findfirst()` may return a `default` value when there is no match.

If `search()` is proposed to replace `findall()[0]`, then the idiom has to be (depending on the case):

m[0] if (m := re.search(...)) else '0'
m.groups() if (m := re.search(...)) else '0'

In contrast, `findfirst()` returns a value that is the same as `findall()` when there is a match, or a `default` if there is no match.

m[0] if (m := re.findall(...)) else '0'

Compare with:

findfirst(..., default='0')
msg360187 - (view) Author: Juancarlo Añez (apalala) * Date: 2020-01-17 13:46
The bottom problem, as I see it, is that, historically, `re.search()` returns `None` when there is no match, instead of returning a `Match` object that is consistent with "no match" (evaluates to `False`, etc.)

The above seems too difficult to repair as so much existing code relies on those semantics (`if match is None` is the risky bit). 

Hence, `findall()`, `findalliter()`, and `findfirst()`.
msg360189 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-01-17 13:52
Most of examples do not test whether findall() returns an empty list. So there is no significant difference with using search() -- just different type of exception if fails (IndexError, TypeError or AttributeError). Since most examples do not handle errors, this will only affect a traceback if you use the script improperly.

If it is important to you, you can write (re.search(...) or [])[0] and get the same IndexError.
msg360712 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2020-01-25 21:14
I concur with Serhiy and Terry.  Marking this as "closed/rejected".  

Adding more-ways-to-it usually doesn't make users better-off (more to learn, more to remember, more to choose from, creating a new question about whether or not to use re.search(), creating another inter-version API difference, etc.)
History
Date User Action Args
2020-01-25 21:14:23rhettingersetstatus: open -> closed
resolution: rejected
messages: + msg360712

stage: resolved
2020-01-17 13:52:55serhiy.storchakasetmessages: + msg360189
2020-01-17 13:46:46apalalasetmessages: + msg360187
2020-01-17 13:41:51apalalasetmessages: + msg360186
2020-01-04 19:11:17terry.reedysetmessages: + msg359304
2020-01-04 13:14:30serhiy.storchakasetmessages: + msg359285
2020-01-04 12:46:26apalalasetmessages: + msg359283
2020-01-04 06:19:57terry.reedysetnosy: + terry.reedy
messages: + msg359278
2020-01-01 14:47:08serhiy.storchakasetmessages: + msg359160
2020-01-01 12:43:20apalalasetmessages: + msg359153
2020-01-01 02:10:44rhettingersetmessages: - msg359133
2019-12-31 22:02:17rhettingersetnosy: + rhettinger
messages: + msg359133
2019-12-30 13:29:42xtreaksetnosy: + ezio.melotti, mrabarnett, serhiy.storchaka
2019-12-30 13:19:05apalalacreate