classification
Title: Make re match object iterable
Type: enhancement Stage: resolved
Components: Regular Expressions Versions: Python 3.5
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: MizardX, ezio.melotti, moreati, mrabarnett, serhiy.storchaka, timehorse
Priority: normal Keywords: patch

Created on 2010-08-06 05:41 by MizardX, last changed 2019-12-07 09:31 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
re_matchobj_iterable.patch serhiy.storchaka, 2014-08-01 12:02 review
Messages (12)
msg113074 - (view) Author: MizardX (MizardX) Date: 2010-08-06 05:41
re.findall and re.finditer has very different signature. One iterates over match objects, the other returns a list of tuples.

I can think of two ways to make them more similar:

1) Make match objects iterable over their captures. With this, you could write something like the following:

for key,value in re.finditer(r'(\w+):(\w+)', text):
  data[key] = value

2) Make re.findall return an iterator over tuples. This would decrease the memory footprint.
msg113121 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2010-08-06 17:46
(1) would break existing code. It would also mean that you wouldn't have access to the start and end positions of the matches either.

(2) would also break existing code which is expecting a list. It's like the change that happened when some methods which return a list in Python 2 return a generator in Python 3. I think it's too late now because we're already at Python 3.1. If you want to reduce the memory footprint then you can still do:

items = (m.groups() for m in re.finditer(r'(\w+):(\w+)', text))
for key,value in items:
    data[key] = value
msg113170 - (view) Author: MizardX (MizardX) Date: 2010-08-07 12:28
I don't think (1) would break any code. finditer() would still generate match-objects.

The only time you would be discard the match-object, is if you try to do a destructuring bind in, e.g. a loop. This shouldn't be unexpected for the programmer.
msg113189 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2010-08-07 18:45
Ah, I see what you mean. I still think you're wrong, though! :-)

The 'for' loop is doing is basically this:

    it = re.finditer(r'(\w+):(\w+)', text)
    try:
        while True:
            match_object = next(it)
            # body of loop
    except StopIteration:
        pass

re.finditer() it returns a generator which yields match objects.

What I think you're actually requesting (but not realising) is for the 'for' loop not just to iterate over the generator, but also over what the generator yields.

If you want re.finditer() to yield the groups then it has to return a generator which yields those groups, not match objects.
msg224486 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-08-01 11:21
I think MizardX means that match object should be iterable. This will allow sequence unpacking.

>>> import re
>>> m = re.match(r'(\w+):(\w+)', 'qwerty:asdfgh')
>>> k, v = m
>>> k
'qwerty'
>>> v
'asdfgh'

This idea looks reasonable to me. Here is simple preliminary patch which implements it.
msg224487 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2014-08-01 11:38
Match objects have a .groups method:

>>> import re
>>> m = re.match(r'(\w+):(\w+)', 'qwerty:asdfgh')
>>> m.groups()
('qwerty', 'asdfgh')
>>> k, v = m.groups()
>>> k
'qwerty'
>>> v
'asdfgh'
msg224489 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-08-01 12:02
Yes, but the purpose of this feature is to simplify the use of finditer() in 
the "for" loop.

>>> import re
>>> for k, v in re.finditer(r"(\w+):?(\w+)?", "ab:cd\nef\n"):
...     print(k, v)
... 
ab cd
ef None

Currently you should either unpack manually:

    for m in re.finditer(...):
        k, v = m.groups()
        ...

This way doesn't work well with comprehensions.

Or use the operator module:

    import operator
    for k, v in map(operator.methodcaller('groups'), re.finditer(...)):
        ...

This way is too verbose and unclear.

Sorry, previous version of the patch had reference leak.
msg224494 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-08-01 12:51
See also #19536.

I still think that if we do something about these issues, we should try to be compatible with the regex module.

If we are going to add support for both iterability and __getitem__, they should be consistent, so that list(m) == [m[0], m[1], m[N]].
This means that m[0] should be equal to m.group(0), rather than m.group(1).

Currently the Match object of the regex module supports __getitem__ (with m[0] == m.group[0]) but is not iterable:
>>> m = regex.match('([^:]+): (.*)', 'foo: bar')
>>> m[0], m[1], m[2]
('foo: bar', 'foo', 'bar')
>>> len(m)
3
>>> list(m)
TypeError: '_regex.Match' object is not iterable

I can see different possible solutions:
1) do what regex does, have m[X] == m.group(X) and live with m[0] == m.group(0) (this means that unpacking will be "_, key, value = m");
2) have m[0] == m.group(1), which makes unpacking easier, but is inconsistent with both m.group() and with what regex does; *
3) disregard regex compatibility and implement what we think is best;


* since regex already has a few incompatibilities with re, a global flag/function could be added to regex to make it behave like the re module (where possible).  If necessary, the re module could also include and ignore a similar flag/function.  This would make interoperability between the two easier.
msg224495 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-08-01 13:02
I think that if the regex module will be adopted in the stdlib, not all it's feature should be included. Regex is too complicated. In particular indexing looks confusing (due to ambiguity of starting index and redundant first item in unpacking). If we will not add support for indexing, there will no incompatibility.

There is yet one solution:

0) Reject both iterating and indexing.
msg224497 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-08-01 13:05
That's indeed another valid solution, even though having indexing and iterability would be convenient (assuming we can figure out a reasonable way to implement them).
msg224506 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-08-01 14:43
Why worry about the "new" regex module?  It doesn't appear to be any closer to getting into the stdlib than it was when #2636 was first opened on 15th April 2008, so maybe Python 4.0?
msg224507 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-08-01 14:48
Even if it doesn't get included in the stdlib, some people might decide to switch from re to regex (or possibly vice versa) for their projects, so the closer they are the easier this will be.  Another reason is that afaik the authors of the regex module made a conscious effort to maintain compatibility with re, so returning the favor is the least we can do.
History
Date User Action Args
2019-12-07 09:31:53serhiy.storchakasetkeywords: + patch, - easy
status: open -> closed
resolution: rejected
stage: patch review -> resolved
2019-12-07 07:06:21serhiy.storchakasetkeywords: + easy, - patch
2019-04-26 20:22:13BreamoreBoysetnosy: - BreamoreBoy
2014-08-01 14:48:47ezio.melottisetmessages: + msg224507
2014-08-01 14:43:47BreamoreBoysetnosy: + BreamoreBoy
messages: + msg224506
2014-08-01 14:36:10serhiy.storchakasetfiles: - re_matchobj_iterable.patch
2014-08-01 13:05:27ezio.melottisetmessages: + msg224497
2014-08-01 13:02:53serhiy.storchakasetmessages: + msg224495
2014-08-01 12:51:00ezio.melottisetmessages: + msg224494
2014-08-01 12:02:14serhiy.storchakasetfiles: + re_matchobj_iterable.patch

messages: + msg224489
2014-08-01 11:38:46mrabarnettsetmessages: + msg224487
2014-08-01 11:21:24serhiy.storchakasetfiles: + re_matchobj_iterable.patch

type: behavior -> enhancement

title: Converge re.findall and re.finditer -> Make re match object iterable
keywords: + patch
nosy: + serhiy.storchaka
versions: + Python 3.5, - Python 3.2
messages: + msg224486
stage: needs patch -> patch review
2010-08-07 18:45:07mrabarnettsetmessages: + msg113189
2010-08-07 12:28:55MizardXsetmessages: + msg113170
2010-08-06 17:46:38mrabarnettsetmessages: + msg113121
2010-08-06 05:52:35ezio.melottisetnosy: + timehorse, ezio.melotti, mrabarnett, moreati
stage: needs patch
type: behavior

versions: + Python 3.2
2010-08-06 05:41:03MizardXcreate