Issue 9529: Make re match object iterable

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53738

classification

Title:	Make re match object iterable
Type:	enhancement	Stage:	resolved
Components:	Regular Expressions	Versions:	Python 3.5

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	MizardX, ezio.melotti, moreati, mrabarnett, serhiy.storchaka, timehorse
Priority:	normal	Keywords:	patch

Created on 2010-08-06 05:41 by MizardX, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
re_matchobj_iterable.patch	serhiy.storchaka, 2014-08-01 12:02		review

Messages (12)
msg113074 - (view)	Author: MizardX (MizardX)	Date: 2010-08-06 05:41
re.findall and re.finditer has very different signature. One iterates over match objects, the other returns a list of tuples. I can think of two ways to make them more similar: 1) Make match objects iterable over their captures. With this, you could write something like the following: for key,value in re.finditer(r'(\w+):(\w+)', text): data[key] = value 2) Make re.findall return an iterator over tuples. This would decrease the memory footprint.
msg113121 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2010-08-06 17:46
(1) would break existing code. It would also mean that you wouldn't have access to the start and end positions of the matches either. (2) would also break existing code which is expecting a list. It's like the change that happened when some methods which return a list in Python 2 return a generator in Python 3. I think it's too late now because we're already at Python 3.1. If you want to reduce the memory footprint then you can still do: items = (m.groups() for m in re.finditer(r'(\w+):(\w+)', text)) for key,value in items: data[key] = value
msg113170 - (view)	Author: MizardX (MizardX)	Date: 2010-08-07 12:28
I don't think (1) would break any code. finditer() would still generate match-objects. The only time you would be discard the match-object, is if you try to do a destructuring bind in, e.g. a loop. This shouldn't be unexpected for the programmer.
msg113189 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2010-08-07 18:45
Ah, I see what you mean. I still think you're wrong, though! :-) The 'for' loop is doing is basically this: it = re.finditer(r'(\w+):(\w+)', text) try: while True: match_object = next(it) # body of loop except StopIteration: pass re.finditer() it returns a generator which yields match objects. What I think you're actually requesting (but not realising) is for the 'for' loop not just to iterate over the generator, but also over what the generator yields. If you want re.finditer() to yield the groups then it has to return a generator which yields those groups, not match objects.
msg224486 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-08-01 11:21
I think MizardX means that match object should be iterable. This will allow sequence unpacking. >>> import re >>> m = re.match(r'(\w+):(\w+)', 'qwerty:asdfgh') >>> k, v = m >>> k 'qwerty' >>> v 'asdfgh' This idea looks reasonable to me. Here is simple preliminary patch which implements it.
msg224487 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2014-08-01 11:38
Match objects have a .groups method: >>> import re >>> m = re.match(r'(\w+):(\w+)', 'qwerty:asdfgh') >>> m.groups() ('qwerty', 'asdfgh') >>> k, v = m.groups() >>> k 'qwerty' >>> v 'asdfgh'
msg224489 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-08-01 12:02
Yes, but the purpose of this feature is to simplify the use of finditer() in the "for" loop. >>> import re >>> for k, v in re.finditer(r"(\w+):?(\w+)?", "ab:cd\nef\n"): ... print(k, v) ... ab cd ef None Currently you should either unpack manually: for m in re.finditer(...): k, v = m.groups() ... This way doesn't work well with comprehensions. Or use the operator module: import operator for k, v in map(operator.methodcaller('groups'), re.finditer(...)): ... This way is too verbose and unclear. Sorry, previous version of the patch had reference leak.
msg224494 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2014-08-01 12:51
See also #19536. I still think that if we do something about these issues, we should try to be compatible with the regex module. If we are going to add support for both iterability and __getitem__, they should be consistent, so that list(m) == [m[0], m[1], m[N]]. This means that m[0] should be equal to m.group(0), rather than m.group(1). Currently the Match object of the regex module supports __getitem__ (with m[0] == m.group[0]) but is not iterable: >>> m = regex.match('([^:]+): (.)', 'foo: bar') >>> m[0], m[1], m[2] ('foo: bar', 'foo', 'bar') >>> len(m) 3 >>> list(m) TypeError: '_regex.Match' object is not iterable I can see different possible solutions: 1) do what regex does, have m[X] == m.group(X) and live with m[0] == m.group(0) (this means that unpacking will be "_, key, value = m"); 2) have m[0] == m.group(1), which makes unpacking easier, but is inconsistent with both m.group() and with what regex does; 3) disregard regex compatibility and implement what we think is best; * since regex already has a few incompatibilities with re, a global flag/function could be added to regex to make it behave like the re module (where possible). If necessary, the re module could also include and ignore a similar flag/function. This would make interoperability between the two easier.
msg224495 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-08-01 13:02
I think that if the regex module will be adopted in the stdlib, not all it's feature should be included. Regex is too complicated. In particular indexing looks confusing (due to ambiguity of starting index and redundant first item in unpacking). If we will not add support for indexing, there will no incompatibility. There is yet one solution: 0) Reject both iterating and indexing.
msg224497 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2014-08-01 13:05
That's indeed another valid solution, even though having indexing and iterability would be convenient (assuming we can figure out a reasonable way to implement them).
msg224506 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2014-08-01 14:43
Why worry about the "new" regex module? It doesn't appear to be any closer to getting into the stdlib than it was when #2636 was first opened on 15th April 2008, so maybe Python 4.0?
msg224507 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2014-08-01 14:48
Even if it doesn't get included in the stdlib, some people might decide to switch from re to regex (or possibly vice versa) for their projects, so the closer they are the easier this will be. Another reason is that afaik the authors of the regex module made a conscious effort to maintain compatibility with re, so returning the favor is the least we can do.

History
Date	User	Action	Args
2022-04-11 14:57:04	admin	set	github: 53738
2019-12-07 09:31:53	serhiy.storchaka	set	keywords: + patch, - easy status: open -> closed resolution: rejected stage: patch review -> resolved
2019-12-07 07:06:21	serhiy.storchaka	set	keywords: + easy, - patch
2019-04-26 20:22:13	BreamoreBoy	set	nosy: - BreamoreBoy
2014-08-01 14:48:47	ezio.melotti	set	messages: + msg224507
2014-08-01 14:43:47	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg224506
2014-08-01 14:36:10	serhiy.storchaka	set	files: - re_matchobj_iterable.patch
2014-08-01 13:05:27	ezio.melotti	set	messages: + msg224497
2014-08-01 13:02:53	serhiy.storchaka	set	messages: + msg224495
2014-08-01 12:51:00	ezio.melotti	set	messages: + msg224494
2014-08-01 12:02:14	serhiy.storchaka	set	files: + re_matchobj_iterable.patch messages: + msg224489
2014-08-01 11:38:46	mrabarnett	set	messages: + msg224487
2014-08-01 11:21:24	serhiy.storchaka	set	files: + re_matchobj_iterable.patch type: behavior -> enhancement title: Converge re.findall and re.finditer -> Make re match object iterable keywords: + patch nosy: + serhiy.storchaka versions: + Python 3.5, - Python 3.2 messages: + msg224486 stage: needs patch -> patch review
2010-08-07 18:45:07	mrabarnett	set	messages: + msg113189
2010-08-07 12:28:55	MizardX	set	messages: + msg113170
2010-08-06 17:46:38	mrabarnett	set	messages: + msg113121
2010-08-06 05:52:35	ezio.melotti	set	nosy: + timehorse, ezio.melotti, mrabarnett, moreati stage: needs patch type: behavior versions: + Python 3.2
2010-08-06 05:41:03	MizardX	create