msg113074 - (view) |
Author: MizardX (MizardX) |
Date: 2010-08-06 05:41 |
re.findall and re.finditer has very different signature. One iterates over match objects, the other returns a list of tuples.
I can think of two ways to make them more similar:
1) Make match objects iterable over their captures. With this, you could write something like the following:
for key,value in re.finditer(r'(\w+):(\w+)', text):
data[key] = value
2) Make re.findall return an iterator over tuples. This would decrease the memory footprint.
|
msg113121 - (view) |
Author: Matthew Barnett (mrabarnett) * |
Date: 2010-08-06 17:46 |
(1) would break existing code. It would also mean that you wouldn't have access to the start and end positions of the matches either.
(2) would also break existing code which is expecting a list. It's like the change that happened when some methods which return a list in Python 2 return a generator in Python 3. I think it's too late now because we're already at Python 3.1. If you want to reduce the memory footprint then you can still do:
items = (m.groups() for m in re.finditer(r'(\w+):(\w+)', text))
for key,value in items:
data[key] = value
|
msg113170 - (view) |
Author: MizardX (MizardX) |
Date: 2010-08-07 12:28 |
I don't think (1) would break any code. finditer() would still generate match-objects.
The only time you would be discard the match-object, is if you try to do a destructuring bind in, e.g. a loop. This shouldn't be unexpected for the programmer.
|
msg113189 - (view) |
Author: Matthew Barnett (mrabarnett) * |
Date: 2010-08-07 18:45 |
Ah, I see what you mean. I still think you're wrong, though! :-)
The 'for' loop is doing is basically this:
it = re.finditer(r'(\w+):(\w+)', text)
try:
while True:
match_object = next(it)
# body of loop
except StopIteration:
pass
re.finditer() it returns a generator which yields match objects.
What I think you're actually requesting (but not realising) is for the 'for' loop not just to iterate over the generator, but also over what the generator yields.
If you want re.finditer() to yield the groups then it has to return a generator which yields those groups, not match objects.
|
msg224486 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2014-08-01 11:21 |
I think MizardX means that match object should be iterable. This will allow sequence unpacking.
>>> import re
>>> m = re.match(r'(\w+):(\w+)', 'qwerty:asdfgh')
>>> k, v = m
>>> k
'qwerty'
>>> v
'asdfgh'
This idea looks reasonable to me. Here is simple preliminary patch which implements it.
|
msg224487 - (view) |
Author: Matthew Barnett (mrabarnett) * |
Date: 2014-08-01 11:38 |
Match objects have a .groups method:
>>> import re
>>> m = re.match(r'(\w+):(\w+)', 'qwerty:asdfgh')
>>> m.groups()
('qwerty', 'asdfgh')
>>> k, v = m.groups()
>>> k
'qwerty'
>>> v
'asdfgh'
|
msg224489 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2014-08-01 12:02 |
Yes, but the purpose of this feature is to simplify the use of finditer() in
the "for" loop.
>>> import re
>>> for k, v in re.finditer(r"(\w+):?(\w+)?", "ab:cd\nef\n"):
... print(k, v)
...
ab cd
ef None
Currently you should either unpack manually:
for m in re.finditer(...):
k, v = m.groups()
...
This way doesn't work well with comprehensions.
Or use the operator module:
import operator
for k, v in map(operator.methodcaller('groups'), re.finditer(...)):
...
This way is too verbose and unclear.
Sorry, previous version of the patch had reference leak.
|
msg224494 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2014-08-01 12:51 |
See also #19536.
I still think that if we do something about these issues, we should try to be compatible with the regex module.
If we are going to add support for both iterability and __getitem__, they should be consistent, so that list(m) == [m[0], m[1], m[N]].
This means that m[0] should be equal to m.group(0), rather than m.group(1).
Currently the Match object of the regex module supports __getitem__ (with m[0] == m.group[0]) but is not iterable:
>>> m = regex.match('([^:]+): (.*)', 'foo: bar')
>>> m[0], m[1], m[2]
('foo: bar', 'foo', 'bar')
>>> len(m)
3
>>> list(m)
TypeError: '_regex.Match' object is not iterable
I can see different possible solutions:
1) do what regex does, have m[X] == m.group(X) and live with m[0] == m.group(0) (this means that unpacking will be "_, key, value = m");
2) have m[0] == m.group(1), which makes unpacking easier, but is inconsistent with both m.group() and with what regex does; *
3) disregard regex compatibility and implement what we think is best;
* since regex already has a few incompatibilities with re, a global flag/function could be added to regex to make it behave like the re module (where possible). If necessary, the re module could also include and ignore a similar flag/function. This would make interoperability between the two easier.
|
msg224495 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2014-08-01 13:02 |
I think that if the regex module will be adopted in the stdlib, not all it's feature should be included. Regex is too complicated. In particular indexing looks confusing (due to ambiguity of starting index and redundant first item in unpacking). If we will not add support for indexing, there will no incompatibility.
There is yet one solution:
0) Reject both iterating and indexing.
|
msg224497 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2014-08-01 13:05 |
That's indeed another valid solution, even though having indexing and iterability would be convenient (assuming we can figure out a reasonable way to implement them).
|
msg224506 - (view) |
Author: Mark Lawrence (BreamoreBoy) * |
Date: 2014-08-01 14:43 |
Why worry about the "new" regex module? It doesn't appear to be any closer to getting into the stdlib than it was when #2636 was first opened on 15th April 2008, so maybe Python 4.0?
|
msg224507 - (view) |
Author: Ezio Melotti (ezio.melotti) * |
Date: 2014-08-01 14:48 |
Even if it doesn't get included in the stdlib, some people might decide to switch from re to regex (or possibly vice versa) for their projects, so the closer they are the easier this will be. Another reason is that afaik the authors of the regex module made a conscious effort to maintain compatibility with re, so returning the favor is the least we can do.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:04 | admin | set | github: 53738 |
2019-12-07 09:31:53 | serhiy.storchaka | set | keywords:
+ patch, - easy status: open -> closed resolution: rejected stage: patch review -> resolved |
2019-12-07 07:06:21 | serhiy.storchaka | set | keywords:
+ easy, - patch |
2019-04-26 20:22:13 | BreamoreBoy | set | nosy:
- BreamoreBoy
|
2014-08-01 14:48:47 | ezio.melotti | set | messages:
+ msg224507 |
2014-08-01 14:43:47 | BreamoreBoy | set | nosy:
+ BreamoreBoy messages:
+ msg224506
|
2014-08-01 14:36:10 | serhiy.storchaka | set | files:
- re_matchobj_iterable.patch |
2014-08-01 13:05:27 | ezio.melotti | set | messages:
+ msg224497 |
2014-08-01 13:02:53 | serhiy.storchaka | set | messages:
+ msg224495 |
2014-08-01 12:51:00 | ezio.melotti | set | messages:
+ msg224494 |
2014-08-01 12:02:14 | serhiy.storchaka | set | files:
+ re_matchobj_iterable.patch
messages:
+ msg224489 |
2014-08-01 11:38:46 | mrabarnett | set | messages:
+ msg224487 |
2014-08-01 11:21:24 | serhiy.storchaka | set | files:
+ re_matchobj_iterable.patch
type: behavior -> enhancement
title: Converge re.findall and re.finditer -> Make re match object iterable keywords:
+ patch nosy:
+ serhiy.storchaka versions:
+ Python 3.5, - Python 3.2 messages:
+ msg224486 stage: needs patch -> patch review |
2010-08-07 18:45:07 | mrabarnett | set | messages:
+ msg113189 |
2010-08-07 12:28:55 | MizardX | set | messages:
+ msg113170 |
2010-08-06 17:46:38 | mrabarnett | set | messages:
+ msg113121 |
2010-08-06 05:52:35 | ezio.melotti | set | nosy:
+ timehorse, ezio.melotti, mrabarnett, moreati stage: needs patch type: behavior
versions:
+ Python 3.2 |
2010-08-06 05:41:03 | MizardX | create | |