New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a new regex module (compatible with re) #46888
Comments
I am working on adding features to the current Regexp implementation, I will be posting regular patch updates to this thread when major
More items may come and suggestions are welcome. ----- Currently, I have code which implements 5) and 7), have done some work In a few days, I will provide a patch with my interim results and will |
I am very sorry to report (at least for me) that as of this moment, item It is this current conclusion that greatly saddens me, not that the Anyway, all that being said, and keeping in mind that I am not 100% Old Engine: 6.574s This makes the old Engine 665ms faster over the entire first test_re.py |
Here are the modification so far for item 9) in _sre.c plus some small |
Here is a patch to implement item 7) |
This simple patch adds (?P#...)-style comment support. |
Why 5.1 instead of 5.8 or at least 5.6? Is it just a scope-creep issue?
because this also adds to the scope.
(2) and (3) would both be nice, but I'm not sure it makes sense to do
[handles parens in comments without turning on verbose, but is slower] Why? It adds another incompatibility, so it has to be very useful or
Be careful on those, particular on str/unicode and different compile options. |
5.10.0 comes after 5.8 and is the latest version (2007/12/18)! |
Thanks Jim for your thoughts! Armaury has already explained about Perl 5.10.0. I suppose it's like
At this point the only python-specific changes I am proposing would be
Well, I think named matches are better than numbered ones, so I'd
Well, Larry Wall and Guido agreed long ago that we, the python As for speed, the this all occurs in the parser and does not effect the Verbose is generally a good idea for anything more than a trivial r'He(?# 2 (TWO) ls)llo' should match "Hello" but it doesn't. That expression only matches "He ls)llo", so I created the (?P#...) to r'He(?P# 2 (TWO) ls)llo' matches "Hello".
Will do; thanks for the advice! I have only observed the UNICODE flag At some point, I hope to get my current changes on Launchpad if I can |
Python 2.6 isn't the last, but Guido has said that there won't be a 2.10.
I may be misunderstanding -- isn't this just a matter of writing the
Cool -- that reference should probably be added to the docs. For someone Definately put the example in the doc.
even without the change, as doco on the current situation. Does VERBOSE really have to be the first flag, or does it just have to be on I'm not sure I fully understand what you said about template. Is this a |
I don't know anything about regexp implementation, but if you replace a So you might try to do the cleanup while keeping the switch-case |
Thank you and Merci Antoine! That is a good point. It is clearly specific to the compiler whether a |
I am making my changes in a Bazaar branch hosted on Launchpad. It took Anyway, if anyone is interested in monitoring my progress, it is https://code.launchpad.net/~timehorse/ I will still post major milestones here, but one can monitory day-to-day Thanks again for all the advice! |
I am finally making progress again, after a month of changing my |
AFAIK if you have a regex with named capture groups there is no direct d = {v: k for k, v in match.groupdict()}
for i in range(match.lastindex):
print(i, match.group(i), d[match.group(i)]) One possible solution would be a grouptuples() function that returned a Anyway, good luck with all your improvements, I will be especially glad |
Mark scribbled:
Hmm. Well, that's not a bad idea at all IMHO and would, AFAICT probably My preference right now is to finish off the test cases for (7) because Anyway, thanks for the input, Mark! |
Well, it's time for another update on my progress... Some good news first: Atomic Grouping is now completed, tested and Now, I want to also update my list of items. We left off at 11: Other
16-1) Implement the FIXME such that if m is a MatchObject, del m.string ----- Finally, I want to say a couple notes about Item 2: Firstly, as noted in Item 14, I wish to add support for UNICODE match Secondly, there is a FIXME which I discussed in Item 16; I gave that Finally, I would like suggestions on how to handle name collisions when I have 3 proposals as to how to handle this: a) Simply disallow the exposure of match group name attributes if the b) Expose the reserved names through a special prefix notation, and for c) Don't expose the names directly; only expose them through a prefixed Personally, I like a because if Item 3 is implemented, it makes a fairly ----- Now, rather than posting umteen patch files I am posting one bz2- bpo-2636(-\d\d|+\d\d)*(-only)?.patch For instance, bpo-2636-01.patch is the p1 patch that is a difference between the As noted above, Items 01, 02, 05, 07 and 12 should be considered more or |
Sorry, as I stated in the last post, I generated the patches then realized |
[snip]
:-) [snip]
I don't like the prefix ideas and now that you've spelt it out I don't ------------------------------------------------------------
It isn't hard to work round but it did highlight the fact that you can't |
Thanks for weighing in Mark! Actually, your point is valid and quite
d) (As Mark suggested) we drop Item 2 completely. I have not invested e) Add an option, re.MATCH_ATTRIBUTES, that is used as a Match Creation I really like the idea of e) so I'm taking Item 2 out of the "ready for How does that sound to you, Mark and anyone else who wishes to weigh in? |
[snip] It seems to me that both using a special prefix or adding an option are The nice thing about (3) (even without slicing) is that it seems a v. BTW I just noticed this: '<_sre.SRE_Pattern object at 0x9ded020>'
>>> "{0!r}".format(rx)
'<_sre.SRE_Pattern object at 0x9ded020>'
>>> "{0!s}".format(rx)
'<_sre.SRE_Pattern object at 0x9ded020>'
>>> "{0!a}".format(rx) That's fair enough, but maybe for !s the output should be rx.pattern? |
See also bpo-3825. |
Update 16 Sep 2008: Based on the work for issue bpo-3825, I would like to simply update the
9-1) Single-loop Engine redesign that runs 8% slower than current. 9-1-1) 3-loop Engine redesign that runs 10% slower than current. [Complete] 9-2) Matthew Bernett's Engine redesign as per issue bpo-3825
16-1) Allow for the disassociation of a source string from a Match_Type,
--- Now, we have a combination of Items 1, 9-2 and 17 available in issue I sadly admit I have made not progress on this since June because 01 is the child of bpo-2636 Which all seems rather simple until you wrap your head around: 01+09+10 is the child of 01, 09, 10, 01+09, 01+10 AND 09+10! Keep in mind the reason for all this complex numbering is because many Anyway, that's the state of things; this is me, signing out! |
Comparing item 2 and item 3, I think that item 3 is the Pythonic choice Item 4: back-references in the pattern are like \1 and (?P=name), not |
Thanks for weighing in Matthew! Yeah, I do get some flack for item 2 because originally item 3 wasn't Your interpretation of 4 matches mine, though, and I would definitely Oh, and as I suggested in bpo-3825, I have these new item proposals: Item 18: Add a re.REVERSE, re.R (?r) flag for reversing the direction of Item 19: Make various in-line flags positionally dependant, for example Item 20: All the negation of in-line flags to cancel their effect in Item 21: Allow for scoped flagged expressions, i.e. (?i:...), where the Item 22: Zero-width regular expression split: when splitting via a Item 23: Character class ranges over case-insensitive matches, i.e. does And I shall create a bazaar repository for your current development line Anyway, great work Matthew and I look forward to working with you on |
Regarding item 22: there's also bpo-1647489 ("zero-length match confuses This had me stumped for a while, but I might have a solution. I'll see I wasn't planning on doing any more major changes on my branch, just |
Not that it matters in any way, but if the regex semantics has to be distinguished via "non-standard" custom flags; I would prefer even less wordy flags, possibly such that the short forms for the in-pattern flag setting would be one-letter (such as all the other flags) and preferably some with underlying plain English words as base, to get some mnemotechnics (which I don't see in the numbered versions requiring one to keep track of the rather internal library versioning). |
Matthew Barnett wrote:
Seems reasonable to me. +1 |
Not sure if this is better as a separate feature request or a comment here, but... the new version of .NET includes an option to specify a time limit on evaluation of regexes (not sure if this is a feature in other regex libs). This would be useful especially when you're executing regexes configured by the user and you don't know if/when they might go exponential. Something like this maybe: # Raises an re.Timeout if not complete within 60 seconds
match = myregex.match(mystring, maxseconds=60.0) |
As part of the PEP-408 discussions, Guido approved the addition of 'regex' in 3.3 (using that name, rather than as a drop-in replacement for re) [1,2] That should greatly ease the backwards compatibility concerns, even if it isn't as transparent an upgrade path. [1] http://mail.python.org/pipermail/python-dev/2012-January/115961.html |
So, to my reading of teh compatibility PEP this cannot be added wholesale, On Sun, Jan 29, 2012 at 1:26 AM, Nick Coghlan <report@bugs.python.org>wrote:
|
I created a new sandbox branch to integrate regex into CPython, see "remote repo" field. I mainly had to adapt the test suite to use unittest. |
Alex has a valid point in relation to PEP-399, since, like lzma, regex will be coming in under the "special permission" clause that allows the addition of C extension modules without pure Python equivalents. Unlike lzma, though, the new regex engine isn't a relatively simple wrapper around an existing library - supporting the new API features on other implementations is going to mean a substantial amount of work. In practice, I expect that a pure Python implementation of a regular expression engine would only be fast enough to be usable on PyPy. So while we'd almost certainly accept a patch that added a parallel Python implementation, I doubt it would actually help Jython or IronPython all that much - they're probably going to need versions written in Java and C# to be effective (as I believe they already have for the re module). |
Not sure why this is necessarily true. I'd expect a pure-Python implementation to be maybe 200 times as slow. Many queries (those on relatively short strings that backtrack little) finish within microseconds. On this scale, a couple of orders of magnitudes is not noticeable by humans (unless it adds up), and even where it gets noticeable, it's better than having nothing at all or a non-working program (up until a point). python -m timeit -n 1000000 -s "import re; x = re.compile(r'.*<\s*help\s*>([^\<])<\s/\s*help.*>'); data = ' '*1000 + '< help >' + 'abc'*100 + '</help>'" "x.match(data)" |
Well, REs are very often used to process large chunks of text by repeated application. So if the whole operation takes 0.1 or 20 seconds you're going to notice :) |
It'd be nice if we had some sort of representative benchmark for real-world uses of Python regexps. The JS guys have all pitched in to create such a thing for uses of regexps on thew web. I don't know of any such thing for Python. I agree that a Python implementation wouldn't be useful for some cases. On the other hand, I believe it would be fine (or at least tolerable) for some others. I don't know the ratio between the two. |
See http://hg.python.org/benchmarks/, there are regex benchmarks there.
I think the ratio would be something like 2% tolerable :) As I said to Ezio and Georg, I think adding the regex module needs a |
I've just uploaded regex into Debian: this will hopefully gives some more eyes looking at the module and reporting some feedbacks. |
I've been working through the "known crashers" list in the stdlib. The recursive import one was fixed with the migration to importlib in 3.3, the compiler one will be fixed in 3.3.1 (with an enforced nesting limit). One of those remaining is actually a pathological failure in the re module rather than a true crasher (i.e. it doesn't segfault, and in 2.7 and 3.3 you can interrupt it with Ctrl-C): I mention it here as another problem that adopting the regex module could resolve (as regex promptly returns None for this case). |
Will we actually get regex into the standard library on this pass? |
Even with in principle approval from Guido, this idea still depends on |
Here is my (slowly implemented) plan:
In best case in 3.7 or 3.8 we could replace re with simplified regex. Or at this time re will be free from bugs and warts. |
Exciting. Perhaps you should post your plan on python-dev. In any case, huge thanks for your work on the re module. |
Thank you Antoine. I think all interested core developers are already aware |
So you are suggesting to fix bugs in re to make it closer to regex, and then replace re with a forked subset of regex that doesn't include advanced features, or just to fix/improve re until it matches the behavior of regex? |
Depends on what will be easier. May be some bugs are so hard to fix that |
Ok, regardless of what will happen, increasing test coverage is a worthy goal. We might start by looking at the regex test suite to see if we can import some tests from there. |
Thanks for pushing this one forward Serhiy! Your approach sounds like a |
If I recall, I started this thread with a plan to update re itself with implementations of various features listed in the top post. If you look at the list of files uploaded by me there are seme complete patches for Re to add various features like Atomic Grouping. If we wish to therefore bring re to regex standard we could start with those features. |
Well, I found a bug with this module, on Python 2.7(.5), on Windows 7 64-bit when you try to compile a regex with the flags V1|DEBUG, the module crashes as if it wanted to call a builtin called "ascii". The bug happened to me several times, but this is the regexp when the last one happened. http://paste.ubuntu.com/8993680/ I hope it's fixed, I really love the module and found it very useful to have PCRE regexes in Python. |
@mateon1: "I hope it's fixed"? Did you report it? |
Well, I am reporting it here, is this not the correct place? Sorry if it is. |
The page on PyPI says where the project's homepage is located: Home Page: https://code.google.com/p/mrab-regex-hg/ The bug was fixed in the last release. |
It's now a third party project: https://pypi.org/project/regex/ If someone wants to move it into the Python stdlib, I suggest to start on the python-ideas list first. I close the issue as REJECTED. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: