classification
Title: Major reworking of Python 2.5.2 re module
Type: behavior Stage:
Components: Regular Expressions Versions: Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Adding a new regex module (compatible with re)
View: 2636
Assigned To: Nosy List: amaury.forgeotdarc, effbot, georg.brandl, giampaolo.rodola, gregory.p.smith, mrabarnett, pitrou, terry.reedy, timehorse
Priority: normal Keywords: patch

Created on 2008-09-09 21:07 by mrabarnett, last changed 2010-08-21 23:46 by georg.brandl. This issue is now closed.

Files
File name Uploaded Description Edit
regex_2.5.2.diff mrabarnett, 2008-09-13 15:20
test_re.py timehorse, 2008-09-15 18:58 Regexp Test Suite (from Lib/test/test_re.py) with Atomic Grouping
regex_2.6rc2.diff mrabarnett, 2008-09-20 20:32
regex_2.6rc2+1.diff mrabarnett, 2008-09-21 19:51
regex_2.6rc2+2.diff mrabarnett, 2008-09-22 00:43
regex_2.6rc2+3.diff mrabarnett, 2008-09-22 00:49
regex_2.6rc2+4.diff mrabarnett, 2008-09-22 16:04
regex_2.6rc2+5.diff mrabarnett, 2008-09-24 00:37
regex_2.6rc2+6.diff mrabarnett, 2008-09-24 02:20
Messages (31)
msg72910 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-09 21:07
This is a major reworking of the re module in Python 2.5.2.

Added atomic groups.
Added possessive quantifiers.
Lookbehinds can now be variable length.
Typically x2 faster.

More changes to follow.
msg72915 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-09-09 21:16
Very interesting! Have you seen #3626? Another thing to note is that
this will have to wait for 2.7 before it could potentially be integrated
into the trunk.
msg72916 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-09-09 21:16
Hi,

This looks impressive. You should really work from the current SVN
trunk, not from the 2.5 sources, as there were some additions to the re
module (mainly, a bytecode verifier contributed by Google).

Also, if it can be split into several functionally independent patches,
it will certainly help reviewing and (perhaps) integrating.

Last remark: we are currently in the last phases of the release process
for 2.6 and 3.0, so your work will probably not get a lot of attention
in the next few weeks. Don't be discouraged though!
msg72920 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-09-09 21:26
This sounds really neat.  but as Anotine said it'll be several weeks 
before any of us can give this serious attention.  Definitely update to 
trunk and base your work off of that.

quick comments:

Your _sre.c diff appears to remove and replace the entire file.  I don't 
think thats actually what you did.  Chances are that diff is a result of 
changed newline f lea.  Can you please make sure your file uses the 
previous line ending style so that the diff is more managable.

Also, please include unit test updates to validate all new functionality 
in Lib/test/test_re.py.

thanks for your efforts!  -gps
msg72921 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-09-09 21:27
weird typo:  s/f lea/formats/
msg72933 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-10 00:10
Corrected the diff file. I worked from Python 2.5.2 because that's what
I'm currently using. I'll work from the trunk in future.
msg72953 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-09-10 10:10
The correct link is #2636.
Is it the same work?
msg72965 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-10 14:51
This is different work from a different author than #2636. I've
submitted what I've done so far in case my computer gets hit by a bus.
:-) I still have more work to do on it, so I'm not concerned that it
might not get any attention for a while.
msg73162 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2008-09-13 02:29
Atomic groups and possessive quantifiers appear to be relatively new:
http://en.wikipedia.org/wiki/Regular_expressions
for instance, has no mention of either that I found.

http://www.regular-expressions.info/atomic.html
http://www.regular-expressions.info/possessive.html
seemed pretty clear to me.  If they accurately describe what you are
adding, they might be a starting point for the needed new manual
sections.  They should include a warning about carefully ordering
alternatives lest one prune too much.
msg73184 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-09-13 13:40
By the way, the patch must be pretty incomplete, since there are almost
no changes to _sre.c. Am I missing something?
msg73189 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-13 15:20
Corrected the diff file, again. :-(

The atomic groups and possessive quantifiers are as described at
http://www.regular-expressions.info.
msg73191 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2008-09-13 15:35
A bit more information on the changes to the core engine that are
responsible for the 2x speedup (on what?) would be nice to have, I think
(especially since you seem to have removed the KMP prefix scanner).

(Isn't there a RE benchmark suite somewhere under tests?)
msg73255 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-15 11:21
Well, I implemented this months ago, but have been busy with other
things so I haven't updated in a while.  I noticed that the current
version is missing my patches for Atomic Grouping / Possessive
Qualifiers and a number of other patches I added in #2636 , but I do
have working test cases and documentation updates for all the features
I've so far implemented as well as splitting my work into separate
sub-issues to make individual selection easier -- though with a number
of my modifications, I found that there are SO MANY co-dependencies
between, say, an engine modification (item 9) and adding Atomic Grouping
/ Possessive Qualifiers (item 1) and using shared Engine Constants (item
10) that I need a branch for Atomic, a branch for Atomic + Engine Mod 1,
Atomic + Engine Mod 2, Atomic + Shared Constants, Atomic + Engine Mod 1
+ Shared Constants AND Atomic + Engine Mod 2 + Shared Constants, and
those were just THREE item co-dependencies.  My code is all off of the
trunk line for 2.6 and is currently available via my Bazaar repository
under https://code.launchpad.net/~timehorse, where you can access any
source tree via the bazaar version control client.  The main reason I
got stumped in my development which might otherwise have implemented ALL
the issues I intended by now is that very situation I just described
where development of new features is NOT mutually independent.  You can
see by all my branches that the multiplicity of A or B or C is just
nightmarish, but what had to be done to keep issues independent.

Anyway, I'm looking forward to having a look at your suggestions and
think we may take best advantage with combining our work visa vi these
things; after all, there's no point re-inventing the wheel.

Thanks again for your contribution, Matthew!
msg73261 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-15 13:19
I know what you mean about the dependencies!

My current problem is that now I'm working with the current trunk, which
means using Visual C++ Express 2008 instead of 2005. When debugging it's
behaving like the debug info is out of date (showing the wrong values
for variables in the debugger, single-stepping to the wrong line, etc).
Rebuilding doesn't fix it. This might take a while. :-(
msg73262 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-09-15 13:24
If you use Visual C++ Express 2005, you can build python from the
PC\VS8.0 directory.
msg73272 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-15 17:18
Used Visual C++ Express 2005 and the PC\VS8.0 directory. Same problem.
msg73274 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-09-15 17:30
Do you have some big source files, of more than 10000 lines?
msg73279 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-15 17:55
_sre.c is over 6000, but it does contain macros. I didn't have this
problem when based on Python 2.5.2 in Express 2005.
msg73280 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-15 18:58
I have uploaded my test cases for Atomic Grouping / Possessive
Qualifier, which is the common code we seem to have developed, as this
may be of use to you.  I also have documentation, but for now, would you
mind running these tests against your code to see what the test outputs
and also, how did you come up with the 2x result?  Was that running the
test suite?  Usually, the regexp module is benchmarked against its test
suite and there are timings built into that, so it may be useful if you
could run the unmodified Lib/test/test_re.py you got from the trunk
against the original code before modification and your modification, and
do so a few times to get a good average result on multi-tasking systems,
and post the results here so we can get a good statistical feel for how
your new engine improves efficiency.  Certainly, I support any Engine
that works faster, as I myself have tried to make it faster but ended up
with something 8% slower instead, alas.

Also, good thinking on fixing the Negative Look-behind variable-width
issue; I wish I'd thought of that, but I am curious about something: did
you remove the optimization for fixed-width look-behind?  The old code
only allowed fixed with because that test can be done quickly; I noticed
your code adds a lot of new REV opcodes to handle back-tracking and I
assume look-behind logic for variable-width look-behind.  It would be
handy if the compiler and engine would be able to differentiate between
fixed-width look-behind (optimized as was originally) and variable-width
(using your advanced code).

Thanks to AMK for some of these suggestions.  Your changes are quite
radical though so I am still trying to wade through them all and I still
don't have a full-picture of how you've changed things, but there are
some good ideas here, IMHO, especially if you do indeed get 2x speedup.
msg73466 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-20 16:43
This patch is now based on Python 2.6rc2.

I've reduced the number of macros and used functions instead, provided
that it didn't cost much in terms of speed. In many cases it should be
faster than the current release, and at worst no slower. More speed
tests and tweaks needed.

BTW, the impression I got was that look-behind was fixed width because
the matching operations could only match forwards through the text, so
in order to look behind it had to step back through the text and then
match forwards. For simplicity and speed it insisted that it must be
able to determine the size of the step beforehand, hence fixed-width. My
addition was to add matching operations which worked matched backwards
and also reverse the order of the matching for look-behinds, an idea
which I got from a page on how it could be implemented in Perl 6!
msg73472 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-20 20:32
Bugfix.
msg73524 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-21 19:51
Fixed the matching of word boundaries when searching and matching in
substrings.
msg73543 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-22 00:15
regex_2.6rc2+2.diff is a bugfix for capture groups in look-behinds.
msg73546 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-22 00:43
Needed to correct regex_2.6rc2+2.diff.
msg73548 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-22 00:49
regex_2.6rc2+3.diff adds reverse searching with the re.REVERSE/re.R and
"(?r)" flag.

This gives results such as:

>>> re.findall("(\w+)", "one two three")
['one', 'two', 'three']
>>> re.findall("(?r)(\w+)", "one two three")
['three', 'two', 'one']

See #516762.
msg73583 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-22 16:02
regex_2.6rc2+4.diff fixes the ordering of the capture groups for reverse
searching.
msg73584 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-22 16:04
Correction of regex_2.6rc2+4.diff. (Aargh!)
msg73684 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-24 00:37
Patch regex_2.6rc2+5.diff adds scoped and 'negative' flags for (?i),
(?m) and (?s). The other flags remain unchanged in behaviour.

See #433024, #433027 and #433028.
msg73690 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-24 02:20
Patch regex_2.6rc2+6.diff is a bugfix.
msg73708 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-24 12:06
Matthew,

I am really happy that you are making such progress on your engine, but
can I PLEASE ask you to slow down for a moment?  We have a lot of issues
already listed in issue 2636 that is a catch-all for any Python 2.7
Regexp improvements, including your new engine, and I have been working
frantically to try and document all the changes YOU are making here into
the general Regexp 2.7 modification thread and setting up development
trees in my Bazaar VCS repository for your work.

There is also a recommended process for patching which makes it easier
for the moderators to accept your patches which is to provide
dis-entangled functionality and letting each improvement stand on its
own two feet.  In other words, let your engine stand ONLY on it's 2x
speed improvements.  We already have an implementation of Atomic
Grouping / Possessive Qualifiers in issue 2636 but you have a version of
your engine with both.  We have no such 'feature-only' implementation
for Variable-Length Look-Behind, for a Reverse flag, for Positionally
Dependent modifier flags or modifier negation flags, as well as the
zero-width Regular Expression split feature, though you and I completely
agree these would all be great things to have!  The more features you
add to your engine as an all-or-nothing proposition, the less likely the
moderators are going to be to adapt it because it's harder for them to
examine the merits of each individual piece.  That is why issue 2636 is
broken up into items (currently 1 - 18, with your proposals bringing
that up toward 22) and where alternate, combined features are provided
if implementing 1 features would affect the implementation of another.

Please understand that I personally have no problem with you redesigning
large swaths of the Python Regular Expression engine.  I would
personally, like to see one of the design goals of any new engine not
only be speed but better source comments because my main beef with the
current engine is that it took me a month to understand and part of my
redesign in issue 2636 9-1 was to add copious comments to the engine so
that future developers would understand what was going on and be able to
pick up from my work.  I am not proposing we use my 9-1 engine because
it is 8% slower than the current engine and I don't intend to propose
anything slower.  But it would be nice if you could add lots of comments
to your engine so that others could help develop features against it. 
None the less, I will fully support your engine if it does indeed
perform substantially and measurably faster and am happy to see all the
Regexp issues you are finding are finally being implemented, all be it
entangled with your engine.  But let's return to the fundamentals of
what you propose IN THIS THREAD, which simply to propose a new Regexp
Engine which is 2x faster than the existing engine (Which I have
allocated item 9-2 in the issue 2636 thread).  I am not trying to put
more work on your hands -- in fact, what I am trying to do is get us to
co-operate on a better python Regexp Engine so that I can help you to
achieve your goals.  Please read issue 2636 and join the discussion
there; feel free to add any new items you feel are missing from my
existing list.  And remember, each new feature needs tests and
documentation changes.  I have been doing each for any feature I
undertake and would be happy to share those skills with you.  

Let's work together to see your engine be the new model, okay?

Thanks.
msg114610 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-08-21 23:46
Work has gone on in #2636.
History
Date User Action Args
2014-11-08 12:19:22serhiy.storchakaunlinkissue433028 superseder
2010-08-22 00:03:35georg.brandlunlinkissue433027 dependencies
2010-08-22 00:03:21georg.brandlunlinkissue433024 superseder
2010-08-21 23:46:11georg.brandlsetstatus: open -> closed

nosy: + georg.brandl
messages: + msg114610

superseder: Adding a new regex module (compatible with re)
resolution: duplicate
2010-08-07 21:00:30terry.reedylinkissue433024 superseder
2009-06-04 10:00:11georg.brandllinkissue433028 superseder
2009-02-12 19:36:06ajaksu2linkissue433027 dependencies
2009-02-12 19:35:13ajaksu2linkissue433024 dependencies
2008-09-24 12:06:42timehorsesetmessages: + msg73708
2008-09-24 02:20:22mrabarnettsetfiles: + regex_2.6rc2+6.diff
messages: + msg73690
2008-09-24 00:37:27mrabarnettsetfiles: + regex_2.6rc2+5.diff
messages: + msg73684
2008-09-22 16:04:30mrabarnettsetfiles: + regex_2.6rc2+4.diff
messages: + msg73584
2008-09-22 16:03:30mrabarnettsetfiles: - regex_2.6rc2+4.diff
2008-09-22 16:02:58mrabarnettsetfiles: + regex_2.6rc2+4.diff
messages: + msg73583
2008-09-22 01:00:03benjamin.petersonlinkissue516762 superseder
2008-09-22 00:50:03mrabarnettsetfiles: + regex_2.6rc2+3.diff
messages: + msg73548
2008-09-22 00:43:27mrabarnettsetfiles: + regex_2.6rc2+2.diff
messages: + msg73546
2008-09-22 00:42:20mrabarnettsetfiles: - regex_2.6rc2+2.diff
2008-09-22 00:15:54mrabarnettsetfiles: + regex_2.6rc2+2.diff
messages: + msg73543
2008-09-21 19:51:31mrabarnettsetfiles: + regex_2.6rc2+1.diff
messages: + msg73524
2008-09-20 20:32:41mrabarnettsetfiles: + regex_2.6rc2.diff
messages: + msg73472
2008-09-20 20:31:41mrabarnettsetfiles: - regex_2.6rc2.diff
2008-09-20 16:43:23mrabarnettsetfiles: + regex_2.6rc2.diff
messages: + msg73466
2008-09-15 18:58:28timehorsesetfiles: + test_re.py
messages: + msg73280
2008-09-15 17:55:09mrabarnettsetmessages: + msg73279
2008-09-15 17:30:03amaury.forgeotdarcsetmessages: + msg73274
2008-09-15 17:18:06mrabarnettsetmessages: + msg73272
2008-09-15 13:24:35amaury.forgeotdarcsetmessages: + msg73262
2008-09-15 13:19:54mrabarnettsetmessages: + msg73261
2008-09-15 11:21:08timehorsesetnosy: + timehorse
messages: + msg73255
2008-09-14 18:25:19giampaolo.rodolasetnosy: + giampaolo.rodola
2008-09-13 15:35:55effbotsetnosy: + effbot
messages: + msg73191
2008-09-13 15:26:28benjamin.petersonsetnosy: - benjamin.peterson
2008-09-13 15:20:46mrabarnettsetfiles: + regex_2.5.2.diff
messages: + msg73189
2008-09-13 15:16:46mrabarnettsetfiles: - regex_2.5.2.diff
2008-09-13 13:40:06pitrousetmessages: + msg73184
2008-09-13 02:29:21terry.reedysetnosy: + terry.reedy
messages: + msg73162
2008-09-10 14:51:27mrabarnettsetmessages: + msg72965
2008-09-10 10:10:26amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg72953
2008-09-10 00:11:05mrabarnettsetfiles: + regex_2.5.2.diff
messages: + msg72933
2008-09-10 00:06:45mrabarnettsetfiles: - regex_2.5.2.diff
2008-09-09 21:27:21gregory.p.smithsetmessages: + msg72921
2008-09-09 21:26:08gregory.p.smithsetnosy: + gregory.p.smith
messages: + msg72920
2008-09-09 21:16:35pitrousetnosy: + pitrou
messages: + msg72916
2008-09-09 21:16:08benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg72915
versions: + Python 2.7, - Python 2.5
2008-09-09 21:07:52mrabarnettcreate