Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major reworking of Python 2.5.2 re module #48075

Closed
mrabarnett mannequin opened this issue Sep 9, 2008 · 31 comments
Closed

Major reworking of Python 2.5.2 re module #48075

mrabarnett mannequin opened this issue Sep 9, 2008 · 31 comments
Labels
topic-regex type-bug An unexpected behavior, bug, or error

Comments

@mrabarnett
Copy link
Mannequin

mrabarnett mannequin commented Sep 9, 2008

BPO 3825
Nosy @birkenfeld, @terryjreedy, @gpshead, @amauryfa, @pitrou, @giampaolo
Superseder
  • bpo-2636: Adding a new regex module (compatible with re)
  • Files
  • regex_2.5.2.diff
  • test_re.py: Regexp Test Suite (from Lib/test/test_re.py) with Atomic Grouping
  • regex_2.6rc2.diff
  • regex_2.6rc2+1.diff
  • regex_2.6rc2+2.diff
  • regex_2.6rc2+3.diff
  • regex_2.6rc2+4.diff
  • regex_2.6rc2+5.diff
  • regex_2.6rc2+6.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-08-21.23:46:11.931>
    created_at = <Date 2008-09-09.21:07:52.768>
    labels = ['expert-regex', 'type-bug']
    title = 'Major reworking of Python 2.5.2 re module'
    updated_at = <Date 2010-08-21.23:46:11.913>
    user = 'https://bugs.python.org/mrabarnett'

    bugs.python.org fields:

    activity = <Date 2010-08-21.23:46:11.913>
    actor = 'georg.brandl'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-08-21.23:46:11.931>
    closer = 'georg.brandl'
    components = ['Regular Expressions']
    creation = <Date 2008-09-09.21:07:52.768>
    creator = 'mrabarnett'
    dependencies = []
    files = ['11484', '11499', '11532', '11543', '11553', '11554', '11559', '11585', '11587']
    hgrepos = []
    issue_num = 3825
    keywords = ['patch']
    message_count = 31.0
    messages = ['72910', '72915', '72916', '72920', '72921', '72933', '72953', '72965', '73162', '73184', '73189', '73191', '73255', '73261', '73262', '73272', '73274', '73279', '73280', '73466', '73472', '73524', '73543', '73546', '73548', '73583', '73584', '73684', '73690', '73708', '114610']
    nosy_count = 9.0
    nosy_names = ['effbot', 'georg.brandl', 'terry.reedy', 'gregory.p.smith', 'amaury.forgeotdarc', 'pitrou', 'giampaolo.rodola', 'timehorse', 'mrabarnett']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = None
    status = 'closed'
    superseder = '2636'
    type = 'behavior'
    url = 'https://bugs.python.org/issue3825'
    versions = ['Python 2.7']

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 9, 2008

    This is a major reworking of the re module in Python 2.5.2.

    Added atomic groups.
    Added possessive quantifiers.
    Lookbehinds can now be variable length.
    Typically x2 faster.

    More changes to follow.

    @mrabarnett mrabarnett mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Sep 9, 2008
    @benjaminp
    Copy link
    Contributor

    Very interesting! Have you seen bpo-3626? Another thing to note is that
    this will have to wait for 2.7 before it could potentially be integrated
    into the trunk.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 9, 2008

    Hi,

    This looks impressive. You should really work from the current SVN
    trunk, not from the 2.5 sources, as there were some additions to the re
    module (mainly, a bytecode verifier contributed by Google).

    Also, if it can be split into several functionally independent patches,
    it will certainly help reviewing and (perhaps) integrating.

    Last remark: we are currently in the last phases of the release process
    for 2.6 and 3.0, so your work will probably not get a lot of attention
    in the next few weeks. Don't be discouraged though!

    @gpshead
    Copy link
    Member

    gpshead commented Sep 9, 2008

    This sounds really neat. but as Anotine said it'll be several weeks
    before any of us can give this serious attention. Definitely update to
    trunk and base your work off of that.

    quick comments:

    Your _sre.c diff appears to remove and replace the entire file. I don't
    think thats actually what you did. Chances are that diff is a result of
    changed newline f lea. Can you please make sure your file uses the
    previous line ending style so that the diff is more managable.

    Also, please include unit test updates to validate all new functionality
    in Lib/test/test_re.py.

    thanks for your efforts! -gps

    @gpshead
    Copy link
    Member

    gpshead commented Sep 9, 2008

    weird typo: s/f lea/formats/

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 10, 2008

    Corrected the diff file. I worked from Python 2.5.2 because that's what
    I'm currently using. I'll work from the trunk in future.

    @amauryfa
    Copy link
    Member

    The correct link is bpo-2636.
    Is it the same work?

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 10, 2008

    This is different work from a different author than bpo-2636. I've
    submitted what I've done so far in case my computer gets hit by a bus.
    :-) I still have more work to do on it, so I'm not concerned that it
    might not get any attention for a while.

    @terryjreedy
    Copy link
    Member

    Atomic groups and possessive quantifiers appear to be relatively new:
    http://en.wikipedia.org/wiki/Regular_expressions
    for instance, has no mention of either that I found.

    http://www.regular-expressions.info/atomic.html
    http://www.regular-expressions.info/possessive.html
    seemed pretty clear to me. If they accurately describe what you are
    adding, they might be a starting point for the needed new manual
    sections. They should include a warning about carefully ordering
    alternatives lest one prune too much.

    @pitrou
    Copy link
    Member

    pitrou commented Sep 13, 2008

    By the way, the patch must be pretty incomplete, since there are almost
    no changes to _sre.c. Am I missing something?

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 13, 2008

    Corrected the diff file, again. :-(

    The atomic groups and possessive quantifiers are as described at
    http://www.regular-expressions.info.

    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Sep 13, 2008

    A bit more information on the changes to the core engine that are
    responsible for the 2x speedup (on what?) would be nice to have, I think
    (especially since you seem to have removed the KMP prefix scanner).

    (Isn't there a RE benchmark suite somewhere under tests?)

    @timehorse
    Copy link
    Mannequin

    timehorse mannequin commented Sep 15, 2008

    Well, I implemented this months ago, but have been busy with other
    things so I haven't updated in a while. I noticed that the current
    version is missing my patches for Atomic Grouping / Possessive
    Qualifiers and a number of other patches I added in bpo-2636 , but I do
    have working test cases and documentation updates for all the features
    I've so far implemented as well as splitting my work into separate
    sub-issues to make individual selection easier -- though with a number
    of my modifications, I found that there are SO MANY co-dependencies
    between, say, an engine modification (item 9) and adding Atomic Grouping
    / Possessive Qualifiers (item 1) and using shared Engine Constants (item
    10) that I need a branch for Atomic, a branch for Atomic + Engine Mod 1,
    Atomic + Engine Mod 2, Atomic + Shared Constants, Atomic + Engine Mod 1
    + Shared Constants AND Atomic + Engine Mod 2 + Shared Constants, and
    those were just THREE item co-dependencies. My code is all off of the
    trunk line for 2.6 and is currently available via my Bazaar repository
    under https://code.launchpad.net/~timehorse, where you can access any
    source tree via the bazaar version control client. The main reason I
    got stumped in my development which might otherwise have implemented ALL
    the issues I intended by now is that very situation I just described
    where development of new features is NOT mutually independent. You can
    see by all my branches that the multiplicity of A or B or C is just
    nightmarish, but what had to be done to keep issues independent.

    Anyway, I'm looking forward to having a look at your suggestions and
    think we may take best advantage with combining our work visa vi these
    things; after all, there's no point re-inventing the wheel.

    Thanks again for your contribution, Matthew!

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 15, 2008

    I know what you mean about the dependencies!

    My current problem is that now I'm working with the current trunk, which
    means using Visual C++ Express 2008 instead of 2005. When debugging it's
    behaving like the debug info is out of date (showing the wrong values
    for variables in the debugger, single-stepping to the wrong line, etc).
    Rebuilding doesn't fix it. This might take a while. :-(

    @amauryfa
    Copy link
    Member

    If you use Visual C++ Express 2005, you can build python from the
    PC\VS8.0 directory.

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 15, 2008

    Used Visual C++ Express 2005 and the PC\VS8.0 directory. Same problem.

    @amauryfa
    Copy link
    Member

    Do you have some big source files, of more than 10000 lines?

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 15, 2008

    _sre.c is over 6000, but it does contain macros. I didn't have this
    problem when based on Python 2.5.2 in Express 2005.

    @timehorse
    Copy link
    Mannequin

    timehorse mannequin commented Sep 15, 2008

    I have uploaded my test cases for Atomic Grouping / Possessive
    Qualifier, which is the common code we seem to have developed, as this
    may be of use to you. I also have documentation, but for now, would you
    mind running these tests against your code to see what the test outputs
    and also, how did you come up with the 2x result? Was that running the
    test suite? Usually, the regexp module is benchmarked against its test
    suite and there are timings built into that, so it may be useful if you
    could run the unmodified Lib/test/test_re.py you got from the trunk
    against the original code before modification and your modification, and
    do so a few times to get a good average result on multi-tasking systems,
    and post the results here so we can get a good statistical feel for how
    your new engine improves efficiency. Certainly, I support any Engine
    that works faster, as I myself have tried to make it faster but ended up
    with something 8% slower instead, alas.

    Also, good thinking on fixing the Negative Look-behind variable-width
    issue; I wish I'd thought of that, but I am curious about something: did
    you remove the optimization for fixed-width look-behind? The old code
    only allowed fixed with because that test can be done quickly; I noticed
    your code adds a lot of new REV opcodes to handle back-tracking and I
    assume look-behind logic for variable-width look-behind. It would be
    handy if the compiler and engine would be able to differentiate between
    fixed-width look-behind (optimized as was originally) and variable-width
    (using your advanced code).

    Thanks to AMK for some of these suggestions. Your changes are quite
    radical though so I am still trying to wade through them all and I still
    don't have a full-picture of how you've changed things, but there are
    some good ideas here, IMHO, especially if you do indeed get 2x speedup.

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 20, 2008

    This patch is now based on Python 2.6rc2.

    I've reduced the number of macros and used functions instead, provided
    that it didn't cost much in terms of speed. In many cases it should be
    faster than the current release, and at worst no slower. More speed
    tests and tweaks needed.

    BTW, the impression I got was that look-behind was fixed width because
    the matching operations could only match forwards through the text, so
    in order to look behind it had to step back through the text and then
    match forwards. For simplicity and speed it insisted that it must be
    able to determine the size of the step beforehand, hence fixed-width. My
    addition was to add matching operations which worked matched backwards
    and also reverse the order of the matching for look-behinds, an idea
    which I got from a page on how it could be implemented in Perl 6!

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 20, 2008

    Bugfix.

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 21, 2008

    Fixed the matching of word boundaries when searching and matching in
    substrings.

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 22, 2008

    regex_2.6rc2+2.diff is a bugfix for capture groups in look-behinds.

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 22, 2008

    Needed to correct regex_2.6rc2+2.diff.

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 22, 2008

    regex_2.6rc2+3.diff adds reverse searching with the re.REVERSE/re.R and
    "(?r)" flag.

    This gives results such as:

    >>> re.findall("(\w+)", "one two three")
    ['one', 'two', 'three']
    >>> re.findall("(?r)(\w+)", "one two three")
    ['three', 'two', 'one']

    See bpo-516762.

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 22, 2008

    regex_2.6rc2+4.diff fixes the ordering of the capture groups for reverse
    searching.

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 22, 2008

    Correction of regex_2.6rc2+4.diff. (Aargh!)

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 24, 2008

    Patch regex_2.6rc2+5.diff adds scoped and 'negative' flags for (?i),
    (?m) and (?s). The other flags remain unchanged in behaviour.

    See bpo-433024, bpo-433027 and bpo-433028.

    @mrabarnett
    Copy link
    Mannequin Author

    mrabarnett mannequin commented Sep 24, 2008

    Patch regex_2.6rc2+6.diff is a bugfix.

    @timehorse
    Copy link
    Mannequin

    timehorse mannequin commented Sep 24, 2008

    Matthew,

    I am really happy that you are making such progress on your engine, but
    can I PLEASE ask you to slow down for a moment? We have a lot of issues
    already listed in bpo-2636 that is a catch-all for any Python 2.7
    Regexp improvements, including your new engine, and I have been working
    frantically to try and document all the changes YOU are making here into
    the general Regexp 2.7 modification thread and setting up development
    trees in my Bazaar VCS repository for your work.

    There is also a recommended process for patching which makes it easier
    for the moderators to accept your patches which is to provide
    dis-entangled functionality and letting each improvement stand on its
    own two feet. In other words, let your engine stand ONLY on it's 2x
    speed improvements. We already have an implementation of Atomic
    Grouping / Possessive Qualifiers in bpo-2636 but you have a version of
    your engine with both. We have no such 'feature-only' implementation
    for Variable-Length Look-Behind, for a Reverse flag, for Positionally
    Dependent modifier flags or modifier negation flags, as well as the
    zero-width Regular Expression split feature, though you and I completely
    agree these would all be great things to have! The more features you
    add to your engine as an all-or-nothing proposition, the less likely the
    moderators are going to be to adapt it because it's harder for them to
    examine the merits of each individual piece. That is why bpo-2636 is
    broken up into items (currently 1 - 18, with your proposals bringing
    that up toward 22) and where alternate, combined features are provided
    if implementing 1 features would affect the implementation of another.

    Please understand that I personally have no problem with you redesigning
    large swaths of the Python Regular Expression engine. I would
    personally, like to see one of the design goals of any new engine not
    only be speed but better source comments because my main beef with the
    current engine is that it took me a month to understand and part of my
    redesign in bpo-2636 9-1 was to add copious comments to the engine so
    that future developers would understand what was going on and be able to
    pick up from my work. I am not proposing we use my 9-1 engine because
    it is 8% slower than the current engine and I don't intend to propose
    anything slower. But it would be nice if you could add lots of comments
    to your engine so that others could help develop features against it.
    None the less, I will fully support your engine if it does indeed
    perform substantially and measurably faster and am happy to see all the
    Regexp issues you are finding are finally being implemented, all be it
    entangled with your engine. But let's return to the fundamentals of
    what you propose IN THIS THREAD, which simply to propose a new Regexp
    Engine which is 2x faster than the existing engine (Which I have
    allocated item 9-2 in the bpo-2636 thread). I am not trying to put
    more work on your hands -- in fact, what I am trying to do is get us to
    co-operate on a better python Regexp Engine so that I can help you to
    achieve your goals. Please read bpo-2636 and join the discussion
    there; feel free to add any new items you feel are missing from my
    existing list. And remember, each new feature needs tests and
    documentation changes. I have been doing each for any feature I
    undertake and would be happy to share those skills with you.

    Let's work together to see your engine be the new model, okay?

    Thanks.

    @birkenfeld
    Copy link
    Member

    Work has gone on in bpo-2636.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants