Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a new regex module (compatible with re) #46888

Closed
timehorse mannequin opened this issue Apr 15, 2008 · 334 comments
Closed

Adding a new regex module (compatible with re) #46888

timehorse mannequin opened this issue Apr 15, 2008 · 334 comments
Labels
stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement

Comments

@timehorse
Copy link
Mannequin

timehorse mannequin commented Apr 15, 2008

BPO 2636
Nosy @loewis, @birkenfeld, @gpshead, @amauryfa, @ncoghlan, @abalkin, @pitrou, @vstinner, @giampaolo, @mark-summerfield, @ssbr, @ezio-melotti, @akitada, @stevendaprano, @alex, @bitdancer, @sandrotosi, @sorcio, @ericsnowcurrently, @Fak3, @serhiy-storchaka, @jayvdb
Files
  • regex_test-20100316: Python 2.6.5 re test run against regex-20100305
  • issue2636-20101230.zip
  • remove_guards.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-01-25.22:30:27.300>
    created_at = <Date 2008-04-15.11:57:51.375>
    labels = ['expert-regex', 'type-feature', 'library']
    title = 'Adding a new regex module (compatible with re)'
    updated_at = <Date 2021-01-25.22:30:27.294>
    user = 'https://bugs.python.org/timehorse'

    bugs.python.org fields:

    activity = <Date 2021-01-25.22:30:27.294>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-01-25.22:30:27.300>
    closer = 'vstinner'
    components = ['Library (Lib)', 'Regular Expressions']
    creation = <Date 2008-04-15.11:57:51.375>
    creator = 'timehorse'
    dependencies = []
    files = ['16563', '20192', '20203']
    hgrepos = ['108']
    issue_num = 2636
    keywords = ['patch']
    message_count = 334.0
    messages = ['65513', '65593', '65613', '65614', '65617', '65725', '65726', '65727', '65734', '65838', '65841', '66033', '67309', '67447', '67448', '68336', '68339', '68358', '68399', '68409', '73185', '73295', '73714', '73717', '73721', '73730', '73752', '73766', '73779', '73780', '73782', '73791', '73794', '73798', '73801', '73803', '73805', '73827', '73848', '73853', '73854', '73855', '73861', '73875', '73955', '74025', '74026', '74058', '74104', '74174', '74203', '74204', '74904', '80916', '81112', '81236', '81238', '81239', '81240', '81359', '81473', '81475', '82673', '82739', '82950', '83271', '83277', '83390', '83411', '83427', '83428', '83429', '83988', '83989', '83993', '84350', '86004', '86032', '89632', '89634', '89643', '90954', '90961', '90985', '90986', '90989', '91028', '91035', '91038', '91245', '91250', '91437', '91439', '91448', '91450', '91460', '91462', '91463', '91473', '91474', '91490', '91495', '91496', '91497', '91500', '91535', '91598', '91607', '91610', '91671', '91917', '97860', '98809', '99072', '99132', '99148', '99186', '99190', '99462', '99470', '99479', '99481', '99494', '99548', '99552', '99665', '99668', '99835', '99863', '99872', '99888', '99890', '99892', '100066', '100076', '100080', '100134', '100152', '100359', '100362', '100370', '100452', '101172', '101181', '101193', '101557', '102042', '103003', '103060', '103064', '103078', '103095', '103096', '103097', '109358', '109363', '109372', '109384', '109401', '109403', '109404', '109405', '109406', '109407', '109408', '109409', '109410', '109413', '109447', '109460', '109461', '109463', '109474', '109657', '110233', '110237', '110701', '110704', '110761', '111519', '111531', '111643', '111652', '111656', '111660', '111921', '113927', '113931', '114034', '114766', '116133', '116151', '116223', '116227', '116229', '116231', '116238', '116248', '116252', '116276', '116749', '117008', '117046', '117050', '118243', '118631', '118636', '118640', '118674', '118682', '119887', '119930', '119947', '119951', '119956', '119958', '120013', '120037', '120038', '120164', '120202', '120203', '120204', '120206', '120215', '120216', '120243', '120571', '120969', '120976', '120984', '120986', '121136', '121145', '121149', '121589', '121832', '122221', '122225', '122228', '122880', '123518', '123527', '123747', '124581', '124582', '124585', '124614', '124626', '124627', '124746', '124750', '124759', '124816', '124821', '124833', '124834', '124891', '124900', '124904', '124905', '124906', '124909', '124912', '124929', '124931', '124936', '124950', '124959', '124971', '124988', '124990', '125291', '126294', '126372', '127045', '130886', '130905', '130906', '130999', '135700', '135703', '135704', '140102', '140152', '140154', '143090', '143333', '143334', '143337', '143340', '143343', '143350', '143352', '143355', '143366', '143367', '143374', '143377', '143389', '143423', '143442', '143445', '143447', '143448', '143467', '143471', '143619', '144110', '152210', '152211', '152212', '152214', '152215', '152217', '152218', '152246', '157445', '174888', '221629', '221657', '230846', '230862', '230866', '230867', '230873', '230877', '230885', '230887', '231137', '231141', '231159', '231165', '385674']
    nosy_count = 43.0
    nosy_names = ['loewis', 'georg.brandl', 'gregory.p.smith', 'jimjjewett', 'sjmachin', 'amaury.forgeotdarc', 'ncoghlan', 'belopolsky', 'pitrou', 'vstinner', 'nneonneo', 'giampaolo.rodola', 'rsc', 'timehorse', 'mark', 'vbr', 'Devin Jeanpierre', 'ezio.melotti', 'mrabarnett', 'jaylogan', 'akitada', 'moreati', 'steven.daprano', 'alex', 'r.david.murray', 'jacques', 'zdwiel', 'THRlWiTi', 'sandro.tosi', 'abacabadabacaba', 'stiv', 'davide.rizzo', 'mattchaput', 'ronnix', 'tshepang', 'eric.snow', 'akoumjian', 'Roman.Evstifeev', 'serhiy.storchaka', 'Mateon1', 'jayvdb', 'Socob', 'petros']
    pr_nums = []
    priority = 'normal'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue2636'
    versions = []

    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented Apr 15, 2008

    I am working on adding features to the current Regexp implementation,
    which is now set to 2.2.2. These features are to bring the Regexp code
    closer in line with Perl 5.10 as well as add a few python-specific
    niceties and potential speed-ups and clean-ups.

    I will be posting regular patch updates to this thread when major
    milestones have been reach with a description of the feature(s) added.
    Currently, the list of proposed changes are (in no particular order):

    1. Fix <a href="http://bugs.python.org/issue433030"\>bpo-433030\</a> by
      adding support for Atomic Grouping and Possessive Qualifiers

    2. Make named matches direct attributes of the match object; i.e.
      instead of m.group('foo'), one will be able to write simply m.foo.

    3. (maybe) make Match objects subscriptable, such that m[n] is
      equivalent to m.group(n) and allow slicing.

    4. Implement Perl-style back-references including relative back-references.

    5. Add a well-formed, python-specific comment modifier, e.g. (?P#...);
      the difference between (?P#...) and Perl/Python's (?#...) is that the
      former will allow nested parentheses as well as parenthetical escaping,
      so that patterns of the form '(?P# Evaluate (the following) expression,
      3\) using some other technique)'. The (?P#...) will interpret this
      entire expression as a comment, where as with (?#...) only, everything
      following ' expression...' would be considered part of the match.
      (?P#...) will necessarily be slower than (?#...) and so only should be
      used if richer commenting style is required but the verbose mode is not
      desired.

    6. Add official support for fast, non-repeating capture groups with the
      Template option. Template is unofficially supported and disables all
      repeat operators (*, + and ?). This would mainly consist of documenting
      its behavior.

    7. Modify the re compiled expression cache to better handle the
      thrashing condition. Currently, when regular expressions are compiled,
      the result is cached so that if the same expression is compiled again,
      it is retrieved from the cache and no extra work has to be done. This
      cache supports up to 100 entries. Once the 100th entry is reached, the
      cache is cleared and a new compile must occur. The danger, all be it
      rare, is that one may compile the 100th expression only to find that one
      recompiles it and has to do the same work all over again when it may
      have been done 3 expressions ago. By modifying this logic slightly, it
      is possible to establish an arbitrary counter that gives a time stamp to
      each compiled entry and instead of clearing the entire cache when it
      reaches capacity, only eliminate the oldest half of the cache, keeping
      the half that is more recent. This should limit the possibility of
      thrashing to cases where a very large number of Regular Expressions are
      continually recompiled. In addition to this, I will update the limit to
      256 entries, meaning that the 128 most recent are kept.

    8. Emacs/Perl style character classes, e.g. [:alphanum:]. For instance,
      :alphanum: would not include the '_' in the character class.

    9. C-Engine speed-ups. I commenting and cleaning up the _sre.c Regexp
      engine to make it flow more linearly, rather than with all the current
      gotos and replace the switch-case statements with lookup tables, which
      in tests have shown to be faster. This will also include adding many
      more comments to the C code in order to make it easier for future
      developers to follow. These changes are subject to testing and some
      modifications may not be included in the final release if they are shown
      to be slower than the existing code. Also, a number of Macros are being
      eliminated where appropriate.

    10. Export any (not already) shared value between the Python Code and
      the C code, e.g. the default Maximum Repeat count (65536); this will
      allow those constants to be changed in 1 central place.

    11. Various other Perl 5.10 conformance modifications, TBD.

    More items may come and suggestions are welcome.

    -----

    Currently, I have code which implements 5) and 7), have done some work
    on 10) and am almost 9). When 9) is complete, I will work on 1), some
    of which, such as parsing, is already done, then probably 8) and 4)
    because they should not require too much work -- 4) is parser-only
    AFAICT. Then, I will attempt 2) and 3), though those will require
    changes at the C-Code level. Then I will investigate what additional
    elements of 11) I can easily implement. Finally, I will write
    documentation for all of these features, including 6).

    In a few days, I will provide a patch with my interim results and will
    update the patches with regular updates when Milestones are reached.

    @timehorse timehorse mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Apr 15, 2008
    @akuchling akuchling added topic-regex and removed stdlib Python modules in the Lib dir labels Apr 15, 2008
    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented Apr 17, 2008

    I am very sorry to report (at least for me) that as of this moment, item
    9), although not yet complete, is stable and able to pass all the
    existing python regexp tests. Because these tests are timed, I am using
    the timings from the first suite of tests to perform a benchmark of
    performance between old and new code. Based on discussion with Andrew
    Kuchling, I have decided for the sake of simplicity, the "timing" of
    each version is to be calculated by the absolute minimum time to execute
    observed because it is believed this execution would have had the most
    continuous CPU cycles and thus most closely represents the true
    execution time.

    It is this current conclusion that greatly saddens me, not that the
    effort has not been valuable in understanding the current engine.
    Indeed, I understand the current engine now well enough that I could
    proceed with the other modifications as-is rather than implementing them
    with the new engine. Mind you, I will likely not bring over the copious
    comments that the new engine received when I translated it to a form
    without C_Macros and gotos, as that would require too much effort IMHO.

    Anyway, all that being said, and keeping in mind that I am not 100%
    satisfied with the new engine and may still be able to wring some timing
    out of it -- not that I will spend much more time on this -- here is
    where we currently stand:

    Old Engine: 6.574s
    New Engine: 7.239s

    This makes the old Engine 665ms faster over the entire first test_re.py
    suite, or 9% faster than the New Engine.

    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented Apr 18, 2008

    Here are the modification so far for item 9) in _sre.c plus some small
    modifications to sre_constants.h which are only to get _sre.c to
    compile; normally sre_constants.h is generated by sre_constants.py, so
    this is not the final version of that file. I also would have intended
    to make SRE_CHARSET and SRE_COUNT use lookup tables, as well as maybe
    others, but not likely any other lookup tables. I also want to remove
    alloc_pos out of the self object and make it a parameter to the ALLOC
    parameter and probably get rid of the op_code attribute since it is only
    used in 1 place to save one subtract in a very rare case. But I want to
    resolve the 10% problem first, so would appreciate it if people could
    look at the REMOVE_SRE_MATCH_MACROS section of code and compare it to
    the non-REMOVE_SRE_MATCH_MACROS version of SRE_MATCH and see if you can
    suggest anything to make the former (new code) faster to get me that
    elusive 10%.

    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented Apr 18, 2008

    Here is a patch to implement item 7)

    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented Apr 18, 2008

    This simple patch adds (?P#...)-style comment support.

    @jimjjewett
    Copy link
    Mannequin

    jimjjewett mannequin commented Apr 24, 2008

    These features are to bring the Regexp code closer in line with Perl 5.10

    Why 5.1 instead of 5.8 or at least 5.6? Is it just a scope-creep issue?

    as well as add a few python-specific

    because this also adds to the scope.

    1. Make named matches direct attributes
      of the match object; i.e. instead of m.group('foo'),
      one will be able to write simply m.foo.
    1. (maybe) make Match objects subscriptable, such
      that m[n] is equivalent to m.group(n) and allow slicing.

    (2) and (3) would both be nice, but I'm not sure it makes sense to do
    *both* instead of picking one.

    1. Add a well-formed, python-specific comment modifier,
      e.g. (?P#...);

    [handles parens in comments without turning on verbose, but is slower]

    Why? It adds another incompatibility, so it has to be very useful or
    clear. What exactly is the advantage over just turning on verbose?

    1. C-Engine speed-ups. ...
      a number of Macros are being eliminated where appropriate.

    Be careful on those, particular on str/unicode and different compile options.

    @amauryfa
    Copy link
    Member

    > These features are to bring the Regexp code closer in line
    > with Perl 5.10

    Why 5.1 instead of 5.8 or at least 5.6? Is it just a scope-creep issue?

    5.10.0 comes after 5.8 and is the latest version (2007/12/18)!
    Yes it is confusing.

    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented Apr 24, 2008

    Thanks Jim for your thoughts!

    Armaury has already explained about Perl 5.10.0. I suppose it's like
    Macintosh version numbering, since Mac Tiger went from version 10.4.9 to
    10.4.10 and 10.4.11 a few years ago. Maybe we should call Python 2.6
    Python 2.06 just in case. But 2.6 is the known last in the 2 series so
    it's not a problem for us! :)

    > as well as add a few python-specific

    because this also adds to the scope.

    At this point the only python-specific changes I am proposing would be
    items 2, 3 (discussed below), 5 (discussed below), 6 and 7. 6 is only a
    documentation change, the code is already implemented. 7 is just a
    better behavior. I think it is RARE one compiles more than 100 unique
    regular expressions, but you never know as projects tend to grow over
    time, and in the old code the 101st would be recompiled even if it was
    just compiled 2 minutes ago. The patch is available so I leave it to
    the community to judge for themselves whether it is worth it, but as you
    can see, it's not a very large change.

    > 2) Make named matches direct attributes
    > of the match object; i.e. instead of m.group('foo'),
    > one will be able to write simply m.foo.

    > 3) (maybe) make Match objects subscriptable, such
    > that m[n] is equivalent to m.group(n) and allow slicing.

    (2) and (3) would both be nice, but I'm not sure it makes sense to do
    *both* instead of picking one.

    Well, I think named matches are better than numbered ones, so I'd
    definitely go with 2. The problem with 2, though, is that it still
    leaves the rather typographically intense m.group(n), since I cannot
    write m.3. However, since capture groups are always numbered
    sequentially, it models a list very nicely. So I think for indexing by
    group number, the subscripting operator makes sense. I was not
    originally suggesting m['foo'] be supported, but I can see how that may
    come out of 3. But there is a restriction on python named matches that
    they have to be valid python and that strikes me as 2 more than 3
    because 3 would not require such a restriction but 2 would. So at least
    I want 2, but it seems IMHO m[1] is better than m.group(1) and not in
    the least hard or a confusing way of retrieving the given group. Mind
    you, the Match object is a C-struct with python binding and I'm not
    exactly sure how to add either feature to it, but I'm sure the C-API
    manual will help with that.

    > 5) Add a well-formed, python-specific comment modifier,
    > e.g. (?P#...);

    [handles parens in comments without turning on verbose, but is slower]

    Why? It adds another incompatibility, so it has to be very useful or
    clear. What exactly is the advantage over just turning on verbose?

    Well, Larry Wall and Guido agreed long ago that we, the python
    community, own all expressions of the form (?P...) and although I'd be
    my preference to make (?#...) more in conformance with understanding
    parenthesis nesting, changing the logic behind THAT would make python
    non-standard. So as far as any conflicting design, we needn't worry.

    As for speed, the this all occurs in the parser and does not effect the
    compiler or engine. It occurs only after a (?P has been read and then
    only as the last check before failure, so it should not be much slower
    except when the expression is invalid. The actual execution time to
    find the closing brace of (?P#...) is a bit slower than that for (?#...)
    but not by much.

    Verbose is generally a good idea for anything more than a trivial
    Regular Expression. However, it can have overhead if not included as
    the first flag: an expression is always checked for verbose
    post-compilation and if it is encountered, the expression is compiled a
    second time, which is somewhat wasteful. But the reason I like the
    (?P#...) over (?#...) is because I think people would more tend to assume:

    r'He(?# 2 (TWO) ls)llo' should match "Hello" but it doesn't.

    That expression only matches "He ls)llo", so I created the (?P#...) to
    make the comment match type more intuitive:

    r'He(?P# 2 (TWO) ls)llo' matches "Hello".

    > 9) C-Engine speed-ups. ...
    > a number of Macros are being eliminated where appropriate.

    Be careful on those, particular on str/unicode and different
    compile options.

    Will do; thanks for the advice! I have only observed the UNICODE flag
    controlling whether certain code is used (besides the ones I've added)
    and have tried to stay true to that when I encounter it. Mind you,
    unless I can get my extra 10% it's unlikely I'd actually go with item 9
    here, even if it is easier to read IMHO. However, I want to run the new
    engine proposal through gprof to see if I can track down some bottlenecks.

    At some point, I hope to get my current changes on Launchpad if I can
    get that working. If I do, I'll give a link to how people can check out
    my working code here as well.

    @jimjjewett
    Copy link
    Mannequin

    jimjjewett mannequin commented Apr 24, 2008

    Python 2.6 isn't the last, but Guido has said that there won't be a 2.10.

    Match object is a C-struct with python binding
    and I'm not exactly sure how to add either feature to it

    I may be misunderstanding -- isn't this just a matter of writing the
    function and setting it in the tp_as_sequence and tp_as_mapping slots?

    Larry Wall and Guido agreed long ago that we, the python
    community, own all expressions of the form (?P...)

    Cool -- that reference should probably be added to the docs. For someone
    trying to learn or translate regular expressions, it helps to know that (?P
    ...) is explicitly a python extension (even if Perl adopts it later).

    Definately put the example in the doc.

    r'He(?# 2 (TWO) ls)llo' should match "Hello" but it doesn't.  Maybe 
    

    even without the change, as doco on the current situation.

    Does VERBOSE really have to be the first flag, or does it just have to be on
    the whole pattern instead of an internal switch?

    I'm not sure I fully understand what you said about template. Is this a
    special undocumented switch, or just an internal optimization mode that
    should be triggered whenever the repeat operators don't happen to occur?

    @pitrou
    Copy link
    Member

    pitrou commented Apr 26, 2008

    I don't know anything about regexp implementation, but if you replace a
    switch-case with a function lookup table, it isn't surprising that the
    new version ends up slower. A local jump is always faster than a
    function call, because of the setup overhead and stack manipulation the
    latter involves.

    So you might try to do the cleanup while keeping the switch-case
    structure, if possible.

    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented Apr 26, 2008

    Thank you and Merci Antoine!

    That is a good point. It is clearly specific to the compiler whether a
    switch-case will be turned into a series of conditional branches or
    simply creating an internal jump table with lookup. And it is true
    that most compilers, if I understand correctly, use the jump-table
    approach for any switch-case over 2 or 3 entries when the cases are
    tightly grouped and near 0. That is probably why the original code
    worked so fast. I'll see if I can combine the best of both
    approaches. Thanks again!

    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented May 1, 2008

    I am making my changes in a Bazaar branch hosted on Launchpad. It took
    me quite a while to get things set up more-or-less logically but there
    they are and I'm currently trying to re-apply my local changes up to
    today into the various branches I have. Each of the 11 issues I
    outlined originally has its own branch, with a root branch from which
    all these branches are derived to serve as a place for a) merging in
    python 2.6 alpha concurrent development (merges) and to apply any
    additional re changes that don't fall into any of the other categories,
    of which I have so far found only 2 small ones.

    Anyway, if anyone is interested in monitoring my progress, it is
    available at:

    https://code.launchpad.net/~timehorse/

    I will still post major milestones here, but one can monitory day-to-day
    progress on Launchpad. Also on launchpad you will find more detail on
    the plans for each of the 11 modifications, for the curious.

    Thanks again for all the advice!

    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented May 24, 2008

    I am finally making progress again, after a month of changing my
    patches from my local svn repository to bazaar hosted on launchpad.net,
    as stated in my last update. I also have more or less finished the
    probably easiest item, #5, so I have a full patch for that available
    now. First, though, I want to update my "No matter what" patch, which
    is to say these are the changes I want to make if any changes are made
    to the Regexp code.

    @mark-summerfield
    Copy link
    Mannequin

    mark-summerfield mannequin commented May 28, 2008

    AFAIK if you have a regex with named capture groups there is no direct
    way to relate them to the capture group numbers.
    You could do (untested; Python 3 syntax):

        d = {v: k for k, v in match.groupdict()}
        for i in range(match.lastindex):
             print(i, match.group(i), d[match.group(i)])

    One possible solution would be a grouptuples() function that returned a
    tuple of 3-tuples (index, name, captured_text) with the name being None
    for unnamed groups.

    Anyway, good luck with all your improvements, I will be especially glad
    if you manage to do (2) and (8) (and maybe (3)).

    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented May 28, 2008

    Mark scribbled:

    One possible solution would be a grouptuples() function that returned
    a tuple of 3-tuples (index, name, captured_text) with the name being
    None for unnamed groups.

    Hmm. Well, that's not a bad idea at all IMHO and would, AFAICT probably
    be easier to do than (2) but I would still do (2) but will try to add
    that to one of the existing items or spawn another item for it since it
    is kind of a distinct feature.

    My preference right now is to finish off the test cases for (7) because
    it is already coded, then finish the work on (1) as that was the
    original reason for modification then on to (2) then (3) as they are
    related and then I don't mind tackling (8) because I think that one
    shouldn't be too hard. Interestingly, the existing engine code
    (sre_parse.py) has a place-holder, commented out, for character classes
    but it was never properly implemented. And I will warn that with
    Unicode, I THINK all the character classes exist as unicode functions or
    can be implemented as multiple unicode functions, but I'm not 100% sure
    so if I run into that problem, some character classes may initially be
    left out while I work on another item.

    Anyway, thanks for the input, Mark!

    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented Jun 17, 2008

    Well, it's time for another update on my progress...

    Some good news first: Atomic Grouping is now completed, tested and
    documented, and as stated above, is classified as bpo-2636-01 and
    related patches. Secondly, with caveats listed below, Named Match Group
    Attributes on a match object (item 2) is also more or less complete at
    bpo-2636-02 -- it only lacks documentation.

    Now, I want to also update my list of items. We left off at 11: Other
    Perl-specific modifications. Since that time, I have spawned a number
    of other branches, the first of which (bpo-2636-12) I am happy to
    announce is also complete!

    1. Implement the changes to the documentation of re as per Jim J.
      Jewett suggestion from 2008-04-24 14:09. Again, this has been done.

    2. Implement a grouptuples(...) method as per Mark Summerfield's
      suggest on 2008-05-28 09:38. grouptuples would take the same filtering
      parameters as the other group* functions, and would return a list of 3-
      tuples (unless only 1 group was requested). It should default to all
      match groups (1..n, not group 0, the matching string).

    3. As per PEP-3131 and the move to Python 3.0, python will begin to
      allow full UNICODE-compliant identifier names. Correspondingly, it
      would be the responsibility of this item to allow UNICODE names for
      match groups. This would allow retrieval of UNICODE names via the
      group* functions or when combined with Item 3, the getitem handler
      (m[u'...']) (03+14) and the attribute name itself (e.g. getattr(m,
      u'...')) when combined with item 2 (02+14).

    4. Change the Pattern_Type, Match_Type and Scanner_Type (experimental)
      to become richer Python Types. Specifically, add __doc__ strings to
      each of these types' methods and members.

    5. Implement various FIXMEs.

    16-1) Implement the FIXME such that if m is a MatchObject, del m.string
    will disassociate the original matched string from the match object;
    string would be the only member that would allow modification or
    deletion and you will not be able to modify the m.string value, only
    delete it.

    -----

    Finally, I want to say a couple notes about Item 2:

    Firstly, as noted in Item 14, I wish to add support for UNICODE match
    group names, and the current version of the C-code would not allow that;
    it would only make sense to add UNICODE support if 14 is implemented, so
    adding support for UNICODE match object attributes would depend on both
    items 2 and 14. Thus, that would be implemented in bpo-2636-02+14.

    Secondly, there is a FIXME which I discussed in Item 16; I gave that
    problem it's own item and branch. Also, as stated in Item 15, I would
    like to add more robust help code to the Match object and bind __doc__
    strings to the fixed attributes. Although this would not directly
    effect the Item 2 implementation, it would probably involve moving some
    code around in its vicinity.

    Finally, I would like suggestions on how to handle name collisions when
    match group names are provided as attributes. For instance, an
    expression like '(?P<pos>.*)' would match more or less any string and
    assign it to the name "pos". But "pos" is already an attribute of the
    Match object, and therefore pos cannot be exposed as a named match group
    attribute, since match.pos will return the usual meaning of pos for a
    match object, not the value of the capture group names "pos".

    I have 3 proposals as to how to handle this:

    a) Simply disallow the exposure of match group name attributes if the
    names collide with an existing member of the basic Match Object
    interface.

    b) Expose the reserved names through a special prefix notation, and for
    forward compatibility, expose all names via this prefix notation. In
    other words, if the prefix was 'k', match.kpos could be used to access
    pos; if it was '_', match._pos would be used. If Item 3 is implemented,
    it may be sufficient to allow access via match['pos'] as the canonical
    way of handling match group names using reserved words.

    c) Don't expose the names directly; only expose them through a prefixed
    name, e.g. match._pos or match.kpos.

    Personally, I like a because if Item 3 is implemented, it makes a fairly
    useful shorthand for retrieving keyword names when a keyword is used for
    a name. Also, we could put a deprecation warning in to inform users
    that eventually match groups names that are keywords in the Match Object
    will eventually be disallowed. However, I don't support restricting the
    match group names any more than they already are (they must be a valid
    python identifier only) so again I would go with a) and nothing more and
    that's what's implemented in bpo-2636-02.patch.

    -----

    Now, rather than posting umteen patch files I am posting one bz2-
    compressed tar of ALL patch files for all threads, where each file is of
    the form:

    bpo-2636(-\d\d|+\d\d)*(-only)?.patch

    For instance,

    bpo-2636-01.patch is the p1 patch that is a difference between the
    current Python trunk and all that would need to be implemented to
    support Atomic Grouping / Possessive Qualifiers. Combined branches are
    combined with a PLUS ('+') and sub-branches concatenated with a DASH ('-
    '). Thus, "bpo-2636-01+09-01-01+10.patch" is a patch which combines
    the work from Item 1: Atomic Grouping / Possessive Qualifiers, the sub-
    sub branch of Item 9: Engine Cleanups and Item 10: Shared Constants.
    Item 9 has both a child and a grandchild. The Child (09-01) is my
    proposed engine redesign with the single loop; the grandchild (09-01-01)
    is the redesign with the triple loop. Finally the optional "-only" flag
    means that the diff is against the core SRE modifications branch and
    thus does not include the core branch changes.

    As noted above, Items 01, 02, 05, 07 and 12 should be considered more or
    less complete and ready for merging assuming I don't identify in my
    implementation of the other items that I neglected something in these.
    The rest, including the combined items, are all provided in the given
    tarball.

    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented Jun 17, 2008

    Sorry, as I stated in the last post, I generated the patches then realized
    that I was missing the documentation for Item 2, so I have updated the
    bpo-2636-02.patch file and am attaching that separately until the next
    release of the patch tarball. bpo-2636-02-only.patch should be ignored
    and I will only regenerate it with the correct documentation in the next
    tarball release so I can move on to either Character Classes or Relative
    Back-references. I wanna pause Item 3 for the moment because 2, 3, 13,
    14, 15 and 16 all seem closely related and I need a break to allow my mind
    to wrap around the big picture before I try and tackle each one.

    @mark-summerfield
    Copy link
    Mannequin

    mark-summerfield mannequin commented Jun 18, 2008

    [snip]

    1. Implement a grouptuples(...) method as per Mark Summerfield's
      suggest on 2008-05-28 09:38. grouptuples would take the same filtering
      parameters as the other group* functions, and would return a list of 3-
      tuples (unless only 1 group was requested). It should default to all
      match groups (1..n, not group 0, the matching string).

    :-)

    [snip]

    Finally, I would like suggestions on how to handle name collisions when
    match group names are provided as attributes. For instance, an
    expression like '(?P<pos>.*)' would match more or less any string and
    assign it to the name "pos". But "pos" is already an attribute of the
    Match object, and therefore pos cannot be exposed as a named match group
    attribute, since match.pos will return the usual meaning of pos for a
    match object, not the value of the capture group names "pos".

    I have 3 proposals as to how to handle this:

    a) Simply disallow the exposure of match group name attributes if the
    names collide with an existing member of the basic Match Object
    interface.

    I don't like the prefix ideas and now that you've spelt it out I don't
    like the sometimes m.foo will work and sometimes it won't. So I prefer
    m['foo'] to be the canonical way because that guarantees your code is
    always consistent.

    ------------------------------------------------------------
    BTW I wanted to do a simple regex to match a string that might or might
    not be quoted, and that could contain quotes (but not those used to
    delimit it). My first attempt was illegal:

    (?P<quote>['"])?([^(?=quote)])+(?(quote)(?=quote))
    

    It isn't hard to work round but it did highlight the fact that you can't
    use captures inside character classes. I don't know if Perl allows this;
    I guess if it doesn't then Python shouldn't either since GvR wants the
    engine to be Perl compatible.

    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented Jun 19, 2008

    Thanks for weighing in Mark! Actually, your point is valid and quite
    fair, though I would not assume that Item 3 would be included just
    because Item 2 isn't. I will do my best to develop both, but I do not
    make the final decision as to what python includes. That having been
    said, 3 seems very likely at this point so we may be okay, but let me
    give this one more try as I think I have a better solution to make Item
    2 more palatable. Let's say we have 5 choices here:

    a) Simply disallow the exposure of match group name attributes if the
    names collide with an existing member of the basic Match Object
    interface.

    b) Expose the reserved names through a special prefix notation, and
    for forward compatibility, expose all names via this prefix notation.
    In other words, if the prefix was 'k', match.kpos could be used to
    access pos; if it was '_', match._pos would be used. If Item 3 is
    implemented, it may be sufficient to allow access via match['pos'] as
    the canonical way of handling match group names using reserved words.

    c) Don't expose the names directly; only expose them through a
    prefixed name, e.g. match._pos or match.kpos.

    d) (As Mark suggested) we drop Item 2 completely. I have not invested
    much work in this so that would not bother me, but IMHO I actually
    prefer Item 2 to 3 so I would really like to see it preserved in some
    form.

    e) Add an option, re.MATCH_ATTRIBUTES, that is used as a Match Creation
    flag. When the re.MATCH_ATTRIBUTES or re.A flag is included in the
    compile, or (?a) is included in the pattern, it will do 2 things.
    First, it will raise an exception if either a) there exists an unnamed
    capture group or b) the capture group name is a reserved keyword. In
    addition to this, I would put in a hook to support a from __future__ so
    that any post 2.6 changes to the match object type can be smoothly
    integrated a version early to allow programmers to change when any
    future changes come. Secondly, I would *conditionally* allow arbitrary
    capture group name via the __getattr__ handler IFF that flag was
    present; otherwise you could not access Capture Groups by name via
    match.foo.

    I really like the idea of e) so I'm taking Item 2 out of the "ready for
    merge" category and going to put it in the queue for the modifications
    spelled out above. I'm not too worried about our flags differing from
    Perl too much as we did base our first 4 on Perl (x, s, m, i), but
    subsequently added Unicode and Locale, which Perl does not have, and
    never implemented o (since our caching semantic already pretty much
    gives every expression that), e (which is specific to Perl syntax
    AFAICT) and g (which can be simulated via re.split). So I propose we
    take A and implement it as I've specified and that is the current goal
    of Item 2. Once this is done and working, we can decide whether it
    should be included in the python trunk.

    How does that sound to you, Mark and anyone else who wishes to weigh in?

    @mark-summerfield
    Copy link
    Mannequin

    mark-summerfield mannequin commented Jun 19, 2008

    [snip]

    It seems to me that both using a special prefix or adding an option are
    adding a lot of baggage and will increase the learning curve.

    The nice thing about (3) (even without slicing) is that it seems a v.
    natural extension. But (2) seems magical (i.e., Perl-like rather than
    Pythonic) which I really don't like.

    BTW I just noticed this:

    '<_sre.SRE_Pattern object at 0x9ded020>'
    >>> "{0!r}".format(rx)
    '<_sre.SRE_Pattern object at 0x9ded020>'
    >>> "{0!s}".format(rx)
    '<_sre.SRE_Pattern object at 0x9ded020>'
    >>> "{0!a}".format(rx)

    That's fair enough, but maybe for !s the output should be rx.pattern?

    @pitrou
    Copy link
    Member

    pitrou commented Sep 13, 2008

    See also bpo-3825.

    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented Sep 16, 2008

    Update 16 Sep 2008:

    Based on the work for issue bpo-3825, I would like to simply update the
    item list as follows:

    1. Atomic Grouping / Possessive Qualifiers (See also Issue bpo-433030)
      [Complete]

    2. Match group names as attributes (e.g. match.foo) [Complete save
      issues outlined above]

    3. Match group indexing (e.g. match['foo'], match[3])

    4. Perl-style back-references (e.g. compile(r'(a)\g{-1}'), and possibly
      adding the r'\k' escape sequence for keywords.

    5. Parenthesis-Aware Python Comment (e.g. r'(?P#...)') [Complete]

    6. Expose support for Template expressions (expressions without repeat
      operators), adding test cases and documentation for existing code.

    7. Larger compiled Regexp cache (256 vs. 100) and reduced thrashing
      risk. [Complete]

    8. Character Classes (e.g. r'[:alphanum:]')

    9. Proposed Engine redesigns and cleanups (core item only contains
      cleanups and comments to the current design but does not modify the design).

    9-1) Single-loop Engine redesign that runs 8% slower than current.
    [Complete]

    9-1-1) 3-loop Engine redesign that runs 10% slower than current. [Complete]

    9-2) Matthew Bernett's Engine redesign as per issue bpo-3825

    1. Have all C-Python shared constants stored in 1 place
      (sre_constants.py) and generated by that into C constants
      (sre_constants.h). [Complete AFAICT]

    2. Scan Perl 5.10.0 for other potential additions that could be
      implemented for Python.

    3. Documentation suggestions by Jim J. Jewett [Complete]

    4. Add grouptuples method to the Match object (i.e. match.grouptuples()
      returns (<index>, <name or None>, <value>) ) suitable for iteration.

    5. UNICODE match group names, as per PEP-3131.

    6. Add __doc__ strings and other Python niceties to the Pattern_Type,
      Match_Type and Scanner_Type (experimental).

    7. Implement any remaining TODOs and FIXMEs in the Regexp modules.

    16-1) Allow for the disassociation of a source string from a Match_Type,
    assuming this will still leave the object in a "reasonable" state.

    1. Variable-length [Positive and Negative] Look-behind assertions, as
      described and implemented in Issue bpo-3825.

    ---

    Now, we have a combination of Items 1, 9-2 and 17 available in issue
    bpo-3825, so for now, refer to that issue for the 01+09-02+17 combined
    solution. Eventually, I hope to merge the work between this and that issue.

    I sadly admit I have made not progress on this since June because
    managing 30 some lines of development, some of which having complex
    diamond branching, e.g.:

    01 is the child of bpo-2636
    09 is the child of bpo-2636
    10 is the child of bpo-2636
    09-01 is the child of 09
    09-01-01 is the child of 09-01
    01+09 is the child of 01 and 09
    01+10 is the child of 01 and 10
    09+10 is the child of 09 and 10
    01+09-01 is the child of 01 and 09-01
    01+09-01-01 is the child of 01 and 09-01-01
    09-01+10 is the child of 09-01 and 10
    09-01-01+10 is the child of 09-01-01 and 10

    Which all seems rather simple until you wrap your head around:

    01+09+10 is the child of 01, 09, 10, 01+09, 01+10 AND 09+10!

    Keep in mind the reason for all this complex numbering is because many
    issues cannot be implemented in a vacuum: If you want Atomic Grouping,
    that's 1 implementation, if you want Shared Constants, that's a
    different implementation. but if you want BOTH Atomic Grouping and
    Shared Constants, that is a wholly other implementation because each
    implementation affects the other. Thus, I end up with a plethora of
    branches and a nightmare when it comes to merging which is why I've been
    so slow in making progress. Bazaar seems to be very confused when it
    comes to a merge in 6 parts between, for example 01, 09, 10, 01+09,
    01+10 and 09+10, as above. It gets confused when it sees the same
    changes applied in a previous merge applied again, instead of simply
    realizing that the change in one since last merge is EXACTLY the same
    change in the other since last merge so effectively there is nothing to
    do; instead, Bazaar gets confused and starts treating code that did NOT
    change since last merge as if it was changed and thus tries to role back
    the 01+09+10-specific changes rather than doing nothing and generates a
    conflict. Oh, that I could only have a version control system that
    understood the kind of complex branching that I require!

    Anyway, that's the state of things; this is me, signing out!

    @timehorse timehorse mannequin changed the title Regexp 2.6 (modifications to current re 2.2.2) Regexp 2.7 (modifications to current re 2.2.2) Sep 16, 2008
    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Sep 24, 2008

    Comparing item 2 and item 3, I think that item 3 is the Pythonic choice
    and item 2 is a bad idea.

    Item 4: back-references in the pattern are like \1 and (?P=name), not
    \g<1> or \g<name>, and in the replacement string are like \g<1> and
    \g<name>, not \1 (or (?P=name)). I'd like to suggest that
    back-references in the pattern also include \g<1> and \g<name> and
    \g<-1> for relative back-references. Interestingly, Perl names groups
    with (?<name>...) whereas Python uses (?P<name>...). A permissible
    alternative?

    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented Sep 24, 2008

    Thanks for weighing in Matthew!

    Yeah, I do get some flack for item 2 because originally item 3 wasn't
    supposed to cover named groups but on investigation it made sense that
    it should. I still prefer 2 over-all but the nice thing about them
    being separate items is that we can accept 2 or 3 or both or neither,
    and for the most part development for the first phase of 2 is complete
    though there is still IMHO the issue of UNICODE name groups (visa-vi
    item 14) and the name collision problem which I propose fixing with an
    Attribute / re.A flag. So, I think it may end up that we could support
    both 3 by default and 2 via a flag or maybe 3 and 2 both but with 2 as
    is, with name collisions hidden (i.e. if you have r'(?P<string>...)' as
    your capture group, typing m.string will still give you the original
    comparison string, as per the current python documentation) but have
    collision-checking via the Attribute flag so that with
    r'(?A)(?P<string>...)' would not compile because string is a reserved word.

    Your interpretation of 4 matches mine, though, and I would definitely
    suggest using Perl's \g<-n> notation for relative back-references, but
    further, I was thinking, if not part of 4, part of the catch-all item 11
    to add support for Perl's (?<name>...) as a synonym for Python's
    (?P<name>...) and Perl's \k<name> for Python's (?P=name) notation. The
    evolution of Perl's name group is actually interesting. Years ago,
    Guido had a conversation with Larry Wall about using the (?P...) capture
    sequence for python-specific Regular Expression blocks. So Python went
    ahead and implemented named capture groups. Years later, the Perl folks
    thought named capture groups were a neat idea and adapted them in the
    (?<...>...) form because Python had restricted the (?P...) notation to
    themselves so they couldn't use our even if they wanted to. Now,
    though, with Perl adapting (?<...>...), I think it inevitable that Java
    and even C++ may see this as the defacto standard. So I 100% agree, we
    should consider supporting (?<name>...) in the parser.

    Oh, and as I suggested in bpo-3825, I have these new item proposals:

    Item 18: Add a re.REVERSE, re.R (?r) flag for reversing the direction of
    the String Evaluation against a given Regular Expression pattern. See
    bpo-516762, as implemented in bpo-3825.

    Item 19: Make various in-line flags positionally dependant, for example
    (?i) makes the pattern before this case-sensitive but after it
    case-insensitive. See bpo-433024, as implemented in bpo-3825.

    Item 20: All the negation of in-line flags to cancel their effect in
    conditionally flagged expressions for example (?-i). See bpo-433027,
    as implemented in bpo-3825.

    Item 21: Allow for scoped flagged expressions, i.e. (?i:...), where the
    flag(s) is applied to the expression within the parenthesis. See Issue
    433028, as implemented in bpo-3825.

    Item 22: Zero-width regular expression split: when splitting via a
    regular expression of Zero-length, this should return an expression
    equivalent to splitting at each character boundary, with a null string
    at the beginning and end representing the space before the first and
    after the last character. See bpo-3262.

    Item 23: Character class ranges over case-insensitive matches, i.e. does
    "(?i)[9-A]" contain '_' , whose ord is greater than the ord of 'A' and
    less than the ord of 'a'. See bpo-5311.

    And I shall create a bazaar repository for your current development line
    with the unfortunately unwieldy name of
    lp:~timehorse/python/issue2636-01+09-02+17+18+19+20+21 as that would,
    AFAICT, cover all the items you've fixed in your latest patch.

    Anyway, great work Matthew and I look forward to working with you on
    Regexp 2.7 as you do great work!

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Sep 24, 2008

    Regarding item 22: there's also bpo-1647489 ("zero-length match confuses
    re.finditer()").

    This had me stumped for a while, but I might have a solution. I'll see
    whether it'll fix item 22 too.

    I wasn't planning on doing any more major changes on my branch, just
    tweaking and commenting and seeing whether I've missed any tricks in the
    speed stakes. Half the task is finding out what's achievable, and how!

    @vbr
    Copy link
    Mannequin

    vbr mannequin commented Sep 3, 2011

    Not that it matters in any way, but if the regex semantics has to be distinguished via "non-standard" custom flags; I would prefer even less wordy flags, possibly such that the short forms for the in-pattern flag setting would be one-letter (such as all the other flags) and preferably some with underlying plain English words as base, to get some mnemotechnics (which I don't see in the numbered versions requiring one to keep track of the rather internal library versioning).
    Unfortunately, it might be difficult to find suitable names, given the objections expressed against the already discussed ones. (FOr what it is worth, I thought e.g. of [t]raditional and [e]nhanced, but these also suffer from some of the mentioned disadvantages...
    vbr

    @stevendaprano
    Copy link
    Member

    Matthew Barnett wrote:

    So, VERSION0 and VERSION1, with "(?V0)" and "(?V1)" in the pattern?

    Seems reasonable to me.

    +1

    @mattchaput
    Copy link
    Mannequin

    mattchaput mannequin commented Sep 15, 2011

    Not sure if this is better as a separate feature request or a comment here, but... the new version of .NET includes an option to specify a time limit on evaluation of regexes (not sure if this is a feature in other regex libs). This would be useful especially when you're executing regexes configured by the user and you don't know if/when they might go exponential. Something like this maybe:

    # Raises an re.Timeout if not complete within 60 seconds
    match = myregex.match(mystring, maxseconds=60.0)

    @ncoghlan
    Copy link
    Contributor

    As part of the PEP-408 discussions, Guido approved the addition of 'regex' in 3.3 (using that name, rather than as a drop-in replacement for re) [1,2]

    That should greatly ease the backwards compatibility concerns, even if it isn't as transparent an upgrade path.

    [1] http://mail.python.org/pipermail/python-dev/2012-January/115961.html
    [2] http://mail.python.org/pipermail/python-dev/2012-January/115962.html

    @alex
    Copy link
    Member

    alex commented Jan 29, 2012

    So, to my reading of teh compatibility PEP this cannot be added wholesale,
    unless there is a pure Python version as well. However, if it replaced re
    (read: patched) it would be valid.

    On Sun, Jan 29, 2012 at 1:26 AM, Nick Coghlan <report@bugs.python.org>wrote:

    Nick Coghlan <ncoghlan@gmail.com> added the comment:

    As part of the PEP-408 discussions, Guido approved the addition of 'regex'
    in 3.3 (using that name, rather than as a drop-in replacement for re) [1,2]

    That should greatly ease the backwards compatibility concerns, even if it
    isn't as transparent an upgrade path.

    [1] http://mail.python.org/pipermail/python-dev/2012-January/115961.html
    [2] http://mail.python.org/pipermail/python-dev/2012-January/115962.html

    ----------
    nosy: +ncoghlan


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue2636\>


    @birkenfeld
    Copy link
    Member

    I created a new sandbox branch to integrate regex into CPython, see "remote repo" field.

    I mainly had to adapt the test suite to use unittest.

    @ncoghlan
    Copy link
    Contributor

    Alex has a valid point in relation to PEP-399, since, like lzma, regex will be coming in under the "special permission" clause that allows the addition of C extension modules without pure Python equivalents. Unlike lzma, though, the new regex engine isn't a relatively simple wrapper around an existing library - supporting the new API features on other implementations is going to mean a substantial amount of work.

    In practice, I expect that a pure Python implementation of a regular expression engine would only be fast enough to be usable on PyPy. So while we'd almost certainly accept a patch that added a parallel Python implementation, I doubt it would actually help Jython or IronPython all that much - they're probably going to need versions written in Java and C# to be effective (as I believe they already have for the re module).

    @ssbr
    Copy link
    Mannequin

    ssbr mannequin commented Jan 29, 2012

    In practice, I expect that a pure Python implementation of a regular expression engine would only be fast enough to be usable on PyPy.

    Not sure why this is necessarily true. I'd expect a pure-Python implementation to be maybe 200 times as slow. Many queries (those on relatively short strings that backtrack little) finish within microseconds. On this scale, a couple of orders of magnitudes is not noticeable by humans (unless it adds up), and even where it gets noticeable, it's better than having nothing at all or a non-working program (up until a point).

    python -m timeit -n 1000000 -s "import re; x = re.compile(r'.*<\s*help\s*>([^\<])<\s/\s*help.*>'); data = ' '*1000 + '< help >' + 'abc'*100 + '</help>'" "x.match(data)"
    1000000 loops, best of 3: 3.27 usec per loop

    @birkenfeld
    Copy link
    Member

    Well, REs are very often used to process large chunks of text by repeated application. So if the whole operation takes 0.1 or 20 seconds you're going to notice :)

    @ssbr
    Copy link
    Mannequin

    ssbr mannequin commented Jan 29, 2012

    It'd be nice if we had some sort of representative benchmark for real-world uses of Python regexps. The JS guys have all pitched in to create such a thing for uses of regexps on thew web. I don't know of any such thing for Python.

    I agree that a Python implementation wouldn't be useful for some cases. On the other hand, I believe it would be fine (or at least tolerable) for some others. I don't know the ratio between the two.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 29, 2012

    It'd be nice if we had some sort of representative benchmark for
    real-world uses of Python regexps. The JS guys have all pitched in to
    create such a thing for uses of regexps on thew web. I don't know of
    any such thing for Python.

    See http://hg.python.org/benchmarks/, there are regex benchmarks there.

    I agree that a Python implementation wouldn't be useful for some
    cases. On the other hand, I believe it would be fine (or at least
    tolerable) for some others. I don't know the ratio between the two.

    I think the ratio would be something like 2% tolerable :)

    As I said to Ezio and Georg, I think adding the regex module needs a
    PEP, even if it ends up non-controversial.

    @sandrotosi
    Copy link
    Contributor

    I've just uploaded regex into Debian: this will hopefully gives some more eyes looking at the module and reporting some feedbacks.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Nov 5, 2012

    I've been working through the "known crashers" list in the stdlib. The recursive import one was fixed with the migration to importlib in 3.3, the compiler one will be fixed in 3.3.1 (with an enforced nesting limit). One of those remaining is actually a pathological failure in the re module rather than a true crasher (i.e. it doesn't segfault, and in 2.7 and 3.3 you can interrupt it with Ctrl-C):
    http://hg.python.org/cpython/file/default/Lib/test/crashers/infinite_loop_re.py

    I mention it here as another problem that adopting the regex module could resolve (as regex promptly returns None for this case).

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jun 26, 2014

    Will we actually get regex into the standard library on this pass?

    @ncoghlan
    Copy link
    Contributor

    Even with in principle approval from Guido, this idea still depends on
    volunteers to actually write up a concrete proposal as a PEP (which
    shouldn't be too controversial, given Guido already OK'ed the idea) and
    then do the integration work to incorporate the code, tests and docs into
    CPython (not technically *hard*, but not trivial either). "pip install
    regex" starts looking fairly attractive at that point :)

    @serhiy-storchaka
    Copy link
    Member

    Here is my (slowly implemented) plan:

    1. Recommend regex as advanced replacement of re (bpo-22594).

    2. Fix all obvious bugs in the re module if this doesn't break backward compatibility (bpo-12728, bpo-14260, and many already closed issues).

    3. Deprecate and then forbid behavior which looks as a bug, doesn't match regex in V1 mode and can't be fixed without breaking backward compatibility (bpo-22407, bpo-22493, bpo-22818).

    4. Unify minor details with regex (bpo-22364, bpo-22578).

    5. Fork regex and drop all advanced nonstandard features (such as fuzzy matching). Too many features make learning and using the module more hard. They should be in advanced module (regex).

    6. Write benchmarks which cover all corner cases and compare re with regex case by case. Optimize slower module. Currently re is faster regex for all simple examples which I tried (may be as result of bpo-18685), but in total results of benchmarks (msg109447) re is slower.

    7. May be implement some standard features which were rejected in favor of this issue (bpo-433028, bpo-433030). re should conform at least Level 1 of UTS Rename Doc/README.txt to Doc/README.rst #18 (http://www.unicode.org/reports/tr18/#Basic_Unicode_Support).

    In best case in 3.7 or 3.8 we could replace re with simplified regex. Or at this time re will be free from bugs and warts.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 8, 2014

    Here is my (slowly implemented) plan:

    Exciting. Perhaps you should post your plan on python-dev.

    In any case, huge thanks for your work on the re module.

    @serhiy-storchaka
    Copy link
    Member

    Exciting. Perhaps you should post your plan on python-dev.

    Thank you Antoine. I think all interested core developers are already aware
    about this issue. A disadvantage of posting on python-dev is that this would
    require manually copy links and may be titles of all mentioned issues, while
    here they are available automatically. Oh, I'm lazy.

    @ezio-melotti
    Copy link
    Member

    So you are suggesting to fix bugs in re to make it closer to regex, and then replace re with a forked subset of regex that doesn't include advanced features, or just to fix/improve re until it matches the behavior of regex?
    If you are suggesting the former, I would also suggest checking the coverage and bringing it as close as possible to 100%.

    @serhiy-storchaka
    Copy link
    Member

    So you are suggesting to fix bugs in re to make it closer to regex, and then
    replace re with a forked subset of regex that doesn't include advanced
    features, or just to fix/improve re until it matches the behavior of regex?

    Depends on what will be easier. May be some bugs are so hard to fix that
    replacing re with regex is only solution. But if fixed re will be simpler and
    faster than lightened regex and will contain all necessary features, there
    will be no need in the replacing. Currently the code of regex looks more high
    level and better structured, but the code of re looks simpler and is much
    smaller. In any case the closer will be re and regex the easier will be the
    migration.

    @ezio-melotti
    Copy link
    Member

    Ok, regardless of what will happen, increasing test coverage is a worthy goal. We might start by looking at the regex test suite to see if we can import some tests from there.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Nov 9, 2014

    Thanks for pushing this one forward Serhiy! Your approach sounds like a
    fine plan to me.

    @timehorse
    Copy link
    Mannequin Author

    timehorse mannequin commented Nov 9, 2014

    If I recall, I started this thread with a plan to update re itself with implementations of various features listed in the top post. If you look at the list of files uploaded by me there are seme complete patches for Re to add various features like Atomic Grouping. If we wish to therefore bring re to regex standard we could start with those features.

    @Mateon1
    Copy link
    Mannequin

    Mateon1 mannequin commented Nov 13, 2014

    Well, I found a bug with this module, on Python 2.7(.5), on Windows 7 64-bit when you try to compile a regex with the flags V1|DEBUG, the module crashes as if it wanted to call a builtin called "ascii".

    The bug happened to me several times, but this is the regexp when the last one happened. http://paste.ubuntu.com/8993680/

    I hope it's fixed, I really love the module and found it very useful to have PCRE regexes in Python.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Nov 13, 2014

    @mateon1: "I hope it's fixed"? Did you report it?

    @Mateon1
    Copy link
    Mannequin

    Mateon1 mannequin commented Nov 14, 2014

    Well, I am reporting it here, is this not the correct place? Sorry if it is.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Nov 14, 2014

    The page on PyPI says where the project's homepage is located:

    Home Page: https://code.google.com/p/mrab-regex-hg/

    The bug was fixed in the last release.

    @vstinner
    Copy link
    Member

    It's now a third party project: https://pypi.org/project/regex/

    If someone wants to move it into the Python stdlib, I suggest to start on the python-ideas list first.

    I close the issue as REJECTED.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests