Adding a new regex module (compatible with re) #46888

timehorse · 2008-04-15T11:57:51Z

BPO	2636
Nosy	@loewis, @birkenfeld, @gpshead, @amauryfa, @ncoghlan, @abalkin, @pitrou, @vstinner, @giampaolo, @mark-summerfield, @ssbr, @ezio-melotti, @akitada, @stevendaprano, @alex, @bitdancer, @sandrotosi, @sorcio, @ericsnowcurrently, @Fak3, @serhiy-storchaka, @jayvdb
Files	regex_test-20100316: Python 2.6.5 re test run against regex-20100305 issue2636-20101230.zip remove_guards.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2021-01-25.22:30:27.300>
created_at = <Date 2008-04-15.11:57:51.375>
labels = ['expert-regex', 'type-feature', 'library']
title = 'Adding a new regex module (compatible with re)'
updated_at = <Date 2021-01-25.22:30:27.294>
user = 'https://bugs.python.org/timehorse'

bugs.python.org fields:

activity = <Date 2021-01-25.22:30:27.294>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2021-01-25.22:30:27.300>
closer = 'vstinner'
components = ['Library (Lib)', 'Regular Expressions']
creation = <Date 2008-04-15.11:57:51.375>
creator = 'timehorse'
dependencies = []
files = ['16563', '20192', '20203']
hgrepos = ['108']
issue_num = 2636
keywords = ['patch']
message_count = 334.0
messages = ['65513', '65593', '65613', '65614', '65617', '65725', '65726', '65727', '65734', '65838', '65841', '66033', '67309', '67447', '67448', '68336', '68339', '68358', '68399', '68409', '73185', '73295', '73714', '73717', '73721', '73730', '73752', '73766', '73779', '73780', '73782', '73791', '73794', '73798', '73801', '73803', '73805', '73827', '73848', '73853', '73854', '73855', '73861', '73875', '73955', '74025', '74026', '74058', '74104', '74174', '74203', '74204', '74904', '80916', '81112', '81236', '81238', '81239', '81240', '81359', '81473', '81475', '82673', '82739', '82950', '83271', '83277', '83390', '83411', '83427', '83428', '83429', '83988', '83989', '83993', '84350', '86004', '86032', '89632', '89634', '89643', '90954', '90961', '90985', '90986', '90989', '91028', '91035', '91038', '91245', '91250', '91437', '91439', '91448', '91450', '91460', '91462', '91463', '91473', '91474', '91490', '91495', '91496', '91497', '91500', '91535', '91598', '91607', '91610', '91671', '91917', '97860', '98809', '99072', '99132', '99148', '99186', '99190', '99462', '99470', '99479', '99481', '99494', '99548', '99552', '99665', '99668', '99835', '99863', '99872', '99888', '99890', '99892', '100066', '100076', '100080', '100134', '100152', '100359', '100362', '100370', '100452', '101172', '101181', '101193', '101557', '102042', '103003', '103060', '103064', '103078', '103095', '103096', '103097', '109358', '109363', '109372', '109384', '109401', '109403', '109404', '109405', '109406', '109407', '109408', '109409', '109410', '109413', '109447', '109460', '109461', '109463', '109474', '109657', '110233', '110237', '110701', '110704', '110761', '111519', '111531', '111643', '111652', '111656', '111660', '111921', '113927', '113931', '114034', '114766', '116133', '116151', '116223', '116227', '116229', '116231', '116238', '116248', '116252', '116276', '116749', '117008', '117046', '117050', '118243', '118631', '118636', '118640', '118674', '118682', '119887', '119930', '119947', '119951', '119956', '119958', '120013', '120037', '120038', '120164', '120202', '120203', '120204', '120206', '120215', '120216', '120243', '120571', '120969', '120976', '120984', '120986', '121136', '121145', '121149', '121589', '121832', '122221', '122225', '122228', '122880', '123518', '123527', '123747', '124581', '124582', '124585', '124614', '124626', '124627', '124746', '124750', '124759', '124816', '124821', '124833', '124834', '124891', '124900', '124904', '124905', '124906', '124909', '124912', '124929', '124931', '124936', '124950', '124959', '124971', '124988', '124990', '125291', '126294', '126372', '127045', '130886', '130905', '130906', '130999', '135700', '135703', '135704', '140102', '140152', '140154', '143090', '143333', '143334', '143337', '143340', '143343', '143350', '143352', '143355', '143366', '143367', '143374', '143377', '143389', '143423', '143442', '143445', '143447', '143448', '143467', '143471', '143619', '144110', '152210', '152211', '152212', '152214', '152215', '152217', '152218', '152246', '157445', '174888', '221629', '221657', '230846', '230862', '230866', '230867', '230873', '230877', '230885', '230887', '231137', '231141', '231159', '231165', '385674']
nosy_count = 43.0
nosy_names = ['loewis', 'georg.brandl', 'gregory.p.smith', 'jimjjewett', 'sjmachin', 'amaury.forgeotdarc', 'ncoghlan', 'belopolsky', 'pitrou', 'vstinner', 'nneonneo', 'giampaolo.rodola', 'rsc', 'timehorse', 'mark', 'vbr', 'Devin Jeanpierre', 'ezio.melotti', 'mrabarnett', 'jaylogan', 'akitada', 'moreati', 'steven.daprano', 'alex', 'r.david.murray', 'jacques', 'zdwiel', 'THRlWiTi', 'sandro.tosi', 'abacabadabacaba', 'stiv', 'davide.rizzo', 'mattchaput', 'ronnix', 'tshepang', 'eric.snow', 'akoumjian', 'Roman.Evstifeev', 'serhiy.storchaka', 'Mateon1', 'jayvdb', 'Socob', 'petros']
pr_nums = []
priority = 'normal'
resolution = 'rejected'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue2636'
versions = []

timehorse · 2008-04-15T11:57:48Z

I am working on adding features to the current Regexp implementation,
which is now set to 2.2.2. These features are to bring the Regexp code
closer in line with Perl 5.10 as well as add a few python-specific
niceties and potential speed-ups and clean-ups.

I will be posting regular patch updates to this thread when major
milestones have been reach with a description of the feature(s) added.
Currently, the list of proposed changes are (in no particular order):

Fix <a href="http://bugs.python.org/issue433030"\>bpo-433030\</a> by
adding support for Atomic Grouping and Possessive Qualifiers
Make named matches direct attributes of the match object; i.e.
instead of m.group('foo'), one will be able to write simply m.foo.
(maybe) make Match objects subscriptable, such that m[n] is
equivalent to m.group(n) and allow slicing.
Implement Perl-style back-references including relative back-references.
Add a well-formed, python-specific comment modifier, e.g. (?P#...);
the difference between (?P#...) and Perl/Python's (?#...) is that the
former will allow nested parentheses as well as parenthetical escaping,
so that patterns of the form '(?P# Evaluate (the following) expression,
3\) using some other technique)'. The (?P#...) will interpret this
entire expression as a comment, where as with (?#...) only, everything
following ' expression...' would be considered part of the match.
(?P#...) will necessarily be slower than (?#...) and so only should be
used if richer commenting style is required but the verbose mode is not
desired.
Add official support for fast, non-repeating capture groups with the
Template option. Template is unofficially supported and disables all
repeat operators (*, + and ?). This would mainly consist of documenting
its behavior.
Modify the re compiled expression cache to better handle the
thrashing condition. Currently, when regular expressions are compiled,
the result is cached so that if the same expression is compiled again,
it is retrieved from the cache and no extra work has to be done. This
cache supports up to 100 entries. Once the 100th entry is reached, the
cache is cleared and a new compile must occur. The danger, all be it
rare, is that one may compile the 100th expression only to find that one
recompiles it and has to do the same work all over again when it may
have been done 3 expressions ago. By modifying this logic slightly, it
is possible to establish an arbitrary counter that gives a time stamp to
each compiled entry and instead of clearing the entire cache when it
reaches capacity, only eliminate the oldest half of the cache, keeping
the half that is more recent. This should limit the possibility of
thrashing to cases where a very large number of Regular Expressions are
continually recompiled. In addition to this, I will update the limit to
256 entries, meaning that the 128 most recent are kept.
Emacs/Perl style character classes, e.g. [:alphanum:]. For instance,
:alphanum: would not include the '_' in the character class.
C-Engine speed-ups. I commenting and cleaning up the _sre.c Regexp
engine to make it flow more linearly, rather than with all the current
gotos and replace the switch-case statements with lookup tables, which
in tests have shown to be faster. This will also include adding many
more comments to the C code in order to make it easier for future
developers to follow. These changes are subject to testing and some
modifications may not be included in the final release if they are shown
to be slower than the existing code. Also, a number of Macros are being
eliminated where appropriate.
Export any (not already) shared value between the Python Code and
the C code, e.g. the default Maximum Repeat count (65536); this will
allow those constants to be changed in 1 central place.
Various other Perl 5.10 conformance modifications, TBD.

More items may come and suggestions are welcome.

-----

Currently, I have code which implements 5) and 7), have done some work
on 10) and am almost 9). When 9) is complete, I will work on 1), some
of which, such as parsing, is already done, then probably 8) and 4)
because they should not require too much work -- 4) is parser-only
AFAICT. Then, I will attempt 2) and 3), though those will require
changes at the C-Code level. Then I will investigate what additional
elements of 11) I can easily implement. Finally, I will write
documentation for all of these features, including 6).

In a few days, I will provide a patch with my interim results and will
update the patches with regular updates when Milestones are reached.

timehorse · 2008-04-17T22:06:58Z

I am very sorry to report (at least for me) that as of this moment, item
9), although not yet complete, is stable and able to pass all the
existing python regexp tests. Because these tests are timed, I am using
the timings from the first suite of tests to perform a benchmark of
performance between old and new code. Based on discussion with Andrew
Kuchling, I have decided for the sake of simplicity, the "timing" of
each version is to be calculated by the absolute minimum time to execute
observed because it is believed this execution would have had the most
continuous CPU cycles and thus most closely represents the true
execution time.

It is this current conclusion that greatly saddens me, not that the
effort has not been valuable in understanding the current engine.
Indeed, I understand the current engine now well enough that I could
proceed with the other modifications as-is rather than implementing them
with the new engine. Mind you, I will likely not bring over the copious
comments that the new engine received when I translated it to a form
without C_Macros and gotos, as that would require too much effort IMHO.

Anyway, all that being said, and keeping in mind that I am not 100%
satisfied with the new engine and may still be able to wring some timing
out of it -- not that I will spend much more time on this -- here is
where we currently stand:

Old Engine: 6.574s
New Engine: 7.239s

This makes the old Engine 665ms faster over the entire first test_re.py
suite, or 9% faster than the New Engine.

timehorse · 2008-04-18T13:38:47Z

Here are the modification so far for item 9) in _sre.c plus some small
modifications to sre_constants.h which are only to get _sre.c to
compile; normally sre_constants.h is generated by sre_constants.py, so
this is not the final version of that file. I also would have intended
to make SRE_CHARSET and SRE_COUNT use lookup tables, as well as maybe
others, but not likely any other lookup tables. I also want to remove
alloc_pos out of the self object and make it a parameter to the ALLOC
parameter and probably get rid of the op_code attribute since it is only
used in 1 place to save one subtract in a very rare case. But I want to
resolve the 10% problem first, so would appreciate it if people could
look at the REMOVE_SRE_MATCH_MACROS section of code and compare it to
the non-REMOVE_SRE_MATCH_MACROS version of SRE_MATCH and see if you can
suggest anything to make the former (new code) faster to get me that
elusive 10%.

timehorse · 2008-04-18T14:23:18Z

Here is a patch to implement item 7)

timehorse · 2008-04-18T14:50:44Z

This simple patch adds (?P#...)-style comment support.

jimjjewett · 2008-04-24T14:23:34Z

These features are to bring the Regexp code closer in line with Perl 5.10

Why 5.1 instead of 5.8 or at least 5.6? Is it just a scope-creep issue?

as well as add a few python-specific

because this also adds to the scope.

Make named matches direct attributes
of the match object; i.e. instead of m.group('foo'),
one will be able to write simply m.foo.

(maybe) make Match objects subscriptable, such
that m[n] is equivalent to m.group(n) and allow slicing.

(2) and (3) would both be nice, but I'm not sure it makes sense to do
*both* instead of picking one.

Add a well-formed, python-specific comment modifier,
e.g. (?P#...);

[handles parens in comments without turning on verbose, but is slower]

Why? It adds another incompatibility, so it has to be very useful or
clear. What exactly is the advantage over just turning on verbose?

C-Engine speed-ups. ...
a number of Macros are being eliminated where appropriate.

Be careful on those, particular on str/unicode and different compile options.

amauryfa · 2008-04-24T14:31:53Z

> These features are to bring the Regexp code closer in line
> with Perl 5.10

Why 5.1 instead of 5.8 or at least 5.6? Is it just a scope-creep issue?

5.10.0 comes after 5.8 and is the latest version (2007/12/18)!
Yes it is confusing.

timehorse · 2008-04-24T16:06:23Z

Thanks Jim for your thoughts!

Armaury has already explained about Perl 5.10.0. I suppose it's like
Macintosh version numbering, since Mac Tiger went from version 10.4.9 to
10.4.10 and 10.4.11 a few years ago. Maybe we should call Python 2.6
Python 2.06 just in case. But 2.6 is the known last in the 2 series so
it's not a problem for us! :)

> as well as add a few python-specific

because this also adds to the scope.

At this point the only python-specific changes I am proposing would be
items 2, 3 (discussed below), 5 (discussed below), 6 and 7. 6 is only a
documentation change, the code is already implemented. 7 is just a
better behavior. I think it is RARE one compiles more than 100 unique
regular expressions, but you never know as projects tend to grow over
time, and in the old code the 101st would be recompiled even if it was
just compiled 2 minutes ago. The patch is available so I leave it to
the community to judge for themselves whether it is worth it, but as you
can see, it's not a very large change.

> 2) Make named matches direct attributes
> of the match object; i.e. instead of m.group('foo'),
> one will be able to write simply m.foo.

> 3) (maybe) make Match objects subscriptable, such
> that m[n] is equivalent to m.group(n) and allow slicing.

(2) and (3) would both be nice, but I'm not sure it makes sense to do
*both* instead of picking one.

Well, I think named matches are better than numbered ones, so I'd
definitely go with 2. The problem with 2, though, is that it still
leaves the rather typographically intense m.group(n), since I cannot
write m.3. However, since capture groups are always numbered
sequentially, it models a list very nicely. So I think for indexing by
group number, the subscripting operator makes sense. I was not
originally suggesting m['foo'] be supported, but I can see how that may
come out of 3. But there is a restriction on python named matches that
they have to be valid python and that strikes me as 2 more than 3
because 3 would not require such a restriction but 2 would. So at least
I want 2, but it seems IMHO m[1] is better than m.group(1) and not in
the least hard or a confusing way of retrieving the given group. Mind
you, the Match object is a C-struct with python binding and I'm not
exactly sure how to add either feature to it, but I'm sure the C-API
manual will help with that.

> 5) Add a well-formed, python-specific comment modifier,
> e.g. (?P#...);

[handles parens in comments without turning on verbose, but is slower]

Why? It adds another incompatibility, so it has to be very useful or
clear. What exactly is the advantage over just turning on verbose?

Well, Larry Wall and Guido agreed long ago that we, the python
community, own all expressions of the form (?P...) and although I'd be
my preference to make (?#...) more in conformance with understanding
parenthesis nesting, changing the logic behind THAT would make python
non-standard. So as far as any conflicting design, we needn't worry.

As for speed, the this all occurs in the parser and does not effect the
compiler or engine. It occurs only after a (?P has been read and then
only as the last check before failure, so it should not be much slower
except when the expression is invalid. The actual execution time to
find the closing brace of (?P#...) is a bit slower than that for (?#...)
but not by much.

Verbose is generally a good idea for anything more than a trivial
Regular Expression. However, it can have overhead if not included as
the first flag: an expression is always checked for verbose
post-compilation and if it is encountered, the expression is compiled a
second time, which is somewhat wasteful. But the reason I like the
(?P#...) over (?#...) is because I think people would more tend to assume:

r'He(?# 2 (TWO) ls)llo' should match "Hello" but it doesn't.

That expression only matches "He ls)llo", so I created the (?P#...) to
make the comment match type more intuitive:

r'He(?P# 2 (TWO) ls)llo' matches "Hello".

> 9) C-Engine speed-ups. ...
> a number of Macros are being eliminated where appropriate.

Be careful on those, particular on str/unicode and different
compile options.

Will do; thanks for the advice! I have only observed the UNICODE flag
controlling whether certain code is used (besides the ones I've added)
and have tried to stay true to that when I encounter it. Mind you,
unless I can get my extra 10% it's unlikely I'd actually go with item 9
here, even if it is easier to read IMHO. However, I want to run the new
engine proposal through gprof to see if I can track down some bottlenecks.

At some point, I hope to get my current changes on Launchpad if I can
get that working. If I do, I'll give a link to how people can check out
my working code here as well.

jimjjewett · 2008-04-24T18:09:25Z

Python 2.6 isn't the last, but Guido has said that there won't be a 2.10.

Match object is a C-struct with python binding
and I'm not exactly sure how to add either feature to it

I may be misunderstanding -- isn't this just a matter of writing the
function and setting it in the tp_as_sequence and tp_as_mapping slots?

Larry Wall and Guido agreed long ago that we, the python
community, own all expressions of the form (?P...)

Cool -- that reference should probably be added to the docs. For someone
trying to learn or translate regular expressions, it helps to know that (?P
...) is explicitly a python extension (even if Perl adopts it later).

Definately put the example in the doc.

r'He(?# 2 (TWO) ls)llo' should match "Hello" but it doesn't.  Maybe

even without the change, as doco on the current situation.

Does VERBOSE really have to be the first flag, or does it just have to be on
the whole pattern instead of an internal switch?

I'm not sure I fully understand what you said about template. Is this a
special undocumented switch, or just an internal optimization mode that
should be triggered whenever the repeat operators don't happen to occur?

pitrou · 2008-04-26T10:08:05Z

I don't know anything about regexp implementation, but if you replace a
switch-case with a function lookup table, it isn't surprising that the
new version ends up slower. A local jump is always faster than a
function call, because of the setup overhead and stack manipulation the
latter involves.

So you might try to do the cleanup while keeping the switch-case
structure, if possible.

timehorse · 2008-04-26T11:51:10Z

Thank you and Merci Antoine!

That is a good point. It is clearly specific to the compiler whether a
switch-case will be turned into a series of conditional branches or
simply creating an internal jump table with lookup. And it is true
that most compilers, if I understand correctly, use the jump-table
approach for any switch-case over 2 or 3 entries when the cases are
tightly grouped and near 0. That is probably why the original code
worked so fast. I'll see if I can combine the best of both
approaches. Thanks again!

timehorse · 2008-05-01T14:15:35Z

I am making my changes in a Bazaar branch hosted on Launchpad. It took
me quite a while to get things set up more-or-less logically but there
they are and I'm currently trying to re-apply my local changes up to
today into the various branches I have. Each of the 11 issues I
outlined originally has its own branch, with a root branch from which
all these branches are derived to serve as a place for a) merging in
python 2.6 alpha concurrent development (merges) and to apply any
additional re changes that don't fall into any of the other categories,
of which I have so far found only 2 small ones.

Anyway, if anyone is interested in monitoring my progress, it is
available at:

https://code.launchpad.net/~timehorse/

I will still post major milestones here, but one can monitory day-to-day
progress on Launchpad. Also on launchpad you will find more detail on
the plans for each of the 11 modifications, for the curious.

Thanks again for all the advice!

timehorse · 2008-05-24T21:38:54Z

I am finally making progress again, after a month of changing my
patches from my local svn repository to bazaar hosted on launchpad.net,
as stated in my last update. I also have more or less finished the
probably easiest item, #5, so I have a full patch for that available
now. First, though, I want to update my "No matter what" patch, which
is to say these are the changes I want to make if any changes are made
to the Regexp code.

mark-summerfield · 2008-05-28T13:38:41Z

AFAIK if you have a regex with named capture groups there is no direct
way to relate them to the capture group numbers.
You could do (untested; Python 3 syntax):

    d = {v: k for k, v in match.groupdict()}
    for i in range(match.lastindex):
         print(i, match.group(i), d[match.group(i)])

One possible solution would be a grouptuples() function that returned a
tuple of 3-tuples (index, name, captured_text) with the name being None
for unnamed groups.

Anyway, good luck with all your improvements, I will be especially glad
if you manage to do (2) and (8) (and maybe (3)).

timehorse · 2008-05-28T13:57:24Z

Mark scribbled:

One possible solution would be a grouptuples() function that returned
a tuple of 3-tuples (index, name, captured_text) with the name being
None for unnamed groups.

Hmm. Well, that's not a bad idea at all IMHO and would, AFAICT probably
be easier to do than (2) but I would still do (2) but will try to add
that to one of the existing items or spawn another item for it since it
is kind of a distinct feature.

My preference right now is to finish off the test cases for (7) because
it is already coded, then finish the work on (1) as that was the
original reason for modification then on to (2) then (3) as they are
related and then I don't mind tackling (8) because I think that one
shouldn't be too hard. Interestingly, the existing engine code
(sre_parse.py) has a place-holder, commented out, for character classes
but it was never properly implemented. And I will warn that with
Unicode, I THINK all the character classes exist as unicode functions or
can be implemented as multiple unicode functions, but I'm not 100% sure
so if I run into that problem, some character classes may initially be
left out while I work on another item.

Anyway, thanks for the input, Mark!

timehorse · 2008-06-17T17:43:20Z

Well, it's time for another update on my progress...

Some good news first: Atomic Grouping is now completed, tested and
documented, and as stated above, is classified as bpo-2636-01 and
related patches. Secondly, with caveats listed below, Named Match Group
Attributes on a match object (item 2) is also more or less complete at
bpo-2636-02 -- it only lacks documentation.

Now, I want to also update my list of items. We left off at 11: Other
Perl-specific modifications. Since that time, I have spawned a number
of other branches, the first of which (bpo-2636-12) I am happy to
announce is also complete!

Implement the changes to the documentation of re as per Jim J.
Jewett suggestion from 2008-04-24 14:09. Again, this has been done.
Implement a grouptuples(...) method as per Mark Summerfield's
suggest on 2008-05-28 09:38. grouptuples would take the same filtering
parameters as the other group* functions, and would return a list of 3-
tuples (unless only 1 group was requested). It should default to all
match groups (1..n, not group 0, the matching string).
As per PEP-3131 and the move to Python 3.0, python will begin to
allow full UNICODE-compliant identifier names. Correspondingly, it
would be the responsibility of this item to allow UNICODE names for
match groups. This would allow retrieval of UNICODE names via the
group* functions or when combined with Item 3, the getitem handler
(m[u'...']) (03+14) and the attribute name itself (e.g. getattr(m,
u'...')) when combined with item 2 (02+14).
Change the Pattern_Type, Match_Type and Scanner_Type (experimental)
to become richer Python Types. Specifically, add __doc__ strings to
each of these types' methods and members.
Implement various FIXMEs.

16-1) Implement the FIXME such that if m is a MatchObject, del m.string
will disassociate the original matched string from the match object;
string would be the only member that would allow modification or
deletion and you will not be able to modify the m.string value, only
delete it.

-----

Finally, I want to say a couple notes about Item 2:

Firstly, as noted in Item 14, I wish to add support for UNICODE match
group names, and the current version of the C-code would not allow that;
it would only make sense to add UNICODE support if 14 is implemented, so
adding support for UNICODE match object attributes would depend on both
items 2 and 14. Thus, that would be implemented in bpo-2636-02+14.

Secondly, there is a FIXME which I discussed in Item 16; I gave that
problem it's own item and branch. Also, as stated in Item 15, I would
like to add more robust help code to the Match object and bind __doc__
strings to the fixed attributes. Although this would not directly
effect the Item 2 implementation, it would probably involve moving some
code around in its vicinity.

Finally, I would like suggestions on how to handle name collisions when
match group names are provided as attributes. For instance, an
expression like '(?P<pos>.*)' would match more or less any string and
assign it to the name "pos". But "pos" is already an attribute of the
Match object, and therefore pos cannot be exposed as a named match group
attribute, since match.pos will return the usual meaning of pos for a
match object, not the value of the capture group names "pos".

I have 3 proposals as to how to handle this:

a) Simply disallow the exposure of match group name attributes if the
names collide with an existing member of the basic Match Object
interface.

b) Expose the reserved names through a special prefix notation, and for
forward compatibility, expose all names via this prefix notation. In
other words, if the prefix was 'k', match.kpos could be used to access
pos; if it was '_', match._pos would be used. If Item 3 is implemented,
it may be sufficient to allow access via match['pos'] as the canonical
way of handling match group names using reserved words.

c) Don't expose the names directly; only expose them through a prefixed
name, e.g. match._pos or match.kpos.

Personally, I like a because if Item 3 is implemented, it makes a fairly
useful shorthand for retrieving keyword names when a keyword is used for
a name. Also, we could put a deprecation warning in to inform users
that eventually match groups names that are keywords in the Match Object
will eventually be disallowed. However, I don't support restricting the
match group names any more than they already are (they must be a valid
python identifier only) so again I would go with a) and nothing more and
that's what's implemented in bpo-2636-02.patch.

-----

Now, rather than posting umteen patch files I am posting one bz2-
compressed tar of ALL patch files for all threads, where each file is of
the form:

bpo-2636(-\d\d|+\d\d)*(-only)?.patch

For instance,

bpo-2636-01.patch is the p1 patch that is a difference between the
current Python trunk and all that would need to be implemented to
support Atomic Grouping / Possessive Qualifiers. Combined branches are
combined with a PLUS ('+') and sub-branches concatenated with a DASH ('-
'). Thus, "bpo-2636-01+09-01-01+10.patch" is a patch which combines
the work from Item 1: Atomic Grouping / Possessive Qualifiers, the sub-
sub branch of Item 9: Engine Cleanups and Item 10: Shared Constants.
Item 9 has both a child and a grandchild. The Child (09-01) is my
proposed engine redesign with the single loop; the grandchild (09-01-01)
is the redesign with the triple loop. Finally the optional "-only" flag
means that the diff is against the core SRE modifications branch and
thus does not include the core branch changes.

As noted above, Items 01, 02, 05, 07 and 12 should be considered more or
less complete and ready for merging assuming I don't identify in my
implementation of the other items that I neglected something in these.
The rest, including the combined items, are all provided in the given
tarball.

timehorse · 2008-06-17T19:07:10Z

Sorry, as I stated in the last post, I generated the patches then realized
that I was missing the documentation for Item 2, so I have updated the
bpo-2636-02.patch file and am attaching that separately until the next
release of the patch tarball. bpo-2636-02-only.patch should be ignored
and I will only regenerate it with the correct documentation in the next
tarball release so I can move on to either Character Classes or Relative
Back-references. I wanna pause Item 3 for the moment because 2, 3, 13,
14, 15 and 16 all seem closely related and I need a break to allow my mind
to wrap around the big picture before I try and tackle each one.

mark-summerfield · 2008-06-18T07:13:15Z

[snip]

Implement a grouptuples(...) method as per Mark Summerfield's
suggest on 2008-05-28 09:38. grouptuples would take the same filtering
parameters as the other group* functions, and would return a list of 3-
tuples (unless only 1 group was requested). It should default to all
match groups (1..n, not group 0, the matching string).

:-)

[snip]

Finally, I would like suggestions on how to handle name collisions when
match group names are provided as attributes. For instance, an
expression like '(?P<pos>.*)' would match more or less any string and
assign it to the name "pos". But "pos" is already an attribute of the
Match object, and therefore pos cannot be exposed as a named match group
attribute, since match.pos will return the usual meaning of pos for a
match object, not the value of the capture group names "pos".

I have 3 proposals as to how to handle this:

a) Simply disallow the exposure of match group name attributes if the
names collide with an existing member of the basic Match Object
interface.

I don't like the prefix ideas and now that you've spelt it out I don't
like the sometimes m.foo will work and sometimes it won't. So I prefer
m['foo'] to be the canonical way because that guarantees your code is
always consistent.

------------------------------------------------------------
BTW I wanted to do a simple regex to match a string that might or might
not be quoted, and that could contain quotes (but not those used to
delimit it). My first attempt was illegal:

(?P<quote>['"])?([^(?=quote)])+(?(quote)(?=quote))

It isn't hard to work round but it did highlight the fact that you can't
use captures inside character classes. I don't know if Perl allows this;
I guess if it doesn't then Python shouldn't either since GvR wants the
engine to be Perl compatible.

timehorse · 2008-06-19T12:01:27Z

Thanks for weighing in Mark! Actually, your point is valid and quite
fair, though I would not assume that Item 3 would be included just
because Item 2 isn't. I will do my best to develop both, but I do not
make the final decision as to what python includes. That having been
said, 3 seems very likely at this point so we may be okay, but let me
give this one more try as I think I have a better solution to make Item
2 more palatable. Let's say we have 5 choices here:

a) Simply disallow the exposure of match group name attributes if the
names collide with an existing member of the basic Match Object
interface.

b) Expose the reserved names through a special prefix notation, and
for forward compatibility, expose all names via this prefix notation.
In other words, if the prefix was 'k', match.kpos could be used to
access pos; if it was '_', match._pos would be used. If Item 3 is
implemented, it may be sufficient to allow access via match['pos'] as
the canonical way of handling match group names using reserved words.

c) Don't expose the names directly; only expose them through a
prefixed name, e.g. match._pos or match.kpos.

d) (As Mark suggested) we drop Item 2 completely. I have not invested
much work in this so that would not bother me, but IMHO I actually
prefer Item 2 to 3 so I would really like to see it preserved in some
form.

e) Add an option, re.MATCH_ATTRIBUTES, that is used as a Match Creation
flag. When the re.MATCH_ATTRIBUTES or re.A flag is included in the
compile, or (?a) is included in the pattern, it will do 2 things.
First, it will raise an exception if either a) there exists an unnamed
capture group or b) the capture group name is a reserved keyword. In
addition to this, I would put in a hook to support a from __future__ so
that any post 2.6 changes to the match object type can be smoothly
integrated a version early to allow programmers to change when any
future changes come. Secondly, I would *conditionally* allow arbitrary
capture group name via the __getattr__ handler IFF that flag was
present; otherwise you could not access Capture Groups by name via
match.foo.

I really like the idea of e) so I'm taking Item 2 out of the "ready for
merge" category and going to put it in the queue for the modifications
spelled out above. I'm not too worried about our flags differing from
Perl too much as we did base our first 4 on Perl (x, s, m, i), but
subsequently added Unicode and Locale, which Perl does not have, and
never implemented o (since our caching semantic already pretty much
gives every expression that), e (which is specific to Perl syntax
AFAICT) and g (which can be simulated via re.split). So I propose we
take A and implement it as I've specified and that is the current goal
of Item 2. Once this is done and working, we can decide whether it
should be included in the python trunk.

How does that sound to you, Mark and anyone else who wishes to weigh in?

mark-summerfield · 2008-06-19T14:15:51Z

[snip]

It seems to me that both using a special prefix or adding an option are
adding a lot of baggage and will increase the learning curve.

The nice thing about (3) (even without slicing) is that it seems a v.
natural extension. But (2) seems magical (i.e., Perl-like rather than
Pythonic) which I really don't like.

BTW I just noticed this:

'<_sre.SRE_Pattern object at 0x9ded020>'
>>> "{0!r}".format(rx)
'<_sre.SRE_Pattern object at 0x9ded020>'
>>> "{0!s}".format(rx)
'<_sre.SRE_Pattern object at 0x9ded020>'
>>> "{0!a}".format(rx)

That's fair enough, but maybe for !s the output should be rx.pattern?

pitrou · 2008-09-13T13:40:22Z

See also bpo-3825.

timehorse · 2008-09-16T11:59:46Z

Update 16 Sep 2008:

Based on the work for issue bpo-3825, I would like to simply update the
item list as follows:

Atomic Grouping / Possessive Qualifiers (See also Issue bpo-433030)
[Complete]
Match group names as attributes (e.g. match.foo) [Complete save
issues outlined above]
Match group indexing (e.g. match['foo'], match[3])
Perl-style back-references (e.g. compile(r'(a)\g{-1}'), and possibly
adding the r'\k' escape sequence for keywords.
Parenthesis-Aware Python Comment (e.g. r'(?P#...)') [Complete]
Expose support for Template expressions (expressions without repeat
operators), adding test cases and documentation for existing code.
Larger compiled Regexp cache (256 vs. 100) and reduced thrashing
risk. [Complete]
Character Classes (e.g. r'[:alphanum:]')
Proposed Engine redesigns and cleanups (core item only contains
cleanups and comments to the current design but does not modify the design).

9-1) Single-loop Engine redesign that runs 8% slower than current.
[Complete]

9-1-1) 3-loop Engine redesign that runs 10% slower than current. [Complete]

9-2) Matthew Bernett's Engine redesign as per issue bpo-3825

Have all C-Python shared constants stored in 1 place
(sre_constants.py) and generated by that into C constants
(sre_constants.h). [Complete AFAICT]
Scan Perl 5.10.0 for other potential additions that could be
implemented for Python.
Documentation suggestions by Jim J. Jewett [Complete]
Add grouptuples method to the Match object (i.e. match.grouptuples()
returns (<index>, <name or None>, <value>) ) suitable for iteration.
UNICODE match group names, as per PEP-3131.
Add __doc__ strings and other Python niceties to the Pattern_Type,
Match_Type and Scanner_Type (experimental).
Implement any remaining TODOs and FIXMEs in the Regexp modules.

16-1) Allow for the disassociation of a source string from a Match_Type,
assuming this will still leave the object in a "reasonable" state.

Variable-length [Positive and Negative] Look-behind assertions, as
described and implemented in Issue bpo-3825.

---

Now, we have a combination of Items 1, 9-2 and 17 available in issue
bpo-3825, so for now, refer to that issue for the 01+09-02+17 combined
solution. Eventually, I hope to merge the work between this and that issue.

I sadly admit I have made not progress on this since June because
managing 30 some lines of development, some of which having complex
diamond branching, e.g.:

01 is the child of bpo-2636
09 is the child of bpo-2636
10 is the child of bpo-2636
09-01 is the child of 09
09-01-01 is the child of 09-01
01+09 is the child of 01 and 09
01+10 is the child of 01 and 10
09+10 is the child of 09 and 10
01+09-01 is the child of 01 and 09-01
01+09-01-01 is the child of 01 and 09-01-01
09-01+10 is the child of 09-01 and 10
09-01-01+10 is the child of 09-01-01 and 10

Which all seems rather simple until you wrap your head around:

01+09+10 is the child of 01, 09, 10, 01+09, 01+10 AND 09+10!

Keep in mind the reason for all this complex numbering is because many
issues cannot be implemented in a vacuum: If you want Atomic Grouping,
that's 1 implementation, if you want Shared Constants, that's a
different implementation. but if you want BOTH Atomic Grouping and
Shared Constants, that is a wholly other implementation because each
implementation affects the other. Thus, I end up with a plethora of
branches and a nightmare when it comes to merging which is why I've been
so slow in making progress. Bazaar seems to be very confused when it
comes to a merge in 6 parts between, for example 01, 09, 10, 01+09,
01+10 and 09+10, as above. It gets confused when it sees the same
changes applied in a previous merge applied again, instead of simply
realizing that the change in one since last merge is EXACTLY the same
change in the other since last merge so effectively there is nothing to
do; instead, Bazaar gets confused and starts treating code that did NOT
change since last merge as if it was changed and thus tries to role back
the 01+09+10-specific changes rather than doing nothing and generates a
conflict. Oh, that I could only have a version control system that
understood the kind of complex branching that I require!

Anyway, that's the state of things; this is me, signing out!

mrabarnett · 2008-09-24T14:28:03Z

Comparing item 2 and item 3, I think that item 3 is the Pythonic choice
and item 2 is a bad idea.

Item 4: back-references in the pattern are like \1 and (?P=name), not
\g<1> or \g<name>, and in the replacement string are like \g<1> and
\g<name>, not \1 (or (?P=name)). I'd like to suggest that
back-references in the pattern also include \g<1> and \g<name> and
\g<-1> for relative back-references. Interestingly, Perl names groups
with (?<name>...) whereas Python uses (?P<name>...). A permissible
alternative?

timehorse · 2008-09-24T15:09:28Z

Thanks for weighing in Matthew!

Yeah, I do get some flack for item 2 because originally item 3 wasn't
supposed to cover named groups but on investigation it made sense that
it should. I still prefer 2 over-all but the nice thing about them
being separate items is that we can accept 2 or 3 or both or neither,
and for the most part development for the first phase of 2 is complete
though there is still IMHO the issue of UNICODE name groups (visa-vi
item 14) and the name collision problem which I propose fixing with an
Attribute / re.A flag. So, I think it may end up that we could support
both 3 by default and 2 via a flag or maybe 3 and 2 both but with 2 as
is, with name collisions hidden (i.e. if you have r'(?P<string>...)' as
your capture group, typing m.string will still give you the original
comparison string, as per the current python documentation) but have
collision-checking via the Attribute flag so that with
r'(?A)(?P<string>...)' would not compile because string is a reserved word.

Your interpretation of 4 matches mine, though, and I would definitely
suggest using Perl's \g<-n> notation for relative back-references, but
further, I was thinking, if not part of 4, part of the catch-all item 11
to add support for Perl's (?<name>...) as a synonym for Python's
(?P<name>...) and Perl's \k<name> for Python's (?P=name) notation. The
evolution of Perl's name group is actually interesting. Years ago,
Guido had a conversation with Larry Wall about using the (?P...) capture
sequence for python-specific Regular Expression blocks. So Python went
ahead and implemented named capture groups. Years later, the Perl folks
thought named capture groups were a neat idea and adapted them in the
(?<...>...) form because Python had restricted the (?P...) notation to
themselves so they couldn't use our even if they wanted to. Now,
though, with Perl adapting (?<...>...), I think it inevitable that Java
and even C++ may see this as the defacto standard. So I 100% agree, we
should consider supporting (?<name>...) in the parser.

Oh, and as I suggested in bpo-3825, I have these new item proposals:

Item 18: Add a re.REVERSE, re.R (?r) flag for reversing the direction of
the String Evaluation against a given Regular Expression pattern. See
bpo-516762, as implemented in bpo-3825.

Item 19: Make various in-line flags positionally dependant, for example
(?i) makes the pattern before this case-sensitive but after it
case-insensitive. See bpo-433024, as implemented in bpo-3825.

Item 20: All the negation of in-line flags to cancel their effect in
conditionally flagged expressions for example (?-i). See bpo-433027,
as implemented in bpo-3825.

Item 21: Allow for scoped flagged expressions, i.e. (?i:...), where the
flag(s) is applied to the expression within the parenthesis. See Issue
433028, as implemented in bpo-3825.

Item 22: Zero-width regular expression split: when splitting via a
regular expression of Zero-length, this should return an expression
equivalent to splitting at each character boundary, with a null string
at the beginning and end representing the space before the first and
after the last character. See bpo-3262.

Item 23: Character class ranges over case-insensitive matches, i.e. does
"(?i)[9-A]" contain '_' , whose ord is greater than the ord of 'A' and
less than the ord of 'a'. See bpo-5311.

And I shall create a bazaar repository for your current development line
with the unfortunately unwieldy name of
lp:~timehorse/python/issue2636-01+09-02+17+18+19+20+21 as that would,
AFAICT, cover all the items you've fixed in your latest patch.

Anyway, great work Matthew and I look forward to working with you on
Regexp 2.7 as you do great work!

mrabarnett · 2008-09-24T15:48:49Z

Regarding item 22: there's also bpo-1647489 ("zero-length match confuses
re.finditer()").

This had me stumped for a while, but I might have a solution. I'll see
whether it'll fix item 22 too.

I wasn't planning on doing any more major changes on my branch, just
tweaking and commenting and seeing whether I've missed any tricks in the
speed stakes. Half the task is finding out what's achievable, and how!

vbr · 2011-09-03T18:20:29Z

Not that it matters in any way, but if the regex semantics has to be distinguished via "non-standard" custom flags; I would prefer even less wordy flags, possibly such that the short forms for the in-pattern flag setting would be one-letter (such as all the other flags) and preferably some with underlying plain English words as base, to get some mnemotechnics (which I don't see in the numbered versions requiring one to keep track of the rather internal library versioning).
Unfortunately, it might be difficult to find suitable names, given the objections expressed against the already discussed ones. (FOr what it is worth, I thought e.g. of [t]raditional and [e]nhanced, but these also suffer from some of the mentioned disadvantages...
vbr

stevendaprano · 2011-09-06T16:07:45Z

Matthew Barnett wrote:

So, VERSION0 and VERSION1, with "(?V0)" and "(?V1)" in the pattern?

Seems reasonable to me.

+1

mattchaput · 2011-09-15T22:45:17Z

Not sure if this is better as a separate feature request or a comment here, but... the new version of .NET includes an option to specify a time limit on evaluation of regexes (not sure if this is a feature in other regex libs). This would be useful especially when you're executing regexes configured by the user and you don't know if/when they might go exponential. Something like this maybe:

# Raises an re.Timeout if not complete within 60 seconds
match = myregex.match(mystring, maxseconds=60.0)

ncoghlan · 2012-01-29T06:26:28Z

As part of the PEP-408 discussions, Guido approved the addition of 'regex' in 3.3 (using that name, rather than as a drop-in replacement for re) [1,2]

That should greatly ease the backwards compatibility concerns, even if it isn't as transparent an upgrade path.

[1] http://mail.python.org/pipermail/python-dev/2012-January/115961.html
[2] http://mail.python.org/pipermail/python-dev/2012-January/115962.html

alex · 2012-01-29T06:28:40Z

So, to my reading of teh compatibility PEP this cannot be added wholesale,
unless there is a pure Python version as well. However, if it replaced re
(read: patched) it would be valid.

On Sun, Jan 29, 2012 at 1:26 AM, Nick Coghlan <report@bugs.python.org>wrote:

Nick Coghlan <ncoghlan@gmail.com> added the comment:

As part of the PEP-408 discussions, Guido approved the addition of 'regex'
in 3.3 (using that name, rather than as a drop-in replacement for re) [1,2]

That should greatly ease the backwards compatibility concerns, even if it
isn't as transparent an upgrade path.

[1] http://mail.python.org/pipermail/python-dev/2012-January/115961.html
[2] http://mail.python.org/pipermail/python-dev/2012-January/115962.html

----------
nosy: +ncoghlan

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue2636\>

birkenfeld · 2012-01-29T07:24:37Z

I created a new sandbox branch to integrate regex into CPython, see "remote repo" field.

I mainly had to adapt the test suite to use unittest.

ncoghlan · 2012-01-29T07:42:22Z

Alex has a valid point in relation to PEP-399, since, like lzma, regex will be coming in under the "special permission" clause that allows the addition of C extension modules without pure Python equivalents. Unlike lzma, though, the new regex engine isn't a relatively simple wrapper around an existing library - supporting the new API features on other implementations is going to mean a substantial amount of work.

In practice, I expect that a pure Python implementation of a regular expression engine would only be fast enough to be usable on PyPy. So while we'd almost certainly accept a patch that added a parallel Python implementation, I doubt it would actually help Jython or IronPython all that much - they're probably going to need versions written in Java and C# to be effective (as I believe they already have for the re module).

ssbr · 2012-01-29T08:01:43Z

In practice, I expect that a pure Python implementation of a regular expression engine would only be fast enough to be usable on PyPy.

Not sure why this is necessarily true. I'd expect a pure-Python implementation to be maybe 200 times as slow. Many queries (those on relatively short strings that backtrack little) finish within microseconds. On this scale, a couple of orders of magnitudes is not noticeable by humans (unless it adds up), and even where it gets noticeable, it's better than having nothing at all or a non-working program (up until a point).

python -m timeit -n 1000000 -s "import re; x = re.compile(r'.*<\s*help\s*>([^\<])<\s/\s*help.*>'); data = ' '*1000 + '< help >' + 'abc'*100 + '</help>'" "x.match(data)"
1000000 loops, best of 3: 3.27 usec per loop

birkenfeld · 2012-01-29T08:31:33Z

Well, REs are very often used to process large chunks of text by repeated application. So if the whole operation takes 0.1 or 20 seconds you're going to notice :)

ssbr · 2012-01-29T08:37:52Z

It'd be nice if we had some sort of representative benchmark for real-world uses of Python regexps. The JS guys have all pitched in to create such a thing for uses of regexps on thew web. I don't know of any such thing for Python.

I agree that a Python implementation wouldn't be useful for some cases. On the other hand, I believe it would be fine (or at least tolerable) for some others. I don't know the ratio between the two.

pitrou · 2012-01-29T17:21:19Z

It'd be nice if we had some sort of representative benchmark for
real-world uses of Python regexps. The JS guys have all pitched in to
create such a thing for uses of regexps on thew web. I don't know of
any such thing for Python.

See http://hg.python.org/benchmarks/, there are regex benchmarks there.

I agree that a Python implementation wouldn't be useful for some
cases. On the other hand, I believe it would be fine (or at least
tolerable) for some others. I don't know the ratio between the two.

I think the ratio would be something like 2% tolerable :)

As I said to Ezio and Georg, I think adding the regex module needs a
PEP, even if it ends up non-controversial.

sandrotosi · 2012-04-03T21:47:30Z

I've just uploaded regex into Debian: this will hopefully gives some more eyes looking at the module and reporting some feedbacks.

ncoghlan · 2012-11-05T10:58:36Z

I've been working through the "known crashers" list in the stdlib. The recursive import one was fixed with the migration to importlib in 3.3, the compiler one will be fixed in 3.3.1 (with an enforced nesting limit). One of those remaining is actually a pathological failure in the re module rather than a true crasher (i.e. it doesn't segfault, and in 2.7 and 3.3 you can interrupt it with Ctrl-C):
http://hg.python.org/cpython/file/default/Lib/test/crashers/infinite_loop_re.py

I mention it here as another problem that adopting the regex module could resolve (as regex promptly returns None for this case).

BreamoreBoy · 2014-06-26T18:05:30Z

Will we actually get regex into the standard library on this pass?

ncoghlan · 2014-06-26T22:58:14Z

Even with in principle approval from Guido, this idea still depends on
volunteers to actually write up a concrete proposal as a PEP (which
shouldn't be too controversial, given Guido already OK'ed the idea) and
then do the integration work to incorporate the code, tests and docs into
CPython (not technically *hard*, but not trivial either). "pip install
regex" starts looking fairly attractive at that point :)

serhiy-storchaka · 2014-11-08T11:59:18Z

Here is my (slowly implemented) plan:

Recommend regex as advanced replacement of re (bpo-22594).
Fix all obvious bugs in the re module if this doesn't break backward compatibility (bpo-12728, bpo-14260, and many already closed issues).
Deprecate and then forbid behavior which looks as a bug, doesn't match regex in V1 mode and can't be fixed without breaking backward compatibility (bpo-22407, bpo-22493, bpo-22818).
Unify minor details with regex (bpo-22364, bpo-22578).
Fork regex and drop all advanced nonstandard features (such as fuzzy matching). Too many features make learning and using the module more hard. They should be in advanced module (regex).
Write benchmarks which cover all corner cases and compare re with regex case by case. Optimize slower module. Currently re is faster regex for all simple examples which I tried (may be as result of bpo-18685), but in total results of benchmarks (msg109447) re is slower.
May be implement some standard features which were rejected in favor of this issue (bpo-433028, bpo-433030). re should conform at least Level 1 of UTS Rename Doc/README.txt to Doc/README.rst #18 (http://www.unicode.org/reports/tr18/#Basic_Unicode_Support).

In best case in 3.7 or 3.8 we could replace re with simplified regex. Or at this time re will be free from bugs and warts.

pitrou · 2014-11-08T15:06:08Z

Here is my (slowly implemented) plan:

Exciting. Perhaps you should post your plan on python-dev.

In any case, huge thanks for your work on the re module.

serhiy-storchaka · 2014-11-08T15:58:31Z

Exciting. Perhaps you should post your plan on python-dev.

Thank you Antoine. I think all interested core developers are already aware
about this issue. A disadvantage of posting on python-dev is that this would
require manually copy links and may be titles of all mentioned issues, while
here they are available automatically. Oh, I'm lazy.

ezio-melotti · 2014-11-08T16:04:18Z

So you are suggesting to fix bugs in re to make it closer to regex, and then replace re with a forked subset of regex that doesn't include advanced features, or just to fix/improve re until it matches the behavior of regex?
If you are suggesting the former, I would also suggest checking the coverage and bringing it as close as possible to 100%.

serhiy-storchaka · 2014-11-08T20:22:10Z

So you are suggesting to fix bugs in re to make it closer to regex, and then
replace re with a forked subset of regex that doesn't include advanced
features, or just to fix/improve re until it matches the behavior of regex?

Depends on what will be easier. May be some bugs are so hard to fix that
replacing re with regex is only solution. But if fixed re will be simpler and
faster than lightened regex and will contain all necessary features, there
will be no need in the replacing. Currently the code of regex looks more high
level and better structured, but the code of re looks simpler and is much
smaller. In any case the closer will be re and regex the easier will be the
migration.

ezio-melotti · 2014-11-08T21:02:14Z

Ok, regardless of what will happen, increasing test coverage is a worthy goal. We might start by looking at the regex test suite to see if we can import some tests from there.

ncoghlan · 2014-11-09T12:37:18Z

Thanks for pushing this one forward Serhiy! Your approach sounds like a
fine plan to me.

timehorse · 2014-11-09T15:49:10Z

If I recall, I started this thread with a plan to update re itself with implementations of various features listed in the top post. If you look at the list of files uploaded by me there are seme complete patches for Re to add various features like Atomic Grouping. If we wish to therefore bring re to regex standard we could start with those features.

Mateon1 · 2014-11-13T21:53:37Z

Well, I found a bug with this module, on Python 2.7(.5), on Windows 7 64-bit when you try to compile a regex with the flags V1|DEBUG, the module crashes as if it wanted to call a builtin called "ascii".

The bug happened to me several times, but this is the regexp when the last one happened. http://paste.ubuntu.com/8993680/

I hope it's fixed, I really love the module and found it very useful to have PCRE regexes in Python.

mrabarnett · 2014-11-13T23:51:55Z

@mateon1: "I hope it's fixed"? Did you report it?

Mateon1 · 2014-11-14T12:39:06Z

Well, I am reporting it here, is this not the correct place? Sorry if it is.

mrabarnett · 2014-11-14T17:41:57Z

The page on PyPI says where the project's homepage is located:

Home Page: https://code.google.com/p/mrab-regex-hg/

The bug was fixed in the last release.

vstinner · 2021-01-25T22:30:27Z

It's now a third party project: https://pypi.org/project/regex/

If someone wants to move it into the Python stdlib, I suggest to start on the python-ideas list first.

I close the issue as REJECTED.

timehorse mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Apr 15, 2008

akuchling added topic-regex and removed stdlib Python modules in the Lib dir labels Apr 15, 2008

timehorse mannequin changed the title ~~Regexp 2.6 (modifications to current re 2.2.2)~~ Regexp 2.7 (modifications to current re 2.2.2) Sep 16, 2008

vstinner closed this as completed Jan 25, 2021

ezio-melotti transferred this issue from another repository Apr 10, 2022

Adding a new regex module (compatible with re) #46888

Adding a new regex module (compatible with re) #46888

Comments

timehorse mannequin commented Apr 15, 2008

timehorse mannequin commented Apr 15, 2008

timehorse mannequin commented Apr 17, 2008

timehorse mannequin commented Apr 18, 2008

timehorse mannequin commented Apr 18, 2008

timehorse mannequin commented Apr 18, 2008

jimjjewett mannequin commented Apr 24, 2008

amauryfa commented Apr 24, 2008

timehorse mannequin commented Apr 24, 2008

jimjjewett mannequin commented Apr 24, 2008

pitrou commented Apr 26, 2008

timehorse mannequin commented Apr 26, 2008

timehorse mannequin commented May 1, 2008

timehorse mannequin commented May 24, 2008

mark-summerfield mannequin commented May 28, 2008

timehorse mannequin commented May 28, 2008

timehorse mannequin commented Jun 17, 2008

timehorse mannequin commented Jun 17, 2008

mark-summerfield mannequin commented Jun 18, 2008

timehorse mannequin commented Jun 19, 2008

mark-summerfield mannequin commented Jun 19, 2008

pitrou commented Sep 13, 2008

timehorse mannequin commented Sep 16, 2008

mrabarnett mannequin commented Sep 24, 2008

timehorse mannequin commented Sep 24, 2008

mrabarnett mannequin commented Sep 24, 2008

vbr mannequin commented Sep 3, 2011

stevendaprano commented Sep 6, 2011

mattchaput mannequin commented Sep 15, 2011

ncoghlan commented Jan 29, 2012

alex commented Jan 29, 2012

birkenfeld commented Jan 29, 2012

ncoghlan commented Jan 29, 2012

ssbr mannequin commented Jan 29, 2012

birkenfeld commented Jan 29, 2012

ssbr mannequin commented Jan 29, 2012

pitrou commented Jan 29, 2012

sandrotosi commented Apr 3, 2012

ncoghlan commented Nov 5, 2012

BreamoreBoy mannequin commented Jun 26, 2014

ncoghlan commented Jun 26, 2014

serhiy-storchaka commented Nov 8, 2014

pitrou commented Nov 8, 2014

serhiy-storchaka commented Nov 8, 2014

ezio-melotti commented Nov 8, 2014

serhiy-storchaka commented Nov 8, 2014

ezio-melotti commented Nov 8, 2014

ncoghlan commented Nov 9, 2014

timehorse mannequin commented Nov 9, 2014

Mateon1 mannequin commented Nov 13, 2014

mrabarnett mannequin commented Nov 13, 2014

Mateon1 mannequin commented Nov 14, 2014

mrabarnett mannequin commented Nov 14, 2014

vstinner commented Jan 25, 2021