classification
Title: Adding a new regex module (compatible with re)
Type: enhancement Stage: patch review
Components: Library (Lib), Regular Expressions Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Devin Jeanpierre, akitada, akoumjian, alex, amaury.forgeotdarc, belopolsky, davide.rizzo, eric.snow, ezio.melotti, georg.brandl, giampaolo.rodola, gregory.p.smith, jacques, jaylogan, jhalcrow, jimjjewett, loewis, mark, mattchaput, moreati, mrabarnett, ncoghlan, nneonneo, pitrou, r.david.murray, ronnix, rsc, sandro.tosi, sjmachin, steven.daprano, stiv, timehorse, tshepang, vbr, zdwiel
Priority: normal Keywords: patch

Created on 2008-04-15 11:57 by timehorse, last changed 2012-11-27 09:02 by mark.dickinson.

Files
File name Uploaded Description Edit
regex_test-20100316 moreati, 2010-03-16 15:56 Python 2.6.5 re test run against regex-20100305
issue2636-20101230.zip mrabarnett, 2010-12-30 02:25
remove_guards.diff jacques, 2010-12-31 09:23
Repositories containing patches
http://hg.python.org/sandbox/regex-integration
Messages (319)
msg65513 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-04-15 11:57
I am working on adding features to the current Regexp implementation,
which is now set to 2.2.2.  These features are to bring the Regexp code
closer in line with Perl 5.10 as well as add a few python-specific
niceties and potential speed-ups and clean-ups.

I will be posting regular patch updates to this thread when major
milestones have been reach with a description of the feature(s) added. 
Currently, the list of proposed changes are (in no particular order):

1) Fix <a href="http://bugs.python.org/issue433030">issue 433030</a> by
adding support for Atomic Grouping and Possessive Qualifiers

2) Make named matches direct attributes of the match object; i.e.
instead of m.group('foo'), one will be able to write simply m.foo.

3) (maybe) make Match objects subscriptable, such that m[n] is
equivalent to m.group(n) and allow slicing.

4) Implement Perl-style back-references including relative back-references.

5) Add a well-formed, python-specific comment modifier, e.g. (?P#...);
the difference between (?P#...) and Perl/Python's (?#...) is that the
former will allow nested parentheses as well as parenthetical escaping,
so that patterns of the form '(?P# Evaluate (the following) expression,
3\) using some other technique)'.  The (?P#...) will interpret this
entire expression as a comment, where as with (?#...) only, everything
following ' expression...' would be considered part of the match. 
(?P#...) will necessarily be slower than (?#...) and so only should be
used if richer commenting style is required but the verbose mode is not
desired.

6) Add official support for fast, non-repeating capture groups with the
Template option.  Template is unofficially supported and disables all
repeat operators (*, + and ?).  This would mainly consist of documenting
its behavior.

7) Modify the re compiled expression cache to better handle the
thrashing condition.  Currently, when regular expressions are compiled,
the result is cached so that if the same expression is compiled again,
it is retrieved from the cache and no extra work has to be done.  This
cache supports up to 100 entries.  Once the 100th entry is reached, the
cache is cleared and a new compile must occur.  The danger, all be it
rare, is that one may compile the 100th expression only to find that one
recompiles it and has to do the same work all over again when it may
have been done 3 expressions ago.  By modifying this logic slightly, it
is possible to establish an arbitrary counter that gives a time stamp to
each compiled entry and instead of clearing the entire cache when it
reaches capacity, only eliminate the oldest half of the cache, keeping
the half that is more recent.  This should limit the possibility of
thrashing to cases where a very large number of Regular Expressions are
continually recompiled.  In addition to this, I will update the limit to
256 entries, meaning that the 128 most recent are kept.

8) Emacs/Perl style character classes, e.g. [:alphanum:].  For instance,
:alphanum: would not include the '_' in the character class.

9) C-Engine speed-ups.  I commenting and cleaning up the _sre.c Regexp
engine to make it flow more linearly, rather than with all the current
gotos and replace the switch-case statements with lookup tables, which
in tests have shown to be faster.  This will also include adding many
more comments to the C code in order to make it easier for future
developers to follow.  These changes are subject to testing and some
modifications may not be included in the final release if they are shown
to be slower than the existing code.  Also, a number of Macros are being
eliminated where appropriate.

10) Export any (not already) shared value between the Python Code and
the C code, e.g. the default Maximum Repeat count (65536); this will
allow those constants to be changed in 1 central place.

11) Various other Perl 5.10 conformance modifications, TBD.


More items may come and suggestions are welcome.

-----

Currently, I have code which implements 5) and 7), have done some work
on 10) and am almost 9).  When 9) is complete, I will work on 1), some
of which, such as parsing, is already done, then probably 8) and 4)
because they should not require too much work -- 4) is parser-only
AFAICT.  Then, I will attempt 2) and 3), though those will require
changes at the C-Code level.  Then I will investigate what additional
elements of 11) I can easily implement.  Finally, I will write
documentation for all of these features, including 6).

In a few days, I will provide a patch with my interim results and will
update the patches with regular updates when Milestones are reached.
msg65593 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-04-17 22:06
I am very sorry to report (at least for me) that as of this moment, item 
9), although not yet complete, is stable and able to pass all the 
existing python regexp tests.  Because these tests are timed, I am using 
the timings from the first suite of tests to perform a benchmark of 
performance between old and new code.  Based on discussion with Andrew 
Kuchling, I have decided for the sake of simplicity, the "timing" of 
each version is to be calculated by the absolute minimum time to execute 
observed because it is believed this execution would have had the most 
continuous CPU cycles and thus most closely represents the true 
execution time.

It is this current conclusion that greatly saddens me, not that the 
effort has not been valuable in understanding the current engine.  
Indeed, I understand the current engine now well enough that I could 
proceed with the other modifications as-is rather than implementing them 
with the new engine.  Mind you, I will likely not bring over the copious  
comments that the new engine received when I translated it to a form 
without C_Macros and gotos, as that would require too much effort IMHO.

Anyway, all that being said, and keeping in mind that I am not 100% 
satisfied with the new engine and may still be able to wring some timing 
out of it -- not that I will spend much more time on this -- here is 
where we currently stand:

Old Engine: 6.574s
New Engine: 7.239s

This makes the old Engine 665ms faster over the entire first test_re.py 
suite, or 9% faster than the New Engine.
msg65613 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-04-18 13:38
Here are the modification so far for item 9) in _sre.c plus some small
modifications to sre_constants.h which are only to get _sre.c to
compile; normally sre_constants.h is generated by sre_constants.py, so
this is not the final version of that file.  I also would have intended
to make SRE_CHARSET and SRE_COUNT use lookup tables, as well as maybe
others, but not likely any other lookup tables.  I also want to remove
alloc_pos out of the self object and make it a parameter to the ALLOC
parameter and probably get rid of the op_code attribute since it is only
used in 1 place to save one subtract in a very rare case.  But I want to
resolve the 10% problem first, so would appreciate it if people could
look at the REMOVE_SRE_MATCH_MACROS section of code and compare it to
the non-REMOVE_SRE_MATCH_MACROS version of SRE_MATCH and see if you can
suggest anything to make the former (new code) faster to get me that
elusive 10%.
msg65614 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-04-18 14:23
Here is a patch to implement item 7)
msg65617 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-04-18 14:50
This simple patch adds (?P#...)-style comment support.
msg65725 - (view) Author: Jim Jewett (jimjjewett) Date: 2008-04-24 14:23
> These features are to bring the Regexp code closer in line with Perl 5.10

Why 5.1 instead of 5.8 or at least 5.6?  Is it just a scope-creep issue?

> as well as add a few python-specific

because this also adds to the scope.


> 2) Make named matches direct attributes 
> of the match object; i.e. instead of m.group('foo'), 
> one will be able to write simply m.foo.

> 3) (maybe) make Match objects subscriptable, such 
> that m[n] is equivalent to m.group(n) and allow slicing.

(2) and (3) would both be nice, but I'm not sure it makes sense to do 
*both* instead of picking one.

> 5) Add a well-formed, python-specific comment modifier, 
> e.g. (?P#...);  

[handles parens in comments without turning on verbose, but is slower]

Why?  It adds another incompatibility, so it has to be very useful or 
clear.  What exactly is the advantage over just turning on verbose?

> 9) C-Engine speed-ups. ...
> a number of Macros are being eliminated where appropriate.

Be careful on those, particular on str/unicode and different compile options.
msg65726 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-04-24 14:31
> > These features are to bring the Regexp code closer in line 
> > with Perl 5.10
> 
> Why 5.1 instead of 5.8 or at least 5.6?  Is it just a scope-creep issue?

5.10.0 comes after 5.8 and is the latest version (2007/12/18)! 
Yes it is confusing.
msg65727 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-04-24 16:06
Thanks Jim for your thoughts!

Armaury has already explained about Perl 5.10.0.  I suppose it's like
Macintosh version numbering, since Mac Tiger went from version 10.4.9 to
10.4.10 and 10.4.11 a few years ago.  Maybe we should call Python 2.6
Python 2.06 just in case.  But 2.6 is the known last in the 2 series so
it's not a problem for us!  :)

>> as well as add a few python-specific
>
> because this also adds to the scope.

At this point the only python-specific changes I am proposing would be
items 2, 3 (discussed below), 5 (discussed below), 6 and 7.  6 is only a
documentation change, the code is already implemented.  7 is just a
better behavior.  I think it is RARE one compiles more than 100 unique
regular expressions, but you never know as projects tend to grow over
time, and in the old code the 101st would be recompiled even if it was
just compiled 2 minutes ago.  The patch is available so I leave it to
the community to judge for themselves whether it is worth it, but as you
can see, it's not a very large change.

>> 2) Make named matches direct attributes 
>> of the match object; i.e. instead of m.group('foo'), 
>> one will be able to write simply m.foo.
>
>> 3) (maybe) make Match objects subscriptable, such 
>> that m[n] is equivalent to m.group(n) and allow slicing.
>
> (2) and (3) would both be nice, but I'm not sure it makes sense to do 
> *both* instead of picking one.

Well, I think named matches are better than numbered ones, so I'd
definitely go with 2.  The problem with 2, though, is that it still
leaves the rather typographically intense m.group(n), since I cannot
write m.3.  However, since capture groups are always numbered
sequentially, it models a list very nicely.  So I think for indexing by
group number, the subscripting operator makes sense.  I was not
originally suggesting m['foo'] be supported, but I can see how that may
come out of 3.  But there is a restriction on python named matches that
they have to be valid python and that strikes me as 2 more than 3
because 3 would not require such a restriction but 2 would.  So at least
I want 2, but it seems IMHO m[1] is better than m.group(1) and not in
the least hard or a confusing way of retrieving the given group.  Mind
you, the Match object is a C-struct with python binding and I'm not
exactly sure how to add either feature to it, but I'm sure the C-API
manual will help with that.

>> 5) Add a well-formed, python-specific comment modifier, 
>> e.g. (?P#...);  
>
> [handles parens in comments without turning on verbose, but is slower]
>
> Why?  It adds another incompatibility, so it has to be very useful or 
> clear.  What exactly is the advantage over just turning on verbose?

Well, Larry Wall and Guido agreed long ago that we, the python
community, own all expressions of the form (?P...) and although I'd be
my preference to make (?#...) more in conformance with understanding
parenthesis nesting, changing the logic behind THAT would make python
non-standard.  So as far as any conflicting design, we needn't worry.

As for speed, the this all occurs in the parser and does not effect the
compiler or engine.  It occurs only after a (?P has been read and then
only as the last check before failure, so it should not be much slower
except when the expression is invalid.  The actual execution time to
find the closing brace of (?P#...) is a bit slower than that for (?#...)
but not by much.

Verbose is generally a good idea for anything more than a trivial
Regular Expression.  However, it can have overhead if not included as
the first flag: an expression is always checked for verbose
post-compilation and if it is encountered, the expression is compiled a
second time, which is somewhat wasteful.  But the reason I like the
(?P#...) over (?#...) is because I think people would more tend to assume:

r'He(?# 2 (TWO) ls)llo' should match "Hello" but it doesn't.

That expression only matches "He ls)llo", so I created the (?P#...) to
make the comment match type more intuitive:

r'He(?P# 2 (TWO) ls)llo' matches "Hello".

>> 9) C-Engine speed-ups. ...
>> a number of Macros are being eliminated where appropriate.
>
> Be careful on those, particular on str/unicode and different
> compile options.

Will do; thanks for the advice!  I have only observed the UNICODE flag
controlling whether certain code is used (besides the ones I've added)
and have tried to stay true to that when I encounter it.  Mind you,
unless I can get my extra 10% it's unlikely I'd actually go with item 9
here, even if it is easier to read IMHO.  However, I want to run the new
engine proposal through gprof to see if I can track down some bottlenecks.

At some point, I hope to get my current changes on Launchpad if I can
get that working.  If I do, I'll give a link to how people can check out
my working code here as well.
msg65734 - (view) Author: Jim Jewett (jimjjewett) Date: 2008-04-24 18:09
Python 2.6 isn't the last, but Guido has said that there won't be a 2.10.

> Match object is a C-struct with python binding
> and I'm not exactly sure how to add either feature to it

I may be misunderstanding -- isn't this just a matter of writing the 
function and setting it in the tp_as_sequence and tp_as_mapping slots?

> Larry Wall and Guido agreed long ago that we, the python
> community, own all expressions of the form (?P...)

Cool -- that reference should probably be added to the docs.  For someone 
trying to learn or translate regular expressions, it helps to know that (?P
 ...) is explicitly a python extension (even if Perl adopts it later).

Definately put the example in the doc.  

    r'He(?# 2 (TWO) ls)llo' should match "Hello" but it doesn't.  Maybe 
even without the change, as doco on the current situation.

Does VERBOSE really have to be the first flag, or does it just have to be on 
the whole pattern instead of an internal switch?

I'm not sure I fully understand what you said about template.  Is this a 
special undocumented switch, or just an internal optimization mode that 
should be triggered whenever the repeat operators don't happen to occur?
msg65838 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-04-26 10:08
I don't know anything about regexp implementation, but if you replace a
switch-case with a function lookup table, it isn't surprising that the
new version ends up slower. A local jump is always faster than a
function call, because of the setup overhead and stack manipulation the
latter involves.

So you might try to do the cleanup while keeping the switch-case
structure, if possible.
msg65841 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-04-26 11:51
Thank you and Merci Antoine!

That is a good point.  It is clearly specific to the compiler whether a 
switch-case will be turned into a series of conditional branches or 
simply creating an internal jump table with lookup.  And it is true 
that most compilers, if I understand correctly, use the jump-table 
approach for any switch-case over 2 or 3 entries when the cases are 
tightly grouped and near 0.  That is probably why the original code 
worked so fast.  I'll see if I can combine the best of both 
approaches.  Thanks again!
msg66033 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-05-01 14:15
I am making my changes in a Bazaar branch hosted on Launchpad.  It took
me quite a while to get things set up more-or-less logically but there
they are and I'm currently trying to re-apply my local changes up to
today into the various branches I have.  Each of the 11 issues I
outlined originally has its own branch, with a root branch from which
all these branches are derived to serve as a place for a) merging in
python 2.6 alpha concurrent development (merges) and to apply any
additional re changes that don't fall into any of the other categories,
of which I have so far found only 2 small ones.

Anyway, if anyone is interested in monitoring my progress, it is
available at:

https://code.launchpad.net/~timehorse/

I will still post major milestones here, but one can monitory day-to-day 
progress on Launchpad.  Also on launchpad you will find more detail on
the plans for each of the 11 modifications, for the curious.

Thanks again for all the advice!
msg67309 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-05-24 21:38
I am finally making progress again, after a month of changing my 
patches from my local svn repository to bazaar hosted on launchpad.net, 
as stated in my last update.  I also have more or less finished the 
probably easiest item, #5, so I have a full patch for that available 
now.  First, though, I want to update my "No matter what" patch, which 
is to say these are the changes I want to make if any changes are made 
to the Regexp code.
msg67447 - (view) Author: Mark Summerfield (mark) Date: 2008-05-28 13:38
AFAIK if you have a regex with named capture groups there is no direct
way to relate them to the capture group numbers.
You could do (untested; Python 3 syntax):

    d = {v: k for k, v in match.groupdict()}
    for i in range(match.lastindex):
         print(i, match.group(i), d[match.group(i)])

One possible solution would be a grouptuples() function that returned a
tuple of 3-tuples (index, name, captured_text) with the name being None
for unnamed groups.

Anyway, good luck with all your improvements, I will be especially glad
if you manage to do (2) and (8) (and maybe (3)).
msg67448 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-05-28 13:57
Mark scribbled:
> One possible solution would be a grouptuples() function that returned
> a tuple of 3-tuples (index, name, captured_text) with the name being
> None for unnamed groups.

Hmm.  Well, that's not a bad idea at all IMHO and would, AFAICT probably 
be easier to do than (2) but I would still do (2) but will try to add 
that to one of the existing items or spawn another item for it since it 
is kind of a distinct feature.

My preference right now is to finish off the test cases for (7) because 
it is already coded, then finish the work on (1) as that was the 
original reason for modification then on to (2) then (3) as they are 
related and then I don't mind tackling (8) because I think that one 
shouldn't be too hard.  Interestingly, the existing engine code 
(sre_parse.py) has a place-holder, commented out, for character classes 
but it was never properly implemented.  And I will warn that with 
Unicode, I THINK all the character classes exist as unicode functions or 
can be implemented as multiple unicode functions, but I'm not 100% sure 
so if I run into that problem, some character classes may initially be 
left out while I work on another item.

Anyway, thanks for the input, Mark!
msg68336 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-06-17 17:43
Well, it's time for another update on my progress...

Some good news first: Atomic Grouping is now completed, tested and 
documented, and as stated above, is classified as issue2636-01 and 
related patches.  Secondly, with caveats listed below, Named Match Group 
Attributes on a match object (item 2) is also more or less complete at 
issue2636-02 -- it only lacks documentation.

Now, I want to also update my list of items.  We left off at 11: Other 
Perl-specific modifications.  Since that time, I have spawned a number 
of other branches, the first of which (issue2636-12) I am happy to 
announce is also complete!

12) Implement the changes to the documentation of re as per Jim J. 
Jewett suggestion from 2008-04-24 14:09.  Again, this has been done.

13) Implement a grouptuples(...) method as per Mark Summerfield's 
suggest on 2008-05-28 09:38.  grouptuples would take the same filtering 
parameters as the other group* functions, and would return a list of 3-
tuples (unless only 1 group was requested).  It should default to all 
match groups (1..n, not group 0, the matching string).

14) As per PEP-3131 and the move to Python 3.0, python will begin to 
allow full UNICODE-compliant identifier names.  Correspondingly, it 
would be the responsibility of this item to allow UNICODE names for 
match groups.  This would allow retrieval of UNICODE names via the 
group* functions or when combined with Item 3, the getitem handler 
(m[u'...']) (03+14) and the attribute name itself (e.g. getattr(m, 
u'...')) when combined with item 2 (02+14).

15) Change the Pattern_Type, Match_Type and Scanner_Type (experimental) 
to become richer Python Types.  Specifically, add __doc__ strings to 
each of these types' methods and members.

16) Implement various FIXMEs.

16-1) Implement the FIXME such that if m is a MatchObject, del m.string 
will disassociate the original matched string from the match object; 
string would be the only member that would allow modification or 
deletion and you will not be able to modify the m.string value, only 
delete it.

-----

Finally, I want to say a couple notes about Item 2:

Firstly, as noted in Item 14, I wish to add support for UNICODE match 
group names, and the current version of the C-code would not allow that; 
it would only make sense to add UNICODE support if 14 is implemented, so 
adding support for UNICODE match object attributes would depend on both 
items 2 and 14.  Thus, that would be implemented in issue2636-02+14.

Secondly, there is a FIXME which I discussed in Item 16; I gave that 
problem it's own item and branch.  Also, as stated in Item 15, I would 
like to add more robust help code to the Match object and bind __doc__ 
strings to the fixed attributes.  Although this would not directly 
effect the Item 2 implementation, it would probably involve moving some 
code around in its vicinity.

Finally, I would like suggestions on how to handle name collisions when 
match group names are provided as attributes.  For instance, an 
expression like '(?P<pos>.*)' would match more or less any string and 
assign it to the name "pos".  But "pos" is already an attribute of the 
Match object, and therefore pos cannot be exposed as a named match group  
attribute, since match.pos will return the usual meaning of pos for a 
match object, not the value of the capture group names "pos".

I have 3 proposals as to how to handle this:

a) Simply disallow the exposure of match group name attributes if the 
names collide with an existing member of the basic Match Object 
interface.

b) Expose the reserved names through a special prefix notation, and for 
forward compatibility, expose all names via this prefix notation.  In 
other words, if the prefix was 'k', match.kpos could be used to access 
pos; if it was '_', match._pos would be used.  If Item 3 is implemented, 
it may be sufficient to allow access via match['pos'] as the canonical 
way of handling match group names using reserved words.

c) Don't expose the names directly; only expose them through a prefixed 
name, e.g. match._pos or match.kpos.

Personally, I like a because if Item 3 is implemented, it makes a fairly 
useful shorthand for retrieving keyword names when a keyword is used for 
a name.  Also, we could put a deprecation warning in to inform users 
that eventually match groups names that are keywords in the Match Object 
will eventually be disallowed.  However, I don't support restricting the 
match group names any more than they already are (they must be a valid 
python identifier only) so again I would go with a) and nothing more and 
that's what's implemented in issue2636-02.patch.

-----

Now, rather than posting umteen patch files I am posting one bz2-
compressed tar of ALL patch files for all threads, where each file is of 
the form:

issue2636(-\d\d|+\d\d)*(-only)?.patch

For instance,

issue2636-01.patch is the p1 patch that is a difference between the 
current Python trunk and all that would need to be implemented to 
support Atomic Grouping / Possessive Qualifiers.  Combined branches are 
combined with a PLUS ('+') and sub-branches concatenated with a DASH ('-
').  Thus, "issue2636-01+09-01-01+10.patch" is a patch which combines 
the work from Item 1: Atomic Grouping / Possessive Qualifiers, the sub-
sub branch of Item 9: Engine Cleanups and Item 10: Shared Constants.  
Item 9 has both a child and a grandchild.  The Child (09-01) is my 
proposed engine redesign with the single loop; the grandchild (09-01-01) 
is the redesign with the triple loop.  Finally the optional "-only" flag 
means that the diff is against the core SRE modifications branch and 
thus does not include the core branch changes.

As noted above, Items 01, 02, 05, 07 and 12 should be considered more or 
less complete and ready for merging assuming I don't identify in my 
implementation of the other items that I neglected something in these.  
The rest, including the combined items, are all provided in the given 
tarball.
msg68339 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-06-17 19:07
Sorry, as I stated in the last post, I generated the patches then realized 
that I was missing the documentation for Item 2, so I have updated the 
issue2636-02.patch file and am attaching that separately until the next 
release of the patch tarball.  issue2636-02-only.patch should be ignored 
and I will only regenerate it with the correct documentation in the next 
tarball release so I can move on to either Character Classes or Relative 
Back-references.  I wanna pause Item 3 for the moment because 2, 3, 13, 
14, 15 and 16 all seem closely related and I need a break to allow my mind 
to wrap around the big picture before I try and tackle each one.
msg68358 - (view) Author: Mark Summerfield (mark) Date: 2008-06-18 07:13
[snip]
> 13) Implement a grouptuples(...) method as per Mark Summerfield's
> suggest on 2008-05-28 09:38.  grouptuples would take the same filtering
> parameters as the other group* functions, and would return a list of 3-
> tuples (unless only 1 group was requested).  It should default to all
> match groups (1..n, not group 0, the matching string).

:-)

[snip]
> Finally, I would like suggestions on how to handle name collisions when
> match group names are provided as attributes.  For instance, an
> expression like '(?P<pos>.*)' would match more or less any string and
> assign it to the name "pos".  But "pos" is already an attribute of the
> Match object, and therefore pos cannot be exposed as a named match group
> attribute, since match.pos will return the usual meaning of pos for a
> match object, not the value of the capture group names "pos".
>
> I have 3 proposals as to how to handle this:
>
> a) Simply disallow the exposure of match group name attributes if the
> names collide with an existing member of the basic Match Object
> interface.

I don't like the prefix ideas and now that you've spelt it out I don't
like the sometimes m.foo will work and sometimes it won't. So I prefer
m['foo'] to be the canonical way because that guarantees your code is
always consistent.

------------------------------------------------------------
BTW I wanted to do a simple regex to match a string that might or might
not be quoted, and that could contain quotes (but not those used to
delimit it). My first attempt was illegal:

    (?P<quote>['"])?([^(?=quote)])+(?(quote)(?=quote))

It isn't hard to work round but it did highlight the fact that you can't
use captures inside character classes. I don't know if Perl allows this;
I guess if it doesn't then Python shouldn't either since GvR wants the
engine to be Perl compatible.
msg68399 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-06-19 12:01
Thanks for weighing in Mark!  Actually, your point is valid and quite 
fair, though I would not assume that Item 3 would be included just 
because Item 2 isn't.  I will do my best to develop both, but I do not 
make the final decision as to what python includes.  That having been 
said, 3 seems very likely at this point so we may be okay, but let me 
give this one more try as I think I have a better solution to make Item 
2 more palatable.  Let's say we have 5 choices here:

> a) Simply disallow the exposure of match group name attributes if the 
> names collide with an existing member of the basic Match Object 
> interface.
>
> b) Expose the reserved names through a special prefix notation, and
> for forward compatibility, expose all names via this prefix notation. 
> In other words, if the prefix was 'k', match.kpos could be used to
> access pos; if it was '_', match._pos would be used.  If Item 3 is
> implemented, it may be sufficient to allow access via match['pos'] as
> the canonical way of handling match group names using reserved words.
>
> c) Don't expose the names directly; only expose them through a
> prefixed name, e.g. match._pos or match.kpos.

d) (As Mark suggested) we drop Item 2 completely.  I have not invested 
much work in this so that would not bother me, but IMHO I actually 
prefer Item 2 to 3 so I would really like to see it preserved in some 
form.

e) Add an option, re.MATCH_ATTRIBUTES, that is used as a Match Creation 
flag.  When the re.MATCH_ATTRIBUTES or re.A flag is included in the 
compile, or (?a) is included in the pattern, it will do 2 things.  
First, it will raise an exception if either a) there exists an unnamed 
capture group or b) the capture group name is a reserved keyword.  In 
addition to this, I would put in a hook to support a from __future__ so 
that any post 2.6 changes to the match object type can be smoothly 
integrated a version early to allow programmers to change when any 
future changes come.  Secondly, I would *conditionally* allow arbitrary 
capture group name via the __getattr__ handler IFF that flag was 
present; otherwise you could not access Capture Groups by name via 
match.foo.

I really like the idea of e) so I'm taking Item 2 out of the "ready for 
merge" category and going to put it in the queue for the modifications 
spelled out above.  I'm not too worried about our flags differing from 
Perl too much as we did base our first 4 on Perl (x, s, m, i), but 
subsequently added Unicode and Locale, which Perl does not have, and 
never implemented o (since our caching semantic already pretty much 
gives every expression that), e (which is specific to Perl syntax 
AFAICT) and g (which can be simulated via re.split).  So I propose we 
take A and implement it as I've specified and that is the current goal 
of Item 2.  Once this is done and working, we can decide whether it 
should be included in the python trunk.

How does that sound to you, Mark and anyone else who wishes to weigh in?
msg68409 - (view) Author: Mark Summerfield (mark) Date: 2008-06-19 14:15
[snip]

It seems to me that both using a special prefix or adding an option are
adding a lot of baggage and will increase the learning curve.

The nice thing about (3) (even without slicing) is that it seems a v.
natural extension. But (2) seems magical (i.e., Perl-like rather than
Pythonic) which I really don't like.

BTW I just noticed this:

'<_sre.SRE_Pattern object at 0x9ded020>'
>>> "{0!r}".format(rx)
'<_sre.SRE_Pattern object at 0x9ded020>'
>>> "{0!s}".format(rx)
'<_sre.SRE_Pattern object at 0x9ded020>'
>>> "{0!a}".format(rx)

That's fair enough, but maybe for !s the output should be rx.pattern?
msg73185 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-09-13 13:40
See also #3825.
msg73295 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-16 11:59
Update 16 Sep 2008:

Based on the work for issue #3825, I would like to simply update the
item list as follows:

1) Atomic Grouping / Possessive Qualifiers (See also Issue #433030)
[Complete]

2) Match group names as attributes (e.g. match.foo) [Complete save
issues outlined above]

3) Match group indexing (e.g. match['foo'], match[3])

4) Perl-style back-references (e.g. compile(r'(a)\g{-1}'), and possibly
adding the r'\k' escape sequence for keywords.

5) Parenthesis-Aware Python Comment (e.g. r'(?P#...)') [Complete]

6) Expose support for Template expressions (expressions without repeat
operators), adding test cases and documentation for existing code.

7) Larger compiled Regexp cache (256 vs. 100) and reduced thrashing
risk. [Complete]

8) Character Classes (e.g. r'[:alphanum:]')

9) Proposed Engine redesigns and cleanups (core item only contains
cleanups and comments to the current design but does not modify the design).

9-1) Single-loop Engine redesign that runs 8% slower than current.
[Complete]

9-1-1) 3-loop Engine redesign that runs 10% slower than current. [Complete]

9-2) Matthew Bernett's Engine redesign as per issue #3825

10) Have all C-Python shared constants stored in 1 place
(sre_constants.py) and generated by that into C constants
(sre_constants.h). [Complete AFAICT]

11) Scan Perl 5.10.0 for other potential additions that could be
implemented for Python.

12) Documentation suggestions by Jim J. Jewett [Complete]

13) Add grouptuples method to the Match object (i.e. match.grouptuples()
returns (<index>, <name or None>, <value>) ) suitable for iteration.

14) UNICODE match group names, as per PEP-3131.

15) Add __doc__ strings and other Python niceties to the Pattern_Type,
Match_Type and Scanner_Type (experimental).

16) Implement any remaining TODOs and FIXMEs in the Regexp modules.

16-1) Allow for the disassociation of a source string from a Match_Type,
assuming this will still leave the object in a "reasonable" state.

17) Variable-length [Positive and Negative] Look-behind assertions, as
described and implemented in Issue #3825.

---

Now, we have a combination of Items 1, 9-2 and 17 available in issue
#3825, so for now, refer to that issue for the 01+09-02+17 combined
solution.  Eventually, I hope to merge the work between this and that issue.

I sadly admit I have made not progress on this since June because
managing 30 some lines of development, some of which having complex
diamond branching, e.g.:

01 is the child of Issue2636
09 is the child of Issue2636
10 is the child of Issue2636
09-01 is the child of 09
09-01-01 is the child of 09-01
01+09 is the child of 01 and 09
01+10 is the child of 01 and 10
09+10 is the child of 09 and 10
01+09-01 is the child of 01 and 09-01
01+09-01-01 is the child of 01 and 09-01-01
09-01+10 is the child of 09-01 and 10
09-01-01+10 is the child of 09-01-01 and 10

Which all seems rather simple until you wrap your head around:

01+09+10 is the child of 01, 09, 10, 01+09, 01+10 AND 09+10!

Keep in mind the reason for all this complex numbering is because many
issues cannot be implemented in a vacuum: If you want Atomic Grouping,
that's 1 implementation, if you want Shared Constants, that's a
different implementation. but if you want BOTH Atomic Grouping and
Shared Constants, that is a wholly other implementation because each
implementation affects the other.  Thus, I end up with a plethora of
branches and a nightmare when it comes to merging which is why I've been
so slow in making progress.  Bazaar seems to be very confused when it
comes to a merge in 6 parts between, for example 01, 09, 10, 01+09,
01+10 and 09+10, as above.  It gets confused when it sees the same
changes applied in a previous merge applied again, instead of simply
realizing that the change in one since last merge is EXACTLY the same
change in the other since last merge so effectively there is nothing to
do; instead, Bazaar gets confused and starts treating code that did NOT
change since last merge as if it was changed and thus tries to role back
the 01+09+10-specific changes rather than doing nothing and generates a
conflict.  Oh, that I could only have a version control system that
understood the kind of complex branching that I require!

Anyway, that's the state of things; this is me, signing out!
msg73714 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-24 14:28
Comparing item 2 and item 3, I think that item 3 is the Pythonic choice
and item 2 is a bad idea.

Item 4: back-references in the pattern are like \1 and (?P=name), not
\g<1> or \g<name>, and in the replacement string are like \g<1> and
\g<name>, not \1 (or  (?P=name)). I'd like to suggest that
back-references in the pattern also include \g<1> and \g<name> and
\g<-1> for relative back-references. Interestingly, Perl names groups
with (?<name>...) whereas Python uses (?P<name>...). A permissible
alternative?
msg73717 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-24 15:09
Thanks for weighing in Matthew!

Yeah, I do get some flack for item 2 because originally item 3 wasn't
supposed to cover named groups but on investigation it made sense that
it should.  I still prefer 2 over-all but the nice thing about them
being separate items is that we can accept 2 or 3 or both or neither,
and for the most part development for the first phase of 2 is complete
though there is still IMHO the issue of UNICODE name groups (visa-vi
item 14) and the name collision problem which I propose fixing with an
Attribute / re.A flag.  So, I think it may end up that we could support
both 3 by default and 2 via a flag or maybe 3 and 2 both but with 2 as
is, with name collisions hidden (i.e. if you have r'(?P<string>...)' as
your capture group, typing m.string will still give you the original
comparison string, as per the current python documentation) but have
collision-checking via the Attribute flag so that with
r'(?A)(?P<string>...)' would not compile because string is a reserved word.

Your interpretation of 4 matches mine, though, and I would definitely
suggest using Perl's \g<-n> notation for relative back-references, but
further, I was thinking, if not part of 4, part of the catch-all item 11
to add support for Perl's (?<name>...) as a synonym for Python's
(?P<name>...) and Perl's \k<name> for Python's (?P=name) notation.  The
evolution of Perl's name group is actually interesting.  Years ago,
Guido had a conversation with Larry Wall about using the (?P...) capture
sequence for python-specific Regular Expression blocks.  So Python went
ahead and implemented named capture groups.  Years later, the Perl folks
thought named capture groups were a neat idea and adapted them in the
(?<...>...) form because Python had restricted the (?P...) notation to
themselves so they couldn't use our even if they wanted to.  Now,
though, with Perl adapting (?<...>...), I think it inevitable that Java
and even C++ may see this as the defacto standard.  So I 100% agree, we
should consider supporting (?<name>...) in the parser.

Oh, and as I suggested in Issue 3825, I have these new item proposals:

Item 18: Add a re.REVERSE, re.R (?r) flag for reversing the direction of
the String Evaluation against a given Regular Expression pattern. See
issue 516762, as implemented in Issue 3825.

Item 19: Make various in-line flags positionally dependant, for example
(?i) makes the pattern before this case-sensitive but after it
case-insensitive. See Issue 433024, as implemented in Issue 3825.

Item 20: All the negation of in-line flags to cancel their effect in
conditionally flagged expressions for example (?-i). See Issue 433027,
as implemented in Issue 3825.

Item 21: Allow for scoped flagged expressions, i.e. (?i:...), where the
flag(s) is applied to the expression within the parenthesis. See Issue
433028, as implemented in Issue 3825.

Item 22: Zero-width regular expression split: when splitting via a
regular expression of Zero-length, this should return an expression
equivalent to splitting at each character boundary, with a null string
at the beginning and end representing the space before the first and
after the last character. See issue 3262.

Item 23: Character class ranges over case-insensitive matches, i.e. does
"(?i)[9-A]" contain '_' , whose ord is greater than the ord of 'A' and
less than the ord of 'a'. See issue 5311.

And I shall create a bazaar repository for your current development line
with the unfortunately unwieldy name of
lp:~timehorse/python/issue2636-01+09-02+17+18+19+20+21 as that would,
AFAICT, cover all the items you've fixed in your latest patch.

Anyway, great work Matthew and I look forward to working with you on
Regexp 2.7 as you do great work!
msg73721 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-24 15:48
Regarding item 22: there's also #1647489 ("zero-length match confuses
re.finditer()").

This had me stumped for a while, but I might have a solution. I'll see
whether it'll fix item 22 too.

I wasn't planning on doing any more major changes on my branch, just
tweaking and commenting and seeing whether I've missed any tricks in the
speed stakes. Half the task is finding out what's achievable, and how!
msg73730 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-09-24 16:33
Though I can't look at the code at this time, I just want to express how
good it feels that you both are doing these great things for regular
expressions in Python! Especially atomic grouping is something I've
often wished for when writing lexers for Pygments... Keep up the good work!
msg73752 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-24 19:45
Good catch on issue 1647489 Matthew; it looks like this is where that
bug fix will end up going.  But, I am unsure if the solution for this
issue is going to be the same as for 3262.  I think the solution here is
to add an internal flag that will keep track of whether the current
character had previously participated in a Zero-Width match and thus not
allow any subsequent zero-width matches associated beyond the first, and
at the same time not consuming any characters in a Zero-width match.

Thus, I have allocated this fix as Item 24, but it may be later merged
with 22 if the solutions turn out to be more or less the same, likely
via a 22+24 thread.  The main difference, though, as I see it, is that
the change in 24 may be considered a bug where the general consensus of
22 is that it is more of a feature request and given Guido's acceptance
of a flag-based approach, I suggest we allocate re.ZEROWIDTH, re.Z and
(?z) flags to turn on the behaviour you and I expect, but still think
that be best as a 2.7 / 3.1 solution.  I would also like to add a from
__futurue__ import ZeroWidthRegularExpressions or some such to make this
the default behaviour so that by version 3.2 it may indeed be considered
the default.

Anyway, I've allocated all the new items in the launchpad repository so
feel free to go to http://www.bazaar-vcs.org/ and install Bazaar for
windows so you can download any of the individual item development
threads and try them out for yourself.  Also, please consider setting up
a free launchpad account of your very own so that I can perhaps create a
group that would allow us to better share development.

Thanks again Matthew for all your greatly appreciated contributions!
msg73766 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-25 00:06
I've moved all the development branches to the ~pythonregexp2.7 team so 
that we can work collaboratively.  You just need to install Bazaar, join 
www.launchpad.net, upload your public SSH key and then request to be added 
to the pythonregexp2.7 team.  At that point, you can check out any code 
via:

bzr co lp:~pythonregexp2.7/python/issue2636-*

This should make co-operative development easier.
msg73779 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-25 11:56
Just out of interest, is there any plan to include #1160 while we're at it?
msg73780 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-25 11:57
I've enumerated the current list of Item Numbers at the official
Launchpad page for this issue:

https://launchpad.net/~pythonregexp2.7

There you will find links to each development branch associated with
each item, where a broader description of each issue may be found.

I will no longer enumerate the entire list here as it has grown too long
to keep repeating; please consult that web page for the most up-to-date
list of items we will try to tackle in the Python Regexp 2.7 update.

Also, anyone wanting to join the development team who already has a
Launchpad account can just go to the Python Regexp 2.7 web site above
and request to join.  You will need Bazaar to check out, pull or branch
code from the repository, which is available at www.bazaar-vcs.org.
msg73782 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-25 12:23
Good catch, Matthew, and if you spot any other outstanding Regular
Expression issues feel free to mention them here.

I'll give issue 1160 an item number of 25 and think all we need to do
here is change SRE_CODE to be typedefed to an unsigned long and change
the repeat count constants (which would be easier if we assume item 10:
shared constants).
msg73791 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-25 13:43
For reference, these are all the regex-related issues that I've found
(including this one!):

id       : activity : title
#2636    : 25/09/08 : Regexp 2.7 (modifications to current re 2.2.2)
#1160    : 25/09/08 : Medium size regexp crashes python
#1647489 : 24/09/08 : zero-length match confuses re.finditer()
#3511    : 24/09/08 : Incorrect charset range handling with ignore case
flag?
#3825    : 24/09/08 : Major reworking of Python 2.5.2 re module
#433028  : 24/09/08 : SRE: (?flag:...) is not supported
#433027  : 24/09/08 : SRE: (?-flag) is not supported.
#433024  : 24/09/08 : SRE: (?flag) isn't properly scoped
#3262    : 22/09/08 : re.split doesn't split with zero-width regex
#3299    : 17/09/08 : invalid object destruction in re.finditer()
#3665    : 24/08/08 : Support \u and \U escapes in regexes
#3482    : 15/08/08 : re.split, re.sub and re.subn should support flags
#1519638 : 11/07/08 : Unmatched Group issue - workaround
#1662581 : 09/07/08 : the re module can perform poorly: O(2**n) versus
O(n**2)
#3255    : 02/07/08 : [proposal] alternative for re.sub
#2650    : 28/06/08 : re.escape should not escape underscore
#433030  : 17/06/08 : SRE: Atomic Grouping (?>...) is not supported
#1721518 : 24/04/08 : Small case which hangs
#1693050 : 24/04/08 : \w not helpful for non-Roman scripts
#2537    : 24/04/08 : re.compile(r'((x|y+)*)*') should fail
#1633953 : 23/02/08 : re.compile("(.*$){1,4}", re.MULTILINE) fails
#1282    : 06/01/08 : re module needs to support bytes / memoryview well
#814253  : 11/09/07 : Grouprefs in lookbehind assertions
#214033  : 10/09/07 : re incompatibility in sre
#1708652 : 01/05/07 : Exact matching
#694374  : 28/06/03 : Recursive regular expressions
#433029  : 14/06/01 : SRE: posix classes aren't supported
msg73794 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-25 14:17
Hmmm.  Well, some of those are already covered:

#2636    : self
#1160    : Item 25
#1647489 : Item 24
#3511    : Item 23
#3825    : Item 9-2
#433028  : Item 21
#433027  : Item 20
#433024  : Item 19
#3262    : Item 22
#3299    : TBD
#3665    : TBD
#3482    : TBD
#1519638 : TBD
#1662581 : TBD
#3255    : TBD
#2650    : TBD
#433030  : Item 1
#1721518 : TBD
#1693050 : TBD
#2537    : TBD
#1633953 : TBD
#1282    : TBD
#814253  : TBD (but I think you implemented this, didn't you Matthew?)
#214033  : TBD
#1708652 : TBD
#694374  : TBD
#433029  : Item 8

I'll have to get nosy and go over the rest of these to see if any of
them have already been solved, like the duplicate test case issue from a
while ago, but someone forgot to close them.  I'm thinking specifically
the '\u' escape sequence one.
msg73798 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-25 15:57
#814253 is part of the fix for variable-width lookbehind.

BTW, I've just tried a second time to register with Launchpad, but still
no reply. :-(
msg73801 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-25 16:32
Yes, I see in you rc2+2 diff it was added into that.  I will have to
allocate a new number for that fix though, as technically it's a
different feature than variable-length look-behind.

For now I'm having a hard time merging your diffs in with my code base.
 Lots and lots of conflicts, alas.

BTW, what UID did you try to register under at Launchpad?  Maybe I can
see if it's registered but just forgetting to send you e-mail.
msg73803 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-25 17:01
Tried bazaar@mrabarnett.plus.com twice, no reply. Succeeded with
mrabarnett@freeuk.com.
msg73805 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-25 17:36
Thanks Matthew.  You are now part of the pythonregexp2.7 team.  I want
to handle integrating Branch 01+09-02+17 myself for now and the other
branches will need to be renamed because I need to add Item 26: Capture
Groups in Look-Behind expressions, which would mean the order of your
patches are:

01+09-02+17:

regex_2.6rc2.diff
regex_2.6rc2+1.diff

01+09-02+17+26:

regex_2.6rc2+2.diff

01+09-02+17+18+26:

regex_2.6rc2+3.diff
regex_2.6rc2+4.diff

01+09-02+17+18+19+20+21+26:

regex_2.6rc2+5
regex_2.6rc2+6

It is my intention, therefore, to check a version of each of these
patches in to their corresponding repository, sequentially, starting
with 0, which is what I am working on now.

I am worried about a straight copy to each thread though, as there are
some basic cleanups provided through the core issue2636 patch, the item
1 patch and the item 9 patch.  The best way to see what these changes
are is to download
http://bugs.python.org/file10645/issue2636-patches.tar.bz2 and look at
the issue2636-01+09.patch file or, by typing the following into bazaar:

bzr diff --old lp:~pythonregexp2.7/python/base --new
lp:~pythonregexp2.7/python/issue2636+01+09

Which is more up-to-date than my June patches -- I really need to
regenerate those!
msg73827 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-25 23:59
I've been completely unable to get Bazaar to work with Launchpad:
authentication errors and bzrlib.errors.TooManyConcurrentRequests.
msg73848 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-26 13:11
Matthew,

Did you upload a public SSH key to your Launchpad account?

You're on MS Windows, right?  I can try and do an install on an MS
Windows XP box or 2 I have lying around and see how that works, but we
should try and solve this vexing thing I've noticed about Windows
development, which is that Windows cannot understand Unix-style file
permissions, and so when I check out Python on Windows and then check it
back in, I've noticed that EVERY python and C file is "changed" by
virtue of its permissions having changed.  I would hope there's some way
to tell Bazaar to ignore 'permissions' changes because I know our edits
really have nothing to do with that.

Anyway, I'll try a few things visa-vi Windows to see if I get a similar
problem; there's also the https://answers.launchpad.net/bazaar forum
where you can post your Bazaar issues and see if the community can help.
 Search previous questions or click the "Ask a question" button and type
your subject.  Launchpad's UI is even smart enough to scan your question
title for similar ones so you may be able to find a solution right away
that way.  I use the Launchpad Answers section all the time and have
found it usually is a great way of getting help.
msg73853 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-26 15:16
I have it working finally!
msg73854 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-26 15:43
Great, Matthew!!

Now, I'm still in the process of setting up branches related to your
work; generally they should be created from a core and set of features
implemented for example:

To get from Version 2 to Version 3 of your Engine, I had to first check
out lp:~pythonregexp2.7/python/issue2636-01+09-02+17 and then "push" it
back onto launchpad as
lp:~pythonregexp2.7/python/issue2636-01+09-02+17+26.  This way the
check-in logs become coherent.

So, please hold off on checking your code in until I have your current
patch-set checked in, which I should finish by today; I also need to
rename some of the projects based on the fact that you also implemented
item 26 in most of your patches.  Actually, I keep a general To-Do list
of what I am up to on the
https://code.launchpad.net/~pythonregexp2.7/python/issue2636 whiteboard,
which you can also edit, if you want to see what I'm up to.  But I'll
try to have that list complete by today, fingers crossed!  In the mean
time, would you mind seeing if you are getting the file permissions
issue by doing a checkout or pull or branch and then calling "bzr stat"
to see if this caused Bazaar to add your entire project for checkin
because the permissions changed.  Thanks and congratulations again!
msg73855 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-26 16:00
I did a search on the permissions problem:
https://answers.launchpad.net/bzr/+question/34332.
msg73861 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-26 16:28
Thanks, Matthew.  My reading of that Answer is that you should be okay
because you, I assume, installed the Windows-Native package rather than
the cygwin that I first tested.  I think the problem is specific to
Cygwin as well as the circumstances described in the article.  Still, it
should be quite easy to verify if you just check out python and then do
a stat, as this will show all files whose permissions have changed as
well as general changes.  Unfortunately, I am still working on setting
up those branches, but once I finish documenting each of the branches, I
should proceed more rapidly.
msg73875 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-26 18:04
Phew!  Okay, all you patches have been applied as I said in a previous
message, and you should now be able to check out
lp:~pythonregexp2.7/python/issue2636+01+09-02+17+18+19+20+21+24+26 where
you can then apply your latest known patch (rc2+7) to add a fix for the
findall / finditer bug.

However, please review my changes to:

a) lp:~pythonregexp2.7/python/issue2636-01+09-02+17
b) lp:~pythonregexp2.7/python/issue2636-01+09-02+17+26
c) lp:~pythonregexp2.7/python/issue2636-01+09-02+17+18+26
d) lp:~pythonregexp2.7/python/issue2636-01+09-02+17+18+19+20+21+26

To make sure my mergers are what your code snapshots should be.  I did
get one conflict with patch 5 IIRC where a reverse attribute was added
to the SRE_STATE struct, and get a weird grouping error when running the
tests for (a) and (b), which I think is a typo; a compile error
regarding the afore mentioned missing reverse attribute from patch 3 or
4 in (c) and the SRE_FLAG_REVERSE seems to have been lost in (d) for
some reason.

Also, if you feel like tackling any other issues, whether they have
numbers or not, and implementing them in your current development line,
please let me know so I can get all the documentation and development
branches set up.  Thanks and good luck!
msg73955 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-28 02:51
I haven't yet found out how to turn on compression when getting the
branches, so I've only looked at
lp:~pythonregexp2.7/python/issue2636+01+09-02+17+18+19+20+21+24+26. I
did see that the SRE_FLAG_REVERSE flag was missing.

BTW, I ran re.findall(r"(?m)^(.*re\..+\\m)$", text) where text was 67MB
of emails. Python v2.5.2 took 2.4secs and the new version 5.6secs. Ouch!
I added 4 lines to _sre.c and tried again. 1.1secs. Nice! :-)
msg74025 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-29 11:47
Good work, Matthew.  Now, another bazaar hint, IMHO, is once of my
favourite commands: switch.  I generally develop all in one directory,
rather than getting a new directory for each branch.  Once does have to
be VERY careful to type "bzr info" to make sure the branch you're
editing is the one you think it is! but with "bzr switch", you do a
differential branch switch that allows you to change your development
branch quickly and painlessly.  This assumes you did a "bzr checkout"
and not a "bzr pull".  If you did a pull, you can still turn this into a
"checkout", where all VCS actions are mirrored on the server, by using
the 'bind' command.  Make sure you push your branch first.  You don't
need to worry about all this "bind"ing, "push"ing and "pull"ing if you
choose checkout, but OTOH, if your connection is over-all very slow, you
may still be better off with a "pull"ed branch rather than a
"checkout"ed one.

Anyway, good catch on those 4 lines and I'll see if I can get your
earlier branches up to date.
msg74026 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2008-09-29 12:36
Matthew, I've traced down the patch failures in my merges and now each
of the 4 versions of code on Launchpad should compile, though the first
2 do not pass all the negative look-behind tests, though your later 2
do.  Any chance you could back-port that fix to the
lp:~pythonregexp2.7/python/issue2636-01+09-02+17 branch?  If you can, I
can propagate that fix to the higher levels pretty quickly.
msg74058 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-30 00:45
issue2636-01+09-02+17_backport.diff is the backport fix.

Still unable to compress the download, so that's >200MB each time!
msg74104 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-09-30 23:42
The explanation of the zero-width bug is incorrect. What happens is this:

The functions for finditer(), findall(), etc, perform searches and want
the next one to continue from where the previous match ended. However,
if the match was actually zero-width then that would've made it search
from where the previous search _started_, and it would be stuck forever.
Therefore, after a zero-width match the caller of the search consumes a
character. Unfortunately, that can result a character being 'missed'.

The bug in re.split() is also the result of an incorrect fix to this
zero-width problem.

I suggest that the regex code should include the fix for the zero-width
split bug; we can have code to turn it off unless a re.ZEROWIDTH flag is
present, if that's the decision.

The patch issue2636+01+09-02+17+18+19+20+21+24+26_speedup.diff includes
some speedups.
msg74174 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-10-02 16:48
I've found an interesting difference between Python and Perl regular
expressions:

    In Python:

        \Z matches at the end of the string

    In Perl:

        \Z matches at the end of the string or before a newline at the
end of the string

        \z matches at the end of the string
msg74203 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-10-02 22:49
Perl v5.10 offers the ability to have duplicate capture group numbers in
branches. For example:

    (?|(a)|(b))

would number both of the capture groups as group 1.

Something to include?
msg74204 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-10-02 22:51
I've extended the group referencing. It now has:

Forward group references

    (\2two|(one))+

\g-type group references

    (n is name or number)
    \g<n> (Python re replacement string)
    \g{n} (Perl)
    \g'n' (Perl)
    \g"n" (because ' and " are interchangeable)
    \gn   (n is single digit) (Perl)

    (n is number)
    \g<+n>
    \g<-n>
    \g{+n} (Perl)
    \g{-n} (Perl)

\k-type group references

    (n is group name)
    \k<n> (Perl)
    \k{n} (Perl)
    \k'n' (Perl)
    \k"n" (because ' and " are interchangeable)
msg74904 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2008-10-17 12:28
Further to msg74203, I can see no reason why we can't allow duplicate
capture group names if the groups are on different branches are are thus
mutually exclusive. For example:

    (?P<name>a)|(?P<name>b)

Apart from this I think that duplicate names should continue to raise an
exception.
msg80916 - (view) Author: Alex Willmer (moreati) * Date: 2009-02-01 19:25
I've been trying, and failing to understand the state of play with this
bug. The most recent upload is
issue2636+01+09-02+17+18+19+20+21+24+26_speedup.diff, but I can't seem
to apply that to anything. Nearly every hunk fails when I try against
25-maint, 26-maint or trunk. How does one apply this? Do I need to apply
mrabarnett's patches from bug 3825?
msg81112 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-02-03 23:07
issue2636-features.diff is based on Python 2.6. It includes:

Named Unicode characters eg \N{LATIN CAPITAL LETTER A}

Unicode character properties eg \p{Lu} (uppercase letter) and \P{Lu}
(not uppercase letter)

Other character properties not restricted to Unicode eg \p{Alnum} and 
\P{Alnum}

Issue #3511 : Incorrect charset range handling with ignore case
flag?
Issue #3665 : Support \u and \U escapes in regexes
Issue #1519638 Unmatched Group issue - workaround
Issue #1693050 \w not helpful for non-Roman scripts

The next 2 seemed a good idea at the time. :-)

Octal escape \onnn

Extended hex escape \x{n}
msg81236 - (view) Author: Robert Xiao (nneonneo) * Date: 2009-02-05 23:13
I'm glad to see that the unmatched group issue is finally being addressed.

Thanks!
msg81238 - (view) Author: Russ Cox (rsc) Date: 2009-02-05 23:52
> Named Unicode characters eg \N{LATIN CAPITAL LETTER A}

These descriptions are not as stable as, say, Unicode code
point values or language names.  Are you sure it is a good idea
to depend on them not being adjusted in the future?
It's certainly nice and self-documenting, but it doesn't seem
better from a future-proofing point of view than \u0041.

Do other languages implement this?

Russ
msg81239 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-02-06 00:03
Python 2.6 does (and probably Python 3.x, although I haven't checked):

>>> u"\N{LATIN CAPITAL LETTER A}"
u'A'

If it's good enough for Python's Unicode string literals then it's good
enough for Python's re module.  :-)
msg81240 - (view) Author: Robert Xiao (nneonneo) * Date: 2009-02-06 00:06
In fact, it works for Python 2.4, 2.5, 2.6 and 3.0 from my rather
limited testing.

In Python 2.4:
>>> u"\N{LATIN CAPITAL LETTER A}"
u'A'
>>> u"\N{MUSICAL SYMBOL DOUBLE SHARP}"
u'\U0001d12a'

In Python 3.0:
>>> "\N{LATIN CAPITAL LETTER A}"
'A'
>>> ord("\N{MUSICAL SYMBOL DOUBLE SHARP}")
119082
msg81359 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-02-08 00:39
issue2636-features-2.diff is based on Python 2.6.

Bugfix. No new features.
msg81473 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-09 19:09
Besides the fact that this is probably great work, I really wonder who
will have enough time and skills to review such a huge patch... :-S

In any case, some recommendations:
- please provide patches against trunk; there is no way such big changes
will get committed against 2.6, which is in maintenance mode
- avoid, as far as possible, doing changes in style, whitespace or
indentation; this will make the patch slightly smaller or cleaner
- avoid C++-style comments (use /* ... */ instead)
- don't hesitate to add extensive comments and documentation about what
you've added

Once you think your patch is ready, you may post it to
http://codereview.appspot.com/, in the hope that it makes reviewing easier.
msg81475 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-09 19:17
One thing I forgot:
- please don't make lines longer than 80 characters :-)

Once the code has settled down, it would also be interesting to know if
performance has changed compared to the previous implementation.
msg82673 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-02-24 19:28
issue2636-features-3.diff is based on the 2.x trunk.

Added comments.
Restricted line lengths to no more than 80 characters
Added common POSIX character classes like [[:alpha:]].
Added further checks to reduce unnecessary backtracking.

I've decided to remove \onnn and \x{n} because they aren't supported
elsewhere in the language.
msg82739 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-02-26 01:22
issue2636-features-4.diff includes:

Bugfixes
msg74203: duplicate capture group numbers
msg74904: duplicate capture group names
msg82950 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-03-01 01:42
issue2636-features-5.diff includes:

Bugfixes
Added \G anchor (from Perl).

\G is the anchor at the start of a search, so re.search(r'\G(\w)') is
the same as re.match(r'(\w)').

re.findall normally performs a series of searches, each starting where
the previous one finished, but if the pattern starts with \G then it's
like a series of matches:

>>> re.findall(r'\w', 'abc def')
['a', 'b', 'c', 'd', 'e', 'f']
>>> re.findall(r'\G\w', 'abc def')
['a', 'b', 'c']

Notice how it failed to match at the space, so no more results.
msg83271 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-03-07 02:47
issue2636-features-6.diff includes:

Bugfixes
Added group access via subscripting.

>>> m = re.search("(\D*)(?<number>\d+)(\D*)", "abc123def")
>>> len(m)
4
>>> m[0]
'abc123def'
>>> m[1]
'abc'
>>> m[2]
'123'
>>> m[3]
'def'
>>> m[1 : 4]
('abc', '123', 'def')
>>> m[ : ]
('abc123def', 'abc', '123', 'def')
>>> m["number"]
'123'
msg83277 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-03-07 11:27
I don't think it will be possible to accept these patches in the current
form and way in which they are presented. I randomly picked
issue2636-features-2.diff, and see that it contains lots of style and
formatting changes, which is completely taboo for this kind of contribution.

I propose to split up the patches into separate tracker issues, one
issue per proposed new feature. No need to migrate all changes to new
issues - start with the one single change that you think is already
complete, and acceptance is likely without debate. Leave a note in this
issue what change has been moved to what issue.

For each such new issue, describe what precisely the patch is supposed
to do. Make sure it is complete with respect to this specific change,
and remove any code not contributing to the change.

Also procedurally, it is not quite clear to me who is contributing these
changes: Jeffrey C. Jacobs, or Matthew Barnett. We will need copyright
forms from the original contributor.
msg83390 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2009-03-09 15:15
Martin and Matthew,

I've been far too busy in the new year to keep up with all your updates 
to this issue, but since Martin wanted some clarification on direction 
and copyright, Matthew and I are co-developers, but there is clear delineation between each of our work where the patches uploaded by 
Matthew (mrbarnett) were uploaded by him and totally a product of his 
work.  The ones uploaded by me are more complicated, as I have always 
intended this to be a piecemeal project, not one patch fixes all, which 
is why I created the Bazaar repository hierarchy 
(https://launchpad.net/~pythonregexp2.7) with 36 or so branches of 
mostly independent development at various stages of completion.  Here is 
where the copyrights get more complicated, but not much so.  As I said, 
there are branches where multiple issues are combined (with the plus 
operator (+)).  In general, I consider primary development the single-
number branch and only create combined branches where I feel there may 
be a cross-dependency between one branch and the other.  Working this 
way is VERY time consuming: one spends more time merging branches than 
actually developing.  Matthew, on the other hand, has worked fairly 
linearly so his branches generally have long number trains to indicate 
all the issues solved in each.  What's more, the last time I updated the 
repository was last summer so all of Matthew's latest patches have not 
been catalogued and documented.  But, what is there that is more or less 
100% copyright and thanks to Matthew's diligent work always contains his 
first contribution, the new RegExp engine, thread 09-02.  So, any items 
which contain ...+09-02+... are pretty much Matthew's work and the rest 
are mine.

All that said, I personally like having all this development in one 
place, but also like having the separate branch development model I've 
set up in Bazaar.  If new issues are created from this one, I would thus 
hope they would still follow the outline specified on the Launchpad 
page.  I prefer keeping everything in one issue though as IMHO it makes 
things easier to keep track of.

As for the stuff I've worked on, I first should forewarn that there is a  
root patch at 
(https://code.launchpad.net/~pythonregexp2.7/python/issue2636) and as 
issue2636.patch in the tar.bz2 patch library I posted last June.  This 
patch contains various code cleanups and most notably a realignment of 
the documentation to follow 72-column rule.  I know Python's 
documentation is supposed to be 80-column, but some of the lines were 
going out even passed that and by making it 72 it allows for incremental 
expansion before having to reformat any lines.  However, textually, the 
issue2636 version of re.rst is no different than the last version it's 
based off off, which I verified by generating Sphinx hierarchies for 
both versions.  I therefore suggest this as the only change which is 
'massive restructuring' as it does not effect the actual documentation, 
it just makes it more legible in reStructuredText form.  This and other 
suggested changes in the root issue2636 thread are indented to be 
applied if at least 1 of the other issues is accepted, and as such is 
the root branch of every other branch.  Understanding that even these 
small changes may not in fact be acceptable, I have always generated 2 
sets of patches for each issue: one diff'ed against the python snapshot 
stored in base  
(https://code.launchpad.net/~pythonregexp2.7/python/base) and one that 
is diff'ed against the issue2636 root so if the changes in issue2636 
root are none the less unacceptable, they can easily be disregarded.

Now, with respect to work ready for analysis and merging prepared by me, 
I have 4 threads ready for analysis, with documentation updated and test 
cases written and passing:

1: Atomic Grouping / Possessive Qualifiers

5: Added a Python-specific RegExp comment group, (?P#...) which supports  
parenthetical nesting (see the issue for details)

7: Better caching algorithm for the RegExp compiler with more entries in 
the cache and reduced possibility of thrashing.

12: Clarify the python documentation for RegExp comments; this was only 
a change in re.rst.

The branches 09-01 and 09-01-01 are engine redesigns that I used to 
better understand the current RegExp engine but neither is faster than 
the existing engine so they will probably be abandoned.

10 is also nearly complete and effects the implementation of 01 (whence 
01+10) if accepted, but I have not done a final analysis to determine if any other variables can be consolidated to be defined only in one place.  

Thread 2 is in a near-complete form, but has been snagged by a decision 
as to what the interface to it should be -- see the discussion above and specifically http://bugs.python.org/msg68336 and http://bugs.python.org/msg68399.  The stand-alone patch by me is the 
latest version and implements the version called (a) in those notes.  I 
prefer to implement (e).

I don't think I'd had a chance to do any significant work on any of the 
other threads and got really bogged down with changing thread 2 as 
described above, trying to maintain threads for Matthew and just 
performing all those merges in Bazaar!

So that's the news from me, and nothing new to contribute at this time, 
but if you want separate, piecemeal solutions, feel free to crack opened http://bugs.python.org/file10645/issue2636-patches.tar.bz2 and grab them 
for at least items 1, 5, 7 and 12.
msg83411 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-03-09 23:09
> I've been far too busy in the new year to keep up with all your updates 
> to this issue, but since Martin wanted some clarification on direction 
> and copyright,

Thanks for the clarification. So I think we should focus on Matthew's
patches first, and come back to yours when you have time to contribute
them.
msg83427 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2009-03-10 12:00
Okay, as I said, Atomic Grouping, etc., off a recent 2.6 is already 
available and I can do any cleanups requested to those already 
mentioned, I just don't want to start any new items at the moment.  As 
it is, we are still over a year from any of this seeing the light of day 
as it's not going to be merged until we start 2.7 / 3.1 alpha.

Fortunately, I think Matthew here DOES have a lot of potential to have 
everything wrapped up by then, but I think to summarize everyone's 
concern, we really would like to be able to examine each change 
incrementally, rather than as a whole.  So, for the purposes of this, I 
would recommend that you, Matthew, make a version of your new engine 
WITHOUT any Atomic Group, variable length look behind / ahead 
assertions, reverse string scanning, positional, negated or scoped 
inline flags, group key indexing or any other feature described in the 
various issues, and that we then evaluate purely on the merits of the 
engine itself whether it is worth moving to that engine, and having made 
that decision officially move all work to that design if warranted.  
Personally, I'd like to see that 'pure' engine for myself and maybe we 
can all develop an appropriate benchmark suite to test it fairly against 
the existing engine.  I also think we should consider things like 
presentation (are all lines terminated by column 80), number of 
comments, and general readability.  IMHO, the current code is conformant 
in the line length, but VERY deficient WRT comments and readability, the 
later of which it sacrifices for speed (as well as being retrofitted for 
iteration rather than recursion).  I'm no fan of switch-case, but I 
found that by turning the various case statements into bite-sized 
functions and adding many, MANY comments, the code became MUCH more 
readable at the minor cost of speed.  As I think speed trumps 
readability (though not blindly), I abandoned my work on the engines, 
but do feel that if we are going to keep the old engine, I should try 
and adapt my comments to the old framework to make the current code a 
bit easier to understand since the framework is more or less the same 
code as in the existing engine, just re-arranged.

I think all of the things you've added to your engine, Matthew, can, 
with varying levels of difficulty be implemented in the existing Regexp 
Engine, though I'm not suggesting that we start that effort.  Simply, 
let's evaluate fairly whether your engine is worth the switch over.  
Personally, I think the engine has some potential -- though not much 
better than current WRT readability -- but we've only heard anecdotal 
evidence of it's superior speed.  Even if the engine isn't faster, 
developing speed benchmarks that fairly gage any potential new engine 
would be handy for the next person to have a great idea for a rewrite, 
so perhaps while you peruse the stripped down version of your engine, 
the rest of us can work on modifying regex_tests.py, test_re.py and 
re_tests.py in Lib/test specifically for the purpose of benchmarking.

If we can focus on just these two issues ('pure' engine and fair 
benchmarks) I think I can devote some time to the later as I've dealt a 
lot with benchmarking (WRT the compiler-cache) and test cases and hope 
to be a bit more active here.
msg83428 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-03-10 12:08
> Okay, as I said, Atomic Grouping, etc., off a recent 2.6 is already 
> available and I can do any cleanups requested to those already 
> mentioned, I just don't want to start any new items at the moment.  As 
> it is, we are still over a year from any of this seeing the light of day 
> as it's not going to be merged until we start 2.7 / 3.1 alpha.

3.1 will actually be released, if all goes well, before July of this
year. The first alpha was released a couple of days ago. The goal is to
fix most deficiencies of the 3.0 release.

See http://www.python.org/dev/peps/pep-0375/ for the planned release
schedule.
msg83429 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2009-03-10 12:14
Thanks, Antione!  Then I think for the most part any changes to Regexp 
will have to wait for 3.2 / 2.7.
msg83988 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-03-22 23:33
An additional feature that could be borrowed, though in slightly
modified form, from Perl is case-changing controls in replacement
strings. Roughly the idea is to add these forms to the replacement string:

    \g<1> provides capture group 1

    \u\g<1> provides capture group 1 with the first character in uppercase

    \U\g<1> provides capture group 1 with all the characters in uppercase

    \l\g<1> provides capture group 1 with the first character in lowercase

    \L\g<1> provides capture group 1 with all the characters in lowercase

In Perl titlecase is achieved by using both \u and \L, and the same
could be done in Python:

    \u\L\g<1> provides capture group 1 with the first character in
uppercase after putting all the characters in all lowercase

although internally it would do proper titlecase.

I'm suggesting restricting the action to only the following group. Note
that this is actually syntactically unambiguous.
msg83989 - (view) Author: Robert Xiao (nneonneo) * Date: 2009-03-23 00:08
Frankly, I don't really like that idea; I think it muddles up the RE
syntax to have such a group-modifying operator, and seems rather
unpythonic: the existing way to do this --  use .upper(), .lower() or
.title() to format the groups in a match object as necessary -- seems to
be much more readable and reasonable in this sense.

I think the proposed changes look good, but I agree that the focus
should be on breaking up the megapatch into more digestible feature
additions, starting from the barebones engine. Until that's done, I
doubt *anyone* will want to review it, let alone merge it into the main
Python distribution. So, I think we should hold off on any new features
until this raft of changes can be properly broken up, reviewed and
(hopefully) merged in.
msg83993 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-03-23 01:42
Ah, too Perlish! :-)

Another feature request that I've decided not to consider any further is
recursive regular expressions. There are other tools available for that
kind of thing, and I don't want the re module to go the way of Perl 6's
rules; such things belong elsewhere, IMHO.
msg84350 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-03-29 00:43
Patch issue2636-patch-1.diff contains a stripped down version of my
regex engine and the other changes that are necessary to make it work.
msg86004 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-04-15 23:13
fyi - I can't compile issue2636-patch-1.diff when applied to trunk (2.7) 
using gcc 4.0.3.  many errors.
msg86032 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-04-16 14:58
Try issue2636-patch-2.diff.
msg89632 - (view) Author: Akira Kitada (akitada) Date: 2009-06-23 16:29
Thanks for this great work!

Does Regexp 2.7 include Unicode Scripts support?
http://www.regular-expressions.info/unicode.html

Perl and Ruby support it and it's pretty handy.
msg89634 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-06-23 17:01
It includes Unicode character properties, but not the Unicode script
identification, because the Python Unicode database contains the former
but not the latter.

Although they could be added to the re module, IMHO their proper place
is in the Unicode database, from which the re module could access them.
msg89643 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2009-06-23 20:52
http://bugs.python.org/6331 is a patch that adds unicode script info to
the unicode database.
msg90954 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-07-26 19:11
issue2636-20090726.zip is a new implementation of the re engine. It
replaces re.py, sre.py, sre_constants.py, sre_parse.py and
sre_compile.py with a new re.py and replaces sre_constants.h, sre.h and
_sre.c with _re.h and _re.c.

The internal engine no longer interprets a form of bytecode but instead
follows a linked set of nodes, and it can work breadth-wise as well as
depth-first, which makes it perform much better when faced with one of
those 'pathological' regexes.

It supports scoped flags, variable-length lookbehind, Unicode
properties, named characters, atomic groups, possessive quantifiers, and
will handle zero-width splits correctly when the ZEROWIDTH flag is set.

There are a few more things to add, like allowing indexing for capture
groups, and further speed improvements might be possible (at worst it's
roughly the same speed as the existing re module).

I'll be adding some documentation about how it works and the slight
differences in behaviour later.
msg90961 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2009-07-26 21:29
Sounds like this is an awesome piece of work!

Since the patch is obviously a very large piece and will be hard to
review, may I suggest releasing the new engine as a standalone package
and spreading the word, so that people can stress-test it?  By the time
2.7 is ready to release, if it has had considerable exposure to the
public, that will help acceptance greatly.

The Unicode script identification might not be hard to add to
unicodedata; maybe Martin can do that?
msg90985 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-07-27 16:13
issue2636-20090727.zip contains regex.py, _regex.h, _regex.c and also
_regex.pyd (for Python 2.6 on Windows). For Windows machines just put
regex.py and _regex.pyd into Python's Lib\site-packages folder. I've
changed the name so that it won't hide the re module.
msg90986 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-07-27 17:36
Agreed, a standalone release combined with a public announcement about
its availability is a must if we want to get any sort of wide spread
testing.

It'd be great if we had a fully characterized set of tests for the
behavior of the existing engine... but we don't.  So widespread testing
is important.
msg90989 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2009-07-27 17:53
We have lengthy sets of tests in Lib/test/regex_tests.py and
Lib/test/test_re.py.

While widespread testing of a standalone module would certainly be good,
I doubt that will exercise many corner cases and the more esoteric
features.  Most actual code probably uses relatively few regex pattern
constructs.
msg91028 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-07-29 00:56
issue2636-20090729.zip contains regex.py, _regex.h, _regex.c which will
work with Python 2.5 as well as Python 2.6, and also 2 builds of
_regex.pyd (for Python 2.5 and Python 2.6 on Windows).

This version supports accessing the capture groups by subscripting the
match object, for example:

>>> m = regex.match("(?<foo>.)(?<bar>.)", "abc")
>>> len(m)
3
>>> m[0]
'ab'
>>> m[1 : 3]
['a', 'b']
>>> m["foo"]
'a'
msg91035 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-07-29 11:10
Unfortunately I found a bug in regex.py, caused when I made it
compatible with Python 2.5. :-(

issue2636-20090729.zip is now corrected.
msg91038 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-07-29 13:01
Apparently Perl has a quite comprehensive set of tests at
http://perl5.git.perl.org/perl.git/blob/HEAD:/t/op/re_tests .
If we want the engine to be Perl-compatible, it might be a good idea to
reuse (part of) their tests (if their license allows it).
msg91245 - (view) Author: John Machin (sjmachin) Date: 2009-08-03 22:36
Problem is memory leak from repeated calls of e.g.
compiled_pattern.search(some_text). Task Manager performance panel shows
increasing memory usage with regex but not with re. It appears to be
cumulative i.e. changing to another pattern or text doesn't release memory.

Environment: Python 2.6.2, Windows XP SP3, latest (29 July) regex zip file.

Example:

8<-- regex_timer.py
import sys
import time
if sys.platform == 'win32':
    timer = time.clock
else:
    timer = time.time
module = __import__(sys.argv[1])
count = int(sys.argv[2])
pattern = sys.argv[3]
expected = sys.argv[4]
text = 80 * '~' + 'qwerty'
rx = module.compile(pattern)
t0 = timer()
for i in xrange(count):
    assert rx.search(text).group(0) == expected
t1 = timer()
print "%d iterations in %.6f seconds" % (count, t1 - t0)
8<---

Here are the results of running this (plus observed difference between
peak memory usage and base memory usage):

dos-prompt>\python26\python regex_timer.py regex 1000000 "~" "~"
1000000 iterations in 3.811500 seconds [60 Mb]

dos-prompt>\python26\python regex_timer.py regex 2000000 "~" "~"
2000000 iterations in 7.581335 seconds [128 Mb]

dos-prompt>\python26\python regex_timer.py re 2000000 "~" "~"
2000000 iterations in 2.549738 seconds [3 Mb]

This happens on a variety of patterns: "w", "wert", "[a-z]+", "[a-z]+t",
...
msg91250 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-08-04 01:30
issue2636-20090804.zip is a new version of the regex module.

The memory leak has been fixed.
msg91437 - (view) Author: Vlastimil Brom (vbr) Date: 2009-08-10 08:54
First, many thanks for this contribution; it's great, that the re 
module gets updated in that comprehensive way!

I'd like to report some issue with the current version 
(issue2636-20090804.zip).

Using an empty string as the search pattern ends up consuming system 
resources and the function doesn't return anything nor raise an 
exception or crash (within several minutes I tried).
The current re engine simply returns the empty matches on all character 
boundaries in this case.

I use win XPh SP3, the behaviour is the same on python 2.5.4 and 2.6.2:
It should be reproducible with the following simple code:

>>> import re
>>> import regex
>>> re.findall("", "abcde")
['', '', '', '', '', '']
>>> regex.findall("", "abcde")
_

regards
    vbr
msg91439 - (view) Author: John Machin (sjmachin) Date: 2009-08-10 10:58
Adding to vbr's report: [2.6.2, Win XP SP3] (1) bug mallocs memory
inside loop (2) also happens to regex.findall with patterns 'a{0,0}' and
'\B' (3) regex.sub('', 'x', 'abcde') has similar problem BUT 'a{0,0}'
and '\B' appear to work OK.
msg91448 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-08-10 14:18
issue2636-20090810.zip should fix the empty-string bug.
msg91450 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-08-10 15:04
issue2636-20090810#2.zip has some further improvements and bugfixes.
msg91460 - (view) Author: Vlastimil Brom (vbr) Date: 2009-08-10 19:27
I'd like to confirm, that the above reported error is fixed in 
issue2636-20090810#2.zip
While testing the new features a bit, I noticed some irregularity in 
handling the Unicode Character Properties; 
I tried randomly some of those mentioned at http://www.regular-
expressions.info/unicode.html using the simple findall like above.

It seems, that only the short abbreviated forms of the properties are 
supported, however, the long variants are handled in different ways.
Namely, the properties names containing whitespace or other non-letter 
characters cause some probably unexpected exception:

>>> regex.findall(ur"\p{Ll}", u"abcDEF")
[u'a', u'b', u'c']
# works ok

\p{LowercaseLetter} isn't supported, but seems to be handled, as it 
throws "error: undefined property name" at the end of the traceback.

\p{Lowercase Letter} \p{Lowercase_Letter} \p{Lowercase-Letter} 
isn't probably expected, the traceback is:

>>> regex.findall(ur"\p{Lowercase_Letter}", u"abcDEF")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Python25\lib\regex.py", line 194, in findall
    return _compile(pattern, flags).findall(string)
  File "C:\Python25\lib\regex.py", line 386, in _compile
    parsed = _parse_pattern(source, info)
  File "C:\Python25\lib\regex.py", line 465, in _parse_pattern
    branches = [_parse_sequence(source, info)]
  File "C:\Python25\lib\regex.py", line 477, in _parse_sequence
    item = _parse_item(source, info)
  File "C:\Python25\lib\regex.py", line 485, in _parse_item
    element = _parse_element(source, info)
  File "C:\Python25\lib\regex.py", line 610, in _parse_element
    return _parse_escape(source, info, False)
  File "C:\Python25\lib\regex.py", line 844, in _parse_escape
    return _parse_property(source, ch == "p", here, in_set)
  File "C:\Python25\lib\regex.py", line 983, in _parse_property
    if info.local_flags & IGNORECASE and not in_set:
NameError: global name 'info' is not defined
>>> 

Of course, arbitrary strings other than properties names are handled 
identically.

Python 2.6.2 version behaves the same like 2.5.4.

vbr
msg91462 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2009-08-10 22:02
for each of these discrepancies that you're finding, please consider 
submitting them as patches that add a unittest to the existing test 
suite.  otherwise their behavior guarantees will be lost regardless of if 
the suite in this issue is adopted.  thanks!

I'll happily commit any passing re module unittest additions.
msg91463 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-08-10 22:42
issue2636-20090810#3.zip adds more Unicode character properties such as
"\p{Lowercase_Letter}", and also Unicode script ranges.

In addition, the 'findall' method now accepts an 'overlapped' argument
for finding overlapped matches. For example:

>>> regex.findall(r"(..)", "abc")
['ab']
>>> regex.findall(r"(..)", "abc", overlapped=True)
['ab', 'bc']
msg91473 - (view) Author: Vlastimil Brom (vbr) Date: 2009-08-11 11:15
Sorry for the dumb question, which may also suggest, that I'm 
unfortunately unable to contribute at this level (with zero knowledge 
of C and only "working" one for Python):
Where can I find the sources for tests etc. and how they are eventually 
to be submitted? Is some other account needed besides the one for 
bugs.python.org?

Anyway, the long character properties now work in the latest version 
issue2636-20090810#3.zip

In the mentioned overview 
http://www.regular-expressions.info/unicode.html
there is a statement for the property names: "You may omit the 
underscores or use hyphens or spaces instead." 
While I'm not sure, that it is a good thing to have that many 
variations, they should probably be handled in the same way.

Now, the whitespace (and also non ascii characters) in the property 
name seem to confuse the parser: these pass silently (don't match 
anything) and don't throw an exception like "undefined property name".

cf.

>>> regex.findall(ur"\p{Dummy Property}", u"abcDEF")
[]
>>> regex.findall(ur"\p{DümmýPrópërtý}", u"abcDEF")
[]
>>> regex.findall(ur"\p{DummyProperty}", u"abcDEF")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "regex.pyc", line 195, in findall
  File "regex.pyc", line 563, in _compile
  File "regex.pyc", line 642, in _parse_pattern
  File "regex.pyc", line 654, in _parse_sequence
  File "regex.pyc", line 662, in _parse_item
  File "regex.pyc", line 787, in _parse_element
  File "regex.pyc", line 1021, in _parse_escape
  File "regex.pyc", line 1159, in _parse_property
error: undefined property name 'DummyProperty'
>>> 

vbr
msg91474 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-08-11 12:59
Take a look a the dev FAQ, linked from http://www.python.org/dev.  The
tests are in Lib/test in a distribution installed from source, but
ideally you would be (anonymously) pulling the trunk from SVN (when it
is back) and creating your patches with respect to that code as
explained in the FAQ.  You would be adding unit test code to
Lib/test/test_re.py, though it looks like re_tests.py might be an
interesting file to look at as well.

As the dev docs say, anyone can contribute, and writing tests is a great
way to start, so please don't feel like you aren't qualified to
contribute, you are.  If you have questions, come to #python-dev on
freenode.
msg91490 - (view) Author: John Machin (sjmachin) Date: 2009-08-12 03:00
What is the expected timing comparison with re? Running the Aug10#3
version on Win XP SP3 with Python 2.6.3, I see regex typically running
at only 20% to %50 of the speed of re in ASCII mode, with
not-very-atypical tests (find all Python identifiers in a line, failing
search for a Python identifier in an 80-byte text). Is the supplied
_regex.pyd from some sort of debug or unoptimised build? Here are some
results:

dos-prompt>\python26\python -mtimeit -s"import re as
x;r=x.compile(r'[A-Za-z_][A-Za-z0-9_]+');t='    def __init__(self, arg1,
arg2):\n'" "r.findall(t)"
100000 loops, best of 3: 5.32 usec per loop

dos-prompt>\python26\python -mtimeit -s"import regex as
x;r=x.compile(r'[A-Za-z_][A-Za-z0-9_]+');t='    def __init__(self, arg1,
arg2):\n'" "r.findall(t)"
100000 loops, best of 3: 12.2 usec per loop

dos-prompt>\python26\python -mtimeit -s"import re as
x;r=x.compile(r'[A-Za-z_][A-Za-z0-9_]+');t='1234567890'*8" "r.search(t)"
1000000 loops, best of 3: 1.61 usec per loop

dos-prompt>\python26\python -mtimeit -s"import regex as
x;r=x.compile(r'[A-Za-z_][A-Za-z0-9_]+');t='1234567890'*8" "r.search(t)"
100000 loops, best of 3: 7.62 usec per loop

Here's the worst case that I've found so far:

dos-prompt>\python26\python -mtimeit -s"import re as
x;r=x.compile(r'z{80}');t='z'*79" "r.search(t)"
1000000 loops, best of 3: 1.19 usec per loop

dos-prompt>\python26\python -mtimeit -s"import regex as
x;r=x.compile(r'z{80}');t='z'*79" "r.search(t)"
1000 loops, best of 3: 334 usec per loop

See Friedl: "length cognizance". Corresponding figures for match() are
1.11 and 8.5.
msg91495 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2009-08-12 12:04
</lurk>
Re: timings

Thanks for the info, John.  First of all, I really like those tests and
could you please submit a patch or other document so that we could
combine them into the python test suite.

The python test suite, which can be run as part of 'make test' or IIRC
there is a way to run JUST the 2 re test suites which I seem to have
senior moment'd, includes a built-in timing output over some of the
tests, though I don't recall which ones were being timed: standard cases
or pathological (rare) ones.  Either way, we should include some timings
that are of a standard nature in the test suite to make Matthew's and
any other developer's work easier.

So, John, if you are not familiar with the test suite, I can look into
adding the specific cases you've developed into the test suite so we can
have a more representative timing of things.  Remember, though, that
when run as a single instance, at least in the existing engine, the re
compiler caches recent compiles, so repeatedly compiling an expression
flattens the overhead in a single run to a single compile and lookup,
where as your tests recompile at each test (though I'm not sure what
timeit is doing: if it invokes a new instance of python each time, it is
recompiling each time, if it is reusing the instance, it is only
compiling once).

Having not looked at Matthew's regex code recently (nice name, BTW), I
don't know if it also contains the compiled expression cache, in which
case, adding it in might help timings.  Originally, the cache worked by
storing ~100 entries and cleared itself when full; I have a modification
which increases this to 256 (IIRC) and only removes the 128 oldest to
prevent thrashing at the boundary which I think is better if only for a
particular pathological case.

In any case, don't despair at these numbers, Matthew: you have a lot of
time and potentially a lot of ways to make your engine faster by the
time 1.7 alpha is coined.  But also be forewarned, because, knowing what
I know about the current re engine and what it is further capable of, I
don't think your regex will be replacing re in 1.7 if it isn't at least
as fast as the existing engine for some standard set of agreed upon
tests, no matter how many features you can add.  I have no doubt, with a
little extra monkey grease, we could implement all new features in the
existing engine.  I don't want to have to reinvent the wheel, of course,
and if Matthew's engine can pick up some speed everybody wins!  So, keep
up the good work Matthew, as it's greatly appreciated!

Thanks all!

Jeffrey.

<lurk>
msg91496 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-08-12 12:16
> Remember, though, that
> when run as a single instance, at least in the existing engine, the re
> compiler caches recent compiles, so repeatedly compiling an expression
> flattens the overhead in a single run to a single compile and lookup,
> where as your tests recompile at each test

They don't. The pattern is compiled only once. Please take a look at
http://docs.python.org/library/timeit.html#command-line-interface
msg91497 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2009-08-12 12:29
Mea culpa et mes apologies,

The '-s' option to John's expressions are indeed executed only once --
they are one-time setup lines.  The final quoted expression is what's
run multiple times.

In other words, improving caching in regex will not help.  >sigh<

Merci, Antoine!

Jeffrey.
msg91500 - (view) Author: Collin Winter (collinwinter) * (Python committer) Date: 2009-08-12 18:01
FYI, Unladen Swallow includes several regex benchmark suites: a port of 
V8's regex benchmarks (regex_v8); some of the regexes used when tuning 
the existing sre engine 7-8 years ago (regex_effbot); and a 
regex_compile benchmark that tests regex compilation time.

See http://code.google.com/p/unladen-swallow/wiki/Benchmarks for more 
details, including how to check out and run the benchmark suite. You'll 
need to modify your experimental Python build to have "import re" import 
the proposed regex engine, rather than _sre. The benchmark command would 
look something like `./perf.py -r -b regex /control/python 
/experiment/python`, which will run all the regex benchmarks in rigorous 
mode. I'll be happy to answer any questions you have about our 
benchmarks.

I'd be very interested to see how the proposed regex engine performs on 
these tests.
msg91535 - (view) Author: Alex Willmer (moreati) * Date: 2009-08-13 21:14
I've made an installable package of Matthew Barnett's patch. It may get
this to a wider audience. 

http://pypi.python.org/pypi/regex

Next I'll look at incorporating Andrew Kuchling's suggestion of the re
tests from CPython.
msg91598 - (view) Author: Mark Summerfield (mark) Date: 2009-08-15 07:49
Hi,

I've noticed 3 differences between the re and regex engines. 
I don't know if they are intended or not, but thought it best to mention
them. (I used the issue2636-20090810#3.zip version.)

Python 2.6.2 (r262:71600, Apr 20 2009, 09:25:38) 
[GCC 4.3.2 20081105 (Red Hat 4.3.2-7)] on linux2
IDLE 2.6.2      
>>> import re, regex
>>> ############################################################ 1 of 3
>>> re1= re.compile(r"""
                    (?!<\w)(?P<name>[-\w]+)=
                    (?P<quote>(?P<single>')|(?P<double>"))?
                    (?P<value>(?(single)[^']+?|(?(double)[^"]+?|\S+)))
                    (?(quote)(?P=quote))
                    """, re.VERBOSE)
>>> re2= regex.compile(r"""
                    (?!<\w)(?P<name>[-\w]+)=
                    (?P<quote>(?P<single>')|(?P<double>"))?
                    (?P<value>(?(single)[^']+?|(?(double)[^"]+?|\S+)))
                    (?(quote)(?P=quote))
                    """, re.VERBOSE)
>>> text = "<table border='1'>"
>>> re1.findall(text)
[('border', "'", "'", '', '1')]
>>> re2.findall(text)
[]
>>> text = "<table border=1>"
>>> re1.findall(text)
[('border', '', '', '', '1>')]
>>> re2.findall(text)
[]
>>> ############################################################ 2 of 3
>>> re1 = re.compile(r"""^[ \t]*
                         (?P<parenthesis>\()?
                         [- ]?
                         (?P<area>\d{3})
                         (?(parenthesis)\))
                         [- ]?
                         (?P<local_a>\d{3})
                         [- ]?
                         (?P<local_b>\d{4})
                         [ \t]*$
                         """, re.VERBOSE)
>>> re2 = regex.compile(r"""^[ \t]*
                         (?P<parenthesis>\()?
                         [- ]?
                         (?P<area>\d{3})
                         (?(parenthesis)\))
                         [- ]?
                         (?P<local_a>\d{3})
                         [- ]?
                         (?P<local_b>\d{4})
                         [ \t]*$
                         """, re.VERBOSE)
>>> data = ("179-829-2116", "(187) 160 0880", "(286)-771-3878",
"(291) 835-9634", "353-896-0505", "(555) 555 5555", "(555) 555-5555",
"(555)-555-5555", "555 555 5555", "555 555-5555", "555-555-5555",
"601 805 3142", "(675) 372 3135", "810 329 7071", "(820) 951 3885",
"942 818-5280", "(983)8792282")
>>> for d in data:
	ans1 = re1.findall(d)
	ans2 = re2.findall(d)
	print "re=%s rx=%s %d" % (ans1, ans2, ans1 == ans2)

re=[('', '179', '829', '2116')] rx=[('', '179', '829', '2116')] 1
re=[('(', '187', '160', '0880')] rx=[] 0
re=[('(', '286', '771', '3878')] rx=[('(', '286', '771', '3878')] 1
re=[('(', '291', '835', '9634')] rx=[] 0
re=[('', '353', '896', '0505')] rx=[('', '353', '896', '0505')] 1
re=[('(', '555', '555', '5555')] rx=[] 0
re=[('(', '555', '555', '5555')] rx=[] 0
re=[('(', '555', '555', '5555')] rx=[('(', '555', '555', '5555')] 1
re=[('', '555', '555', '5555')] rx=[] 0
re=[('', '555', '555', '5555')] rx=[] 0
re=[('', '555', '555', '5555')] rx=[('', '555', '555', '5555')] 1
re=[('', '601', '805', '3142')] rx=[] 0
re=[('(', '675', '372', '3135')] rx=[] 0
re=[('', '810', '329', '7071')] rx=[] 0
re=[('(', '820', '951', '3885')] rx=[] 0
re=[('', '942', '818', '5280')] rx=[] 0
re=[('(', '983', '879', '2282')] rx=[('(', '983', '879', '2282')] 1
>>> ############################################################ 3 of 3
>>> re1 = re.compile(r"""
<img\s+[^>]*?src=(?:(?P<quote>["'])(?P<qimage>[^\1>]+?)   
(?P=quote)|(?P<uimage>[^"' >]+))[^>]*?>""", re.VERBOSE)
>>> re2 = regex.compile(r"""
<img\s+[^>]*?src=(?:(?P<quote>["'])(?P<qimage>[^\1>]+?)   
(?P=quote)|(?P<uimage>[^"' >]+))[^>]*?>""", re.VERBOSE)
>>> data = """<body> <img src='a.png'> <img alt='picture' src="b.png">
              <img alt="picture" src="Big C.png" other="xyx">
              <img src=icon.png alt=icon>
              <img src="I'm here!.jpg" alt="aren't I?">"""
>>> data = data.split("\n")
>>> data = [x.strip() for x in data]
>>> for d in data:
	ans1 = re1.findall(d)
	ans2 = re2.findall(d)
	print "re=%s rx=%s %d" % (ans1, ans2, ans1 == ans2)

re=[("'", 'a.png', '')] rx=[("'", 'a.png', '')] 1
re=[('"', 'b.png', '')] rx=[('"', 'b.png', '')] 1
re=[('"', 'Big C.png', '')] rx=[('"', 'Big C.png', '')] 1
re=[('', '', 'icon.png')] rx=[('', '', 'icon.png alt=icon')] 0
re=[('"', "I'm here!.jpg", '')] rx=[('"', "I'm here!.jpg", '')] 1

I'm sorry I haven't had the time to try to minimize the examples, but I
hope that at least they will prove helpful.

Number 3 looks like a problem with non-greedy matching; I don't know
about the others.
msg91607 - (view) Author: John Machin (sjmachin) Date: 2009-08-15 14:02
Simplification of mark's first two problems:

Problem 1: looks like regex's negative look-head assertion is broken
>>> re.findall(r'(?!a)\w', 'abracadabra')
['b', 'r', 'c', 'd', 'b', 'r']
>>> regex.findall(r'(?!a)\w', 'abracadabra')
[]


Problem 2: in VERBOSE mode, regex appears to be ignoring spaces inside
character classes

>>> import re, regex
>>> pat = r'(\w)([- ]?)(\w{4})'
>>> for data in ['abbbb', 'a-bbbb', 'a bbbb']:
...    print re.compile(pat).findall(data), regex.compile(pat).findall(data)
...    print re.compile(pat, re.VERBOSE).findall(data),
regex.compile(pat,regex.
VERBOSE).findall(data)
...
[('a', '', 'bbbb')] [('a', '', 'bbbb')]
[('a', '', 'bbbb')] [('a', '', 'bbbb')]
[('a', '-', 'bbbb')] [('a', '-', 'bbbb')]
[('a', '-', 'bbbb')] [('a', '-', 'bbbb')]
[('a', ' ', 'bbbb')] [('a', ' ', 'bbbb')]
[('a', ' ', 'bbbb')] []

HTH,
John
msg91610 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2009-08-15 16:12
issue2636-20090815.zip fixes the bugs found in msg91598 and msg91607.

The regex engine currently lacks some of the optimisations that the re
engine has, but I've concluded that even with them the extra work that
the engine needs to do to make it easy to switch to breadth-wise
matching when needed is slowing it down too much (if it's matching only
depth-first then it can save only the changes to the 'context', but if
it's matching breadth-wise then it needs to duplicate the entire 'context').

I'm therefore seeing whether I can have 2 engines internally, one
optimised for depth-first and the other for breadth-wise, and switch
from the former to the latter if matching is taking too long.
msg91671 - (view) Author: Alex Willmer (moreati) * Date: 2009-08-17 20:29
Matthew's 20080915.zip attachment is now on PyPI. This one, having a
more complete MANIFEST, will build for people other than me.
msg91917 - (view) Author: Vlastimil Brom (vbr) Date: 2009-08-24 12:55
I'd like to add some detail to the previous msg91473

The current behaviour of the character properties looks a bit 
surprising sometimes:

>>> 
>>> regex.findall(ur"\p{UppercaseLetter}", u"QW\p{UppercaseLetter}as")
[u'Q', u'W', u'U', u'L']
>>> regex.findall(ur"\p{Uppercase Letter}", u"QW\p{Uppercase Letter}as")
[u'\\p{Uppercase Letter}']
>>> regex.findall(ur"\p{UppercaseÄÄÄLetter}", u"QW\p
{UppercaseÄÄÄLetter}as")
[u'\\p{Uppercase\xc4\xc4\xc4Letter}']
>>> regex.findall(ur"\p{UppercaseQQQLetter}", u"QW\p
{UppercaseQQQLetter}as")

Traceback (most recent call last):
  File "<pyshell#34>", line 1, in <module>
    regex.findall(ur"\p{UppercaseQQQLetter}", u"QW\p
{UppercaseQQQLetter}as")
...
  File "C:\Python26\lib\regex.py", line 1178, in _parse_property
    raise error("undefined property name '%s'" % name)
error: undefined property name 'UppercaseQQQLetter'
>>> 

i.e. potential property names consisting only from the ascii-letters  
(+ _, -) are looked up and either used or an error is raised,
other names (containing whitespace or non-ascii letters) aren't treated 
as a special expression, hence, they either match their literal value 
or simply don't match (without errors).

Is this the intended behaviour? 
I am not sure whether it is maybe defined somewhere, or there are some 
de-facto standards for this...
I guess, the space in the property names might be allowed (unless there 
are some implications for the parser...), otherwise the fallback 
handling of invalid property names as normal strings is probably the 
expected way.
vbr
msg97860 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-01-16 03:00
issue2636-20100116.zip is a new version of the regex module.

I've given up on the breadth-wise matching - it was too difficult finding a pattern structure that would work well for both depth-first and breadth-wise. It probably still needs some tweaks and tidying up, but I thought I might as well release something!
msg98809 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-02-04 02:34
issue2636-20100204.zip is a new version of the regex module.

I've added splititer and added a build for Python 3.1.
msg99072 - (view) Author: Vlastimil Brom (vbr) Date: 2010-02-08 23:45
Hi, thanks for the update! 
Just for the unlikely case, it hasn't been noticed sofar, using python  2.6.4 or 2.5.4 with the regexp build issue2636-20100204.zip
I am getting the following easy-to-fix error:

Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python26\lib\regex.py", line 2003
    print "Header file written at %s\n" % os.path.abspath(header_file.name))
                                                                           ^
SyntaxError: invalid syntax

After removing the extra closing paren in regex.py, line 2003, everything seems ok.
   vbr
msg99132 - (view) Author: Vlastimil Brom (vbr) Date: 2010-02-09 17:38
I'd like to add another issue I encountered with the latest version of regex - issue2636-20100204.zip

It seems, that there is an error in handling some quantifiers in python 2.5

on
Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on win32

I get e.g.:

>>> regex.findall(ur"q*", u"qqwe")

Traceback (most recent call last):
  File "<pyshell#35>", line 1, in <module>
    regex.findall(ur"q*", u"qqwe")
  File "C:\Python25\lib\regex.py", line 213, in findall
    return _compile(pattern, flags).findall(string, overlapped=overlapped)
  File "C:\Python25\lib\regex.py", line 633, in _compile
    p = _regex.compile(pattern, info.global_flags | info.local_flags, code, info.group_index, index_group)
RuntimeError: invalid RE code

There is the same error for other possibly "infinite" quantifiers like "q+", "q{0,}" etc. with their non-greedy and possesive variants.

On python 2.6 and 3.1 all these patterns works without errors.

vbr
msg99148 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-02-10 02:20
issue2636-20100210.zip is a new version of the regex module.

The reported bugs appear to be fixed now.
msg99186 - (view) Author: Vlastimil Brom (vbr) Date: 2010-02-11 01:09
Thanks for the quick update,
I confirm the fix for both issues;
just another finding (while testing the behaviour mentioned previously - msg91917)

The property name normalisation seem to be much more robust now, I just encountered an encoding error using a rather artificial input (in python 2.5, 2.6):

>>> regex.findall(ur"\p{UppercaseÄÄÄLetter}", u"QW\p{UppercaseÄÄÄLetter}as")

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    regex.findall(ur"\p{UppercaseÄÄÄLetter}", u"QW\p{UppercaseÄÄÄLetter}as")
  File "C:\Python25\lib\regex.py", line 213, in findall
    return _compile(pattern, flags).findall(string, overlapped=overlapped)
  File "C:\Python25\lib\regex.py", line 599, in _compile
    parsed = _parse_pattern(source, info)
  File "C:\Python25\lib\regex.py", line 690, in _parse_pattern
    branches = [_parse_sequence(source, info)]
  File "C:\Python25\lib\regex.py", line 702, in _parse_sequence
    item = _parse_item(source, info)
  File "C:\Python25\lib\regex.py", line 710, in _parse_item
    element = _parse_element(source, info)
  File "C:\Python25\lib\regex.py", line 837, in _parse_element
    return _parse_escape(source, info, False)
  File "C:\Python25\lib\regex.py", line 1098, in _parse_escape
    return _parse_property(source, info, in_set, ch)
  File "C:\Python25\lib\regex.py", line 1240, in _parse_property
    raise error("undefined property name '%s'" % name)
error: <unprintable error object>
>>> 

Not sure, how this would be fixed (i.e. whether the error message should be changed to unicode, if applicable).

Not surprisingly, in python 3.1, there is a correct message at the end:

regex.error: undefined property name 'UppercaseÄÄÄLetter'

vbr
msg99190 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-02-11 02:16
I've been aware for some time that exception messages in Python 2 can't be Unicode, but I wasn't sure which encoding to use, so I've decided to use that of sys.stdout.

It appears to work OK in IDLE and at the Python prompt.

issue2636-20100211.zip is the new version of the regex module.
msg99462 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-02-17 04:09
issue2636-20100217.zip is a new version of the regex module.

It includes a fix for issue #7940.
msg99470 - (view) Author: Alex Willmer (moreati) * Date: 2010-02-17 13:01
I've packaged this latest revision and uploaded to PyPI http://pypi.python.org/pypi/regex
msg99479 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-02-17 19:35
The main text at http://pypi.python.org/pypi/regex appears to have lost its backslashes, for example:

    The Unicode escapes uxxxx and Uxxxxxxxx are supported.

instead of:

    The Unicode escapes \uxxxx and \Uxxxxxxxx are supported.
msg99481 - (view) Author: Vlastimil Brom (vbr) Date: 2010-02-17 23:43
I just tested the fix for unicode tracebacks and found some possibly weird results (not sure how/whether it should be fixed, as these inputs are indeed rather artificial...).
(win XPp SP3 Czech, Python 2.6.4)

Using the cmd console, the output is fine (for the characters it can accept and display)

>>> regex.findall(ur"\p{InBasicLatinĚ}", u"aé")
Traceback (most recent call last):
...
  File "C:\Python26\lib\regex.py", line 1244, in _parse_property
    raise error("undefined property name '%s'" % name)
regex.error: undefined property name 'InBasicLatinĚ'
>>>

(same result for other distorted "proprety names" containing e.g. ěščřžýáíéúůßäëiöüîô ...

However, in Idle the output differs depending on the characters present

>>> regex.findall(ur"\p{InBasicLatinÉ}", u"ab c")
yields the expected
...
  File "C:\Python26\lib\regex.py", line 1244, in _parse_property
    raise error("undefined property name '%s'" % name)
error: undefined property name 'InBasicLatinÉ'

but

>>> regex.findall(ur"\p{InBasicLatinĚ}", u"ab c")

Traceback (most recent call last):
...
  File "C:\Python26\lib\regex.py", line 1244, in _parse_property
    raise error("undefined property name '%s'" % name)
  File "C:\Python26\lib\regex.py", line 167, in __init__
    message = message.encode(sys.stdout.encoding)
  File "C:\Python26\lib\encodings\cp1250.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xcc' in position 37: character maps to <undefined>
>>> 

which might be surprising, as cp1250 should be able to encode "Ě", maybe there is some intermediate ascii step?

using the wxpython pyShell I get its specific encoding error:

regex.findall(ur"\p{InBasicLatinÉ}", u"ab c")
Traceback (most recent call last):
...
  File "C:\Python26\lib\regex.py", line 1102, in _parse_escape
    return _parse_property(source, info, in_set, ch)
  File "C:\Python26\lib\regex.py", line 1244, in _parse_property
    raise error("undefined property name '%s'" % name)
  File "C:\Python26\lib\regex.py", line 167, in __init__
    message = message.encode(sys.stdout.encoding)
AttributeError: PseudoFileOut instance has no attribute 'encoding'

(the same for \p{InBasicLatinĚ} etc.)


In python 3.1 in Idle, all of these exceptions are displayed correctly, also in other scripts or with special characters.

Maybe in python 2.x e.g. repr(...) of the unicode error messages could be used in order to avoid these problems, but I don't know, what the conventions are in these cases.


Another issue I found here (unrelated to tracebacks) are backslashes or punctuation (except the handled -_) in the property names, which just lead to failed mathces and no exceptions about unknown property names

regex.findall(u"\p{InBasic.Latin}", u"ab c")
[]


I was also surprised by the added pos/endpos parameters, as I used flags as a non-keyword third parameter for the re functions in my code (probably my fault ...)

re.findall(pattern, string, flags=0)

regex.findall(pattern, string, pos=None, endpos=None, flags=0, overlapped=False)

(is there a specific reason for this order, or could it be changed to maintain compatibility with the current re module?)

I hope, at least some of these remarks make some sense;
  thanks for the continued work on this module!

   vbr
msg99494 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-02-18 03:03
issue2636-20100218.zip is a new version of the regex module.

I've added '.' to the permitted characters when parsing the name of a property. The name itself is no longer reported in the error message.

I've also corrected the positions of the 'pos' and 'endpos' arguments:

    regex.findall(pattern, string, flags=0, pos=None, endpos=None, overlapped=False)
msg99548 - (view) Author: Vlastimil Brom (vbr) Date: 2010-02-19 00:29
Thanks for fixing the argument positions;
unfortunately, it seems, there might be some other problem, that makes my code work differently than the builtin re;
it seems, in the character classes the ignorcase flag is ignored somehow: 

>>> regex.findall(r"[ab]", "aB", regex.I)
['a']
>>> re.findall(r"[ab]", "aB", re.I)
['a', 'B']
>>> 

(The same with the flag set in the pattern.)

Outside of the character class the case seems to be handled normally, or am I missing something?

vbr
msg99552 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-02-19 01:31
issue2636-20100219.zip is a new version of the regex module.

The regex module should give the same results as the re module for backwards compatibility.

The ignorecase bug is now fixed.

This new version releases the GIL when matching on str and bytes (str and unicode in Python 2.x).
msg99665 - (view) Author: Alex Willmer (moreati) * Date: 2010-02-21 14:46
On 17 February 2010 19:35, Matthew Barnett <report@bugs.python.org> wrote:
> The main text at http://pypi.python.org/pypi/regex appears to have lost its backslashes, for example:
>
>    The Unicode escapes uxxxx and Uxxxxxxxx are supported.
>
> instead of:
>
>    The Unicode escapes \uxxxx and \Uxxxxxxxx are supported.

Matthew, As you no doubt realised that text is read straight from the
Features.txt file. PyPI interprets it as RestructuredText, which uses
\ as an escape character in various cases. Do you intentionally write
Features.txt as RestructuredText? If so here is a patch that escapes
the \ characters as appropriate, otherwise I'll work out how to make
PyPI read it as plain text.

Regards, Alex
-- 
Alex Willmer <alex@moreati.org.uk>
http://moreati.org.uk/blog
msg99668 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-02-21 16:21
To me the extension .txt means plain text. Is there a specific extension for ReStructuredText, eg .rst?
msg99835 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-02-22 21:24
issue2636-20100222.zip is a new version of the regex module.

This new version adds reverse searching.

The 'features' now come in ReStructuredText (.rst) and HTML.
msg99863 - (view) Author: Vlastimil Brom (vbr) Date: 2010-02-22 22:51
Is the issue2636-20100222.zip archive supposed to be complete? I can't find not only the rst or html "features", but more importantly the py and pyd files for the particular versions.

Anyway, I just skimmed through the regular-expressions.info documentation and found, that most features, which I missed in the builtin re version seems to be present in the regex module;
a few possibly notable exceptions being some unicode features:
http://www.regular-expressions.info/unicode.html 
support for unicode script properties might be needlessly complex (maybe unless http://bugs.python.org/issue6331 is implemented)

On the other hand \X for matching any single grapheme might be useful, according to the mentioned page, the currently working equivalent would be 
\P{M}\p{M}*
However, I am not sure about the compatibility concerns; it is possible, that the modifier characters as a part of graphemes might cause some discrepancies in the text indices etc. 

A feature, where i personally (currently) can't find a usecase is \G and continuing matches (but no doubt, there would be some some cases for this).

regards
   vbr
msg99872 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-02-22 23:28
I don't know what happened there. I didn't notice that the zip file was way too small. Here's a replacement (still called issue2636-20100222.zip).

Unicode script properties are already included, at least those whose definitions at http://www.regular-expressions.info/unicode.html

I haven't notice \X before. I'll have a look at it.

As for \G, .findall performs searches normally, but when using \G it effectively performs contiguous matches only, which can be useful when you need it!
msg99888 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-02-23 00:39
OK, you've convinced me, \X is supported. :-)

issue2636-20100223.zip is a new version of the regex module.
msg99890 - (view) Author: Alex Willmer (moreati) * Date: 2010-02-23 00:47
On 22 Feb 2010, at 21:24, Matthew Barnett <report@bugs.python.org>  
wrote:

> issue2636-20100222.zip is a new version of the regex module.
>
> This new version adds reverse searching.
>
> The 'features' now come in ReStructuredText (.rst) and HTML

Thank you matthew. My laptop is out of action, so it will be a few  
days before I can upload a new version to PyPI.

If you would prefer to have control of the pypi package, or to share  
control please let mr know.

Alex
msg99892 - (view) Author: Vlastimil Brom (vbr) Date: 2010-02-23 01:31
Wow, that's what can be called rapid development :-), thanks very much!
I did'n noticed before, that \G had been implemented already.
\X works fine for me, it also maintains the input string indices correctly.

We can use unicode character properties \p{Letter} and unicode bloks \p{inBasicLatin} properties; 
the script properties like \p{Latin} or \p{IsLatin} return "undefined property name".
I guess, this would require the access to the respective information in unicodedata, where it isn't available now (there also seem to be much more scripts than those mentioned at regular-expressions.info
cf.
http://www.unicode.org/Public/UNIDATA/Scripts.txt
http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt (under "# Script (sc)").

vbr
msg100066 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-02-24 20:25
issue2636-20100224.zip is a new version of the regex module.

It includes support for matching based on Unicode scripts as well as on Unicode blocks and properties.
msg100076 - (view) Author: Vlastimil Brom (vbr) Date: 2010-02-24 23:14
Thanks, its indeed a very nice addition to the library...
Just a marginal remark; it seems, that in script-names also some non BMP characters are covered, however, in the unicode ranges thee only BMP.
http://www.unicode.org/Public/UNIDATA/Blocks.txt

Am I missing something more complex, as why 
10000.. - ..10FFFF; ranges weren't included in _BLOCKS ?
Maybe building these ranges is expensive, in contrast to rare uses of these properties?

(Not that I am able to reliably test it on my "narrow" python build on windows, but currently, obviously, e.g. \p{InGothic} gives "undefined property name" whereas \p{Gothic} is accepted.)

vbr
msg100080 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-02-25 00:12
It was more of an oversight.

issue2636-20100225.zip now contains the full list of both blocks and scripts.
msg100134 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-02-26 03:20
issue2636-20100226.zip is a new version of the regex module.

It now supports the branch reset (?|...|...), enabling the different branches of an alternation to reuse group numbers.
msg100152 - (view) Author: Alex Willmer (moreati) * Date: 2010-02-26 14:36
On 26 February 2010 03:20, Matthew Barnett <report@bugs.python.org> wrote:
> Added file: http://bugs.python.org/file16375/issue2636-20100226.zip

This is now uploaded to PyPI http://pypi.python.org/pypi/regex/0.1.20100226
-- 
Alex Willmer <alex@moreati.org.uk>
http://moreati.org.uk/blog
msg100359 - (view) Author: Vlastimil Brom (vbr) Date: 2010-03-03 23:48
I just noticed a cornercase with the newly introduced grapheme matcher \X, if this is used in the character set:

>>> regex.findall("\X", "abc")
['a', 'b', 'c']
>>> regex.findall("[\X]", "abc")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "regex.pyc", line 218, in findall
  File "regex.pyc", line 1435, in _compile
  File "regex.pyc", line 2351, in optimise
  File "regex.pyc", line 2705, in optimise
  File "regex.pyc", line 2798, in optimise
  File "regex.pyc", line 2268, in __hash__
AttributeError: '_Sequence' object has no attribute '_key'

It obviously doesn't make much sense to use this universal literal in the character class (the same with "." in its metacharacter role) and also http://www.regular-expressions.info/refunicode.html doesn't mention this possibility; but the error message might probably be more descriptive, or the pattern might match "X" or "\" and "\X" (?)

I was originally thinking about the possibility to combine the positive and negative character classes, where e.g. \X would be a kind of base; I am not aware of any re engine supporting this, but I eventually found an unicode guidelines for regular expressions, which also covers this:

http://unicode.org/reports/tr18/#Subtraction_and_Intersection

It also surprises a bit, that these are all included in
Basic Unicode Support: Level 1; (even with arbitrary unions, intersections, differences ...) it suggests, that there is probably no implementation available (AFAIK) - even on this basic level, according to this guideline.

Among other features on this level, the section
http://unicode.org/reports/tr18/#Supplementary_Characters
seems useful, especially the handling of the characters beyond \uffff, also in the form of surrogate pairs as single characters.

This might be useful on the narrow python builds, but it is possible, that there would be be an incompatibility with the handling of these data in "narrow" python itself.

Just some suggestions or rather remarks, as you already implemented many advanced features and are also considering some different approaches ...:-)

vbr
msg100362 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-03-04 00:41
\X shouldn't be allowed in a character class because it's equivalent to \P{M}\p{M}*. It's a bug, now fixed in issue2636-20100304.zip.

I'm not convinced about the set intersection and difference stuff. Isn't that overdoing it a little? :-)
msg100370 - (view) Author: Vlastimil Brom (vbr) Date: 2010-03-04 01:45
Actually I had that impression too, but I was mainly surprised with these requirements being on the lowest level of the unicode support. Anyway, maybe the relevance of these guidelines for the real libraries is is lower, than I expected.

Probably the simpler cases are adequately handled with lookarounds, e.g. (?:\w(?<!\p{Greek}))+ and the complex examples like symmetric differences seem to be beyond the normal scope of re anyway.

Personally, I would find the surrogate handling more useful, but I see, that it isn't actually the job for the re library, given that the narrow build of python doesn't support indexing, slicing, len  of these characters either...

vbr
msg100452 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-03-05 03:27
issue2636-20100305.zip is a new version of the regex module.

Just a few tweaks.
msg101172 - (view) Author: Alex Willmer (moreati) * Date: 2010-03-16 15:56
I've adapted the Python 2.6.5 test_re.py as follows, 

 from test.test_support import verbose, run_unittest
-import re
-from re import Scanner
+import regex as re
+from regex import Scanner

and run it against regex-2010305. Three tests failed, and the report is attached.
msg101181 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-03-16 19:31
Does regex.py have its own test suite (which also includes tests for all the problems reported in the last few messages)?
If so, the new tests could be merged in re's test_re. This will simplify the testing of regex.py and will improve the test coverage of re.py, possibly finding new bugs. It will also be useful to check if the two libraries behave in the same way.
msg101193 - (view) Author: Vlastimil Brom (vbr) Date: 2010-03-16 21:37
I am not sure about the testsuite for this regex module, but it seems to me, that many of the problems reported here probably don't apply for the current builtin re, as they are connected with the new features of regex.
After the suggestion in msg91462. I briefly checked the re testsuite and found it very comprehensive, given the featureset. Of course, most/all? re tests should apply for regex, but probably not vice versa.
vbr
msg101557 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-03-23 01:21
issue2636-20100323.zip is a new version of the regex module.

It now includes a test script. Most of the tests come from the existing test scripts.
msg102042 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-03-31 22:26
issue2636-20100331.zip is a new version of the regex module.

It includes speed-ups and a minor bugfix.
msg103003 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-04-13 02:21
issue2636-20100413.zip is a new version of the regex module.

It includes additional speed-ups.
msg103060 - (view) Author: Alex Willmer (moreati) * Date: 2010-04-13 16:23
On 13 April 2010 03:21, Matthew Barnett <report@bugs.python.org> wrote:
> issue2636-20100413.zip is a new version of the regex module.

Matthew, When I run test_regex.py 6 tests are failing, with Python
2.6.5 on Ubuntu Lucid and my setup.py. Attached is the output, do all
the tests pass in your build?

Alex
msg103064 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-04-13 17:10
Yes, it passed all the tests, although I've since found a minor bug that isn't covered/caught by them, so I'll need to add a few more tests.

Anyway, do:

    regex.match(ur"\p{Ll}", u"a")
    regex.match(ur'(?u)\w', u'\xe0')

really return None? Your results suggest that they won't.

I downloaded Python 2.6.5 (I was using Python 2.6.4) just in case, but it still passes (WinXP, 32-bit).
msg103078 - (view) Author: Alex Willmer (moreati) * Date: 2010-04-13 19:46
On 13 April 2010 18:10, Matthew Barnett <report@bugs.python.org> wrote:
> Anyway, do:
>
>    regex.match(ur"\p{Ll}", u"a")
>    regex.match(ur'(?u)\w', u'\xe0')
>
> really return None? Your results suggest that they won't.

Python 2.6.5 (r265:79063, Apr  3 2010, 01:56:30)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> regex.__version__
'2.3.0'
>>> print regex.match(ur"\p{Ll}", u"a")
None
>>> print regex.match(ur'(?u)\w', u'\xe0')
None

I thought I might be a 64 bit issue, but I see the same result in a 32
bit VM. That leaves my build process. Attached is the setup.py and
build output, unicodedata_db.h was taken from the Ubuntu source deb
for Python 2.6.5.
msg103095 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-04-13 23:33
issue2636-20100414.zip is a new version of the regex module.

I think I might have identified the cause of the problem, although I still haven't been able to reproduce it, so I can't be certain.
msg103096 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-04-13 23:34
Oops, forgot the file! :-)
msg103097 - (view) Author: Alex Willmer (moreati) * Date: 2010-04-13 23:39
On 14 April 2010 00:33, Matthew Barnett <report@bugs.python.org> wrote:
> I think I might have identified the cause of the problem, although I still haven't been able to reproduce it, so I can't be certain.

Performed 76

Passed

Looks like you got it.
msg109358 - (view) Author: Vlastimil Brom (vbr) Date: 2010-07-05 21:42
I just noticed a somehow strange behaviour in matching character sets or alternate matches which contain some more "advanced" unicode characters, if they are in the search pattern with some "simpler" ones. The former seem to be ignored and not matched (the original re engine matches all of them); (win XPh SP3 Czech, Python 2.7; regex issue2636-20100414)

>>> print u"".join(regex.findall(u".", u"eèéêëēěė"))
eèéêëēěė
>>> print u"".join(regex.findall(u"[eèéêëēěė]", u"eèéêëēěė"))
eèéêëē
>>> print u"".join(regex.findall(u"e|è|é|ê|ë|ē|ě|ė", u"eèéêëēěė"))
eèéêëē
>>> print u"".join(re.findall(u"[eèéêëēěė]", u"eèéêëēěė"))
eèéêëēěė
>>> print u"".join(re.findall(u"e|è|é|ê|ë|ē|ě|ė", u"eèéêëēěė"))
eèéêëēěė

even stranger, if the pattern contains only these "higher" unicode characters, everything works ok: 
>>> print u"".join(regex.findall(u"ē|ě|ė", u"eèéêëēěė"))
ēěė
>>> print u"".join(regex.findall(u"[ēěė]", u"eèéêëēěė"))
ēěė


The characters in question are some accented latin letters (here in ascending codepoints), but it can be other scripts as well.
>>> print regex.findall(u".", u"eèéêëēěė")
[u'e', u'\xe8', u'\xe9', u'\xea', u'\xeb', u'\u0113', u'\u011b', u'\u0117']

The threshold isn't obvious to me, at first I thought, the characters represented as unicode escapes are problematic, whereas those with hexadecimal escapes are ok; however ē - u'\u0113' seems ok too.
(python 3.1 behaves identically:
>>> regex.findall("[eèéêëēěė]", "eèéêëēěė")
['e', 'è', 'é', 'ê', 'ë', 'ē']
>>> regex.findall("[ēěė]", "eèéêëēěė")
['ē', 'ě', 'ė']
)

vbr
msg109363 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-07-06 00:02
issue2636-20100706.zip is a new version of the regex module.

I've added your examples to the unit tests. The module now passes.

Keep up the good work! :-)
msg109372 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-07-06 08:42
Matthew, I'd like to see at least some of these features in 3.2, but ISTM that after more than 2 years this issue is not going anywhere.
Is the module still under active development? Is it "ready"? Is it waiting for reviews and to be added to the stdlib? Is it waiting for more people to test it on PyPI?
If the final goal is adding it to the stdlib, are you planning to add it as a new module or to replace the current 're' module? (or is 'regex' just the 're' module with improvements that could be merged?)
Another alternative would be to split it in smaller patches (ideally one per feature) and integrate them one by one, but IIRC several of the patches depend on each other so it can't be done easily.

Unless there is already a plan about this (and I'm not aware of it), I'd suggest to bring this up to python-dev and decide what to do with the 'regex' module.
msg109384 - (view) Author: Alex Willmer (moreati) * Date: 2010-07-06 11:25
I've packaged Matthew's latest revision and uploaded it to PyPI. This
version will build for Python 2 and Python 3, parallel installs will
coexist on the same machine.
msg109401 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-07-06 16:17
I started with trying to modify the existing re module, but I wanted to make too many changes, so in the end I decided to make a clean break and start on a new implementation which was compatible with the existing re module and which could replace the existing implementation, even under the same name.

Apart from the recent bug fix, I haven't done any further work since April on it because I think it's pretty much ready.
msg109403 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-07-06 16:27
So, if it's pretty much ready, do you think it could be included already in 3.2?
msg109404 - (view) Author: Brian Curtin (brian.curtin) * (Python committer) Date: 2010-07-06 16:29
Before anything else is done with it, it should probably be announced in some way. I'm not sure if anyone has opened any of these zip files, reviewed anything, ran anything, or if anyone even knows this whole thing has been going on.
msg109405 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-07-06 16:38
Yes, as I said in the previous message it should probably be announced on python-dev and see what the others think. I don't know how much the module has been used in the wild, but since there has been a PyPI package available for a few months now and since people reported here issues I assume someone is using it (however I don't know how many persons and if they used it with real applications or just played around with it).
I don't want to rush the things, but if the module is "ready" I do want to start moving the things forward, so that after all the necessary decisions and reviews it eventually gets merged.
msg109406 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-07-06 17:03
The file at:

    http://pypi.python.org/pypi/regex

was downloaded 75 times, if that's any help. (Now reset to 0 because of the bug fix.)

If it's included in 3.2 then there's the question of whether it should replace the re module and be called "re".
msg109407 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-07-06 17:07
If it's backward-compatible with the 're' module, all the tests of the test suite pass and it just improves it and add features I don't see why not. (That's just my personal opinion though, other people might (and probably will) disagree.)

Try to send an email on python-dev and see what they say.
msg109408 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2010-07-06 17:16
My only addition opinion is that re is very much used in deployed python applications and was written not just for correctness but also speed.  As such, regex should be benchmarked fairly to show that it is commensurately speedy.  I wouldn't not personally object to a slightly slower module, though not one that is noticeably slower and if it can be proven faster in the average case, it's one more check in the box for favorable inclusion.
msg109409 - (view) Author: Vlastimil Brom (vbr) Date: 2010-07-06 17:30
Thanks for the prompt fix!
It would indeed be nice to see this enhanced re module in the standard library e.g. in 3.2, but I also really appreciate, that also multiple 2.x versions are supported (as my current main usage of this library involves py2-only wx gui).
As for the usage statistics, I for one always downloaded the updates from here rather than pypi, but maybe it is not a regular case.
msg109410 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-07-06 17:34
FWIW, I'd love seeing the updated regex module in 3.2.  Please do bring it up on python-dev.

Looking at the latest module on PyPI, I noted that the regex.py file is very long (~3500 lines), even though it is quite compressed (e.g. no blank lines between methods).  It would be good to split it up.  This would also remove the need for underscore-prefixing most of the identifiers, since they would simply live in another (private) module.
Things like the _create_header_file function should be put into utility scripts.  The C file is also very long, but I think we all know why :)

It would also be nice to see some performance comparisons -- where is the new engine faster, where does it return matches while re just loops forever, and where is the new engine slower?
msg109413 - (view) Author: Alex Willmer (moreati) * Date: 2010-07-06 17:50
On 6 July 2010 18:03, Matthew Barnett <report@bugs.python.org> wrote:
> The file at http://pypi.python.org/pypi/regex/ was downloaded 75 times, if that's any help. (Now reset to 0 because of the bug fix.)
>

Each release was downloaded between 50 and 100 times. Matthew let me
know if you'd like control of the package, or maintainer access. Other
than the odd tweet I haven't publicized the releases.
msg109447 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-07-07 01:45
As a crude guide of the speed difference, here's Python 2.6:

                         re          regex
bm_regex_compile.py     86.53secs  260.19secs
bm_regex_effbot.py      13.70secs    8.94secs
bm_regex_v8.py          15.66secs    9.09secs

Note that compiling regexes is a lot slower. I concentrated my efforts on the matching speed because regexes tend to be compiled only once, so it's not as important.

Matching speed should _at worst_ be comparable.
msg109460 - (view) Author: Mark Summerfield (mark) Date: 2010-07-07 08:57
On the PyPI page:
http://pypi.python.org/pypi/regex/0.1.20100706.1
in the "Subscripting for groups" bullet it gives this pattern:

r"(?<before>.*?)(?<num>\\d+)(?<after>.*)"

Shouldn't this be:

r"(?P<before>.*?)(?P<num>\\d+)(?P<after>.*)"

Or has a new syntax been introduced?
msg109461 - (view) Author: Mark Summerfield (mark) Date: 2010-07-07 09:13
If you do:

>>> import regex as re
>>> dir(re)

you get over 160 items, many of which begin with an underscore and so are private. Couldn't __dir__ be reimplemented to eliminate them. (I know that the current re module's dir() also returns private items, but I guess this is a legacy of not having the __dir__ special method?)
msg109463 - (view) Author: Mark Summerfield (mark) Date: 2010-07-07 09:29
I was wrong about r"(?<name>.*)". It is valid in the new engine. And the PyPI docs do say so immediately _following_ the example.

I've tried all the examples in "Programming in Python 3 second edition" using "import regex as re" and they all worked.
msg109474 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-07-07 13:48
Mark, __dir__ as a special method only works when defined on types, so you'd have to use a module subclass for the "regex" module :)

As I already suggested, it is probably best to move most of the private stuff into a separate module, and only import the really needed entry points into the regex module.
msg109657 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-07-09 01:20
issue2636-20100709.zip is a new version of the regex module.

I've moved most of the regex module's Python code into a private module.
msg110233 - (view) Author: Jonathan Halcrow (jhalcrow) Date: 2010-07-13 21:34
The most recent version on pypi (20100709) seems to be missing _regex_core from py_modules in setup.py.  Currently import regex fails, unable to locate _regex_core.
msg110237 - (view) Author: Alex Willmer (moreati) * Date: 2010-07-13 21:56
On 13 July 2010 22:34, Jonathan Halcrow <report@bugs.python.org> wrote:
> The most recent version on pypi (20100709) seems to be missing _regex_core from py_modules in setup.py.

Sorry, my fault. I've uploaded a corrected version
http://pypi.python.org/pypi/regex/0.1.20100709.1
msg110701 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-07-19 00:15
issue2636-20100719.zip is a new version of the regex module.

Just a few more tweaks for speed.
msg110704 - (view) Author: Vlastimil Brom (vbr) Date: 2010-07-19 01:37
Thanks for the update;
Just a small observation regarding some character ranges and ignorecase, probably irrelevant, but a difference to the current re anyway:

>>> zero2z = u"0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz"

>>> re.findall("(?i)[X-d]", zero2z)
[]

>>> regex.findall("(?i)[X-d]", zero2z)
[u'A', u'B', u'C', u'D', u'X', u'Y', u'Z', u'[', u'\\', u']', u'^', u'_', u'`', u'a', u'b', u'c', u'd', u'x', u'y', u'z']
>>>


re.findall("(?i)[B-d]", zero2z)
[u'B', u'C', u'D', u'b', u'c', u'd']

regex.findall("(?i)[B-d]", zero2z)
[u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J', u'K', u'L', u'M', u'N', u'O', u'P', u'Q', u'R', u'S', u'T', u'U', u'V', u'W', u'X', u'Y', u'Z', u'[', u'\\', u']', u'^', u'_', u'`', u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'm', u'n', u'o', u'p', u'q', u'r', u's', u't', u'u', u'v', u'w', u'x', u'y', u'z']

It seems, that the re module is building the character set using a case insensitive "alphabet" in some way.

I guess, the behaviour of re is buggy here, while regex is ok (tested on py 2.7, Win XPp).

vbr
msg110761 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-07-19 14:43
This has already been reported in issue #3511.
msg111519 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-07-25 02:46
issue2636-20100725.zip is a new version of the regex module.

More tweaks for speed.


                         re          regex   
bm_regex_compile.py     87.05secs  278.00secs
bm_regex_effbot.py      14.00secs    6.58secs
bm_regex_v8.py          16.11secs    6.66secs
msg111531 - (view) Author: Alex Willmer (moreati) * Date: 2010-07-25 09:20
On 25 July 2010 03:46, Matthew Barnett <report@bugs.python.org> wrote:
> issue2636-20100725.zip is a new version of the regex module.

This is now packaged and uploaded to PyPI
http://pypi.python.org/pypi/regex/0.1.20100725
msg111643 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-07-26 16:53
Does 'regex' implement "default word boundaries" (see #7255)?
msg111652 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-07-26 17:41
No.

Wouldn't that break compatibility with 're'?
msg111656 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2010-07-26 17:50
What about a regex flag?  Like regex.W or (?w)?
msg111660 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-07-26 18:32
That's a possibility.

I must admit that I don't entirely understand it enough to implement it (the OP said "I don't believe that the algorithm for this is a
whole lot more complicated"), and I don't have a need for it myself, but if someone would like to provide some code for it, even if it's in the form of a function written in Python:

    def at_default_word_boundary(text, pos):
        ...

then I'll see what I can do! :-)
msg111921 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-07-29 13:29
Wishlist item: could you give the regex and match classes nicer names, so that they can be referenced as `regex.Pattern` (or `regex.Regex`) and `regex.Match`?
msg113927 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-08-14 20:24
issue2636-20100814.zip is a new version of the regex module.

I've added default Unicode word boundaries and renamed the Pattern and Match classes.

Over to you, Alex. :-)
msg113931 - (view) Author: Alex Willmer (moreati) * Date: 2010-08-14 21:18
On 14 August 2010 21:24, Matthew Barnett <report@bugs.python.org> wrote:
> Over to you, Alex. :-)

Et voilà, an exciting Saturday evening
http://pypi.python.org/pypi/regex/0.1.20100814

Matthew, I'm currently keeping regex in a private bzr repository. Do
you have yours in source control? If so/not could we make yours/mine
public, and keep everything in one repository?
-- 
Alex Willmer <alex@moreati.org.uk>
msg114034 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-08-16 02:04
issue2636-20100816.zip is a new version of the regex module.

Unfortunately I came across a bug in the handing of sets. More unit tests added.
msg114766 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-08-24 02:13
issue2636-20100824.zip is a new version of the regex module.

More speedups. Getting towards Perl speed now, depending on the regex. :-)
msg116133 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-09-11 23:37
issue2636-20100912.zip is a new version of the regex module.

More speedups. I've been comparing the speed against Perl wherever possible. In some cases Perl is lightning fast, probably because regex is built into the language and it doesn't have to parse method arguments (for some short regexes a large part of the processing time is spent in PyArg_ParseTupleAndKeywords!). In other cases, where it has to use Unicode codepoints outside the 8-bit range, or character properties such as \p{Alpha}, its performance is simply appalling! :-)
msg116151 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-09-12 06:47
(?flags) are still scoping by default... a new flag to activate that behavior would really by helpful  :)
msg116223 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-09-12 18:42
Another flag? Hmm.

How about this instead: if a scoped flag appears at the end of a regex (and would therefore normally have no effect) then it's treated as though it's at the start of the regex. Thus:

    foo(?i)

is treated like:

    (?i)foo
msg116227 - (view) Author: Vlastimil Brom (vbr) Date: 2010-09-12 20:15
Not that my opinion matters, but for what is it worth, I find it rather unusual to have to use special flags to get "normal" (for some definition of normal) behaviour, while retaining the defaults buggy in some way (like ZEROWIDTH). I would think, the backwards compatibility would not be needed under these circumstances - in such probably marginal cases (or is setting global flags at the end or otherwhere than on beginning oof the pattern that frequent?). It seems, that with many new features and enhancements for previously "impossible" patterns, chances are, that the code using regular expressions in a more advanced way might benefit from reviewing the patterns (where also the flags for "historical" behaviour could be adjusted if really needed).

Anyway, thanks for further improvements! (although it broke my custom function previously misusing the internal data of the regex module for getting the unicode script property (currently unavailable via unicodedata) :-).

Best regards,
   vbr
msg116229 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-09-12 20:47
The tests for re include these regexes:

    a.b(?s)
    a.*(?s)b

I understand what Georg said previously about some people preferring to put them at the end, but I personally wouldn't do that because some regex implementations support scoped inline flags, although others, like re, don't.

I think that second regex is a bit perverse, though! :-)

On the other matter, I could make the Unicode script and block available through a couple of functions if you need them, eg:

    # Using Python 3 here
    >>> regex.script("A")
    'Latin'
    >>> regex.block("A")
    'BasicLatin'
msg116231 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-09-12 21:16
Matthew, I understand why you want to have these flags scoped, and if you designed a regex dialect from scratch, that would be the way to go.  However, if we want to integrate this in Python 3.2 or 3.3, this is an absolute killer if it's not backwards compatible.

I can live with behavior changes that really are bug fixes, and of course with new features that were invalid syntax before, but this is changing an aspect that was designed that way (as the test case shows), and that really is not going to happen without an explicit new flag. Special-casing the "flags at the end" case is too magical to be of any help.

It will be hard enough to get your code into Python -- it is a huge new codebase for an absolutely essential module.  I'm nevertheless optimistic that it is going to happen at some point or other.  Of course, you would have to commit to maintaining it within Python for the forseeable future.

The "script" and "block" functions really belong into unicodedata; you'll have to coordinate that with Marc-Andre.

@Vlastimil: backwards compatibility is needed very much here.  Nobody wants to review all their regexes when switching from Python 3.1 to Python 3.2.  Many people will not care about the improved engine, they just expect their regexes to work as before, and that is a perfectly fine attitude.
msg116238 - (view) Author: Vlastimil Brom (vbr) Date: 2010-09-12 22:01
Thank you both for the explanations; I somehow suspected, there would be some strong reasoning for the conservative approach with regard to the backward compatibility.
Thanks for the block() and script() offer, Matthew, but I believe, this might clutter the interface of the module, while it belongs somwhere else.
(Personally, I just solved this need by directly grabbing 
http://www.unicode.org/Public/UNIDATA/Scripts.txt using regex :-)

It might be part of the problem for unicodedata, that this is another data file than UnicodeData.txt (which is the only one used, currently, IIRC).

On the other hand it might be worthwile to synchronise this features with such updates in unicodedata (block, script, unicode range; maybe the full names of the character properties might be added too).
As unicode 6.0 is about to come with the end of September, this might also reduce the efforts of upgrading it for regex.

Do you think, it would be appropriate/realistic to create a feature request in bug tracker on enhancing unicodedata?
(Unfortunately, I must confess, I am unable to contribute code in this area, without the C knowledge I always failed to find any useful data in optimised sources of unicodedata; hence I rather directly scanned the online datafiles.) 

vbr
msg116248 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-09-12 23:14
OK, so would it be OK if there was, say, a NEW (N) flag which made the inline flags (?flags) scoped and allowed splitting on zero-width matches?
msg116252 - (view) Author: Vlastimil Brom (vbr) Date: 2010-09-12 23:34
Just another rather marginal findings; differences between regex and re:

>>> regex.findall(r"[\B]", "aBc")
['B']
>>> re.findall(r"[\B]", "aBc")
[]

(Python 2.7 ... on win32; regex - issue2636-20100912.zip)
I believe, regex is more correct here, as uppercase \B doesn't have a special meaning within a set (unlike backspace \b), hence it should be treated as B, but I wanted to mention it as a difference, just in case it would matter.

I also noticed another case, where regex is more permissive:

>>> regex.findall(r"[\d-h]", "ab12c-h")
['1', '2', '-', 'h']
>>> re.findall(r"[\d-h]", "ab12c-h")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "re.pyc", line 177, in findall
  File "re.pyc", line 245, in _compile
error: bad character range
>>> 

howewer, there might be an issue in negated sets, where the negation seem to apply for the first shorthand literal only; the rest is taken positively

>>> regex.findall(r"[^\d-h]", "a^b12c-h")
['-', 'h']

cf. also a simplified pattern, where re seems to work correctly:

>>> regex.findall(r"[^\dh]", "a^b12c-h")
['h']
>>> re.findall(r"[^\dh]", "a^b12c-h")
['a', '^', 'b', 'c', '-']
>>> 

or maybe regardless the order - in presence of shorthand literals and normal characters in negated sets, these normal characters are matched positively

>>> regex.findall(r"[^h\s\db]", "a^b 12c-h")
['b', 'h']
>>> re.findall(r"[^h\s\db]", "a^b 12c-h")
['a', '^', 'c', '-']
>>> 

also related to character sets but possibly different - maybe adding a (reduntant) character also belonging to the shorthand in a negated set seem to somehow confuse the parser:

regex.findall(r"[^b\w]", "a b")
[]
re.findall(r"[^b\w]", "a b")
[' ']

regex.findall(r"[^b\S]", "a b")
[]
re.findall(r"[^b\S]", "a b")
[' ']

>>> regex.findall(r"[^8\d]", "a 1b2")
[]
>>> re.findall(r"[^8\d]", "a 1b2")
['a', ' ', 'b']
>>> 

I didn't find any relevant tracker issues, sorry if I missed some...
I initially wanted to provide test code additions, but as I am not sure about the intended output in all cases, I am leaving it in this form;

vbr
msg116276 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-09-13 04:24
issue2636-20100913.zip is a new version of the regex module.

I've removed the ZEROWIDTH flag and added the NEW flag, which turns on the new behaviour such as splitting on zero-width matches and positional flags. If the NEW flag isn't turned on then the inline flags are global, like in the re module.

You were right about those bugs in the regex module, Vlastimil. :-(

I've left the permissiveness of the sets in, at least for the moment, or until someone complains about it!

Incidentally:

>>> re.findall(r"[\B]", "aBc")
[]
>>> re.findall(r"[\c]", "aBc")
['c']

so it is a bug in the re module (it's putting a non-word-boundary in a set).
msg116749 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-09-18 02:55
issue2636-20100918.zip is a new version of the regex module.

I've added 'pos' and 'endpos' arguments to regex.sub and regex.subn and refactored a little.

I can't think of any other features that need to be added or see any more speed improvements.

Have I missed anything important? :-)
msg117008 - (view) Author: Vlastimil Brom (vbr) Date: 2010-09-20 23:51
I like the idea of the general "new" flag introducing the reasonable, backwards incompatible behaviour; one doesn't have to remember a list of non-standard flags to get this features.

While I recognise, that the module probably can't work correctly with wide unicode characters on a narrow python build (py 2.7, win XP in this case), i noticed a difference to re in this regard (it might be based on the absence of the wide unicode literal in the latter).

re.findall(u"\\U00010337", u"a\U00010337bc")
[]
re.findall(u"(?i)\\U00010337", u"a\U00010337bc")
[]
regex.findall(u"\\U00010337", u"a\U00010337bc")
[]
regex.findall(u"(?i)\\U00010337", u"a\U00010337bc")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Python27\lib\regex.py", line 203, in findall
    return _compile(pattern, flags).findall(string, pos, endpos,
  File "C:\Python27\lib\regex.py", line 310, in _compile
    parsed = parsed.optimise(info)
  File "C:\Python27\lib\_regex_core.py", line 1735, in optimise
    if self.is_case_sensitive(info):
  File "C:\Python27\lib\_regex_core.py", line 1727, in is_case_sensitive
    return char_type(self.value).lower() != char_type(self.value).upper()
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

I.e. re fails to match this pattern (as it actually looks for "U00010337" ), regex doesn't recognise the wide unicode as surrogate pair either, but it also raises an error from narrow unichr. Not sure, whether/how it should be fixed, but the difference based on the i-flag seems unusual.

Of course it would be nice, if surrogate pairs were interpreted, but I can imagine, that it would open a whole can of worms, as this is not thoroughly supported in the builtin unicode either (len, indices, slicing).

I am trying to make wide unicode characters somehow usable in my app, mainly with hacks like extended unichr
("\U"+hex(67)[2:].zfill(8)).decode("unicode-escape") 
or likewise for ord
surrog_ord = (ord(first) - 0xD800) * 0x400 + (ord(second) - 0xDC00) + 0x10000

Actually, using regex, one can work around some of these limitations of len, index or slice using a list form of the string containing surrogates

regex.findall(ur"(?s)(?:\p{inHighSurrogates}\p{inLowSurrogates})|.", u"ab𐌷𐌸𐌹cd")
[u'a', u'b', u'\U00010337', u'\U00010338', u'\U00010339', u'c', u'd']

but apparently things like wide unicode literals or character sets (even extending of the shorthands like \w etc.) are much more complicated.

regards,
   vbr
msg117046 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-09-21 11:41
I use Python 3, where len("\U00010337") == 2 on a narrow build.

Yes, wide Unicode on a narrow build is a problem:

>>> regex.findall("\\U00010337", "a\U00010337bc")
[]
>>> regex.findall("(?i)\\U00010337", "a\U00010337bc")
[]

I'm not sure how (or whether!) to handle surrogate pairs. It _would_ make things more complicated.

I suppose the moral is that if you want to use wide Unicode then you really should use a wide build.
msg117050 - (view) Author: Vlastimil Brom (vbr) Date: 2010-09-21 14:17
Well, of course, the surrogates probably shouldn't be handled separately in one module independently of the rest of the standard library. (I actually don't know such narrow implementation (although it is mentioned in those unicode quidelines 
http://unicode.org/reports/tr18/#Supplementary_Characters )

The main surprise on my part was due to the compile error rather than empty match as was the case with re; 
but now I see, that it is a consequence of the newly introduced wide unicode notation, the matching behaviour changed consistently.

(for my part, the workarounds I found, seem to be sufficient in the cases I work with wide unicode; most likely I am not going to compile wide unicode build on windows myself in the near future :-)
 vbr
msg118243 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-10-09 03:08
issue2636-20101009.zip is a new version of the regex module.

It appears from a posting in python-list and a closer look at the docs that string positions in the 're' module are limited to 32 bits, even on 64-bit builds. I think it's because of things like:

    Py_BuildValue("i", ...)

where 'i' indicates the size of a C int, which, at least in Windows compilers, is 32-bits in both 32-bit and 64-bit builds.

The regex module shared the same problem. I've changed such code to:

    Py_BuildValue("n", ...)

and so forth, which indicates Py_ssize_t.

Unfortunately I'm not able to confirm myself that this will fix the problem on 64 bits.
msg118631 - (view) Author: Vlastimil Brom (vbr) Date: 2010-10-14 08:13
I tried to give the 64-bit version a try, but I might have encountered a more general difficulties.
I tested this on Windows 7 Home Premium (Czech), the system is 64-bit (or I've hoped so sofar :-), according to System info: x64-based PC
I installed
Python 2.7 Windows X86-64 installer
from http://www.python.org/download/
which run ok, but the header in the python shell contains "win32"

Python 2.7 (r27:82525, Jul  4 2010, 07:43:08) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

Consequently, after copying the respecitive files from issue2636-20101009.zip
I get an import error:

>>> import regex
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python_64bit_27\lib\regex.py", line 253, in <module>
    from _regex_core import *
  File "C:\Python_64bit_27\lib\_regex_core.py", line 53, in <module>
    import _regex
ImportError: DLL load failed: %1 nenÝ platnß aplikace typu Win32.

>>> 

(The last part of the message is a in Czech with broken diacritics:
 %1 is not a valid Win32 type application.)

Is there something I can do in this case? I'd think, the installer would refuse to install a 64-bit software on a 32-bit OS or 32-bit architecture, or am I missing something obvious from the naming peculiarities x64, 64bit etc.?
That being said, I probably don't need to use 64-bit version of python, obviously, it isn't a wide unicode build mentioned earlier, hence
>>> len(u"\U00010333") # is still: 
2
>>>
And I currently don't have special memory requirements, which might be better addressed on a 64-bit system.

If there is something I can do to test regex in this environment, please, let me know;
On the same machine the 32-version is ok:
Python 2.7 (r27:82525, Jul  4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>>

regards
   vbr
msg118636 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-14 08:41
Vlastil, what makes you think that issue2636-20101009.zip is a 64-bit version? I can only find 32-bit DLLs in it.
msg118640 - (view) Author: Vlastimil Brom (vbr) Date: 2010-10-14 08:55
Well, it seemed to me too,
I happened to read the last post from Matthew, msg118243, in the sense that he made some updates which need testing on a 64 bit system (I am unsure, whether hardware architecture, OS type, python build or something else was meant); but it must have been somehow separated as a new directory in the issue2636-20101009.zip which is not the case.

More generaly, I was somhow confused about the "win32" in the shell header in the mentioned install.
    vbr
msg118674 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-10-14 15:42
I am not able to build or test a 64-bit version. The update was to the source files to ensure that if it is compiled for 64 bits then the string positions will also be 64-bit.

This change was prompted by a poster who tried to use the re module of a 64-bit Python build on a 30GB memmapped file but found that the string positions were still limited to 32 bits.

It looked like a 64-bit build of the regex module would have the same limitation.
msg118682 - (view) Author: Vlastimil Brom (vbr) Date: 2010-10-14 16:21
Sorry for the noise,
it seems, I can go back to the 32-bit python for now then...
vbr
msg119887 - (view) Author: Jacques Grove (jacques) Date: 2010-10-29 11:11
Do we expect this to work on 64 bit Linux and python 2.6.5?  I've compiled and run some of my code through this, and there seems to be issues with non-greedy quantifier matching (at least relative to the old re module):

$ cat test.py
import re, regex

text = "(MY TEST)"
regexp = '\((?P<test>.{0,5}?TEST)\)'
print re.findall(regexp, text)
print regex.findall(regexp, text)


$ python test.py
['MY TEST']
[]

python 2.7 produces the same results for me.

However, making the quantifier greedy (removing the '?') gives the same result for both re and regex modules.
msg119930 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-10-29 19:36
That's a bug. I'll fix it as soon has I've reinstalled the SDK. <sigh/>
msg119947 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-10-29 22:33
issue2636-20101029.zip is a new version of the regex module.

I've also added to the unit tests.
msg119951 - (view) Author: Jacques Grove (jacques) Date: 2010-10-30 00:48
Here's another inconsistency (same setup as before, running issue2636-20101029.zip code):

$ cat test.py
import re, regex

text = "\n  S"

regexp = '[^a]{2}[A-Z]'
print re.findall(regexp, text)
print regex.findall(regexp, text)

$ python test.py
['  S']
[]


I might flush out some more as I excercise this over the next few days.
msg119956 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-10-30 03:39
issue2636-20101030.zip is a new version of the regex module.

I've also added yet more to the unit tests.
msg119958 - (view) Author: Jacques Grove (jacques) Date: 2010-10-30 04:40
And another (with issue2636-20101030.zip):

$ cat test.py 
import re, regex
text = "XYABCYPPQ\nQ DEF"
regexp = 'X(Y[^Y]+?){1,2}(\ |Q)+DEF'
print re.findall(regexp, text)
print regex.findall(regexp, text)

$ python test.py 
[('YPPQ\n', ' ')]
[]
msg120013 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-10-30 20:15
issue2636-20101030a.zip is a new version of the regex module.

This bug was a bit more difficult to fix, but I think it's OK now!
msg120037 - (view) Author: Jacques Grove (jacques) Date: 2010-10-31 05:27
Here's one that really falls in the category of "don't do that";  but I found this because I was limiting the system recursion level to somewhat less than the standard 1000 (for other reasons), and I had some shorter duplicate patterns in a big regex.  Here is the simplest case to make it blow up with the standard recursion settings:

$ cat test.py
import re, regex
regexp = '(abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ|abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ)'
re.compile(regexp)
regex.compile(regexp)

$ python test.py
<snip big traceback except for last few lines>

File "/tmp/test/src/lib/_regex_core.py", line 2024, in optimise
    subpattern = subpattern.optimise(info)
  File "/tmp/test/src/lib/_regex_core.py", line 1552, in optimise
    branches = [_Branch(branches)]
RuntimeError: maximum recursion depth exceeded
msg120038 - (view) Author: Jacques Grove (jacques) Date: 2010-10-31 06:09
And another, bit less pathological, testcase.  Sorry for the ugly testcase;  it was much worse before I boiled it down :-)

$ cat test.py 
import re, regex

text = "\nTest\nxyz\nxyz\nEnd"

regexp = '(\nTest(\n+.+?){0,2}?)?\n+End'
print re.findall(regexp, text)
print regex.findall(regexp, text)

$ python test.py
[('\nTest\nxyz\nxyz', '\nxyz')]
[('', '')]
msg120164 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-11-01 20:28
issue2636-20101101.zip is a new version of the regex module.

I hope it's finally fixed this time! :-)
msg120202 - (view) Author: Jacques Grove (jacques) Date: 2010-11-02 02:49
OK, I think this might be the last one I will find for the moment:

$ cat test.py
import re, regex

text = "test?"
regexp = "test\?"
sub_value = "result\?"
print repr(re.sub(regexp, sub_value, text))
print repr(regex.sub(regexp, sub_value, text))


$ python test.py
'result\\?'
'result?'
msg120203 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-11-02 03:51
issue2636-20101102.zip is a new version of the regex module.
msg120204 - (view) Author: Jacques Grove (jacques) Date: 2010-11-02 04:08
Spoke too soon, although this might be a valid divergence in behavior:

$ cat test.py 
import re, regex

text = "test: 2"

print regex.sub('(test)\W+(\d+)(?:\W+(TEST)\W+(\d))?', '\\2 \\1, \\4 \\3', text)
print re.sub('(test)\W+(\d+)(?:\W+(TEST)\W+(\d))?', '\\2 \\1, \\4 \\3', text)


$ python test.py 
2 test,  
Traceback (most recent call last):
  File "test.py", line 6, in <module>
    print re.sub('(test)\W+(\d+)(?:\W+(TEST)\W+(\d))?', '\\2 \\1, \\4 \\3', text)
  File "/usr/lib64/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib64/python2.7/re.py", line 278, in filter
    return sre_parse.expand_template(template, match)
  File "/usr/lib64/python2.7/sre_parse.py", line 787, in expand_template
    raise error, "unmatched group"
sre_constants.error: unmatched group
msg120206 - (view) Author: Jacques Grove (jacques) Date: 2010-11-02 04:52
Another, with backreferences:

import re, regex

text = "TEST, BEST; LEST ; Lest 123 Test, Best"
regexp = "(?i)(.{1,40}?),(.{1,40}?)(?:;)+(.{1,80}).{1,40}?\\3(\ |;)+(.{1,80}?)\\1"
print re.findall(regexp, text)
print regex.findall(regexp, text)

$ python test.py 
[('TEST', ' BEST', ' LEST', ' ', '123 ')]
[('T', ' BEST', ' ', ' ', 'Lest 123 ')]
msg120215 - (view) Author: Vlastimil Brom (vbr) Date: 2010-11-02 11:56
There seems to be a bug in the handling of numbered backreferences in sub() in
issue2636-20101102.zip
I believe, it would be a fairly new regression, as it would be noticed rather soon.
(tested on Python 2.7; winXP)

>>> re.sub("([xy])", "-\\1-", "abxc")
'ab-x-c'
>>> regex.sub("([xy])", "-\\1-", "abxc")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\regex.py", line 176, in sub
    return _compile(pattern, flags).sub(repl, string, count, pos, endpos)
  File "C:\Python27\lib\regex.py", line 375, in _compile_replacement
    compiled.extend(items)
TypeError: 'int' object is not iterable
>>>

vbr
msg120216 - (view) Author: Vlastimil Brom (vbr) Date: 2010-11-02 12:08
Sorry for the noise, please, forgot my previous msg120215;
I somehow managed to keep an older version of _regex_core.py along with the new regex.py in the Lib directory, which are obviously incompatible.
After updating the files correctly, the mentioned examples work correctly.

vbr
msg120243 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-11-02 19:12
issue2636-20101102a.zip is a new version of the regex module.

msg120204 relates to issue #1519638 "Unmatched group in replacement". In 'regex' an unmatched group is treated as an empty string in a replacement template. This behaviour is more in keeping with regex implementations in other languages.

msg120206 was caused by not all group references being made case-insensitive when they should be.
msg120571 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-11-06 02:22
issue2636-20101106.zip is a new version of the regex module.

Fix for issue 10328, which regex also shared.
msg120969 - (view) Author: Alex Willmer (moreati) * Date: 2010-11-11 21:00
The re module throws an exception for re.compile(r'[\A\w]'). latest
regex doesn't, but I don't think the pattern is matching correctly.
Shouldn't findall(r'[\A]\w', 'a b c') return ['a'] and
findall(r'[\A\s]\w', 'a b c') return ['a', ' b', ' c'] ?

Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> for s in [r'\A\w', r'[\A]\w', r'[\A\s]\w']: print re.findall(s, 'a b c')
...
['a']
[]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/re.py", line 177, in findall
    return _compile(pattern, flags).findall(string)
  File "/usr/lib/python2.6/re.py", line 245, in _compile
    raise error, v # invalid expression
sre_constants.error: internal: unsupported set operator
>>> import regex
>>> for s in [r'\A\w', r'[\A]\w', r'[\A\s]\w']: print regex.findall(s, 'a b c')
...
['a']
[]
[' b', ' c']
msg120976 - (view) Author: Vlastimil Brom (vbr) Date: 2010-11-11 22:20
Maybe I am missing something, but the result in regex seem ok to me:
\A is treated like A in a character set; when the test string is changed to "A b c" or in the case insensitive search the A is matched.

[\A\s]\w doesn't match the starting "a", as it is not followed by any word character:

>>> for s in [r'\A\w', r'[\A]\w', r'[\A\s]\w']: print regex.findall(s, 'A b c')
... 
['A']
[]
[' b', ' c']
>>> for s in [r'\A\w', r'(?i)[\A]\w', r'[\A\s]\w']: print regex.findall(s, 'a b c')
... 
['a']
[]
[' b', ' c']
>>> 

In the original re there seem to be a bug/limitation in this regard (\A and also \Z in character sets aren't supported in some combinations...

vbr
msg120984 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-11-11 23:13
It looks like a similar problem to msg116252 and msg116276.
msg120986 - (view) Author: Alex Willmer (moreati) * Date: 2010-11-11 23:48
On Thu, Nov 11, 2010 at 10:20 PM, Vlastimil Brom <report@bugs.python.org> wrote:
> Maybe I am missing something, but the result in regex seem ok to me:
> \A is treated like A in a character set;

I think it's me who missed something. I'd assumed that all backslash
patterns (including \A for beginning of string) maintain their meaning
in a character class. AFAICT that assumption was wrong.
msg121136 - (view) Author: Vlastimil Brom (vbr) Date: 2010-11-13 13:47
I'd have liked to suggest updating the underlying unicode data to the latest standard 6.0, but it turns out, it might be problematic with the cross-version compatibility;
according to the clarification in 
http://bugs.python.org/issue10400
the 3... versions are going to be updated, while it is not allowed in the 2.x series.
I guess it would cause maintainance problems (as the needed properties are not available via unicodedata).
Anyway, while I'd like the recent unicode data to be supported (new characters, ranges, scripts, and corrected individual properties...),
I'm much happier, that there is support for the 2 series in regex...
vbr
msg121145 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-11-13 17:15
issue2636-20101113.zip is a new version of the regex module.

It now supports Unicode 6.0.0.
msg121149 - (view) Author: Vlastimil Brom (vbr) Date: 2010-11-13 18:13
Thank you very much!
a quick test with my custom unicodedata with 6.0 on py 2.7 seems ok.
I hope, there won't be problems with "cooperation" of the more recent internal data with the original 5.2 database in python 2.x releases.

vbr
msg121589 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-11-20 01:43
issue2636-20101120.zip is a new version of the regex module.

The match object now supports additional methods which return information on all the successful matches of a repeated capture group.

The API was inspired by that of .Net:

    matchobject.captures([group1, ...])

        Returns a tuple of the strings matched in a group or groups. Compare with matchobject.group([group1, ...]).

    matchobject.starts([group])

        Returns a tuple of the start positions. Compare with matchobject.start([group]).

    matchobject.ends([group])

        Returns a tuple of the end positions. Compare with matchobject.end([group]).

    matchobject.spans([group])

        Returns a tuple of the spans. Compare with matchobject.span([group]).
msg121832 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-11-21 01:54
issue2636-20101121.zip is a new version of the regex module.

The captures didn't work properly with lookarounds or atomic groups.
msg122221 - (view) Author: Steve Moran (stiv) Date: 2010-11-23 15:58
Forgive me if this is just a stupid oversight. 

I'm a linguist and use UTF-8 for "special" characters for linguistics data. This often includes multi-byte Unicode character sequences that are composed as one grapheme. For example the í̵ (if it's displaying correctly for you) is a LATIN SMALL LETTER I WITH STROKE \u0268 combined with COMBINING ACUTE ACCENT \u0301. E.g. a word I'm parsing:

jí̵-e-gɨ

I was pretty excited to find out that this regex library implements the grapheme match \X (equivalent to \P{M}\p{M}*). For the above example I needed to evaluate which sequences of characters can occur across syllable boundaries (here the hyphen "-"), so I'm aiming for:

í̵-e
e-g

When regex couldn't get any better, you awesome developers implemented an overlapped=True flag with findall and finditer. 

Python 3.1.2 (r312:79147, May 19 2010, 11:50:28) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-46)] on linux2
>>> import regex
>>> s = "jí̵-e-gɨ"
>>> s
'jí̵-e-gɨ'
>>> m = regex.compile("(\X)(-)(\X)")
>>> m.findall(s, overlapped=False)
[('í̵', '-', 'e')]

But these results are weird to me:

>>> m.findall(s, overlapped=True)
[('í̵', '-', 'e'), ('í̵', '-', 'e'), ('e', '-', 'g'), ('e', '-', 'g'), ('e', '-', 'g')]

Why the extra matches? At first I figured this had something to do with the overlapping match of the grapheme, since it's multiple characters. So I tried it with with out the grapheme match:

>>> m = regex.compile("(.)(-)(.)")
>>> s2 = "a-b-cd-e-f"
>>> m.findall(s2, overlapped=False)
[('a', '-', 'b'), ('d', '-', 'e')]

That's right. But with overlap...

>>> m.findall(s2, overlapped=True)
[('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c'), ('d', '-', 'e'), ('d', '-', 'e'), ('d', '-', 'e'), ('e', '-', 'f'), ('e', '-', 'f')]

Those 'extra' matches are confusing me. 2x b-c, 3x d-e, 2x e-f? Or even more simply:

>>> s2 = "a-b-c"
>>> m.findall(s2, overlapped=False)
[('a', '-', 'b')]
>>> m.findall(s2, overlapped=True)
[('a', '-', 'b'), ('b', '-', 'c'), ('b', '-', 'c')]

Thanks!
msg122225 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-11-23 16:31
Please don't change the type, this issue is about the feature request of adding this regex engine to the stdlib.

I'm sure Matthew will get back to you about your question.
msg122228 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-11-23 18:34
issue2636-20101123.zip is a new version of the regex module.

Oops, sorry, the weird behaviour of msg122221 was a bug. :-(
msg122880 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-11-30 04:37
issue2636-20101130.zip is a new version of the regex module.

Added 'special_only' keyword parameter (default False) to regex.escape. When True, regex.escape escapes only 'special' characters, such as '?'.
msg123518 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-12-07 03:26
issue2636-20101207.zip is a new version of the regex module.

It includes additional checks against pathological regexes.
msg123527 - (view) Author: Zach Dwiel (zdwiel) Date: 2010-12-07 07:17
Here is the terminal log of what happens when I try to install and then import regex.  Any ideas what is going on?

$ python setup.py install
running install
running build
running build_py
creating build
creating build/lib.linux-i686-2.6
copying Python2/regex.py -> build/lib.linux-i686-2.6
copying Python2/_regex_core.py -> build/lib.linux-i686-2.6
running build_ext
building '_regex' extension
creating build/temp.linux-i686-2.6
creating build/temp.linux-i686-2.6/Python2
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.6 -c Python2/_regex.c -o build/temp.linux-i686-2.6/Python2/_regex.o
Python2/_regex.c:109: warning: ‘struct RE_State’ declared inside parameter list
Python2/_regex.c:109: warning: its scope is only this definition or declaration, which is probably not what you want
Python2/_regex.c:110: warning: ‘struct RE_State’ declared inside parameter list
Python2/_regex.c:538: warning: initialization from incompatible pointer type
Python2/_regex.c:539: warning: initialization from incompatible pointer type
Python2/_regex.c:679: warning: initialization from incompatible pointer type
Python2/_regex.c:680: warning: initialization from incompatible pointer type
Python2/_regex.c:1217: warning: initialization from incompatible pointer type
Python2/_regex.c:1218: warning: initialization from incompatible pointer type
Python2/_regex.c: In function ‘try_match’:
Python2/_regex.c:3153: warning: passing argument 1 of ‘state->encoding->at_boundary’ from incompatible pointer type
Python2/_regex.c:3153: note: expected ‘struct RE_State *’ but argument is of type ‘struct RE_State *’
Python2/_regex.c:3184: warning: passing argument 1 of ‘state->encoding->at_default_boundary’ from incompatible pointer type
Python2/_regex.c:3184: note: expected ‘struct RE_State *’ but argument is of type ‘struct RE_State *’
Python2/_regex.c: In function ‘search_start’:
Python2/_regex.c:3535: warning: assignment from incompatible pointer type
Python2/_regex.c:3581: warning: assignment from incompatible pointer type
Python2/_regex.c: In function ‘basic_match’:
Python2/_regex.c:3995: warning: assignment from incompatible pointer type
Python2/_regex.c:3996: warning: assignment from incompatible pointer type
Python2/_regex.c: At top level:
Python2/unicodedata_db.h:241: warning: ‘nfc_first’ defined but not used
Python2/unicodedata_db.h:448: warning: ‘nfc_last’ defined but not used
Python2/unicodedata_db.h:550: warning: ‘decomp_prefix’ defined but not used
Python2/unicodedata_db.h:2136: warning: ‘decomp_data’ defined but not used
Python2/unicodedata_db.h:3148: warning: ‘decomp_index1’ defined but not used
Python2/unicodedata_db.h:3333: warning: ‘decomp_index2’ defined but not used
Python2/unicodedata_db.h:4122: warning: ‘comp_index’ defined but not used
Python2/unicodedata_db.h:4241: warning: ‘comp_data’ defined but not used
Python2/unicodedata_db.h:5489: warning: ‘get_change_3_2_0’ defined but not used
Python2/unicodedata_db.h:5500: warning: ‘normalization_3_2_0’ defined but not used
Python2/_regex.c: In function ‘basic_match’:
Python2/_regex.c:4106: warning: ‘info.captures_count’ may be used uninitialized in this function
Python2/_regex.c:4720: warning: ‘info.captures_count’ may be used uninitialized in this function
Python2/_regex.c: In function ‘splitter_split’:
Python2/_regex.c:8076: warning: ‘result’ may be used uninitialized in this function
gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions build/temp.linux-i686-2.6/Python2/_regex.o -o build/lib.linux-i686-2.6/_regex.so
running install_lib
copying build/lib.linux-i686-2.6/_regex.so -> /usr/local/lib/python2.6/dist-packages
copying build/lib.linux-i686-2.6/_regex_core.py -> /usr/local/lib/python2.6/dist-packages
copying build/lib.linux-i686-2.6/regex.py -> /usr/local/lib/python2.6/dist-packages
byte-compiling /usr/local/lib/python2.6/dist-packages/_regex_core.py to _regex_core.pyc
byte-compiling /usr/local/lib/python2.6/dist-packages/regex.py to regex.pyc
running install_egg_info
Writing /usr/local/lib/python2.6/dist-packages/regex-0.1.20101123.egg-info
$ python
Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.6/dist-packages/regex-0.1.20101207-py2.6-linux-i686.egg/regex.py", line 273, in <module>
    from _regex_core import *
  File "/usr/local/lib/python2.6/dist-packages/regex-0.1.20101207-py2.6-linux-i686.egg/_regex_core.py", line 54, in <module>
    import _regex
ImportError: /usr/local/lib/python2.6/dist-packages/regex-0.1.20101207-py2.6-linux-i686.egg/_regex.so: undefined symbol: max
msg123747 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-12-10 20:03
issue2636-20101210.zip is a new version of the regex module.

I've extended the additional checks of the previous version.

It has been tested with Python 2.5 to Python 3.2b1.
msg124581 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-12-24 01:02
issue2636-20101224.zip is a new version of the regex module.

Case-insensitive matching is now faster.

The matching functions and methods now accept a keyword argument to release the GIL during matching to enable other Python threads to run concurrently:

    matches = regex.findall(pattern, string, concurrent=True)

This should be used only when it's guaranteed that the string won't change during matching.

The GIL is always released when working on instances of the builtin (immutable) string classes because that's known to be safe.
msg124582 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-12-24 01:36
I would like to start reviewing this code, but dated zip files on a tracker make a very inefficient VC setup.  Would you consider exporting your development history to some public VC system?
msg124585 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2010-12-24 02:58
+1 on VC
msg124614 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-12-24 22:26
I've been trying to push the history to Launchpad, completely without success; it just won't authenticate (no such account, even though I can log in!).

I doubt that the history would be much use to you anyway.
msg124626 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-25 01:38
I suspect it would help if there are more changes, though.

I believe that to push to launchpad you have to upload an ssh key.  Not sure why you'd get "no such account", though.  Barry would probably know :)
msg124627 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-12-25 01:48
It does have an SSH key. It's probably something simple that I'm missing.

I think that the only change I'm likely to make is to a support script I use; it currently uses hard-coded paths, etc, to do its magic. :-)
msg124746 - (view) Author: Jacques Grove (jacques) Date: 2010-12-28 00:56
Testing issue2636-20101224.zip:

Nested modifiers seems to hang the regex compilation when used in a non-capturing group e.g.:

re.compile("(?:(?i)foo)")

or

re.compile("(?:(?u)foo)")


No problem on stock Python 2.6.5 regex engine.

The unnested version of the same regex compiles fine.
msg124750 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-12-28 01:45
issue2636-20101228.zip is a new version of the regex module.

Sorry for the delay, the fix took me a bit longer than I expected. :-)
msg124759 - (view) Author: Jacques Grove (jacques) Date: 2010-12-28 04:01
Another re.compile performance issue (I've seen a couple of others, but I'm still trying to simplify the test-cases):

re.compile("(?ui)(a\s?b\s?c\s?d\s?e\s?f\s?g\s?h\s?i\s?j\s?k\s?l\s?m\s?n\s?o\s?p\s?q\s?r\s?s\s?t\s?u\s?v\s?w\s?y\s?z\s?a\s?b\s?c\s?d)")

completes in around 0.01s on my machine using Python 2.6.5 standard regex library, but takes around 12 seconds using issue2636-20101228.zip
msg124816 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-12-28 20:05
issue2636-20101228a.zip is a new version of the regex module.

It now compiles the pattern quickly.
msg124821 - (view) Author: Jacques Grove (jacques) Date: 2010-12-28 21:57
Thanks, issue2636-20101228a.zip also resolves my compilation speed issues I had on other (very) complex regexes.

Found this one:

re.search("(X.*?Y\s*){3}(X\s*)+AB:", "XY\nX Y\nX  Y\nXY\nXX AB:")

produces a search hit with stock python 2.6.5 regex library, but not with issue2636-20101228a.zip.

re.search("(X.*?Y\s*){3,}(X\s*)+AB:", "XY\nX Y\nX  Y\nXY\nXX AB:")

matches on both, however.
msg124833 - (view) Author: Jacques Grove (jacques) Date: 2010-12-29 00:19
Here is a somewhat crazy pattern (slimmed down from something much larger and more complex, which didn't finish compiling even after several minutes): 

re.compile("(?:(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)|(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)(?:(?:[\-\s\.,>/]){0,4}?)(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??)\W*(?:[79][0-9]|2[0-4]|\d)(?:[\.:Aa])?(?:[0-5][0-9])\W*(?:(?:[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{3})?)?|[Aa]{3}(?:[Aa]{5}[Aa])?|[Aa]{3}(?:[Aa](?:[Aa]{4})?)?|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{3})?)|(?:[Aa][Aa](?:[Aa](?:[Aa]{3})?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{4})?)?)?|[Aa][Aa](?:[Aa](?:[Aa]{3}(?:[Aa](?:[Aa]{3})?)?)?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?|[Aa]{3}(?:[Aa](?:[Aa](?:[Aa]{4})?)?)?|[Aa][Aa](?:[Aa](?:[Aa](?:[Aa]{3})?)?)?))\s*(\-\s*)?(?:(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??(?:(?:[\-\s\.,>/]){0,4}?)(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)|(?:[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}(?:[Aa][Aa])?|[Aa]{3}|[Aa]{4}|[Aa]{4}|[Aa]{3}(?:[Aa]{3})?|[Aa]{3}(?:[Aa](?:[Aa]{5})?)?|[Aa]{3}(?:[Aa]{4})?|[Aa]{3}(?:[Aa]{5})?|[Aa]{3}(?:[Aa]{5})?)(?:(?:[\-\s\.,>/]){0,4}?)(?:[23][0-9]|3[79]|0?[1-9])(?:[Aa][Aa]|[Aa][Aa]|[Aa][Aa])??)(?:(?:(?:[\-\s\.,>/]){0,4}?)(?:(?:68)?[7-9]\d|(?:2[79])?\d{2}))?\W*(?:[79][0-9]|2[0-4]|\d)(?:[\.:Aa])?(?:[0-5][0-9])")


Runs about 10.5 seconds on my machine with issue2636-20101228a.zip, less than 0.03 seconds with stock Python 2.6.5 regex engine.
msg124834 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-12-29 00:42
issue2636-20101229.zip is a new version of the regex module.

It now compiles the pattern quickly.
msg124891 - (view) Author: Jacques Grove (jacques) Date: 2010-12-30 00:06
More an observation than a bug:

I understand that we're trading memory for performance, but I've noticed that the peak memory usage is rather high, e.g.:

$ cat test.py
import os
import regex as re

def resident():
    for line in open('/proc/%d/status' % os.getpid(), 'r').readlines():
        if line.startswith("VmRSS:"):
            return line.split(":")[-1].strip()

cache = {}

print resident()
for i in xrange(0,1000):
    cache[i] = re.compile(str(i)+"(abcd12kl|efghlajsdf|ijkllakjsdf|mnoplasjdf|qrstljasd|sdajdwxyzlasjdf|kajsdfjkasdjkf|kasdflkasjdflkajsd|klasdfljasdf)")

print resident()


Execution output on my machine (Linux x86_64, Python 2.6.5):
4328 kB
32052 kB

with the standard regex library:
3688 kB
5428 kB

So, it looks like around 16x the memory per pattern vs standard regex module

Now the example is pretty silly, the difference is even larger for more complex regexes.  I also understand that the once the patterns are GC-ed, python can reuse the memory (pymalloc doesn't return it to the OS, unfortunately).  However, I have some applications that use large numbers (many thousands) of regexes and need to keep them cached (compiled) indefinitely (especially because compilation is expensive).  This causes some pain (long story).

I've played around with increasing RE_MIN_FAST_LENGTH, and it makes a significant difference, e.g.:

RE_MIN_FAST_LENGTH = 10:
4324 kB
25976 kB

In my use-cases, having a larger RE_MIN_FAST_LENGTH doesn't make a huge performance difference, so that might be the way I'll go.
msg124900 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-12-30 02:25
issue2636-20101230.zip is a new version of the regex module.

I've delayed the building of the tables for fast searching until their first use, which, hopefully, will mean that fewer will be actually built.
msg124904 - (view) Author: Jacques Grove (jacques) Date: 2010-12-30 04:08
Yeah, issue2636-20101230.zip DOES reduce memory usage significantly (30-50%) in my use cases;  however, it also tanks performance overall by 35% for me, so I'll prefer to stick with issue2636-20101229.zip (or some variant of it).

Maybe a regex compile-time option, although that's not necessary.

Thanks for the effort.
msg124905 - (view) Author: Jacques Grove (jacques) Date: 2010-12-30 04:49
re.search('\d{4}(\s*\w)?\W*((?!\d)\w){2}', "9999XX")

matches on stock 2.6.5 regex module, but not on issue2636-20101230.zip or issue2636-20101229.zip (which I've fallen back to for now)
msg124906 - (view) Author: Jacques Grove (jacques) Date: 2010-12-30 05:24
Another one that diverges between stock regex and issue2636-20101229.zip:

re.search('A\s*?.*?(\n+.*?\s*?){0,2}\(X', 'A\n1\nS\n1 (X')
msg124909 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2010-12-30 07:41
As belopolsky said... *please* move this development into version control.  Put it up in an Hg repo on code.google.com.  or put it on github.  *anything* other than repeatedly posting entire zip file source code drops to a bugtracker.
msg124912 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2010-12-30 08:45
Hearty +1.  I have the hope of putting this in 3.3, and for that I'd like to see how the code matures, which is much easier when in version control.
msg124929 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-12-30 19:13
The project is now at:

    https://code.google.com/p/mrab-regex/

Unfortunately it doesn't have the revision history. I don't know why not.
msg124931 - (view) Author: Robert Xiao (nneonneo) * Date: 2010-12-30 19:45
Do you have it in any kind of repository at all? Even a private SVN repo or something like that?
msg124936 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-12-30 21:39
msg124904: It would, of course, be slower on first use, but I'm surprised that it's (that much) slower afterwards.

msg124905, msg124906: I have those matching now.

msg124931: The sources are in TortoiseBzr, but I couldn't upload, so I exported to TortoiseSVN.
msg124950 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-12-31 04:03
Even after much uninstalling and reinstalling (and reboots) I never got TortoiseSVN to work properly, so I switched to TortoiseHg. The sources are now at:

    https://code.google.com/p/mrab-regex-hg/
msg124959 - (view) Author: Jacques Grove (jacques) Date: 2010-12-31 09:23
Thanks for putting up the hg repo, makes it much easier to follow.

Getting back to the performance regression I reported in msg124904:

I've verified that if I take the hg commit 7abd9f9bb1 , and I back out the guards changes manually, while leaving the FAST_INIT changes in, the performance is back to normal on my full regression suite (i.e. the 30-40% penalty disappears).

I've repeated my tests a few times to make sure I'm not mistaken;  since the guard changes doesn't look like it should impact performance much, but it does.

I've attached the diff that restored the speed for me (as usual, using Python 2.6.5 on Linux x86_64)

BTW, now that we have the code on google code, can we log individual issues over there?  Might make it easier for those interested to follow certain issues than trying to comb through every individual detail in this super-issue-thread...?
msg124971 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2010-12-31 17:55
Why not? :-)
msg124988 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-01-01 02:47
Just to check, does this still work with your changes of msg124959?

    regex.search(r'\d{4}(\s*\w)?\W*((?!\d)\w){2}', "9999XX")

For me it fails to match!
msg124990 - (view) Author: Jacques Grove (jacques) Date: 2011-01-01 04:26
You're correct, after the change:

regex.search(r'\d{4}(\s*\w)?\W*((?!\d)\w){2}', "9999XX")

doesn't match (i.e. as before commit 7abd9f9bb1).

I was, however, just trying to narrow down which part of the code change killed the performance on my regression tests :-)

Happy new year to all out there.
msg125291 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-01-04 03:56
I've just done a bug fix. The issue is at:

    https://code.google.com/p/mrab-regex-hg/

BTW, Jacques, I trust that your regression tests don't test how long a regex takes to fail to match, because a bug could cause such a non-match to occur too quickly, before the regex has tried all that it should! :-)
msg126294 - (view) Author: Ronan Amicel (ronnix) Date: 2011-01-14 19:44
The regex 0.1.20110106 package fails to install with Python 2.6, due to the use of 2.7 string formatting syntax in setup.py:

    print("Copying {} to {}".format(unicodedata_db_h, SRC_DIR))

This line should be changed to:

    print("Copying {0} to {1}".format(unicodedata_db_h, SRC_DIR))

Reference: http://docs.python.org/library/string.html#formatstrings
msg126372 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-01-16 20:19
That line crept in somehow.

As it's been there since the 2010-12-24 release and you're the first one to have a problem with it (and you've already fixed it), it looks like a new upload isn't urgently needed (I don't have any other changes to make at present).
msg127045 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-01-25 19:48
I've reduced the size of some internal tables.
msg130886 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2011-03-14 20:06
Could you add me as a member or admin on the mrab-regex-hg project?  I've got a few things I want to fix in the code as I start looking into the state of this module.  gpsmith at gmail dot com is my google account.

There are some fixes in the upstream python that haven't made it into this code that I want to merge in among other things.  I may also add a setup.py file and some scripts to to make building and testing this stand alone easier.
msg130905 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-03-14 21:25
@Gregory: I've added you to the project.

I'm currently trying to fix a problem with iterators shared across threads. As a temporary measure, the current release on PyPI doesn't enable multithreading for them.

The mrab-regex-hg project doesn't have those sources yet. I'll update them later today, either to the release on PyPI, or to a fixed version if all goes well...
msg130906 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2011-03-14 21:29
Okay. Can you push your setup.py and README and such as well?  Your pypi
release tarballs should match the hg repo and ideally include a mention of
what hg revision they are generated from. :)

-gps

On Mon, Mar 14, 2011 at 5:25 PM, Matthew Barnett <report@bugs.python.org>wrote:

>
> Matthew Barnett <python@mrabarnett.plus.com> added the comment:
>
> @Gregory: I've added you to the project.
>
> I'm currently trying to fix a problem with iterators shared across threads.
> As a temporary measure, the current release on PyPI doesn't enable
> multithreading for them.
>
> The mrab-regex-hg project doesn't have those sources yet. I'll update them
> later today, either to the release on PyPI, or to a fixed version if all
> goes well...
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue2636>
> _______________________________________
>
msg130999 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-03-15 17:40
I've fixed the problem with iterators for both Python 3 and Python 2. They can now be shared safely across threads.

I've updated the release on PyPI.
msg135700 - (view) Author: Jonathan Halcrow (jhalcrow) Date: 2011-05-10 13:55
I'm having a problem using the current version (0.1.20110504) with python 2.5 on OSX 10.5.  When I try to import regex I get the following import error:

dlopen(<snipped>/python2.5/site-packages/_regex.so, 2): Symbol not found: _re_is_same_char_ign
  Referenced from: <snipped>/python2.5/site-packages/_regex.so
  Expected in: dynamic lookup
msg135703 - (view) Author: Jonathan Halcrow (jhalcrow) Date: 2011-05-10 14:07
It seems that _regex_unicode.c is missing from setup.py, adding it to ext_modules fixes my previous issue.
msg135704 - (view) Author: Brian Curtin (brian.curtin) * (Python committer) Date: 2011-05-10 14:08
Issues with Regexp should probably be handled on the Regexp tracker.
msg140102 - (view) Author: Alec Koumjian (akoumjian) Date: 2011-07-11 05:19
I apologize if this is the wrong place for this message. I did not see the link to a separate list.

First let me explain what I am trying to accomplish. I would like to be able to take an unknown regular expression that contains both named and unnamed groups and tag their location in the original string where a match was found. Take the following redundantly simple example:

>>> a_string = r"This is a demo sentence."
>>> pattern = r"(?<a_thing>\w+) (\w+) (?<another_thing>\w+)"
>>> m = regex.search(pattern, a_string)

What I want is a way to insert named/numbered tags into the original string, so that it looks something like this:

r"<a_thing>This</a_thing> <2>is</2> <another_thing>a</another_thing> demo sentence."

The syntax doesn't have to be exactly like that, but you get the place. I have inserted the names and/or indices of the groups into the original string, around the span that the groups occupy. 

This task is exceedingly difficult with the current implementation, unless I am missing something obvious. We could call the groups by index, the groups as a tuple, or the groupdict:

>>> m.group(1)
'This'
>>> m.groups()
('This', 'is', 'a')
>>> m.groupdict()
{'another_thing': 'a', 'a_thing': 'This'}

If all I wanted was to tag the groups by index, it would be a simple function. I would be able to call m.spans() for each index in the length of m.groups() and insert the <> and </> tags around the right indices.

The hard part is finding out how to find the spans of the named groups. Do any of you have a suggestion?

It would make more sense from my perspective, if each group was an object that had its own .span property. It would work like this with the above example:

>>> first = m.group(1)
>>> first.name()
'a_thing'
>>> second = m.group(2)
>>> second.name()
None
>>>

You could still call .spans() on the Match object itself, but it would query its children group objects for the data. Overall I think this would be a much more Pythonic approach, especially given that you have added subscripting and key lookup.

So instead of this:
>>> m['a_thing']
'This'
>>> type(m['a_thing'])
<type 'str'>

You could have:
>>> m['a_thing']
'This'
>>> type(m['a_thing'])
<'regex.Match.Group object'>

With the noted benefit of this:
>>> m['a_thing'].span()
(0, 4)
>>> m['a_thing'].index()
1
>>>

Maybe I'm missing a major point or functionality here, but I've been pouring over the docs and don't currently think what I'm trying to achieve is possible.

Thank you for taking the time to read all this.

-Alec
msg140152 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-07-11 17:32
The new regex imlementation is hosted here: https://code.google.com/p/mrab-regex-hg/

The span of m['a_thing'] is m.span('a_thing'), if that helps.

The named groups are listed on the pattern object, which can be accessed via m.re:

>>> m.re
<_regex.Pattern object at 0x0161DE30>
>>> m.re.groupindex
{'another_thing': 3, 'a_thing': 1}

so you can use that to create a reverse dict to go from the index to the name or None. (Perhaps the pattern object should have such a .group_name attribute.)
msg140154 - (view) Author: Alec Koumjian (akoumjian) Date: 2011-07-11 17:40
Thanks, Matthew. I did not realize I could access either of those. I should be able to build a helper function now to do what I want.
msg143090 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2011-08-28 06:12
I'm not sure if this belongs here, or on the Google code project page, so I'll add it in both places :)

Feature request: please change the NEW flag to something else. In five or six years (give or take), the re module will be long forgotten, compatibility with it will not be needed, so-called "new" features will no longer be new, and the NEW flag will just be silly.

If you care about future compatibility, some sort of version specification would be better, e.g. "VERSION=0" (current re module), "VERSION=1" (this regex module), "VERSION=2" (next generation). You could then default to VERSION=0 for the first few releases, and potentially change to VERSION=1 some time in the future.

Otherwise, I suggest swapping the sense of the flag: instead of "re behaviour unless NEW flag is given", I'd say "re behaviour only if OLD flag is given". (Old semantics will, of course, remain old even when the new semantics are no longer new.)
msg143333 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-01 17:13
I tried to run a test suite of 3kloc (not just about regex, but regex were used in several places) and I had only one failure:
>>> s = u'void foo ( type arg1 [, type arg2 ] )'
>>> re.sub('(?<=[][()]) |(?!,) (?!\[,)(?=[][(),])', '', s)
u'void foo(type arg1 [, type arg2])'
>>> regex.sub('(?<=[][()]) |(?!,) (?!\[,)(?=[][(),])', '', s)
u'void foo ( type arg1 [, type arg2 ] )'

Note than when the two patterns are used independently they both yield the same result on re and regex, but once they are combined the result is different:
>>> re.sub('(?<=[][()]) ', '', s)
u'void foo (type arg1 [, type arg2 ])'
>>> regex.sub('(?<=[][()]) ', '', s)
u'void foo (type arg1 [, type arg2 ])'

>>> re.sub('(?!,) (?!\[,)(?=[][(),])', '', s)
u'void foo( type arg1 [, type arg2])'
>>> regex.sub('(?!,) (?!\[,)(?=[][(),])', '', s)
u'void foo( type arg1 [, type arg2])'
msg143334 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-09-01 17:50
The regex module supports nested sets and set operations, eg. r"[[a-z]--[aeiou]]" (the letters from 'a' to 'z', except the vowels). This means that literal '[' in a set needs to be escaped.

For example, re module sees "[][()]..." as:

    [      start of set
     ]     literal ']'
     [()   literals '[', '(', ')'
    ]      end of set
    ...   ...

but the regex module sees it as:

    [      start of set
     ]     literal ']'
     [()]  nested set [()]
     ...   ...

Thus:

>>> s = u'void foo ( type arg1 [, type arg2 ] )'
>>> regex.sub(r'(?<=[][()]) |(?!,) (?!\[,)(?=[][(),])', '', s)
u'void foo ( type arg1 [, type arg2 ] )'
>>> regex.sub('(?<=[]\[()]) |(?!,) (?!\[,)(?=[]\[(),])', '', s)
u'void foo(type arg1 [, type arg2])'

If it can't parse it as a nested set, it tries again as a non-nested set (like re), but there are bound to be regexes where it could be either.
msg143337 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-01 18:12
Thanks for the explanation, but isn't this a backward incompatible feature?
I think it should be enabled only when the re.NEW flag is passed.
The idiom [][...] is also quite common, so I think it might break different programs if regex has a different behavior.
msg143340 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-09-01 18:18
> Thanks for the explanation, but isn't this a backward incompatible
> feature?
> I think it should be enabled only when the re.NEW flag is passed.
> The idiom [][...] is also quite common, so I think it might break
> different programs if regex has a different behavior.

As someone said, I'd rather have a re.COMPAT flag.  re.NEW will look
silly in a few years.
Also, we can have a warning about unescaped brackets during a
transitional period. However, it really needs the warning to be enabled
by default, IMO.
msg143343 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-01 18:40
Changing the name of the flag is fine with me.

Having a warning for unescaped brackets that trigger set operations might also be a solution (once escaped they will still work on the old re).  Maybe the same could also be done for scoped flags.

FWIW I tried to come up with a simpler regex that makes some sense and triggers unwanted set operations and I didn't come up with anything except:
>>> regex.findall('[[(]foo[)]]', '[[foo] (foo)]')
['f', 'o', 'o', '(', 'f', 'o', 'o', ')']
>>> re.findall('[[(]foo[)]]', '[[foo] (foo)]')
['(foo)]']
(but this doesn't make too much sense).  Complex regex will still break though, so the issue needs to be addressed somehow.
msg143350 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-09-01 20:12
I think I need a show of hands.

Should the default be old behaviour (like re) or new behaviour? (It might be old now, new later.)

Should there be a NEW flag (as at present), or an OLD flag, or a VERSION parameter (0=old, 1=new, 2=?)?
msg143352 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-09-01 20:16
> I think I need a show of hands.
> 
> Should the default be old behaviour (like re) or new behaviour? (It
> might be old now, new later.)
> 
> Should there be a NEW flag (as at present), or an OLD flag, or a
> VERSION parameter (0=old, 1=new, 2=?)?

VERSION might be best, but then it should probably be a separate
argument rather than a flag.

"old now, new later" doesn't solve the issue unless we have a careful
set of warnings to point out problematic regexes.
msg143355 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2011-09-01 20:57
On 1 September 2011 16:12, Matthew Barnett <report@bugs.python.org> wrote:
>
> Matthew Barnett <python@mrabarnett.plus.com> added the comment:
>
> I think I need a show of hands.

For my part, I recommend literal flags, i.e. re.VERSION222,
re.VERSION300, etc.  Then you know exactly what you're getting and
although it may be confusing, we can then slowly deprecate
re.VERSION222 so that people can get used to the new syntax.

Returning to lurking on my own issue.  :)
msg143366 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-02 01:30
In order to replace the re module, regex must have the same behavior (except for bugs, where the correct behavior is most likely preferred, even if it's different).

Having re.OLD and warnings active by default in 3.3 (and possibly 3.4) should give enough time to fix the regex if/when necessary (either by changing the regex or by adding the re.OLD flag manually).  In 3.4 (or 3.5) we can then change the default behavior to the new semantics.

In this way we won't have to keep using the re.NEW flag on every regex.  I'm not sure if a version flag is useful, unless you are planning to add more incompatible changes.  Also each new version *flag* means one more path to add/maintain in the code.  Having a simple .regex_version attribute might be a more practical (albeit less powerful) solution.
msg143367 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2011-09-02 02:13
Matthew Barnett wrote:
> Matthew Barnett <python@mrabarnett.plus.com> added the comment:
> 
> I think I need a show of hands.
> 
> Should the default be old behaviour (like re) or new behaviour? (It might be old now, new later.)
> 
> Should there be a NEW flag (as at present), or an OLD flag, or a VERSION parameter (0=old, 1=new, 2=?)?

I prefer Antoine's suggested spelling, COMPAT, rather than OLD.

How would you write the various options? After the transition is easy:

     # Get backwards-compatible behaviour:
     compile(string, COMPAT)
     compile(string, VERSION0)

     # Get regex non-compatible behaviour:
     compile(string)  # will be the default in the future
     compile(string, VERSION1)

But what about during the transition, when backwards-compatible 
behaviour is the default? There needs to be a way to turn compatibility 
mode off, not just turn it on.

     # Get backwards-compatible behaviour:
     compile(string)  # will be the default for a release or two
     compile(string, COMPAT)
     compile(string, VERSION0)

     # Get regex non-compatible behaviour:
     compile(string, VERSION1)

So I guess my preference is VERSION0 and VERSION1 flags, even if there 
is never going to be a VERSION2.
msg143374 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-02 03:38
Also note that some behaviors are not "old" or "compatible", but just different.  For example why inline flags should be the old (or new) behavior?  Or e.g. the behavior of version 2 but not 0 and 1?
Also what if I want zero-width splits but not nested sets and sets operations?  Or if I want inline flags but not zero-width splits?

A new set of "features" flags might be an alternative approach.  It will also make possible to add new features that are not backward compatible that can be turned on explicitly with their flag.

It would be fine for me if I had to turn on explicitly e.g. nested sets if/when I'll need to use them, and keep having the "normal" behavior otherwise.

OTOH there are three problems with these approach:
  1) it's not compatible with regex (I guess people will use the external module in Python <3.3 and the included one in 3.3+, probably expecting the same semantics).  This is also true with the OLD/COMPAT flag though;
  2) it might require other inline feature-flags;
  3) the new set of flags might be added to the other flags or be separate, so e.g. re.compile(pattern, flags=re.I|re.NESTEDSETS) or re.compile(pattern, flags=re.I, features=re.NESTEDSETS).  I'm not sure it's a good idea to add another arg though.

Matthew, is there a comprehensive list of all the bugfix/features that have a different behavior from re?
We should first check what changes are acceptable and what aren't, and depending on how many and what they are we can then decide what is the best approach (a catch-all flag or several flags to change the behavior, transition period + warning before setting it as default, etc.)
msg143377 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2011-09-02 04:04
Ezio Melotti wrote:
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> 
> Also note that some behaviors are not "old" or "compatible", but just different.  For example why inline flags should be the old (or new) behavior?  Or e.g. the behavior of version 2 but not 0 and 1?
> Also what if I want zero-width splits but not nested sets and sets operations?  Or if I want inline flags but not zero-width splits?

I think this is adding excessive complexity. Please consider poor 
Matthew's sanity (or whoever ends up maintaining the module long term), 
not to mention that of the users of the module.

I think it is reasonable to pick a *set* of features as a whole:

"I want the regex module to behave exactly the same as the re module"

or

"I don't care about the re module, give me all the goodies offered by 
the regex module"

but I don't think it is reasonable to expect to pick and choose 
individual features:

"I want zero-width splits but not nested sets or inline flags, and I 
want the locale flag to act like the re module, and ASCII characters to 
be treated just like in Perl, but non-ASCII characters to be treated 
just like grep, and a half double decaff half-caf soy mocha with a twist 
of lemon with a dash of half-fat unsweetened whipped cream on the side."

<wink>

If you don't want a feature, don't use it.

"Feature flags" leads to a combinational explosion that makes 
comprehensive testing all but impossible. If you have four features 
A...D, then for *each* feature you need sixteen tests:

A with flags 0000
A with flags 0001
A with flags 0010
A with flags 0011
[...]
A with flags 1111

to ensure that there are no side-effects from turning features off. The 
alternative is hard to track down bugs:

"this regular expression returns the wrong result, but only if you have 
flags A, B and G turned on and C and F turned off."
msg143389 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-02 08:45
> I think this is adding excessive complexity.

It really depends on how many incompatible features there are, and how difficult it is to turn them on/off.

> I think it is reasonable to pick a *set* of features as a whole

It's probably more practical, but otherwise I'm not sure why you would want to activate 2-5 unrelated features that might require you to rewrite your regex (assuming you are aware of what the features are, what are their side effects, how to fix your regex) just because you need one.

The idea is to make the transition smoother and not having a pre-regex world and an incompatible post-regex world, divided by a single flag.

> If you don't want a feature, don't use it.

With only one flag you are forced to enable all the new features, including the ones you don't want.

> "Feature flags" leads to a combinational explosion that makes
> comprehensive testing all but impossible.

We already have several flags and the tests are working fine.  If the features are orthogonal they can be tested independently.

> The alternative is hard to track down bugs:
> "this regular expression returns the wrong result, but only if you
> have flags A, B and G turned on and C and F turned off."

What about: "X works, Y works, and X|Y works, but when I use NEW flag to enable an inline flag X|Y stops to work while X and Y keep working" (hint: the NEW also enabled nested set -- see msg143333).

I'm not saying that having multiple flag is the best solution (or even a viable one), but it should be considered depending on how many incompatible features there are and what they are.
msg143423 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-09-02 17:40
The least disruptive change would be to have a NEW flag for the new behaviour, as at present, and an OLD flag for the old behaviour.

Currently the default is old behaviour, but in the future it will be new behaviour.

The differences would be:

Old behaviour                   : New behaviour
-------------                     -------------
Global inline flags             : Positional inline flags
Can't split on zero-width match : Can split on zero-width match
Simple sets                     : Nested sets and set operations

The only change would be that nested sets wouldn't be supported in the old behaviour.

There are also additional escape sequences, eg \X is no longer treated as "X", but as they look like escape sequences you really shouldn't be relying on that. (It's similar to writing Windows paths in non-raw string literals: "\T" == "\\T", but "\t" == chr(9).)
msg143442 - (view) Author: Vlastimil Brom (vbr) Date: 2011-09-02 22:39
I'd agree with Steven ( msg143377 ) and others, that there probably shouldn't be a large library-specific set of new tags just for "housekeeping" purposes between re and regex. I would personally prefer, that these tags also be settable in the pattern (?...), which would probably be problematic with versioned flags.

Although I am trying to take advantage of the new additions, if applicable, I agree, that there should be a possibility to use regex in an unreflected way with the same behaviour like re (maybe except for the fixes of what will be agreed on to be a bug (enough)).
On the other hand, it seems to me, that the enhancements/additions can be enabled at once, as an user upgrading the regexes for the new library consciously (or a new user not knowing re) can be supposed to know the new features and their implications. I guess, it is mostly trivially possible to fix/disambiguate the problematic patterns, e.g. by escaping.

As for setting the new/old behaviour, would there be a possibility to distinguish it just by importing (possibly through some magic, without the need to duplicate the code?), 
import re_in_compat_mode as re
vs:
import re_with_all_the_new_features as re

Unfortunately, i have no idea, whether this is possible or viable...
with this option, the (user) code update could be just the change of the imports instead of adding the flags to all relevant places (and to take them away as redundant, as the defaults evolve with the versions...).

However, it is not clear, how this "aliasing" would work out with regard to the transition, maybe the long differenciated "module" names could be kept and the meaning of "import re" would  change, allong with the previous warnings, in some future version.

just a few thoughts...
   vbr
msg143445 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2011-09-03 00:17
Being able to set which behavior you want in a (?XXX) flag at the start of the regex is valuable so that applications that take a regex can support the new syntax automatically when the python version they are running on is updated.  The (?XXX) should override whatever re.XXX flag was provided to re.compile().

Notice I said XXX.  I'm not interested in a naming bikeshed other than agreeing with the fact that NEW will seem quaint 10 years from now so its best to use non-temporal names.  COMPAT, VERSION2, VERSION3, WITH_GOATS, PONY, etc. are all non-temporal and do allow us to change the default away from "old" behavior at a future date beyond 3.3.
msg143447 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-09-03 00:30
So, VERSION0 and VERSION1, with "(?V0)" and "(?V1)" in the pattern?
msg143448 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-03 00:32
If these are the only 3 non-backward compatible features and the nested set one is moved under the NEW flag, I guess the approach might work without having per-feature flags.

The "NEW" could be kept for compatibility for regex (if necessary), possibly aliasing it with VERSION1 or whatever name wins the bikeshed.

If you want to control that at import time, maybe a from __future__ import new_re_semantics could be used instead of a flag, but I'm not sure about that.
msg143467 - (view) Author: Jeffrey C. Jacobs (timehorse) Date: 2011-09-03 16:22
Although V1, V2 is less wordy, technically the current behavior is version 2.2.2, so logically this should be re.VERSION222 vs. re.VERSION3 vs. re.VERSIONn, with corresponding "(?V222)", "(?V3)" and future "(?Vn)".  But that said, I think 2.2.2 can be shorthanded to 2, so basically start counting from there.
msg143471 - (view) Author: Vlastimil Brom (vbr) Date: 2011-09-03 18:20
Not that it matters in any way, but if the regex semantics has to be distinguished via "non-standard" custom flags; I would prefer even less wordy flags, possibly such that the short forms for the in-pattern flag setting would be one-letter (such as all the other flags) and preferably some with underlying plain English words as base, to get some mnemotechnics (which I don't see in the numbered versions requiring one to keep track of the rather internal library versioning).
Unfortunately, it might be difficult to find suitable names, given the objections expressed against the already discussed ones. (FOr what it is worth, I thought e.g. of [t]raditional and [e]nhanced, but these also suffer from some of the mentioned disadvantages... 
vbr
msg143619 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2011-09-06 16:07
Matthew Barnett wrote:
> So, VERSION0 and VERSION1, with "(?V0)" and "(?V1)" in the pattern?

Seems reasonable to me.

+1
msg144110 - (view) Author: Matt Chaput (mattchaput) Date: 2011-09-15 22:45
Not sure if this is better as a separate feature request or a comment here, but... the new version of .NET includes an option to specify a time limit on evaluation of regexes (not sure if this is a feature in other regex libs). This would be useful especially when you're executing regexes configured by the user and you don't know if/when they might go exponential. Something like this maybe:

# Raises an re.Timeout if not complete within 60 seconds
match = myregex.match(mystring, maxseconds=60.0)
msg152210 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-01-29 06:26
As part of the PEP 408 discussions, Guido approved the addition of 'regex' in 3.3 (using that name, rather than as a drop-in replacement for re) [1,2]

That should greatly ease the backwards compatibility concerns, even if it isn't as transparent an upgrade path.

[1] http://mail.python.org/pipermail/python-dev/2012-January/115961.html
[2] http://mail.python.org/pipermail/python-dev/2012-January/115962.html
msg152211 - (view) Author: Alex Gaynor (alex) * (Python committer) Date: 2012-01-29 06:28
So, to my reading of teh compatibility PEP this cannot be added wholesale,
unless there is a pure Python version as well.  However, if it replaced re
(read: patched) it would be valid.

On Sun, Jan 29, 2012 at 1:26 AM, Nick Coghlan <report@bugs.python.org>wrote:

>
> Nick Coghlan <ncoghlan@gmail.com> added the comment:
>
> As part of the PEP 408 discussions, Guido approved the addition of 'regex'
> in 3.3 (using that name, rather than as a drop-in replacement for re) [1,2]
>
> That should greatly ease the backwards compatibility concerns, even if it
> isn't as transparent an upgrade path.
>
> [1] http://mail.python.org/pipermail/python-dev/2012-January/115961.html
> [2] http://mail.python.org/pipermail/python-dev/2012-January/115962.html
>
> ----------
> nosy: +ncoghlan
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue2636>
> _______________________________________
>
msg152212 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2012-01-29 07:24
I created a new sandbox branch to integrate regex into CPython, see "remote repo" field.

I mainly had to adapt the test suite to use unittest.
msg152214 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-01-29 07:42
Alex has a valid point in relation to PEP 399, since, like lzma, regex will be coming in under the "special permission" clause that allows the addition of C extension modules without pure Python equivalents. Unlike lzma, though, the new regex engine isn't a relatively simple wrapper around an existing library - supporting the new API features on other implementations is going to mean a substantial amount of work.

In practice, I expect that a pure Python implementation of a regular expression engine would only be fast enough to be usable on PyPy. So while we'd almost certainly accept a patch that added a parallel Python implementation, I doubt it would actually help Jython or IronPython all that much - they're probably going to need versions written in Java and C# to be effective (as I believe they already have for the re module).
msg152215 - (view) Author: Devin Jeanpierre (Devin Jeanpierre) Date: 2012-01-29 08:01
> In practice, I expect that a pure Python implementation of a regular expression engine would only be fast enough to be usable on PyPy.

Not sure why this is necessarily true. I'd expect a pure-Python implementation to be maybe 200 times as slow. Many queries (those on relatively short strings that backtrack little) finish within microseconds. On this scale, a couple of orders of magnitudes is not noticeable by humans (unless it adds up), and even where it gets noticeable, it's better than having nothing at all or a non-working program (up until a point).

python -m timeit -n 1000000 -s "import re; x = re.compile(r'.*<\s*help\s*>([^<]*)<\s*/\s*help.*>'); data = ' '*1000 + '< help >' + 'abc'*100 + '</help>'" "x.match(data)"
1000000 loops, best of 3: 3.27 usec per loop
msg152217 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2012-01-29 08:31
Well, REs are very often used to process large chunks of text by repeated application.  So if the whole operation takes 0.1 or 20 seconds you're going to notice :)
msg152218 - (view) Author: Devin Jeanpierre (Devin Jeanpierre) Date: 2012-01-29 08:37
It'd be nice if we had some sort of representative benchmark for real-world uses of Python regexps. The JS guys have all pitched in to create such a thing for uses of regexps on thew web. I don't know of any such thing for Python.

I agree that a Python implementation wouldn't be useful for some cases. On the other hand, I believe it would be fine (or at least tolerable) for some others. I don't know the ratio between the two.
msg152246 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-01-29 17:21
> It'd be nice if we had some sort of representative benchmark for
> real-world uses of Python regexps. The JS guys have all pitched in to
> create such a thing for uses of regexps on thew web. I don't know of
> any such thing for Python.

See http://hg.python.org/benchmarks/, there are regex benchmarks there.

> I agree that a Python implementation wouldn't be useful for some
> cases. On the other hand, I believe it would be fine (or at least
> tolerable) for some others. I don't know the ratio between the two.

I think the ratio would be something like 2% tolerable :)

As I said to Ezio and Georg, I think adding the regex module needs a
PEP, even if it ends up non-controversial.
msg157445 - (view) Author: Sandro Tosi (sandro.tosi) * (Python committer) Date: 2012-04-03 21:47
I've just uploaded regex into Debian: this will hopefully gives some more eyes looking at the module and reporting some feedbacks.
msg174888 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2012-11-05 10:58
I've been working through the "known crashers" list in the stdlib. The recursive import one was fixed with the migration to importlib in 3.3, the compiler one will be fixed in 3.3.1 (with an enforced nesting limit). One of those remaining is actually a pathological failure in the re module rather than a true crasher (i.e. it doesn't segfault, and in 2.7 and 3.3 you can interrupt it with Ctrl-C):
http://hg.python.org/cpython/file/default/Lib/test/crashers/infinite_loop_re.py

I mention it here as another problem that adopting the regex module could resolve (as regex promptly returns None for this case).
History
Date User Action Args
2013-04-01 18:57:16terry.reedylinkissue1528154 dependencies
2012-11-27 09:02:15mark.dickinsonsetversions: + Python 3.4, - Python 3.3
2012-11-05 10:58:36ncoghlansetmessages: + msg174888
2012-04-03 21:47:29sandro.tosisetnosy: + sandro.tosi
messages: + msg157445
2012-02-11 20:59:57tshepangsetnosy: + tshepang
2012-01-29 17:21:19pitrousetmessages: + msg152246
2012-01-29 08:37:52Devin Jeanpierresetmessages: + msg152218
2012-01-29 08:31:33georg.brandlsetmessages: + msg152217
2012-01-29 08:01:43Devin Jeanpierresetnosy: + Devin Jeanpierre
messages: + msg152215
2012-01-29 07:42:22ncoghlansetmessages: + msg152214
2012-01-29 07:40:06georg.brandlsetnosy: + loewis, georg.brandl, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, ncoghlan, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, steven.daprano, alex, r.david.murray, jacques, zdwiel, jhalcrow, stiv, davide.rizzo, mattchaput, ronnix, eric.snow, akoumjian
2012-01-29 07:39:53georg.brandlsetfiles: - issue2636-20101229.zip
2012-01-29 07:39:45georg.brandlsetfiles: - issue2636-20101228a.zip
2012-01-29 07:39:37georg.brandlsetfiles: - issue2636-20101228.zip
2012-01-29 07:39:28georg.brandlsetfiles: - issue2636-20101224.zip
2012-01-29 07:39:20georg.brandlsetfiles: - issue2636-20101210.zip
2012-01-29 07:39:09georg.brandlsetfiles: - issue2636-20101207.zip
2012-01-29 07:39:00georg.brandlsetfiles: - issue2636-20101130.zip
2012-01-29 07:38:52georg.brandlsetfiles: - issue2636-20101123.zip
2012-01-29 07:38:43georg.brandlsetfiles: - issue2636-20101121.zip
2012-01-29 07:38:35georg.brandlsetfiles: - issue2636-20101120.zip
2012-01-29 07:38:26georg.brandlsetfiles: - issue2636-20101113.zip
2012-01-29 07:38:14georg.brandlsetfiles: - issue2636-20101106.zip
2012-01-29 07:38:06georg.brandlsetfiles: - issue2636-20101102a.zip
2012-01-29 07:37:57georg.brandlsetfiles: - issue2636-20101102.zip
2012-01-29 07:37:48georg.brandlsetfiles: - issue2636-20101101.zip
2012-01-29 07:37:36georg.brandlsetfiles: - issue2636-20101030a.zip
2012-01-29 07:37:12georg.brandlsetfiles: - issue2636-20101030.zip
2012-01-29 07:37:03georg.brandlsetfiles: - issue2636-20101029.zip
2012-01-29 07:36:55georg.brandlsetfiles: - issue2636-20101009.zip
2012-01-29 07:36:46georg.brandlsetfiles: - issue2636-20100918.zip
2012-01-29 07:36:38georg.brandlsetfiles: - issue2636-20100913.zip
2012-01-29 07:36:28georg.brandlsetfiles: - issue2636-20100912.zip
2012-01-29 07:36:17georg.brandlsetfiles: - issue2636-20100824.zip
2012-01-29 07:36:02georg.brandlsetfiles: - unnamed
2012-01-29 07:35:51georg.brandlsetfiles: - issue2636-20100816.zip
2012-01-29 07:35:42georg.brandlsetfiles: - issue2636-20100814.zip
2012-01-29 07:35:34georg.brandlsetfiles: - issue2636-20100725.zip
2012-01-29 07:35:24georg.brandlsetfiles: - issue2636-20100719.zip
2012-01-29 07:35:16georg.brandlsetfiles: - issue2636-20100709.zip
2012-01-29 07:35:05georg.brandlsetfiles: - issue2636-20100706.zip
2012-01-29 07:34:56georg.brandlsetfiles: - issue2636-20100414.zip
2012-01-29 07:34:45georg.brandlsetfiles: - build.log
2012-01-29 07:34:37georg.brandlsetfiles: - setup.py
2012-01-29 07:34:28georg.brandlsetfiles: - test_regex_20100413
2012-01-29 07:33:43georg.brandlsetfiles: - issue2636-20100413.zip
2012-01-29 07:33:31georg.brandlsetfiles: - issue2636-20100331.zip
2012-01-29 07:33:22georg.brandlsetfiles: - issue2636-20100323.zip
2012-01-29 07:33:14georg.brandlsetfiles: - issue2636-20100305.zip
2012-01-29 07:33:05georg.brandlsetfiles: - issue2636-20100304.zip
2012-01-29 07:32:56georg.brandlsetfiles: - issue2636-20100226.zip
2012-01-29 07:32:44georg.brandlsetfiles: - issue2636-20100225.zip
2012-01-29 07:32:35georg.brandlsetfiles: - issue2636-20100224.zip
2012-01-29 07:32:27georg.brandlsetfiles: - issue2636-20100223.zip
2012-01-29 07:32:18georg.brandlsetnosy: - loewis, georg.brandl, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, ncoghlan, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, steven.daprano, alex, r.david.murray, jacques, zdwiel, jhalcrow, stiv, davide.rizzo, mattchaput, ronnix, eric.snow, akoumjian
-> (no value)
2012-01-29 07:31:58georg.brandlsetfiles: - issue2636-20100222.zip
2012-01-29 07:31:40georg.brandlsetfiles: - Features-backslashes.patch
2012-01-29 07:31:28georg.brandlsetfiles: - issue2636-20100219.zip
2012-01-29 07:31:15georg.brandlsetfiles: - issue2636-20100218.zip
2012-01-29 07:31:06georg.brandlsetfiles: - issue2636-20100217.zip
2012-01-29 07:30:57georg.brandlsetfiles: - issue2636-20100211.zip
2012-01-29 07:30:42georg.brandlsetfiles: - issue2636-20100210.zip
2012-01-29 07:30:29georg.brandlsetfiles: - issue2636-20100204.zip
2012-01-29 07:29:30georg.brandlsetfiles: - issue2636-20100116.zip
2012-01-29 07:29:20georg.brandlsetfiles: - issue2636-20090815.zip
2012-01-29 07:29:10georg.brandlsetfiles: - issue2636-20090810#3.zip
2012-01-29 07:28:58georg.brandlsetfiles: - issue2636-20090810#2.zip
2012-01-29 07:28:48georg.brandlsetfiles: - issue2636-20090810.zip
2012-01-29 07:28:39georg.brandlsetfiles: - issue2636-20090804.zip
2012-01-29 07:28:30georg.brandlsetfiles: - issue2636-20090729.zip
2012-01-29 07:28:21georg.brandlsetfiles: - issue2636-20090727.zip
2012-01-29 07:27:57georg.brandlsetfiles: - issue2636-20090726.zip
2012-01-29 07:27:38georg.brandlsetfiles: - issue2636-patch-2.diff
2012-01-29 07:27:24georg.brandlsetfiles: - issue2636-patch-1.diff
2012-01-29 07:27:11georg.brandlsetfiles: - issue2636-features-6.diff
2012-01-29 07:27:02georg.brandlsetfiles: - issue2636-features-5.diff
2012-01-29 07:26:46georg.brandlsetfiles: - issue2636-features-4.diff
2012-01-29 07:26:37georg.brandlsetfiles: - issue2636-features-3.diff
2012-01-29 07:26:24georg.brandlsetfiles: - issue2636-features-2.diff
2012-01-29 07:26:14georg.brandlsetfiles: - issue2636-features.diff
2012-01-29 07:26:05georg.brandlsetfiles: - issue2636+01+09-02+17+18+19+20+21+24+26_speedup.diff
2012-01-29 07:25:49georg.brandlsetfiles: - issue2636-01+09-02+17_backport.diff
2012-01-29 07:25:38georg.brandlsetfiles: - issue2636-02.patch
2012-01-29 07:25:23georg.brandlsetfiles: - issue2636-patches.tar.bz2
2012-01-29 07:24:37georg.brandlsethgrepos: + hgrepo108
messages: + msg152212
2012-01-29 06:28:40alexsetmessages: + msg152211
2012-01-29 06:26:29ncoghlansetnosy: + ncoghlan
messages: + msg152210
2011-09-15 22:45:17mattchaputsetnosy: + mattchaput
messages: + msg144110
2011-09-06 16:07:44steven.dapranosetmessages: + msg143619
2011-09-03 18:20:29vbrsetmessages: + msg143471
2011-09-03 16:22:55timehorsesetmessages: + msg143467
2011-09-03 00:32:21ezio.melottisetmessages: + msg143448
2011-09-03 00:30:50mrabarnettsetmessages: + msg143447
2011-09-03 00:17:39gregory.p.smithsetmessages: + msg143445
2011-09-02 22:39:50vbrsetmessages: + msg143442
2011-09-02 17:40:21mrabarnettsetmessages: + msg143423
2011-09-02 08:45:12ezio.melottisetmessages: + msg143389
2011-09-02 04:04:14steven.dapranosetmessages: + msg143377
2011-09-02 03:38:12ezio.melottisetmessages: + msg143374
2011-09-02 02:13:10steven.dapranosetmessages: + msg143367
2011-09-02 01:30:31ezio.melottisetmessages: + msg143366
2011-09-01 20:57:58timehorsesetmessages: + msg143355
2011-09-01 20:16:57pitrousetmessages: + msg143352
2011-09-01 20:12:57mrabarnettsetmessages: + msg143350
2011-09-01 18:40:00ezio.melottisetmessages: + msg143343
2011-09-01 18:18:25pitrousetmessages: + msg143340
2011-09-01 18:12:43ezio.melottisetmessages: + msg143337
2011-09-01 17:50:49mrabarnettsetmessages: + msg143334
2011-09-01 17:13:06ezio.melottisetmessages: + msg143333
2011-08-29 15:56:48eric.araujosettitle: Regexp 2.7 (modifications to current re 2.2.2) -> Adding a new regex module (compatible with re)
components: + Library (Lib)
versions: + Python 3.3
2011-08-28 06:12:30steven.dapranosetnosy: + steven.daprano
messages: + msg143090
2011-07-11 17:46:50eric.snowsetnosy: + eric.snow
2011-07-11 17:42:41collinwintersetnosy: - collinwinter
2011-07-11 17:40:50akoumjiansetmessages: + msg140154
2011-07-11 17:38:26brian.curtinsetnosy: - brian.curtin
2011-07-11 17:32:08mrabarnettsetmessages: + msg140152
2011-07-11 05:19:48akoumjiansetnosy: + akoumjian

messages: + msg140102
versions: - Python 3.3
2011-05-10 14:08:44brian.curtinsetnosy: + brian.curtin
messages: + msg135704
2011-05-10 14:07:01jhalcrowsetmessages: + msg135703
2011-05-10 13:55:43jhalcrowsetmessages: + msg135700
2011-03-15 17:40:50mrabarnettsetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, alex, r.david.murray, jacques, zdwiel, jhalcrow, stiv, davide.rizzo, ronnix
messages: + msg130999
2011-03-14 21:29:28gregory.p.smithsetfiles: + unnamed

messages: + msg130906
nosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, alex, r.david.murray, jacques, zdwiel, jhalcrow, stiv, davide.rizzo, ronnix
2011-03-14 21:25:17mrabarnettsetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, alex, r.david.murray, jacques, zdwiel, jhalcrow, stiv, davide.rizzo, ronnix
messages: + msg130905
2011-03-14 20:06:24gregory.p.smithsetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, alex, r.david.murray, jacques, zdwiel, jhalcrow, stiv, davide.rizzo, ronnix
messages: + msg130886
2011-03-11 14:03:31alexsetnosy: + alex
2011-03-08 10:24:00davide.rizzosetnosy: + davide.rizzo
2011-01-25 19:48:03mrabarnettsetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv, ronnix
messages: + msg127045
2011-01-16 20:19:36mrabarnettsetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv, ronnix
messages: + msg126372
2011-01-14 19:44:28ronnixsetnosy: + ronnix
messages: + msg126294
2011-01-04 03:56:21mrabarnettsetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg125291
2011-01-01 04:26:20jacquessetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124990
2011-01-01 02:47:45mrabarnettsetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124988
2010-12-31 17:55:18mrabarnettsetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124971
2010-12-31 09:23:53jacquessetfiles: + remove_guards.diff
nosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124959
2010-12-31 04:03:38mrabarnettsetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124950
2010-12-30 21:39:38mrabarnettsetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124936
2010-12-30 19:45:30nneonneosetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124931
2010-12-30 19:13:39mrabarnettsetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124929
2010-12-30 08:45:35georg.brandlsetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124912
2010-12-30 07:41:24gregory.p.smithsetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124909
2010-12-30 05:24:29jacquessetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124906
2010-12-30 04:49:51jacquessetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124905
2010-12-30 04:08:57jacquessetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124904
2010-12-30 02:25:57mrabarnettsetfiles: + issue2636-20101230.zip
nosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124900
2010-12-30 00:06:55jacquessetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124891
2010-12-29 00:42:23mrabarnettsetfiles: + issue2636-20101229.zip
nosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124834
2010-12-29 00:19:37jacquessetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124833
2010-12-28 21:57:41jacquessetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124821
2010-12-28 20:05:51mrabarnettsetfiles: + issue2636-20101228a.zip
nosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124816
2010-12-28 04:01:02jacquessetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124759
2010-12-28 01:45:50mrabarnettsetfiles: + issue2636-20101228.zip
nosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124750
2010-12-28 00:56:28jacquessetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124746
2010-12-25 01:48:34mrabarnettsetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124627
2010-12-25 01:38:59r.david.murraysetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124626
2010-12-24 22:26:55mrabarnettsetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124614
2010-12-24 02:58:41timehorsesetnosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, belopolsky, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124585
2010-12-24 01:36:55belopolskysetnosy: + belopolsky
messages: + msg124582
2010-12-24 01:02:12mrabarnettsetfiles: + issue2636-20101224.zip
nosy: loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, vbr, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jacques, zdwiel, jhalcrow, stiv
messages: + msg124581
2010-12-14 18:12:11belopolskylinkissue10704 superseder
2010-12-14 18:09:57belopolskylinkissue10703 superseder
2010-12-13 18:57:47eric.araujosetstage: patch review
type: compile error -> enhancement
versions: + Python 3.3, - Python 2.6
2010-12-10 20:03:06mrabarnettsetfiles: + issue2636-20101210.zip

messages: + msg123747
2010-12-07 07:17:13zdwielsetversions: + Python 2.6, - Python 3.2
nosy: + zdwiel

messages: + msg123527

type: enhancement -> compile error
2010-12-07 03:26:35mrabarnettsetfiles: + issue2636-20101207.zip

messages: + msg123518
2010-11-30 04:37:38mrabarnettsetfiles: + issue2636-20101130.zip

messages: + msg122880
2010-11-23 18:34:03mrabarnettsetfiles: + issue2636-20101123.zip

messages: + msg122228
2010-11-23 16:31:55r.david.murraysettype: behavior -> enhancement
messages: + msg122225
2010-11-23 15:58:01stivsettype: enhancement -> behavior

messages: + msg122221
nosy: + stiv
2010-11-21 01:54:31mrabarnettsetfiles: + issue2636-20101121.zip

messages: + msg121832
2010-11-20 01:43:19mrabarnettsetfiles: + issue2636-20101120.zip

messages: + msg121589
2010-11-13 18:13:45vbrsetmessages: + msg121149
2010-11-13 17:15:23mrabarnettsetfiles: + issue2636-20101113.zip

messages: + msg121145
2010-11-13 13:47:27vbrsetmessages: + msg121136
2010-11-11 23:48:56moreatisetmessages: + msg120986
2010-11-11 23:13:58mrabarnettsetmessages: + msg120984
2010-11-11 22:20:14vbrsetmessages: + msg120976
2010-11-11 21:00:49moreatisetmessages: + msg120969
2010-11-06 02:22:50mrabarnettsetfiles: + issue2636-20101106.zip

messages: + msg120571
2010-11-02 19:12:13mrabarnettsetfiles: + issue2636-20101102a.zip

messages: + msg120243
2010-11-02 12:08:13vbrsetmessages: + msg120216
2010-11-02 11:56:05vbrsetmessages: + msg120215
2010-11-02 04:52:19jacquessetmessages: + msg120206
2010-11-02 04:08:01jacquessetmessages: + msg120204
2010-11-02 03:51:34mrabarnettsetfiles: + issue2636-20101102.zip

messages: + msg120203
2010-11-02 02:49:45jacquessetmessages: + msg120202
2010-11-01 20:28:25mrabarnettsetfiles: + issue2636-20101101.zip

messages: + msg120164
2010-10-31 06:09:28jacquessetmessages: + msg120038
2010-10-31 05:27:47jacquessetmessages: + msg120037
2010-10-30 20:15:40mrabarnettsetfiles: + issue2636-20101030a.zip

messages: + msg120013
2010-10-30 04:40:19jacquessetmessages: + msg119958
2010-10-30 03:39:06mrabarnettsetfiles: + issue2636-20101030.zip

messages: + msg119956
2010-10-30 00:48:15jacquessetmessages: + msg119951
2010-10-29 22:33:03mrabarnettsetfiles: + issue2636-20101029.zip

messages: + msg119947
2010-10-29 19:36:46mrabarnettsetmessages: + msg119930
2010-10-29 11:11:32jacquessetnosy: + jacques
messages: + msg119887
2010-10-14 16:21:52vbrsetmessages: + msg118682
2010-10-14 15:42:29mrabarnettsetmessages: + msg118674
2010-10-14 08:55:39vbrsetmessages: + msg118640
2010-10-14 08:41:44loewissetmessages: + msg118636
2010-10-14 08:13:42vbrsetmessages: + msg118631
2010-10-09 03:08:40mrabarnettsetfiles: + issue2636-20101009.zip

messages: + msg118243
2010-09-21 14:17:36vbrsetmessages: + msg117050
2010-09-21 11:41:33mrabarnettsetmessages: + msg117046
2010-09-20 23:51:35vbrsetmessages: + msg117008
2010-09-18 02:55:58mrabarnettsetfiles: + issue2636-20100918.zip

messages: + msg116749
2010-09-13 04:24:46mrabarnettsetfiles: + issue2636-20100913.zip

messages: + msg116276
2010-09-12 23:34:27vbrsetmessages: + msg116252
2010-09-12 23:16:20brian.curtinsetnosy: - brian.curtin
2010-09-12 23:14:54mrabarnettsetmessages: + msg116248
2010-09-12 22:01:05vbrsetmessages: + msg116238
2010-09-12 21:16:17georg.brandlsetmessages: + msg116231
2010-09-12 20:47:44mrabarnettsetmessages: + msg116229
2010-09-12 20:15:20vbrsetmessages: + msg116227
2010-09-12 18:42:24mrabarnettsetmessages: + msg116223
2010-09-12 06:47:12georg.brandlsetmessages: + msg116151
2010-09-11 23:37:24mrabarnettsetfiles: + issue2636-20100912.zip

messages: + msg116133
2010-08-24 02:13:35mrabarnettsetfiles: + issue2636-20100824.zip

messages: + msg114766
2010-08-22 12:39:36giampaolo.rodolasetnosy: + giampaolo.rodola
2010-08-22 00:06:08georg.brandllinkissue433029 superseder
2010-08-22 00:03:35georg.brandllinkissue433027 superseder
2010-08-22 00:03:21georg.brandllinkissue433024 superseder
2010-08-21 23:53:41georg.brandllinkissue1721518 superseder
2010-08-21 23:46:11georg.brandllinkissue3825 superseder
2010-08-17 18:58:27akuchlingsetnosy: - akuchling
2010-08-16 02:04:43mrabarnettsetfiles: + issue2636-20100816.zip

messages: + msg114034
2010-08-14 21:18:27moreatisetmessages: + msg113931
2010-08-14 20:24:08mrabarnettsetfiles: + issue2636-20100814.zip

messages: + msg113927
2010-07-29 17:47:04georg.brandllinkissue6156 superseder
2010-07-29 13:29:12georg.brandlsetmessages: + msg111921
2010-07-26 18:32:58mrabarnettsetmessages: + msg111660
2010-07-26 17:50:07timehorsesetmessages: + msg111656
2010-07-26 17:41:25mrabarnettsetmessages: + msg111652
2010-07-26 16:53:33ezio.melottisetmessages: + msg111643
2010-07-25 09:20:35moreatisetmessages: + msg111531
2010-07-25 02:46:13mrabarnettsetfiles: + issue2636-20100725.zip

messages: + msg111519
2010-07-19 14:43:19mrabarnettsetmessages: + msg110761
2010-07-19 01:37:21vbrsetmessages: + msg110704
2010-07-19 00:15:36mrabarnettsetfiles: + issue2636-20100719.zip

messages: + msg110701
2010-07-13 21:56:48moreatisetmessages: + msg110237
2010-07-13 21:34:22jhalcrowsetnosy: + jhalcrow
messages: + msg110233
2010-07-09 01:20:24mrabarnettsetfiles: + issue2636-20100709.zip

messages: + msg109657
2010-07-07 13:48:15georg.brandlsetmessages: + msg109474
2010-07-07 09:29:55marksetmessages: + msg109463
2010-07-07 09:13:39marksetmessages: + msg109461
2010-07-07 08:57:04marksetmessages: + msg109460
2010-07-07 01:45:21mrabarnettsetmessages: + msg109447
2010-07-06 17:50:56moreatisetmessages: + msg109413
2010-07-06 17:34:15georg.brandlsetmessages: + msg109410
2010-07-06 17:30:33vbrsetmessages: + msg109409
2010-07-06 17:16:24timehorsesetmessages: + msg109408
2010-07-06 17:07:04ezio.melottisetmessages: + msg109407
2010-07-06 17:03:40mrabarnettsetmessages: + msg109406
2010-07-06 16:38:30ezio.melottisetmessages: + msg109405
2010-07-06 16:29:29brian.curtinsetnosy: + brian.curtin
messages: + msg109404
2010-07-06 16:27:47ezio.melottisetmessages: + msg109403
2010-07-06 16:17:38mrabarnettsetmessages: + msg109401
2010-07-06 11:25:56moreatisetmessages: + msg109384
2010-07-06 08:43:00ezio.melottisetmessages: + msg109372
2010-07-06 00:02:32mrabarnettsetfiles: + issue2636-20100706.zip

messages: + msg109363
2010-07-05 21:42:29vbrsetmessages: + msg109358
2010-06-19 16:41:55ezio.melottisetversions: + Python 3.2, - Python 3.1, Python 2.7
2010-04-13 23:39:58moreatisetmessages: + msg103097
2010-04-13 23:34:49mrabarnettsetfiles: + issue2636-20100414.zip

messages: + msg103096
2010-04-13 23:33:13mrabarnettsetmessages: + msg103095
2010-04-13 19:46:45moreatisetfiles: + setup.py, build.log

messages: + msg103078
2010-04-13 17:10:36mrabarnettsetmessages: + msg103064
2010-04-13 16:23:42moreatisetfiles: + test_regex_20100413

messages: + msg103060
2010-04-13 02:21:50mrabarnettsetfiles: + issue2636-20100413.zip

messages: + msg103003
2010-03-31 22:26:03mrabarnettsetfiles: + issue2636-20100331.zip

messages: + msg102042
2010-03-23 01:21:24mrabarnettsetfiles: + issue2636-20100323.zip

messages: + msg101557
2010-03-16 21:37:50vbrsetmessages: + msg101193
2010-03-16 19:31:24ezio.melottisetmessages: + msg101181
2010-03-16 15:56:37moreatisetfiles: + regex_test-20100316

messages: + msg101172
2010-03-05 03:27:29mrabarnettsetfiles: + issue2636-20100305.zip

messages: + msg100452
2010-03-04 01:45:27vbrsetmessages: + msg100370
2010-03-04 00:41:57mrabarnettsetfiles: + issue2636-20100304.zip

messages: + msg100362
2010-03-03 23:48:24vbrsetmessages: + msg100359
2010-02-26 14:36:17moreatisetmessages: + msg100152
2010-02-26 03:20:17mrabarnettsetfiles: + issue2636-20100226.zip

messages: + msg100134
2010-02-25 00:12:54mrabarnettsetfiles: + issue2636-20100225.zip

messages: + msg100080
2010-02-24 23:14:04vbrsetmessages: + msg100076
2010-02-24 20:25:00mrabarnettsetfiles: + issue2636-20100224.zip

messages: + msg100066
2010-02-23 01:31:05vbrsetmessages: + msg99892
2010-02-23 00:47:49moreatisetmessages: + msg99890
2010-02-23 00:39:04mrabarnettsetfiles: + issue2636-20100223.zip

messages: + msg99888
2010-02-22 23:28:30mrabarnettsetfiles: + issue2636-20100222.zip

messages: + msg99872
2010-02-22 23:10:55mrabarnettsetfiles: - issue2636-20100222.zip
2010-02-22 22:51:33vbrsetmessages: + msg99863
2010-02-22 21:24:31mrabarnettsetfiles: + issue2636-20100222.zip

messages: + msg99835
2010-02-21 16:21:20mrabarnettsetmessages: + msg99668
2010-02-21 14:46:40moreatisetfiles: + Features-backslashes.patch

messages: + msg99665
2010-02-19 01:31:23mrabarnettsetfiles: + issue2636-20100219.zip

messages: + msg99552
2010-02-19 00:29:46vbrsetmessages: + msg99548
2010-02-18 03:03:19mrabarnettsetfiles: + issue2636-20100218.zip

messages: + msg99494
2010-02-17 23:43:25vbrsetmessages: + msg99481
2010-02-17 19:35:45mrabarnettsetmessages: + msg99479
2010-02-17 13:01:55moreatisetmessages: + msg99470
2010-02-17 04:09:28mrabarnettsetfiles: + issue2636-20100217.zip

messages: + msg99462
2010-02-11 02:16:55mrabarnettsetfiles: + issue2636-20100211.zip

messages: + msg99190
2010-02-11 01:09:51vbrsetmessages: + msg99186
2010-02-10 02:20:06mrabarnettsetfiles: + issue2636-20100210.zip

messages: + msg99148
2010-02-09 17:38:03vbrsetmessages: + msg99132
2010-02-08 23:45:59vbrsetmessages: + msg99072
2010-02-04 02:34:44mrabarnettsetfiles: + issue2636-20100204.zip

messages: + msg98809
versions: + Python 3.1
2010-01-16 03:00:02mrabarnettsetfiles: + issue2636-20100116.zip

messages: + msg97860
2009-12-31 15:26:36ezio.melottisetpriority: normal
2009-08-24 12:55:50vbrsetmessages: + msg91917
2009-08-17 20:29:51moreatisetmessages: + msg91671
2009-08-15 16:12:29mrabarnettsetfiles: + issue2636-20090815.zip

messages: + msg91610
2009-08-15 14:02:20sjmachinsetmessages: + msg91607
2009-08-15 07:49:47marksetmessages: + msg91598
2009-08-13 21:14:03moreatisetmessages: + msg91535
2009-08-12 18:01:38collinwintersetmessages: + msg91500
2009-08-12 12:42:50doerwaltersetnosy: - doerwalter
2009-08-12 12:29:12timehorsesetmessages: + msg91497
2009-08-12 12:16:21pitrousetmessages: + msg91496
2009-08-12 12:04:09timehorsesetmessages: + msg91495
2009-08-12 03:00:20sjmachinsetmessages: + msg91490
2009-08-11 12:59:22r.david.murraysetnosy: + r.david.murray
messages: + msg91474
2009-08-11 11:15:30vbrsetmessages: + msg91473
2009-08-10 22:42:18mrabarnettsetfiles: + issue2636-20090810#3.zip

messages: + msg91463
2009-08-10 22:02:00gregory.p.smithsetmessages: + msg91462
2009-08-10 19:27:46vbrsetmessages: + msg91460
2009-08-10 15:04:49mrabarnettsetfiles: + issue2636-20090810#2.zip

messages: + msg91450
2009-08-10 14:18:57mrabarnettsetfiles: + issue2636-20090810.zip

messages: + msg91448
2009-08-10 10:58:09sjmachinsetmessages: + msg91439
2009-08-10 08:54:54vbrsetnosy: + vbr
messages: + msg91437
2009-08-04 01:30:19mrabarnettsetfiles: + issue2636-20090804.zip

messages: + msg91250
2009-08-03 22:36:34sjmachinsetnosy: + sjmachin
messages: + msg91245
2009-07-29 13:01:31ezio.melottisetmessages: + msg91038
2009-07-29 11:10:25mrabarnettsetfiles: + issue2636-20090729.zip

messages: + msg91035
2009-07-29 11:09:49mrabarnettsetfiles: - issue2636-20090729.zip
2009-07-29 00:56:31mrabarnettsetfiles: + issue2636-20090729.zip

messages: + msg91028
2009-07-27 17:53:10akuchlingsetmessages: + msg90989
2009-07-27 17:36:54gregory.p.smithsetmessages: + msg90986
2009-07-27 16:13:03mrabarnettsetfiles: + issue2636-20090727.zip

messages: + msg90985
2009-07-26 21:29:23georg.brandlsetmessages: + msg90961
2009-07-26 19:11:52mrabarnettsetfiles: + issue2636-20090726.zip

messages: + msg90954
2009-06-23 20:52:48doerwaltersetnosy: + doerwalter
messages: + msg89643
2009-06-23 17:01:34mrabarnettsetmessages: + msg89634
2009-06-23 16:29:08akitadasetnosy: + akitada
messages: + msg89632
2009-05-20 01:31:06rhettingerunlinkissue5337 dependencies
2009-04-16 14:58:26mrabarnettsetfiles: + issue2636-patch-2.diff

messages: + msg86032
2009-04-15 23:13:41gregory.p.smithsetmessages: + msg86004
2009-04-15 22:59:42gregory.p.smithsetnosy: + gregory.p.smith
2009-03-31 21:11:02georg.brandllinkissue5337 dependencies
2009-03-29 00:44:33mrabarnettsetfiles: + issue2636-patch-1.diff

messages: + msg84350
2009-03-23 01:42:54mrabarnettsetmessages: + msg83993
2009-03-23 00:08:38nneonneosetmessages: + msg83989
2009-03-22 23:33:29mrabarnettsetmessages: + msg83988
2009-03-10 12:14:22timehorsesetmessages: + msg83429
2009-03-10 12:08:04pitrousetmessages: + msg83428
2009-03-10 12:00:47timehorsesetmessages: + msg83427
2009-03-09 23:09:54loewissetmessages: + msg83411
2009-03-09 15:15:54timehorsesetmessages: + msg83390
2009-03-07 14:19:11jaylogansetnosy: + jaylogan
2009-03-07 11:27:06loewissetnosy: + loewis
messages: + msg83277
2009-03-07 02:48:16mrabarnettsetfiles: + issue2636-features-6.diff
messages: + msg83271
2009-03-01 01:42:47mrabarnettsetfiles: + issue2636-features-5.diff
messages: + msg82950
2009-02-26 01:23:14mrabarnettsetfiles: + issue2636-features-4.diff
messages: + msg82739
2009-02-26 00:42:48collinwintersetnosy: + collinwinter
2009-02-24 19:29:15mrabarnettsetfiles: + issue2636-features-3.diff
messages: + msg82673
2009-02-09 19:17:44pitrousetmessages: + msg81475
2009-02-09 19:09:55pitrousetmessages: + msg81473
2009-02-08 08:44:52ezio.melottisetnosy: + ezio.melotti
2009-02-08 00:39:45mrabarnettsetfiles: + issue2636-features-2.diff
messages: + msg81359
2009-02-06 00:06:03nneonneosetmessages: + msg81240
2009-02-06 00:03:00mrabarnettsetmessages: + msg81239
2009-02-05 23:52:49rscsetmessages: + msg81238
2009-02-05 23:13:07nneonneosetnosy: + nneonneo
messages: + msg81236
2009-02-03 23:08:08mrabarnettsetfiles: + issue2636-features.diff
messages: + msg81112
2009-02-01 19:25:08moreatisetmessages: + msg80916
2008-10-18 22:54:49moreatisetnosy: + moreati
2008-10-17 12:28:06mrabarnettsetmessages: + msg74904
2008-10-02 22:51:06mrabarnettsetmessages: + msg74204
2008-10-02 22:49:59mrabarnettsetmessages: + msg74203
2008-10-02 16:48:15mrabarnettsetmessages: + msg74174
2008-09-30 23:42:31mrabarnettsetfiles: + issue2636+01+09-02+17+18+19+20+21+24+26_speedup.diff
messages: + msg74104
2008-09-30 00:45:09mrabarnettsetfiles: + issue2636-01+09-02+17_backport.diff
messages: + msg74058
2008-09-29 12:36:07timehorsesetmessages: + msg74026
2008-09-29 11:48:00timehorsesetmessages: + msg74025
2008-09-28 02:52:00mrabarnettsetmessages: + msg73955
2008-09-26 18:04:38timehorsesetmessages: + msg73875
2008-09-26 16:28:10timehorsesetmessages: + msg73861
2008-09-26 16:00:54mrabarnettsetmessages: + msg73855
2008-09-26 15:43:46timehorsesetmessages: + msg73854
2008-09-26 15:16:18mrabarnettsetmessages: + msg73853
2008-09-26 13:11:22timehorsesetmessages: + msg73848
2008-09-25 23:59:05mrabarnettsetmessages: + msg73827
2008-09-25 17:36:07timehorsesetmessages: + msg73805
2008-09-25 17:01:11mrabarnettsetmessages: + msg73803
2008-09-25 16:32:38timehorsesetmessages: + msg73801
2008-09-25 15:57:45mrabarnettsetmessages: + msg73798
2008-09-25 14:17:06timehorsesetmessages: + msg73794
2008-09-25 13:43:28mrabarnettsetmessages: + msg73791
2008-09-25 12:23:25timehorsesetmessages: + msg73782
2008-09-25 11:57:54timehorsesetmessages: + msg73780
2008-09-25 11:56:40mrabarnettsetmessages: + msg73779
2008-09-25 00:06:33timehorsesetmessages: + msg73766
2008-09-24 19:45:57timehorsesetmessages: + msg73752
2008-09-24 16:33:35georg.brandlsetnosy: + georg.brandl
messages: + msg73730
2008-09-24 15:48:49mrabarnettsetmessages: + msg73721
2008-09-24 15:09:28timehorsesetmessages: + msg73717
2008-09-24 14:28:03mrabarnettsetnosy: + mrabarnett
messages: + msg73714
2008-09-22 21:31:44georg.brandllinkissue433031 superseder
2008-09-16 11:59:48timehorsesettitle: Regexp 2.6 (modifications to current re 2.2.2) -> Regexp 2.7 (modifications to current re 2.2.2)
messages: + msg73295
versions: + Python 2.7, - Python 2.6
2008-09-13 13:40:22pitrousetmessages: + msg73185
2008-06-19 14:15:54marksetmessages: + msg68409
2008-06-19 12:01:31timehorsesetmessages: + msg68399
2008-06-18 07:13:25marksetmessages: + msg68358
2008-06-17 19:07:22timehorsesetfiles: + issue2636-02.patch
messages: + msg68339
2008-06-17 17:44:14timehorsesetfiles: - issue2636-07-only.diff
2008-06-17 17:44:10timehorsesetfiles: - issue2636-07.diff
2008-06-17 17:44:06timehorsesetfiles: - issue2636-05.diff
2008-06-17 17:44:03timehorsesetfiles: - issue2636.diff
2008-06-17 17:43:59timehorsesetfiles: - issue2636-05-only.diff
2008-06-17 17:43:54timehorsesetfiles: - issue2636-09.patch
2008-06-17 17:43:39timehorsesetfiles: + issue2636-patches.tar.bz2
messages: + msg68336
2008-05-29 19:00:39timehorsesetfiles: - issue2636-07.patch
2008-05-29 19:00:25timehorsesetfiles: + issue2636-07-only.diff
2008-05-29 18:59:39timehorsesetfiles: + issue2636-07.diff
2008-05-29 18:58:37timehorsesetfiles: - issue2636-05.diff
2008-05-29 18:58:22timehorsesetfiles: + issue2636-05.diff
2008-05-29 18:57:34timehorsesetfiles: - issue2636.diff
2008-05-29 18:56:29timehorsesetfiles: + issue2636.diff
2008-05-28 13:57:25timehorsesetmessages: + msg67448
2008-05-28 13:38:46marksetnosy: + mark
messages: + msg67447
2008-05-24 21:40:35timehorsesetfiles: - issue2636-05.patch
2008-05-24 21:40:24timehorsesetfiles: + issue2636-05.diff
2008-05-24 21:39:57timehorsesetfiles: + issue2636-05-only.diff
2008-05-24 21:39:09timehorsesetfiles: + issue2636.diff
messages: + msg67309
2008-05-01 14:16:21timehorsesetmessages: + msg66033
2008-04-26 11:51:14timehorsesetmessages: + msg65841
2008-04-26 10:08:05pitrousetnosy: + pitrou
messages: + msg65838
2008-04-24 20:55:49rscsetnosy: + rsc
2008-04-24 18:09:25jimjjewettsetmessages: + msg65734
2008-04-24 16:06:27timehorsesetmessages: + msg65727
2008-04-24 14:31:53amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg65726
2008-04-24 14:23:35jimjjewettsetnosy: + jimjjewett
messages: + msg65725
2008-04-18 14:50:44timehorsesetfiles: + issue2636-05.patch
messages: + msg65617
2008-04-18 14:23:19timehorsesetfiles: + issue2636-07.patch
messages: + msg65614
2008-04-18 13:38:57timehorsesetfiles: + issue2636-09.patch
keywords: + patch
messages: + msg65613
2008-04-17 22:07:00timehorsesetmessages: + msg65593
2008-04-15 13:22:10akuchlingsetcomponents: + Regular Expressions, - Library (Lib)
2008-04-15 12:49:43akuchlingsetnosy: + akuchling
2008-04-15 11:57:51timehorsecreate