Message 65513 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	timehorse
Recipients	timehorse
Date	2008-04-15.11:57:47
SpamBayes Score	0.00013241082
Marked as misclassified	No
Message-id	<1208260672.14.0.711874677361.issue2636@psf.upfronthosting.co.za>
In-reply-to

Content
I am working on adding features to the current Regexp implementation, which is now set to 2.2.2. These features are to bring the Regexp code closer in line with Perl 5.10 as well as add a few python-specific niceties and potential speed-ups and clean-ups. I will be posting regular patch updates to this thread when major milestones have been reach with a description of the feature(s) added. Currently, the list of proposed changes are (in no particular order): 1) Fix <a href="http://bugs.python.org/issue433030">issue 433030</a> by adding support for Atomic Grouping and Possessive Qualifiers 2) Make named matches direct attributes of the match object; i.e. instead of m.group('foo'), one will be able to write simply m.foo. 3) (maybe) make Match objects subscriptable, such that m[n] is equivalent to m.group(n) and allow slicing. 4) Implement Perl-style back-references including relative back-references. 5) Add a well-formed, python-specific comment modifier, e.g. (?P#...); the difference between (?P#...) and Perl/Python's (?#...) is that the former will allow nested parentheses as well as parenthetical escaping, so that patterns of the form '(?P# Evaluate (the following) expression, 3\) using some other technique)'. The (?P#...) will interpret this entire expression as a comment, where as with (?#...) only, everything following ' expression...' would be considered part of the match. (?P#...) will necessarily be slower than (?#...) and so only should be used if richer commenting style is required but the verbose mode is not desired. 6) Add official support for fast, non-repeating capture groups with the Template option. Template is unofficially supported and disables all repeat operators (*, + and ?). This would mainly consist of documenting its behavior. 7) Modify the re compiled expression cache to better handle the thrashing condition. Currently, when regular expressions are compiled, the result is cached so that if the same expression is compiled again, it is retrieved from the cache and no extra work has to be done. This cache supports up to 100 entries. Once the 100th entry is reached, the cache is cleared and a new compile must occur. The danger, all be it rare, is that one may compile the 100th expression only to find that one recompiles it and has to do the same work all over again when it may have been done 3 expressions ago. By modifying this logic slightly, it is possible to establish an arbitrary counter that gives a time stamp to each compiled entry and instead of clearing the entire cache when it reaches capacity, only eliminate the oldest half of the cache, keeping the half that is more recent. This should limit the possibility of thrashing to cases where a very large number of Regular Expressions are continually recompiled. In addition to this, I will update the limit to 256 entries, meaning that the 128 most recent are kept. 8) Emacs/Perl style character classes, e.g. [:alphanum:]. For instance, :alphanum: would not include the '_' in the character class. 9) C-Engine speed-ups. I commenting and cleaning up the _sre.c Regexp engine to make it flow more linearly, rather than with all the current gotos and replace the switch-case statements with lookup tables, which in tests have shown to be faster. This will also include adding many more comments to the C code in order to make it easier for future developers to follow. These changes are subject to testing and some modifications may not be included in the final release if they are shown to be slower than the existing code. Also, a number of Macros are being eliminated where appropriate. 10) Export any (not already) shared value between the Python Code and the C code, e.g. the default Maximum Repeat count (65536); this will allow those constants to be changed in 1 central place. 11) Various other Perl 5.10 conformance modifications, TBD. More items may come and suggestions are welcome. ----- Currently, I have code which implements 5) and 7), have done some work on 10) and am almost 9). When 9) is complete, I will work on 1), some of which, such as parsing, is already done, then probably 8) and 4) because they should not require too much work -- 4) is parser-only AFAICT. Then, I will attempt 2) and 3), though those will require changes at the C-Code level. Then I will investigate what additional elements of 11) I can easily implement. Finally, I will write documentation for all of these features, including 6). In a few days, I will provide a patch with my interim results and will update the patches with regular updates when Milestones are reached.

I am working on adding features to the current Regexp implementation,
which is now set to 2.2.2.  These features are to bring the Regexp code
closer in line with Perl 5.10 as well as add a few python-specific
niceties and potential speed-ups and clean-ups.

I will be posting regular patch updates to this thread when major
milestones have been reach with a description of the feature(s) added. 
Currently, the list of proposed changes are (in no particular order):

1) Fix <a href="http://bugs.python.org/issue433030">issue 433030</a> by
adding support for Atomic Grouping and Possessive Qualifiers

2) Make named matches direct attributes of the match object; i.e.
instead of m.group('foo'), one will be able to write simply m.foo.

3) (maybe) make Match objects subscriptable, such that m[n] is
equivalent to m.group(n) and allow slicing.

4) Implement Perl-style back-references including relative back-references.

5) Add a well-formed, python-specific comment modifier, e.g. (?P#...);
the difference between (?P#...) and Perl/Python's (?#...) is that the
former will allow nested parentheses as well as parenthetical escaping,
so that patterns of the form '(?P# Evaluate (the following) expression,
3\) using some other technique)'.  The (?P#...) will interpret this
entire expression as a comment, where as with (?#...) only, everything
following ' expression...' would be considered part of the match. 
(?P#...) will necessarily be slower than (?#...) and so only should be
used if richer commenting style is required but the verbose mode is not
desired.

6) Add official support for fast, non-repeating capture groups with the
Template option.  Template is unofficially supported and disables all
repeat operators (*, + and ?).  This would mainly consist of documenting
its behavior.

7) Modify the re compiled expression cache to better handle the
thrashing condition.  Currently, when regular expressions are compiled,
the result is cached so that if the same expression is compiled again,
it is retrieved from the cache and no extra work has to be done.  This
cache supports up to 100 entries.  Once the 100th entry is reached, the
cache is cleared and a new compile must occur.  The danger, all be it
rare, is that one may compile the 100th expression only to find that one
recompiles it and has to do the same work all over again when it may
have been done 3 expressions ago.  By modifying this logic slightly, it
is possible to establish an arbitrary counter that gives a time stamp to
each compiled entry and instead of clearing the entire cache when it
reaches capacity, only eliminate the oldest half of the cache, keeping
the half that is more recent.  This should limit the possibility of
thrashing to cases where a very large number of Regular Expressions are
continually recompiled.  In addition to this, I will update the limit to
256 entries, meaning that the 128 most recent are kept.

8) Emacs/Perl style character classes, e.g. [:alphanum:].  For instance,
:alphanum: would not include the '_' in the character class.

9) C-Engine speed-ups.  I commenting and cleaning up the _sre.c Regexp
engine to make it flow more linearly, rather than with all the current
gotos and replace the switch-case statements with lookup tables, which
in tests have shown to be faster.  This will also include adding many
more comments to the C code in order to make it easier for future
developers to follow.  These changes are subject to testing and some
modifications may not be included in the final release if they are shown
to be slower than the existing code.  Also, a number of Macros are being
eliminated where appropriate.

10) Export any (not already) shared value between the Python Code and
the C code, e.g. the default Maximum Repeat count (65536); this will
allow those constants to be changed in 1 central place.

11) Various other Perl 5.10 conformance modifications, TBD.


More items may come and suggestions are welcome.

-----

Currently, I have code which implements 5) and 7), have done some work
on 10) and am almost 9).  When 9) is complete, I will work on 1), some
of which, such as parsing, is already done, then probably 8) and 4)
because they should not require too much work -- 4) is parser-only
AFAICT.  Then, I will attempt 2) and 3), though those will require
changes at the C-Code level.  Then I will investigate what additional
elements of 11) I can easily implement.  Finally, I will write
documentation for all of these features, including 6).

In a few days, I will provide a patch with my interim results and will
update the patches with regular updates when Milestones are reached.

History
Date	User	Action	Args
2008-04-15 11:57:52	timehorse	set	spambayes_score: 0.000132411 -> 0.00013241082 recipients: + timehorse
2008-04-15 11:57:52	timehorse	set	spambayes_score: 0.000132411 -> 0.000132411 messageid: <1208260672.14.0.711874677361.issue2636@psf.upfronthosting.co.za>
2008-04-15 11:57:51	timehorse	link	issue2636 messages
2008-04-15 11:57:48	timehorse	create