This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author timehorse
Recipients akuchling, amaury.forgeotdarc, jimjjewett, mark, pitrou, rsc, timehorse
Date 2008-06-17.17:43:20
SpamBayes Score 7.873811e-06
Marked as misclassified No
Message-id <1213724620.33.0.615366054985.issue2636@psf.upfronthosting.co.za>
In-reply-to
Content
Well, it's time for another update on my progress...

Some good news first: Atomic Grouping is now completed, tested and 
documented, and as stated above, is classified as issue2636-01 and 
related patches.  Secondly, with caveats listed below, Named Match Group 
Attributes on a match object (item 2) is also more or less complete at 
issue2636-02 -- it only lacks documentation.

Now, I want to also update my list of items.  We left off at 11: Other 
Perl-specific modifications.  Since that time, I have spawned a number 
of other branches, the first of which (issue2636-12) I am happy to 
announce is also complete!

12) Implement the changes to the documentation of re as per Jim J. 
Jewett suggestion from 2008-04-24 14:09.  Again, this has been done.

13) Implement a grouptuples(...) method as per Mark Summerfield's 
suggest on 2008-05-28 09:38.  grouptuples would take the same filtering 
parameters as the other group* functions, and would return a list of 3-
tuples (unless only 1 group was requested).  It should default to all 
match groups (1..n, not group 0, the matching string).

14) As per PEP-3131 and the move to Python 3.0, python will begin to 
allow full UNICODE-compliant identifier names.  Correspondingly, it 
would be the responsibility of this item to allow UNICODE names for 
match groups.  This would allow retrieval of UNICODE names via the 
group* functions or when combined with Item 3, the getitem handler 
(m[u'...']) (03+14) and the attribute name itself (e.g. getattr(m, 
u'...')) when combined with item 2 (02+14).

15) Change the Pattern_Type, Match_Type and Scanner_Type (experimental) 
to become richer Python Types.  Specifically, add __doc__ strings to 
each of these types' methods and members.

16) Implement various FIXMEs.

16-1) Implement the FIXME such that if m is a MatchObject, del m.string 
will disassociate the original matched string from the match object; 
string would be the only member that would allow modification or 
deletion and you will not be able to modify the m.string value, only 
delete it.

-----

Finally, I want to say a couple notes about Item 2:

Firstly, as noted in Item 14, I wish to add support for UNICODE match 
group names, and the current version of the C-code would not allow that; 
it would only make sense to add UNICODE support if 14 is implemented, so 
adding support for UNICODE match object attributes would depend on both 
items 2 and 14.  Thus, that would be implemented in issue2636-02+14.

Secondly, there is a FIXME which I discussed in Item 16; I gave that 
problem it's own item and branch.  Also, as stated in Item 15, I would 
like to add more robust help code to the Match object and bind __doc__ 
strings to the fixed attributes.  Although this would not directly 
effect the Item 2 implementation, it would probably involve moving some 
code around in its vicinity.

Finally, I would like suggestions on how to handle name collisions when 
match group names are provided as attributes.  For instance, an 
expression like '(?P<pos>.*)' would match more or less any string and 
assign it to the name "pos".  But "pos" is already an attribute of the 
Match object, and therefore pos cannot be exposed as a named match group  
attribute, since match.pos will return the usual meaning of pos for a 
match object, not the value of the capture group names "pos".

I have 3 proposals as to how to handle this:

a) Simply disallow the exposure of match group name attributes if the 
names collide with an existing member of the basic Match Object 
interface.

b) Expose the reserved names through a special prefix notation, and for 
forward compatibility, expose all names via this prefix notation.  In 
other words, if the prefix was 'k', match.kpos could be used to access 
pos; if it was '_', match._pos would be used.  If Item 3 is implemented, 
it may be sufficient to allow access via match['pos'] as the canonical 
way of handling match group names using reserved words.

c) Don't expose the names directly; only expose them through a prefixed 
name, e.g. match._pos or match.kpos.

Personally, I like a because if Item 3 is implemented, it makes a fairly 
useful shorthand for retrieving keyword names when a keyword is used for 
a name.  Also, we could put a deprecation warning in to inform users 
that eventually match groups names that are keywords in the Match Object 
will eventually be disallowed.  However, I don't support restricting the 
match group names any more than they already are (they must be a valid 
python identifier only) so again I would go with a) and nothing more and 
that's what's implemented in issue2636-02.patch.

-----

Now, rather than posting umteen patch files I am posting one bz2-
compressed tar of ALL patch files for all threads, where each file is of 
the form:

issue2636(-\d\d|+\d\d)*(-only)?.patch

For instance,

issue2636-01.patch is the p1 patch that is a difference between the 
current Python trunk and all that would need to be implemented to 
support Atomic Grouping / Possessive Qualifiers.  Combined branches are 
combined with a PLUS ('+') and sub-branches concatenated with a DASH ('-
').  Thus, "issue2636-01+09-01-01+10.patch" is a patch which combines 
the work from Item 1: Atomic Grouping / Possessive Qualifiers, the sub-
sub branch of Item 9: Engine Cleanups and Item 10: Shared Constants.  
Item 9 has both a child and a grandchild.  The Child (09-01) is my 
proposed engine redesign with the single loop; the grandchild (09-01-01) 
is the redesign with the triple loop.  Finally the optional "-only" flag 
means that the diff is against the core SRE modifications branch and 
thus does not include the core branch changes.

As noted above, Items 01, 02, 05, 07 and 12 should be considered more or 
less complete and ready for merging assuming I don't identify in my 
implementation of the other items that I neglected something in these.  
The rest, including the combined items, are all provided in the given 
tarball.
History
Date User Action Args
2008-06-17 17:43:41timehorsesetspambayes_score: 7.87381e-06 -> 7.873811e-06
recipients: + timehorse, akuchling, jimjjewett, amaury.forgeotdarc, pitrou, rsc, mark
2008-06-17 17:43:40timehorsesetspambayes_score: 7.87381e-06 -> 7.87381e-06
messageid: <1213724620.33.0.615366054985.issue2636@psf.upfronthosting.co.za>
2008-06-17 17:43:39timehorselinkissue2636 messages
2008-06-17 17:43:33timehorsecreate