classification
Title: backreference to named group does not work
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: amaury.forgeotdarc, asvetlov, docs@python, ezio.melotti, georg.brandl, mrabarnett, python-dev, steve.newcomb, terry.reedy
Priority: normal Keywords:

Created on 2012-09-17 13:33 by steve.newcomb, last changed 2013-10-06 10:10 by georg.brandl. This issue is now closed.

Files
File name Uploaded Description Edit
patch steve.newcomb, 2012-09-18 18:46
patch steve.newcomb, 2012-09-18 19:44
Messages (12)
msg170605 - (view) Author: Steve Newcomb (steve.newcomb) * Date: 2012-09-17 13:33
The '\\g<startquote>' in the below does not work:

>>> repr( re.compile( '\\<\\!ENTITY[ \\011\\012\\015]+\\%[ \\011\\012\\015]*(?P<entityName>[A-Za-z][A-Za-z0-9\\.\\-\\_\\:]*)[ \\011\\012\\015]*(?P<startquote>[\\042\\047])(?P<entityText>.+?)\\g<startquote>[ \\011\\012\\015]*\\>', re.IGNORECASE | re.DOTALL).search( '<!ENTITY % m.mixedContent "( #PCDATA | i | b)">'))
'None'

In the following, the '\\g<startquote>' has been replaced by '\\2'.  It works.

>>> repr( re.compile( '\\<\\!ENTITY[ \\011\\012\\015]+\\%[ \\011\\012\\015]*(?P<entityName>[A-Za-z][A-Za-z0-9\\.\\-\\_\\:]*)[ \\011\\012\\015]*(?P<startquote>[\\042\\047])(?P<entityText>.+?)\\2[ \\011\\012\\015]*\\>', re.IGNORECASE | re.DOTALL).search( '<!ENTITY % m.mixedContent "( #PCDATA | i | b)">'))
'<_sre.SRE_Match object at 0x7f77503d1918>'

Either this feature is broken or the re module documentation is somehow misleading me.

(Yes, I know there is an XML error in the above.  That's because it's SGML.)
msg170610 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-09-17 14:13
\g is meant to be used in re.sub(), in the replacement text (see the docs); in the search pattern, (?P=startquote) can be used to refer to a named group.
The docs of "(?P<name>...)" looks clear to me.
msg170630 - (view) Author: Steve Newcomb (steve.newcomb) * Date: 2012-09-18 02:01
I have re-read the documentation on re.sub().  Even now, now that I understand that the \g<groupname> syntax applies to the repl argument only, I cannot see how the documentation can be understood that way.  The paragraph in which the explanation of the \g<groupname> syntax appears does not mention the repl argument by name, and neither does the preceding paragraph. 

The paragraph before the preceding paragraph is about the pattern argument, not the repl argument, and it consists entirely of the words, "The pattern may be a string or an RE object." 

So I don't see how the explanation of the \g<groupname> syntax can be understood as applying only to the repl argument, even though you have now informed me that that's the case (which is helpful to know -- thanks!).  Indeed, the paragraph that explains the \g<groupname> syntax *still* appears to me to be discussing the pattern argument.  And it even mentions the <?P<name> syntax, which can only appear in a pattern, not in a repl, in the very same sentence as the \g<groupname> syntax, even though those two syntactic features appear in *different* expression languages, and no single expression language has both of them.  

So there is no clear indication that it is discussing two different expression languages.  Indeed, another syntactic feature, \groupnumber, also discussed in the same paragraph, *is* found in both expression languages, so it's even more confusing to a person who knows that both <?P<groupname> and \groupnumber appear in the pattern expression language.  There is nothing in the documentation that would inform a person (such as myself) that the \g<groupname> syntax is not also part of the pattern expression language, just as the other two features are.

(And why isn't \g<groupname> part of the pattern language, anyway, or at least some way to refer to a match made in a previous *named* group?  It would be very convenient to be able to do that, particularly when using a dynamically-created regexp to parse strings delimited with a choice of delimiters that must match at both ends.)

In other words, this documentation could be beneficially improved.
msg170636 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-09-18 07:42
> And why isn't \g<groupname> part of the pattern language, anyway, or at
> least some way to refer to a match made in a previous *named* group?

But this way exists: (?P=startquote) is what you want.  To me \g is an exception, and frankly I did not know about it before this bug report.


I agree that the following sentence could be better structured:
"""
For example, if the pattern is (?P<id>[a-zA-Z_]\w*), the group can be referenced by its name in arguments to methods of match objects, such as m.group('id') or m.end('id'), and also by name in the regular expression itself (using (?P=id)) and replacement text given to .sub() (using \g<id>).
"""

It probably needs to be split into several pieces, contributions are welcome.
msg170657 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2012-09-18 17:07
There needed to be a way of referring to named groups in the replacement template. The existing form \groupnumber clearly wouldn't work. Other regex implementations, such as Perl, do have \g and also \k (for named groups).

In my implementation I added support for \g in regex strings.
msg170662 - (view) Author: Steve Newcomb (steve.newcomb) * Date: 2012-09-18 18:46
> But this way exists: (?P=startquote) is what you want.

I know how I missed it: I searched for "backref" in the documentation.  I did not find it in the discussion of the pattern language, because that word does not appear where <?P= is discussed.

> contributions are welcome.

See attached brief patch for the documentation.  It changes the example, adds a table of the three processing contexts in which named groups can be referenced, and accounts for users who, like me, may search for "backref".  (I tested everything.  I think it's correct.)

Thanks again for the advice, Amaury.
msg170666 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-09-18 19:21
Thanks for the patch!  The new formulation looks much better, but I'll let a native speaker have another check.

Some comments: I preferred the previous example "<id>" because it's not obvious what \042\047 is. And a bullet list would be less heavyweight IMO.
(Also please use "diff -u"; without context, the patch cannot be applied automatically)
msg170670 - (view) Author: Steve Newcomb (steve.newcomb) * Date: 2012-09-18 19:44
> I preferred the previous example "<id>" because it's not obvious what \042\047 is. 

Yeah, but the example I wrote has an in-pattern backreference and a real reason to use one.

In the attached patch, I have changed [\042\047] to [\'\"].  That's certainly clearer for everyone who has not memorized the ASCII table in octal!  (Oops.)

> And a bullet list would be less heavyweight IMO.

Well... I rejected that choice because there would be no clarifying columnar distinction between contexts and syntaxes.  Personally, I think the table is clearer.  It makes it easier for users to find what they need know.

>(Also please use "diff -u"; without context, the patch cannot be applied automatically)

Oops.  Attached.
msg170945 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-09-22 01:10
I read it as a 'native speaker' and it looks fine to me. Table is clear, but I will let doc stylist decide.
msg199061 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-10-06 10:08
New changeset bee2736296c5 by Georg Brandl in branch '2.7':
Closes #15956: improve documentation of named groups and how to reference them.
http://hg.python.org/cpython/rev/bee2736296c5
msg199062 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-10-06 10:08
New changeset f765a29309d1 by Georg Brandl in branch '3.3':
Closes #15956: improve documentation of named groups and how to reference them.
http://hg.python.org/cpython/rev/f765a29309d1
msg199063 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-10-06 10:10
Thanks for the patch.  I made a few changes, such as explaining what the example pattern does.
History
Date User Action Args
2013-10-06 10:10:21georg.brandlsetnosy: + georg.brandl
messages: + msg199063
2013-10-06 10:08:26python-devsetmessages: + msg199062
2013-10-06 10:08:15python-devsetstatus: open -> closed

nosy: + python-dev
messages: + msg199061

resolution: fixed
stage: patch review -> resolved
2012-09-22 01:10:51terry.reedysetnosy: + terry.reedy
messages: + msg170945
2012-09-18 21:42:12asvetlovsetnosy: + asvetlov
2012-09-18 21:35:14amaury.forgeotdarcsetassignee: docs@python

nosy: + docs@python
stage: resolved -> patch review
2012-09-18 19:44:14steve.newcombsetfiles: + patch

messages: + msg170670
2012-09-18 19:21:12amaury.forgeotdarcsetmessages: + msg170666
2012-09-18 18:46:03steve.newcombsetfiles: + patch

messages: + msg170662
2012-09-18 17:07:58mrabarnettsetmessages: + msg170657
2012-09-18 07:42:46amaury.forgeotdarcsetmessages: + msg170636
2012-09-18 02:01:14steve.newcombsetstatus: closed -> open
resolution: not a bug -> (no value)
messages: + msg170630
2012-09-17 14:13:40amaury.forgeotdarcsetstatus: open -> closed

nosy: + amaury.forgeotdarc
messages: + msg170610

resolution: not a bug
stage: resolved
2012-09-17 13:33:23steve.newcombcreate