classification
Title: Documentation about re \number
Type: behavior Stage: needs patch
Components: Documentation Versions: Python 3.1, Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: works for me
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Seth.Troisi, docs@python, ezio.melotti, georg.brandl, r.david.murray, terry.reedy
Priority: normal Keywords: patch

Created on 2011-05-24 00:07 by Seth.Troisi, last changed 2013-10-06 19:01 by georg.brandl. This issue is now closed.

Messages (8)
msg136708 - (view) Author: Seth Troisi (Seth.Troisi) Date: 2011-05-24 00:07
It would be nice to clarify re documentation on how to use \number.

current documentation lists three half examples:
"(.+) \1 matches 'the the' or '55 55', but not 'the end' (note the space after the group)."

This is rather confusing (at least to me) as it might be assumed that
re.search("(.+) \1", "the the") would return a match, which it does not.

A better example would be re.search("(\w+) \\1", "the the") which does match.

the other confusing portion is the requirement of the second "\" to make it match.

I would think that a quick example below the text would help.

>>> re.search("(\w+) \\1", "can you do the can can?") # \\1 matches the second can at the end of the sentence
<_sre.SRE_Match object at ...>

This is my first python issue and if I have misfiled or left out some information please tell me how to proceed.
msg136709 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-05-24 00:54
Read the description of strings and raw strings at the top of the re documentation for the answer to your question about \\.  It would probably be better if the example regular expression was written r'(.+) \1' instead of as a bare expression as it is now.
msg136715 - (view) Author: Seth Troisi (Seth.Troisi) Date: 2011-05-24 02:27
Given David Murray's input I think the example would be best done as 

>>> re.search(r'(\w+) \1', "can you do the can can?") # Matches the duplicate can
<_sre.SRE_Match object at ...>


I want to stress that the documentation is not wrong but confusing, especially for someone unfamiliar with regression expressions.
msg137155 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-05-28 21:59
The doc consistently does NOT quote re's in the text. Rather, they are shaded gray, both in Windows help version and html version. So this one should not be treated differently.

Most of the confusion reported is due to not reading the intro paragraphs. I almost suggested closing this without action. However,  after saying to use the r prefix, the doc omits them from examples when not absolutely needed. In particular,

>>> m = re.search('(?<=-)\w+', 'spam-egg')

Why does \w work without being doubled or protected (and it does, I checked), while \1 does not? Hell if I know. So even though that example works, it should be changed. The doc should teach the rule "if strings contains '\', prefix with 'r'" rather than "test and add 'r' if it fails", or "learn the exact list of when needed", which is not given and unknown to me and most any beginner.

I advocate the same practice in the RE How To, which also has at least one example with '\' but without 'r':
>>> p = re.compile('\d+')

I do not think we need another example other than those in the text.
msg137158 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-05-28 22:27
Why it works is due to a quirk in the handling of python strings: if an apparent escape sequence doesn't "mean anything", it is retained verbatim, including the '\' character.  This is documented in http://docs.python.org/reference/lexical_analysis.html#string-literals:

"Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the string. (This behavior is useful when debugging: if an escape sequence is mistyped, the resulting output is more easily recognized as broken.)"

It is *very* unwise to depend on this behavior for anything except debugging, therefore those examples which do are, in my opinion, wrong.
msg137165 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-05-29 02:01
The regex sets (\d\w\s\D\W\S) don't match any Python escape sequence, so even if some suggest to always use r'' regardless, I don't find it necessary, especially for simple regexs.
The two conflicting escape sequences to keep in mind are \b (backspace for Python, word boundary for re) and \number (octal escape for Python, reference to a group for re).
There are also other regex escape sequences that are rarely used (\B\A\Z), but these don't need to be escaped either.
msg137196 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-05-29 16:36
The fact that you have carefully think about which are escapes and which aren't tells you that you should not be depending on the non-escapes not being escapes.  What if we added one?  The doc says preserving the \s is a debugging aid, and that is all it should be used for, IMO.
msg199109 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2013-10-06 19:01
I can't see the issue here.  The RE docs are much better off with the regexes unquoted.

The '(.+) \1' example was fixed today (the string supposed to not match actually did match).
History
Date User Action Args
2013-10-06 19:01:01georg.brandlsetstatus: open -> closed

nosy: + georg.brandl
messages: + msg199109

resolution: works for me
2011-05-29 16:36:52r.david.murraysetmessages: + msg137196
2011-05-29 02:01:00ezio.melottisetmessages: + msg137165
2011-05-28 22:27:27r.david.murraysetmessages: + msg137158
2011-05-28 21:59:27terry.reedysetversions: + Python 3.1, Python 2.7, Python 3.2, Python 3.3
nosy: + terry.reedy

messages: + msg137155

keywords: + patch
stage: needs patch
2011-05-24 02:27:05Seth.Troisisetmessages: + msg136715
2011-05-24 00:54:14r.david.murraysetnosy: + r.david.murray
messages: + msg136709
2011-05-24 00:12:32ezio.melottisetnosy: + ezio.melotti
2011-05-24 00:07:37Seth.Troisicreate