Issue 12162: Documentation about re \number

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/56371

classification

Title:	Documentation about re \number
Type:	behavior	Stage:	needs patch
Components:	Documentation	Versions:	Python 3.1, Python 3.2, Python 3.3, Python 2.7

process

Status:	closed	Resolution:	works for me
Dependencies:		Superseder:
Assigned To:	docs@python	Nosy List:	Seth.Troisi, docs@python, ezio.melotti, georg.brandl, r.david.murray, terry.reedy
Priority:	normal	Keywords:	patch

Created on 2011-05-24 00:07 by Seth.Troisi, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (8)
msg136708 - (view)	Author: Seth Troisi (Seth.Troisi) *	Date: 2011-05-24 00:07
It would be nice to clarify re documentation on how to use \number. current documentation lists three half examples: "(.+) \1 matches 'the the' or '55 55', but not 'the end' (note the space after the group)." This is rather confusing (at least to me) as it might be assumed that re.search("(.+) \1", "the the") would return a match, which it does not. A better example would be re.search("(\w+) \\1", "the the") which does match. the other confusing portion is the requirement of the second "\" to make it match. I would think that a quick example below the text would help. >>> re.search("(\w+) \\1", "can you do the can can?") # \\1 matches the second can at the end of the sentence <_sre.SRE_Match object at ...> This is my first python issue and if I have misfiled or left out some information please tell me how to proceed.
msg136709 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2011-05-24 00:54
Read the description of strings and raw strings at the top of the re documentation for the answer to your question about \\. It would probably be better if the example regular expression was written r'(.+) \1' instead of as a bare expression as it is now.
msg136715 - (view)	Author: Seth Troisi (Seth.Troisi) *	Date: 2011-05-24 02:27
Given David Murray's input I think the example would be best done as >>> re.search(r'(\w+) \1', "can you do the can can?") # Matches the duplicate can <_sre.SRE_Match object at ...> I want to stress that the documentation is not wrong but confusing, especially for someone unfamiliar with regression expressions.
msg137155 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2011-05-28 21:59
The doc consistently does NOT quote re's in the text. Rather, they are shaded gray, both in Windows help version and html version. So this one should not be treated differently. Most of the confusion reported is due to not reading the intro paragraphs. I almost suggested closing this without action. However, after saying to use the r prefix, the doc omits them from examples when not absolutely needed. In particular, >>> m = re.search('(?<=-)\w+', 'spam-egg') Why does \w work without being doubled or protected (and it does, I checked), while \1 does not? Hell if I know. So even though that example works, it should be changed. The doc should teach the rule "if strings contains '\', prefix with 'r'" rather than "test and add 'r' if it fails", or "learn the exact list of when needed", which is not given and unknown to me and most any beginner. I advocate the same practice in the RE How To, which also has at least one example with '\' but without 'r': >>> p = re.compile('\d+') I do not think we need another example other than those in the text.
msg137158 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2011-05-28 22:27
Why it works is due to a quirk in the handling of python strings: if an apparent escape sequence doesn't "mean anything", it is retained verbatim, including the '\' character. This is documented in http://docs.python.org/reference/lexical_analysis.html#string-literals: "Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the string. (This behavior is useful when debugging: if an escape sequence is mistyped, the resulting output is more easily recognized as broken.)" It is very unwise to depend on this behavior for anything except debugging, therefore those examples which do are, in my opinion, wrong.
msg137165 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-05-29 02:01
The regex sets (\d\w\s\D\W\S) don't match any Python escape sequence, so even if some suggest to always use r'' regardless, I don't find it necessary, especially for simple regexs. The two conflicting escape sequences to keep in mind are \b (backspace for Python, word boundary for re) and \number (octal escape for Python, reference to a group for re). There are also other regex escape sequences that are rarely used (\B\A\Z), but these don't need to be escaped either.
msg137196 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2011-05-29 16:36
The fact that you have carefully think about which are escapes and which aren't tells you that you should not be depending on the non-escapes not being escapes. What if we added one? The doc says preserving the \s is a debugging aid, and that is all it should be used for, IMO.
msg199109 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2013-10-06 19:01
I can't see the issue here. The RE docs are much better off with the regexes unquoted. The '(.+) \1' example was fixed today (the string supposed to not match actually did match).

History
Date	User	Action	Args
2022-04-11 14:57:17	admin	set	github: 56371
2013-10-06 19:01:01	georg.brandl	set	status: open -> closed nosy: + georg.brandl messages: + msg199109 resolution: works for me
2011-05-29 16:36:52	r.david.murray	set	messages: + msg137196
2011-05-29 02:01:00	ezio.melotti	set	messages: + msg137165
2011-05-28 22:27:27	r.david.murray	set	messages: + msg137158
2011-05-28 21:59:27	terry.reedy	set	versions: + Python 3.1, Python 2.7, Python 3.2, Python 3.3 nosy: + terry.reedy messages: + msg137155 keywords: + patch stage: needs patch
2011-05-24 02:27:05	Seth.Troisi	set	messages: + msg136715
2011-05-24 00:54:14	r.david.murray	set	nosy: + r.david.murray messages: + msg136709
2011-05-24 00:12:32	ezio.melotti	set	nosy: + ezio.melotti
2011-05-24 00:07:37	Seth.Troisi	create