classification
Title: re: Backreferences vs. escapes: a silent failure solved
Type: enhancement Stage: committed/rejected
Components: Regular Expressions, Unicode Versions: Python 3.2
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: Aaron.Sherman, mrabarnett, r.david.murray, terry.reedy
Priority: normal Keywords:

Created on 2010-04-19 21:11 by Aaron.Sherman, last changed 2010-10-20 02:33 by r.david.murray. This issue is now closed.

Messages (6)
msg103640 - (view) Author: Aaron Sherman (Aaron.Sherman) Date: 2010-04-19 21:11
I tested this under 2.6 and 3.1. Under both, the common mistake that I'm sure many others have made, and which cost me quite some time today was:

 re.sub(r'(foo)bar', '\1baz', 'foobar')

It's obvious, I'm sure, to many reading this that the second "r" was left out before the replacement spec. It's probably obvious that this is going to happen quite a lot, and there are many edge cases which are equally baffling to the uninitiated (e.g. \8, \418 and \1111)

In order to avoid this, I'd like to request that such usage be deprecated, leaving only numeric escapes of the form matched by r'\\[0-7][0-7][0-7]?(?!\d)' as valid, non-deprecated uses (e.g. \01 or \111 are fine). Let's look at what that would do:

Right now, the standard library uses escape sequences with \n where n is a single digit in a handful of places like sndhdr.py and difflib.py. These are certainly not widespread enough to consider this a common usage, but certainly those few would have to change to add a leading zero before the digit.

OK, so the specific requested feature is that \xxx produces a warning where xxx is:

* any single digit or
* any invalid sequence of two or three digits (e.g containing 8 or 9) or
* any sequence of 4 or more digits

... guiding the user to the more explicit \01, \x01 or, if they intended a literal backslash, the r notation.

If you wish to go a step further, I'd suggest adding a no-op escape \e such that:

 \41\e1

would print "!1". Otherwise, there's no clean way to halt the interpretation of a digit-based escape sequence.
msg103695 - (view) Author: Matthew Barnett (mrabarnett) Date: 2010-04-20 12:10
Octal escapes are at most 3 octal digits, so the normal way to handle "\41" + "1" is "\0411".

Some languages support variable-length hex escapes of the form "\x{1B}", so we could add that and also "\o{41}" for octal.

BTW, in some languages "\e" is "\x1B".
msg103727 - (view) Author: Aaron Sherman (Aaron.Sherman) Date: 2010-04-20 15:30
Matthew, thank you for replying. I still think the primary issue is the potential for confusion between single digit escapes and backreferences, and the ease with which they could be addressed, but to cover what you said:

Quote: the normal way to handle "\41" + "1" is "\0411"

That might be the way dictated by the limitations of escape expansion as it is now, but it's entirely non-intuitive and seems more like the "exciting" edge cases (and obfuscated code opportunities) in other languages than something Python would be proud of.

With \41\e1 you would actually be able to tell, visually that the 1 does not get read by the code which reads the \41. This seems to me to be a serious win for maintainability.
msg103985 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-04-22 22:47
Not sure why this was assigned to me.
msg109954 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-07-11 02:12
If you are suggesting that code like
>>> '\2'
'\x02'
should routinely product a warning, and later an exception, this issue should be rejected. Normal code does not produce warnings. Rarity in the stdlib is irrelevant. Backslash processing of string literals is part of core syntax. Breaking existing code would not be acceptable.
msg119185 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-10-20 02:33
I agree with Terry.
History
Date User Action Args
2010-10-20 02:33:23r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg119185

resolution: out of date
stage: committed/rejected
2010-07-11 02:12:48terry.reedysetnosy: + terry.reedy

messages: + msg109954
versions: + Python 3.2, - Python 2.6, Python 3.1
2010-04-22 22:47:56pitrousetnosy: - pitrou
2010-04-22 22:47:42pitrousetassignee: pitrou ->
messages: + msg103985
nosy: pitrou, mrabarnett, Aaron.Sherman
2010-04-21 18:45:42hayposettitle: Backreferences vs. escapes: a silent failure solved -> re: Backreferences vs. escapes: a silent failure solved
2010-04-21 18:43:55georg.brandlsetassignee: pitrou

nosy: + pitrou
2010-04-20 15:30:59Aaron.Shermansetmessages: + msg103727
2010-04-20 12:10:08mrabarnettsetnosy: + mrabarnett
messages: + msg103695
2010-04-19 21:11:15Aaron.Shermancreate