classification
Title: clarification on escaping \d in regular expressions
Type: behavior Stage: resolved
Components: Regular Expressions Versions: Python 3.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, mrabarnett, sabakauser, serhiy.storchaka, xtreak
Priority: normal Keywords:

Created on 2018-08-01 05:13 by sabakauser, last changed 2018-08-01 10:19 by serhiy.storchaka. This issue is now closed.

Messages (3)
msg322842 - (view) Author: Saba Kauser (sabakauser) Date: 2018-08-01 05:13
Hello,

I have a program that works well upto python 3.6 but fails with python 3.7.

import re

pattern="DBMS_NAME: string(%d) %s"
sym = ['\[','\]','\(','\)']
for chr in sym:
  pattern = re.sub(chr, '\\' + chr, pattern)
  print(pattern)
  
pattern=re.sub('%s','.*?',pattern)
print(pattern)
pattern = re.sub('%d', '\\d+', pattern) 
print(pattern)
result=re.match(pattern, "DBMS_NAME: string(8) \"DB2/NT64\" ")
print(result)
result=re.match("DBMS_NAME python4: string\(\d+\) .*?", "DBMS_NAME python4: string(8) \"DB2/NT64\" ")
print(result)

expected output:
DBMS_NAME: string(%d) %s
DBMS_NAME: string(%d) %s
DBMS_NAME: string\(%d) %s
DBMS_NAME: string\(%d\) %s
DBMS_NAME: string\(%d\) .*?
DBMS_NAME: string\(\d+\) .*?
<re.Match object; span=(0, 21), match='DBMS_NAME: string(8) '>
<re.Match object; span=(0, 29), match='DBMS_NAME python4: string(8) '>

However, the below statement execution fails with python 3.7:
pattern = re.sub('%d', '\\d+', pattern) 

DBMS_NAME: string(%d) %s
DBMS_NAME: string(%d) %s
DBMS_NAME: string\(%d) %s
DBMS_NAME: string\(%d\) %s
DBMS_NAME: string\(%d\) .*?
Traceback (most recent call last):
  File "c:\users\skauser\appdata\local\programs\python\python37\lib\sre_parse.py", line 1021, in parse_template
    this = chr(ESCAPES[this][1])
KeyError: '\\d'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pattern.txt", line 11, in <module>
    pattern = re.sub('%d', '\\d+', pattern)
  File "c:\users\skauser\appdata\local\programs\python\python37\lib\re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "c:\users\skauser\appdata\local\programs\python\python37\lib\re.py", line 309, in _subx
    template = _compile_repl(template, pattern)
  File "c:\users\skauser\appdata\local\programs\python\python37\lib\re.py", line 300, in _compile_repl
    return sre_parse.parse_template(repl, pattern)
  File "c:\users\skauser\appdata\local\programs\python\python37\lib\sre_parse.py", line 1024, in parse_template
    raise s.error('bad escape %s' % this, len(this))
re.error: bad escape \d at position 0

if I change the statement to have 3 backslash like 
pattern = re.sub('%d', '\\\d+', pattern) 

I can correctly generate correct regular expression.

Can you please comment if this has changed in python 3.7 and we need to escape 'd' in '\d' as well ?

Thank you!
msg322853 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python triager) Date: 2018-08-01 10:16
The reported behavior is reproducible in master as well as of ea68d83933 but not on 3.6.0. I couldn't bisect to the exact commit between 3.7.0 and 3.6.0 where this change was introduced though. I can also see some deprecation warnings as below while running the script : 

➜  cpython git:(master) ./python.exe ../backups/bpo34034.py
../backups/bpo34034.py:4: DeprecationWarning: invalid escape sequence \[
  sym = ['\[','\]','\(','\)']
../backups/bpo34034.py:4: DeprecationWarning: invalid escape sequence \]
  sym = ['\[','\]','\(','\)']
../backups/bpo34034.py:4: DeprecationWarning: invalid escape sequence \(
  sym = ['\[','\]','\(','\)']
../backups/bpo34034.py:4: DeprecationWarning: invalid escape sequence \)
  sym = ['\[','\]','\(','\)']
../backups/bpo34034.py:15: DeprecationWarning: invalid escape sequence \(
  result=re.match("DBMS_NAME python4: string\(\d+\) .*?", "DBMS_NAME python4: string(8) \"DB2/NT64\" ")
DBMS_NAME: string(%d) %s
DBMS_NAME: string(%d) %s
DBMS_NAME: string\(%d) %s
DBMS_NAME: string\(%d\) %s
DBMS_NAME: string\(%d\) .*?
Traceback (most recent call last):
  File "/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/sre_parse.py", line 1045, in parse_template
    this = chr(ESCAPES[this][1])
KeyError: '\\d'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "../backups/bpo34034.py", line 11, in <module>
    pattern = re.sub('%d', '\\d+', pattern)
  File "/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/re.py", line 309, in _subx
    template = _compile_repl(template, pattern)
  File "/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/re.py", line 300, in _compile_repl
    return sre_parse.parse_template(repl, pattern)
  File "/Users/karthikeyansingaravelan/stuff/python/cpython/Lib/sre_parse.py", line 1048, in parse_template
    raise s.error('bad escape %s' % this, len(this))
re.error: bad escape \d at position 0


Thanks
msg322854 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-08-01 10:19
If you want to replace %d with literal \d, you need to repeat the backslash 4 times:

    pattern = re.sub('%d', '\\\\d+', pattern)

or use a raw string literal and repeat the backslash 2 times:

    pattern = re.sub('%d', r'\\d+', pattern)

Since the backslash has a special meaning in the replacement pattern, it needs to be escaped with a backslash, i.e. duplicated. But since it has a special meaning in Python string literals, every of these backslashes needs to be escaped with a backslash in a non-raw string literal, i.e. repeated 4 times.

Python 3.6 is more lenient. It keeps a backslash if it is followed by a character which doesn't compound a known escape sequences in a replacement string. But it emits a deprecation warning, which you can see when run Python with corresponding -W option.

$ python3.6 -Wa -c 'import re; print(re.sub("%d", "\d+", "DBMS_NAME: string(%d) %s"))'
<string>:1: DeprecationWarning: invalid escape sequence \d
/usr/lib/python3.6/re.py:191: DeprecationWarning: bad escape \d
  return _compile(pattern, flags).sub(repl, string, count)
DBMS_NAME: string(\d+) %s

$ python3.6 -Wa -c 'import re; print(re.sub("%d", "\\d+", "DBMS_NAME: string(%d) %s"))'
/usr/lib/python3.6/re.py:191: DeprecationWarning: bad escape \d
  return _compile(pattern, flags).sub(repl, string, count)
DBMS_NAME: string(\d+) %s

$ python3.6 -Wa -c 'import re; print(re.sub("%d", "\\\d+", "DBMS_NAME: string(%d) %s"))'
<string>:1: DeprecationWarning: invalid escape sequence \d
DBMS_NAME: string(\d+) %s

$ python3.6 -Wa -c 'import re; print(re.sub("%d", "\\\\d+", "DBMS_NAME: string(%d) %s"))'
DBMS_NAME: string(\d+) %s

Here "invalid escape sequence \d" is generated by the Python parser, "bad escape \d" is generated by the RE engine.
History
Date User Action Args
2018-08-01 10:19:42serhiy.storchakasetstatus: open -> closed

nosy: + serhiy.storchaka
messages: + msg322854

resolution: not a bug
stage: resolved
2018-08-01 10:16:15xtreaksetmessages: + msg322853
2018-08-01 08:32:09xtreaksetnosy: + xtreak
2018-08-01 05:13:39sabakausercreate