classification
Title: re.escape() escapes too much
Type: enhancement Stage: resolved
Components: Library (Lib), Regular Expressions Versions: Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: 30021 Superseder:
Assigned To: serhiy.storchaka Nosy List: ezio.melotti, mrabarnett, serhiy.storchaka, terry.reedy
Priority: normal Keywords:

Created on 2017-04-05 14:17 by serhiy.storchaka, last changed 2017-06-11 18:32 by terry.reedy. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 1007 merged serhiy.storchaka, 2017-04-05 14:23
PR 2114 merged terry.reedy, 2017-06-11 17:35
Messages (4)
msg291177 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-04-05 14:17
re.escape() escapes all the characters except ASCII letters, numbers and '_'. This is too excessive, makes escaping and compiling slower and makes the pattern less human-readable. Characters "!\"%&\',/:;<=>@_`~" as well as non-ASCII characters are always literal in a regular expression and don't need escaping.

Proposed patch makes re.escape() escaping only minimal set of characters that can have special meaning in regular expressions. This includes special characters ".\\[]{}()*+?^$|", "-" (a range in a character set), "#" (starts a comment in verbose mode) and ASCII whitespaces (ignored in verbose mode).

The null character no longer need a special escaping.

The patch also increases the speed of re.escape() (even if it produces the same result).

$ ./python -m perf timeit -s 'from re import escape; s = "()[]{}?*+-|^$\\.# \t\n\r\v\f"' -- --duplicate 100 'escape(s)'
Unpatched:  Median +- std dev: 42.2 us +- 0.8 us
Patched:    Median +- std dev: 11.4 us +- 0.1 us

$ ./python -m perf timeit -s 'from re import escape; s = b"()[]{}?*+-|^$\\.# \t\n\r\v\f"' -- --duplicate 100 'escape(s)'
Unpatched:  Median +- std dev: 38.7 us +- 0.7 us
Patched:    Median +- std dev: 18.4 us +- 0.2 us

$ ./python -m perf timeit -s 'from re import escape; s = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"' -- --duplicate 100 'escape(s)'
Unpatched:  Median +- std dev: 40.3 us +- 0.5 us
Patched:    Median +- std dev: 33.1 us +- 0.6 us

$ ./python -m perf timeit -s 'from re import escape; s = b"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"' -- --duplicate 100 'escape(s)'
Unpatched:  Median +- std dev: 54.4 us +- 0.7 us
Patched:    Median +- std dev: 40.6 us +- 0.5 us

$ ./python -m perf timeit -s 'from re import escape; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ"' -- --duplicate 100 'escape(s)'
Unpatched:  Median +- std dev: 156 us +- 3 us
Patched:    Median +- std dev: 43.5 us +- 0.5 us

$ ./python -m perf timeit -s 'from re import escape; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ".encode()' -- --duplicate 100 'escape(s)'
Unpatched:  Median +- std dev: 200 us +- 4 us
Patched:    Median +- std dev: 77.0 us +- 0.6 us

And the speed of compilation of escaped string.

$ ./python -m perf timeit -s 'from re import escape; from sre_compile import compile; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ"; p = escape(s)' -- --duplicate 100 'compile(p)'
Unpatched:  Median +- std dev: 1.96 ms +- 0.02 ms
Patched:    Median +- std dev: 1.16 ms +- 0.02 ms

$ ./python -m perf timeit -s 'from re import escape; from sre_compile import compile; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ".encode(); p = escape(s)' -- --duplicate 100 'compile(p)'
Unpatched:  Median +- std dev: 3.69 ms +- 0.04 ms
Patched:    Median +- std dev: 2.13 ms +- 0.03 ms
msg291624 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-04-13 18:06
New changeset 5908300e4b0891fc5ab8bd24fba8fac72012eaa7 by Serhiy Storchaka in branch 'master':
bpo-29995: re.escape() now escapes only special characters. (#1007)
https://github.com/python/cpython/commit/5908300e4b0891fc5ab8bd24fba8fac72012eaa7
msg295723 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-06-11 17:50
New changeset a895f91a46c65a6076e8c6a28af0df1a07ed60a2 by terryjreedy in branch '3.6':
[3.6]bpo-29995: Adjust IDLE test for 3.7 re.escape change [GH-1007] (#2114)
https://github.com/python/cpython/commit/a895f91a46c65a6076e8c6a28af0df1a07ed60a2
msg295727 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-06-11 18:32
Serhiy, please nosy me when you change idlelib files.
History
Date User Action Args
2017-06-11 18:32:41terry.reedysetmessages: + msg295727
versions: + Python 3.6
2017-06-11 17:50:53terry.reedysetnosy: + terry.reedy
messages: + msg295723
2017-06-11 17:35:55terry.reedysetpull_requests: + pull_request2167
2017-04-13 18:14:26serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2017-04-13 18:06:45serhiy.storchakasetmessages: + msg291624
2017-04-12 09:48:16serhiy.storchakasetassignee: serhiy.storchaka
dependencies: + Add examples for re.escape()
2017-04-05 14:23:39serhiy.storchakasetpull_requests: + pull_request1175
2017-04-05 14:17:51serhiy.storchakacreate