Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re.escape() escapes too much #74181

Closed
serhiy-storchaka opened this issue Apr 5, 2017 · 5 comments
Closed

re.escape() escapes too much #74181

serhiy-storchaka opened this issue Apr 5, 2017 · 5 comments
Assignees
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement

Comments

@serhiy-storchaka
Copy link
Member

BPO 29995
Nosy @terryjreedy, @ezio-melotti, @serhiy-storchaka, @ltworf
PRs
  • bpo-29995: re.escape() now escapes only special characters. #1007
  • [3.6]bpo-29995: Adjust IDLE test for 3.7 re.escape change [GH-1007] #2114
  • Dependencies
  • bpo-30021: Add examples for re.escape()
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2017-04-13.18:14:26.723>
    created_at = <Date 2017-04-05.14:17:51.084>
    labels = ['expert-regex', 'type-feature', 'library', '3.7']
    title = 're.escape() escapes too much'
    updated_at = <Date 2019-01-31.14:58:48.888>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2019-01-31.14:58:48.888>
    actor = 'LtWorf'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2017-04-13.18:14:26.723>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)', 'Regular Expressions']
    creation = <Date 2017-04-05.14:17:51.084>
    creator = 'serhiy.storchaka'
    dependencies = ['30021']
    files = []
    hgrepos = []
    issue_num = 29995
    keywords = []
    message_count = 5.0
    messages = ['291177', '291624', '295723', '295727', '334629']
    nosy_count = 5.0
    nosy_names = ['terry.reedy', 'ezio.melotti', 'mrabarnett', 'serhiy.storchaka', 'LtWorf']
    pr_nums = ['1007', '2114']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue29995'
    versions = ['Python 3.6', 'Python 3.7']

    @serhiy-storchaka
    Copy link
    Member Author

    re.escape() escapes all the characters except ASCII letters, numbers and '_'. This is too excessive, makes escaping and compiling slower and makes the pattern less human-readable. Characters "!\"%&\',/:;<=>@_`~" as well as non-ASCII characters are always literal in a regular expression and don't need escaping.

    Proposed patch makes re.escape() escaping only minimal set of characters that can have special meaning in regular expressions. This includes special characters ".\\[]{}()*+?^$|", "-" (a range in a character set), "#" (starts a comment in verbose mode) and ASCII whitespaces (ignored in verbose mode).

    The null character no longer need a special escaping.

    The patch also increases the speed of re.escape() (even if it produces the same result).

    $ ./python -m perf timeit -s 'from re import escape; s = "()[]{}?*+-|^$\\.# \t\n\r\v\f"' -- --duplicate 100 'escape(s)'
    Unpatched:  Median +- std dev: 42.2 us +- 0.8 us
    Patched:    Median +- std dev: 11.4 us +- 0.1 us
    
    $ ./python -m perf timeit -s 'from re import escape; s = b"()[]{}?*+-|^$\\.# \t\n\r\v\f"' -- --duplicate 100 'escape(s)'
    Unpatched:  Median +- std dev: 38.7 us +- 0.7 us
    Patched:    Median +- std dev: 18.4 us +- 0.2 us
    
    $ ./python -m perf timeit -s 'from re import escape; s = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"' -- --duplicate 100 'escape(s)'
    Unpatched:  Median +- std dev: 40.3 us +- 0.5 us
    Patched:    Median +- std dev: 33.1 us +- 0.6 us
    
    $ ./python -m perf timeit -s 'from re import escape; s = b"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"' -- --duplicate 100 'escape(s)'
    Unpatched:  Median +- std dev: 54.4 us +- 0.7 us
    Patched:    Median +- std dev: 40.6 us +- 0.5 us
    
    $ ./python -m perf timeit -s 'from re import escape; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ"' -- --duplicate 100 'escape(s)'
    Unpatched:  Median +- std dev: 156 us +- 3 us
    Patched:    Median +- std dev: 43.5 us +- 0.5 us
    
    $ ./python -m perf timeit -s 'from re import escape; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ".encode()' -- --duplicate 100 'escape(s)'
    Unpatched:  Median +- std dev: 200 us +- 4 us
    Patched:    Median +- std dev: 77.0 us +- 0.6 us

    And the speed of compilation of escaped string.

    $ ./python -m perf timeit -s 'from re import escape; from sre_compile import compile; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ"; p = escape(s)' -- --duplicate 100 'compile(p)'
    Unpatched:  Median +- std dev: 1.96 ms +- 0.02 ms
    Patched:    Median +- std dev: 1.16 ms +- 0.02 ms
    
    $ ./python -m perf timeit -s 'from re import escape; from sre_compile import compile; s = "абвгґдеєжзиіїйклмнопрстуфхцчшщьюяАБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ".encode(); p = escape(s)' -- --duplicate 100 'compile(p)'
    Unpatched:  Median +- std dev: 3.69 ms +- 0.04 ms
    Patched:    Median +- std dev: 2.13 ms +- 0.03 ms

    @serhiy-storchaka serhiy-storchaka added 3.7 (EOL) end of life stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement labels Apr 5, 2017
    @serhiy-storchaka serhiy-storchaka self-assigned this Apr 12, 2017
    @serhiy-storchaka
    Copy link
    Member Author

    New changeset 5908300 by Serhiy Storchaka in branch 'master':
    bpo-29995: re.escape() now escapes only special characters. (bpo-1007)
    5908300

    @terryjreedy
    Copy link
    Member

    New changeset a895f91 by terryjreedy in branch '3.6':
    [3.6]bpo-29995: Adjust IDLE test for 3.7 re.escape change [GH-1007] (bpo-2114)
    a895f91

    @terryjreedy
    Copy link
    Member

    Serhiy, please nosy me when you change idlelib files.

    @ltworf
    Copy link
    Mannequin

    ltworf mannequin commented Jan 31, 2019

    Aaaand this broke my unit tests when moving from 3.6 to 3.7!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir topic-regex type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants