Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster compiling of charset regexpes #63528

Closed
serhiy-storchaka opened this issue Oct 21, 2013 · 6 comments
Closed

Faster compiling of charset regexpes #63528

serhiy-storchaka opened this issue Oct 21, 2013 · 6 comments
Assignees
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir topic-regex

Comments

@serhiy-storchaka
Copy link
Member

BPO 19329
Nosy @vstinner, @ezio-melotti, @serhiy-storchaka
Dependencies
  • bpo-19327: re doesn't work with big charsets
  • Files
  • re_mk_bitmap.patch
  • re_optimize_charset.patch
  • re_optimize_charset_2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2013-10-27.06:24:34.571>
    created_at = <Date 2013-10-21.12:01:18.787>
    labels = ['expert-regex', 'library', 'performance']
    title = 'Faster compiling of charset regexpes'
    updated_at = <Date 2014-10-31.11:55:20.032>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2014-10-31.11:55:20.032>
    actor = 'python-dev'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2013-10-27.06:24:34.571>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)', 'Regular Expressions']
    creation = <Date 2013-10-21.12:01:18.787>
    creator = 'serhiy.storchaka'
    dependencies = ['19327']
    files = ['32278', '32337', '32364']
    hgrepos = []
    issue_num = 19329
    keywords = ['patch']
    message_count = 6.0
    messages = ['200755', '201166', '201292', '201419', '201420', '230335']
    nosy_count = 5.0
    nosy_names = ['vstinner', 'ezio.melotti', 'mrabarnett', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue19329'
    versions = ['Python 3.4']

    @serhiy-storchaka
    Copy link
    Member Author

    Here is a patch which speed up compiling of regular expressions with big charsets.

    Microbenchmark:
    $ ./python -m timeit "from sre_compile import compile; r = '[%s]' % ''.join(map(chr, range(256, 2**16, 255)))" "compile(r, 0)"

    Unpatched (but with fixed bpo-19327): 119 msec per loop
    Patched: 59.6 msec per loop

    Compiling regular expressions with big charset was main cause of slowing down importing the email.message module (bpo-11454).

    @serhiy-storchaka serhiy-storchaka self-assigned this Oct 21, 2013
    @serhiy-storchaka serhiy-storchaka added stdlib Python modules in the Lib dir topic-regex performance Performance or resource usage labels Oct 21, 2013
    @serhiy-storchaka
    Copy link
    Member Author

    Here is a more complex patch which optimizes charset compiling. It affects small charsets too. Big charsets now supports same optimizations as small charsets. Optimized bitmap now can be used even if the charset contains category items or non-bmp characters.

    $ ./python -m timeit "from sre_compile import compile; r = '[0-9]+'"  "compile(r, 0)"
    Unpatched: 1000 loops, best of 3: 457 usec per loop
    Patched: 1000 loops, best of 3: 368 usec per loop
    $ ./python -m timeit "from sre_compile import compile; r = '[ \t\n\r\v\f]+'"  "compile(r, 0)"
    Unpatched: 1000 loops, best of 3: 490 usec per loop
    Patched: 1000 loops, best of 3: 413 usec per loop
    $ ./python -m timeit "from sre_compile import compile; r = '[0-9A-Za-z_]+'"  "compile(r, 0)"
    Unpatched: 1000 loops, best of 3: 760 usec per loop
    Patched: 1000 loops, best of 3: 527 usec per loop
    $ ./python -m timeit "from sre_compile import compile; r = r'[^\ud800-\udfff]*'"  "compile(r, 0)"
    Unpatched: 100 loops, best of 3: 2.07 msec per loop
    Patched: 1000 loops, best of 3: 1.44 msec per loop
    $ ./python -m timeit "from sre_compile import compile; r = '[\u0410-\u042f\u0430-\u043f\u0404\u0406\u0407\u0454\u0456\u0457\u0490\u0491]+'"  "compile(r, 0)"
    Unpatched: 100 loops, best of 3: 8.24 msec per loop
    Patched: 100 loops, best of 3: 2.13 msec per loop
    $ ./python -m timeit "from sre_compile import compile; r = '[%s]' % ''.join(map(chr, range(256, 2**16, 255)))"  "compile(r, 0)"
    Unpatched: 10 loops, best of 3: 119 msec per loop
    Patched: 10 loops, best of 3: 24.1 msec per loop

    @serhiy-storchaka serhiy-storchaka changed the title Faster compiling of big charset regexpes Faster compiling of charset regexpes Oct 24, 2013
    @serhiy-storchaka
    Copy link
    Member Author

    Updated patch addresses Antoine's comments. One my bug fixed.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 27, 2013

    New changeset d5498d9d9bb0 by Serhiy Storchaka in branch 'default':
    Issue bpo-19329: Optimized compiling charsets in regular expressions.
    http://hg.python.org/cpython/rev/d5498d9d9bb0

    @serhiy-storchaka
    Copy link
    Member Author

    Thank you Antoine for your review.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 31, 2014

    New changeset ebd48b4f650d by Serhiy Storchaka in branch '2.7':
    Backported the optimization of compiling charsets in regular expressions
    https://hg.python.org/cpython/rev/ebd48b4f650d

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage stdlib Python modules in the Lib dir topic-regex
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant