Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re.split performance degraded significantly by capturing group #68614

Closed
PatrickMaupin mannequin opened this issue Jun 10, 2015 · 12 comments
Closed

re.split performance degraded significantly by capturing group #68614

PatrickMaupin mannequin opened this issue Jun 10, 2015 · 12 comments
Assignees
Labels
performance Performance or resource usage topic-regex

Comments

@PatrickMaupin
Copy link
Mannequin

PatrickMaupin mannequin commented Jun 10, 2015

BPO 24426
Nosy @ezio-melotti, @serhiy-storchaka
Files
  • splitter2.py: Script showing re.split capture performance problem
  • re_literal_prefix_with_groups.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2015-06-21.11:10:26.437>
    created_at = <Date 2015-06-10.18:53:08.978>
    labels = ['expert-regex', 'performance']
    title = 're.split performance degraded significantly by capturing group'
    updated_at = <Date 2015-06-21.11:10:26.436>
    user = 'https://bugs.python.org/PatrickMaupin'

    bugs.python.org fields:

    activity = <Date 2015-06-21.11:10:26.436>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2015-06-21.11:10:26.437>
    closer = 'serhiy.storchaka'
    components = ['Regular Expressions']
    creation = <Date 2015-06-10.18:53:08.978>
    creator = 'Patrick Maupin'
    dependencies = []
    files = ['39676', '39684']
    hgrepos = []
    issue_num = 24426
    keywords = ['patch']
    message_count = 12.0
    messages = ['245137', '245139', '245141', '245144', '245145', '245184', '245198', '245310', '245316', '245319', '245326', '245589']
    nosy_count = 5.0
    nosy_names = ['ezio.melotti', 'mrabarnett', 'python-dev', 'serhiy.storchaka', 'Patrick Maupin']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue24426'
    versions = ['Python 3.6']

    @PatrickMaupin
    Copy link
    Mannequin Author

    PatrickMaupin mannequin commented Jun 10, 2015

    The addition of a capturing group in a re.split() pattern, e.g. using '(\n)' instead of '\n', causes a factor of 10 performance degradation.

    I use re.split a() lot, but never noticed the issue before. It was extremely noticeable on 1000 patterns in a 5BG file, though, requiring 40 seconds instead of 4.

    I have attached a script demonstrating the issue. I have tested on 2.7 and 3.4, but have no reason to believe it doesn't exist on other vesions as well.

    Thanks,
    Pat

    @PatrickMaupin PatrickMaupin mannequin added topic-regex performance Performance or resource usage labels Jun 10, 2015
    @serhiy-storchaka
    Copy link
    Member

    Regular expression is optimized for the case when it starts with constant string or charset. It is no degradation when using '(\n)', but rather an optimization of '\n'.

    @serhiy-storchaka
    Copy link
    Member

    Splitting with pattern '\n(?<=(\n))' produces the same result as with pattern '(\n)' and is as fast as with pattern '\n'.

    @PatrickMaupin
    Copy link
    Mannequin Author

    PatrickMaupin mannequin commented Jun 10, 2015

    1. I have obviously oversimplified my test case, to the point where a developer thinks I'm silly enough to reach for the regex module just to split on a linefeed.

    2. '\n(?<=(\n))' -- yes, of course, any casual user of the re module would immediately choose that as the most obvious thing to do.

    3. My real regex is r'( [a-zA-Z0-9_]+ \[[0-9]+\][0-9:]+\].*\n)' because I am taking nasty broken output from a Cadence tool, fixing it up, and dumping it back out to a file. Yes, I'm sure this could be optimized, as well, but when I can just remove the parentheses and get a 10X speedup, and then figure out the string I meant to capture by looking at string lengths, shouldn't there at least be a warning that the re module has performance issues with capturing groups with split(), and casual users like me should figure out what the matching strings are some other way?

    I assumed that, since I saw almost exactly the same performance degradation with \n as I did with this, that that was a valid testcase. If that was a bad assumption and this is insufficient to debug it, I can submit a bigger testcase.

    But if this issue is going to be wontfixed for some reason, there should certainly be a documentation note added, because it is not intuitive that splitting 5GB of data into 1000 strings of around 5MB each should be 10X faster than doing the same thing, but also capturing the 1K ten-byte strings inbetween the big ones.

    Thanks,
    Pat

    @PatrickMaupin
    Copy link
    Mannequin Author

    PatrickMaupin mannequin commented Jun 10, 2015

    Just to be perfectly clear, this is no exaggeration:

    My original file was slightly over 5GB.

    I have approximately 1050 bad strings in it, averaging around 11 characters per string.

    If I split it without capturing those 1050 strings, it takes 3.7 seconds.

    If I split it and capture those 1050 strings, it takes 39 seconds.

    ISTM that 33 ms to create a capture group with a single 11 character string is excessive, so there is probably something else going on like excessive object copying, that just isn't noticeable on a smaller source string.

    In the small example I posted, if I replace the line:

    data = 100 * (200000 * ' ' + '\n')

    with

    data = 1000 * (500000 * ' ' + '\n')

    then I get approximately the same 3.7 second vs 39 second results on that (somewhat older) machine. I didn't start out with that in the example, because I thought the problem should still be obvious from the scaled down example.

    Obviously, your CPU numbers will be somewhat different. The question remains, though, why it takes around 60 million CPU cycles for each and every returned capture group. Or, to put it another way, why can I stop doing the capture group, and grab the same string with pure Python by looking at the string lengths of the intervening strings, well over 100 times faster than it takes for the re module to give me that group?

    Thanks,
    Pat

    @serhiy-storchaka
    Copy link
    Member

    Here is a patch that adds more optimizations for searching patterns that starts with a literal string and groups. In particular it includes a case when a pattern starts with a group containing single character.

    Examples:

    $ ./python -m timeit -s "import re; p = re.compile('(\n)'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
    Unpatched: 100 loops, best of 3: 4.58 msec per loop
    Patched  : 1000 loops, best of 3: 562 usec per loop
    
    $ ./python -m timeit -s "import re; p = re.compile('(\n\r)'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
    Unpatched: 100 loops, best of 3: 3.1 msec per loop
    Patched  : 1000 loops, best of 3: 663 usec per loop

    For comparison:

    $ ./python -m timeit -s "import re; p = re.compile('\n'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
    1000 loops, best of 3: 329 usec per loop
    $ ./python -m timeit -s "import re; p = re.compile('\n\r'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
    1000 loops, best of 3: 338 usec per loop

    Optimized also more complex but rare cases, such as '\n()\r' or '((\n)(\r))'.

    Fast searching no longer can be disabled.

    @serhiy-storchaka serhiy-storchaka self-assigned this Jun 11, 2015
    @PatrickMaupin
    Copy link
    Mannequin Author

    PatrickMaupin mannequin commented Jun 11, 2015

    Thank you for the quick response, Serhiy. I had started investigating and come to the conclusion that it was a problem with the compiler rather than the C engine. Interestingly, my next step was going to be to use names for the compiler constants, and then I noticed you did that exact same thing in the 3.6 tree.

    I will try this out on my use-case tomorrow to make sure it fixes my issue, but now I'm intrigued by the inner workings of this, so I have two questions:

    1. Do you know if anybody maintains a patched version of the Python code anywhere? I could put a package up on github/PyPI, if not.

    2. Do you know if anybody has done a good writeup on the behavior of the instruction stream to the C engine? I could try to do some work on this and put it with the package, if not, or point to it if so.

    Thanks and best regards,
    Pat

    @serhiy-storchaka
    Copy link
    Member

    1. Do you know if anybody maintains a patched version of the Python code anywhere? I could put a package up on github/PyPI, if not.

    Sorry, perhaps I misunderstood you. There are unofficial mirrors of CPython on Bitbucket [1] and GitHub [2]. They don't contain unofficial patches, but perhaps there are private clones with additional patches. Of course different Linux distributives can provide Python with own patches. And you can maintain private fork of CPython with your patches for your own or your company needs.

    But if you needs only optimized regular expressions, I suggest you to look on the regex module [3]. It is more powerful and better supports Unicode.

    Results of the same mickrobenchmarks for regex:

    $ ./python -m timeit -s "import regex as re; p = re.compile('\n'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
    1000 loops, best of 3: 544 usec per loop
    $ ./python -m timeit -s "import regex as re; p = re.compile('(\n)'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
    1000 loops, best of 3: 661 usec per loop
    $ ./python -m timeit -s "import regex as re; p = re.compile('\n\r'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
    1000 loops, best of 3: 521 usec per loop
    $ ./python -m timeit -s "import regex as re; p = re.compile('(\n\r)'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
    1000 loops, best of 3: 743 usec per loop

    regex is slightly slower than optimized re in these cases, but is much faster than non-optimized re in the case of splitting with capturing group.

    1. Do you know if anybody has done a good writeup on the behavior of the instruction stream to the C engine? I could try to do some work on this and put it with the package, if not, or point to it if so.

    Sorry, I don't understood you. Do you mean documenting codes of compiled re pattern? This is implementation detail and will be changed in future.

    [1] https://bitbucket.org/mirror/cpython
    [2] https://github.com/python/cpython
    [3] https://pypi.python.org/pypi/regex

    @PatrickMaupin
    Copy link
    Mannequin Author

    PatrickMaupin mannequin commented Jun 13, 2015

    (stuff about cPython)

    No, I was curious about whether somebody maintained pure-Python fixes (e.g. to the re parser and compiler). Those could be in a regular package that fixed some corner cases such as the capture group you just applied a patch for.

    ... regex is more powerful and better supports Unicode.

    Unfortunately, it is still not competitive. For example, for one package I maintain (github.com/pmaupin/pdfrw), I have one unittest which reads in and parses several PDF files, and then outputs them to new PDF files:

    Python 2.7 with re -- 5.9 s
    Python 2.7 with regex -- 6.9 s
    Python 3.4 with re -- 7.2 s
    Python 3.4 with regex -- 8.2 s

    A large reason for the difference between 2.7 and 3.4 is the fact that I'm too lazy, or it seems too error-prone, to put the b'xxx' in front of every string, so the package uses the same source code for 2.7 and 3.4, which means unicode strings for 3.4 and byte strings for 2.7.

    Nonetheless, when you consider all the other work going on in the package, a 14% _overall_ slowdown to change to a "better" re package seems like going the wrong direction, especially when stacked on top of the 22% slowdown for switching to Python3.

    Do you mean documenting codes of compiled re pattern?

    Yes.

    This is implementation detail and will be changed in future.

    Understood, and that's fine. If the documentation existed, it would have helped if I want to create a pure-python package that simply performed optimizations (like the one in your patch) against existing Python implementations, for use until regex (which is a HUGE undertaking) is ready.

    Thanks,
    Pat

    @serhiy-storchaka
    Copy link
    Member

    This is a reason to file a feature request to regex. In 3.3 re was slower than regex in some cases:

    $ ./python -m timeit -s "import re; p = re.compile('\n\r'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
    Python 3.3 re   : 1000 loops, best of 3: 952 usec per loop
    Python 3.4 regex: 1000 loops, best of 3: 757 usec per loop
    Python 3.4 re   : 1000 loops, best of 3: 323 usec per loop

    And this optimization (bpo-18685 or others) can be applied to regex.

    As for this particular issue, the optimization of splitting with 1-character capturing group needs changes to C part of re engine. Python part of my patch is not needed for this, it is here only for generalizing support of other corner cases. So this issue can't be fixed with patching only Python code.

    @PatrickMaupin
    Copy link
    Mannequin Author

    PatrickMaupin mannequin commented Jun 13, 2015

    OK, thanks.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 21, 2015

    New changeset 7e46a503dd16 by Serhiy Storchaka in branch 'default':
    Issue bpo-24426: Fast searching optimization in regular expressions now works
    https://hg.python.org/cpython/rev/7e46a503dd16

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage topic-regex
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant