re.split performance degraded significantly by capturing group #68614

PatrickMaupin · 2015-06-10T18:53:09Z

BPO	24426
Nosy	@ezio-melotti, @serhiy-storchaka
Files	splitter2.py: Script showing re.split capture performance problem re_literal_prefix_with_groups.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2015-06-21.11:10:26.437>
created_at = <Date 2015-06-10.18:53:08.978>
labels = ['expert-regex', 'performance']
title = 're.split performance degraded significantly by capturing group'
updated_at = <Date 2015-06-21.11:10:26.436>
user = 'https://bugs.python.org/PatrickMaupin'

bugs.python.org fields:

activity = <Date 2015-06-21.11:10:26.436>
actor = 'serhiy.storchaka'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2015-06-21.11:10:26.437>
closer = 'serhiy.storchaka'
components = ['Regular Expressions']
creation = <Date 2015-06-10.18:53:08.978>
creator = 'Patrick Maupin'
dependencies = []
files = ['39676', '39684']
hgrepos = []
issue_num = 24426
keywords = ['patch']
message_count = 12.0
messages = ['245137', '245139', '245141', '245144', '245145', '245184', '245198', '245310', '245316', '245319', '245326', '245589']
nosy_count = 5.0
nosy_names = ['ezio.melotti', 'mrabarnett', 'python-dev', 'serhiy.storchaka', 'Patrick Maupin']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue24426'
versions = ['Python 3.6']

PatrickMaupin · 2015-06-10T18:53:09Z

The addition of a capturing group in a re.split() pattern, e.g. using '(\n)' instead of '\n', causes a factor of 10 performance degradation.

I use re.split a() lot, but never noticed the issue before. It was extremely noticeable on 1000 patterns in a 5BG file, though, requiring 40 seconds instead of 4.

I have attached a script demonstrating the issue. I have tested on 2.7 and 3.4, but have no reason to believe it doesn't exist on other vesions as well.

Thanks,
Pat

serhiy-storchaka · 2015-06-10T19:18:48Z

Regular expression is optimized for the case when it starts with constant string or charset. It is no degradation when using '(\n)', but rather an optimization of '\n'.

serhiy-storchaka · 2015-06-10T19:34:36Z

Splitting with pattern '\n(?<=(\n))' produces the same result as with pattern '(\n)' and is as fast as with pattern '\n'.

PatrickMaupin · 2015-06-10T20:16:59Z

I have obviously oversimplified my test case, to the point where a developer thinks I'm silly enough to reach for the regex module just to split on a linefeed.
'\n(?<=(\n))' -- yes, of course, any casual user of the re module would immediately choose that as the most obvious thing to do.
My real regex is r'( [a-zA-Z0-9_]+ \[[0-9]+\][0-9:]+\].*\n)' because I am taking nasty broken output from a Cadence tool, fixing it up, and dumping it back out to a file. Yes, I'm sure this could be optimized, as well, but when I can just remove the parentheses and get a 10X speedup, and then figure out the string I meant to capture by looking at string lengths, shouldn't there at least be a warning that the re module has performance issues with capturing groups with split(), and casual users like me should figure out what the matching strings are some other way?

I assumed that, since I saw almost exactly the same performance degradation with \n as I did with this, that that was a valid testcase. If that was a bad assumption and this is insufficient to debug it, I can submit a bigger testcase.

But if this issue is going to be wontfixed for some reason, there should certainly be a documentation note added, because it is not intuitive that splitting 5GB of data into 1000 strings of around 5MB each should be 10X faster than doing the same thing, but also capturing the 1K ten-byte strings inbetween the big ones.

Thanks,
Pat

PatrickMaupin · 2015-06-10T21:28:44Z

Just to be perfectly clear, this is no exaggeration:

My original file was slightly over 5GB.

I have approximately 1050 bad strings in it, averaging around 11 characters per string.

If I split it without capturing those 1050 strings, it takes 3.7 seconds.

If I split it and capture those 1050 strings, it takes 39 seconds.

ISTM that 33 ms to create a capture group with a single 11 character string is excessive, so there is probably something else going on like excessive object copying, that just isn't noticeable on a smaller source string.

In the small example I posted, if I replace the line:

data = 100 * (200000 * ' ' + '\n')

with

data = 1000 * (500000 * ' ' + '\n')

then I get approximately the same 3.7 second vs 39 second results on that (somewhat older) machine. I didn't start out with that in the example, because I thought the problem should still be obvious from the scaled down example.

Obviously, your CPU numbers will be somewhat different. The question remains, though, why it takes around 60 million CPU cycles for each and every returned capture group. Or, to put it another way, why can I stop doing the capture group, and grab the same string with pure Python by looking at the string lengths of the intervening strings, well over 100 times faster than it takes for the re module to give me that group?

Thanks,
Pat

serhiy-storchaka · 2015-06-11T20:22:50Z

Here is a patch that adds more optimizations for searching patterns that starts with a literal string and groups. In particular it includes a case when a pattern starts with a group containing single character.

Examples:

$ ./python -m timeit -s "import re; p = re.compile('(\n)'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
Unpatched: 100 loops, best of 3: 4.58 msec per loop
Patched  : 1000 loops, best of 3: 562 usec per loop

$ ./python -m timeit -s "import re; p = re.compile('(\n\r)'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
Unpatched: 100 loops, best of 3: 3.1 msec per loop
Patched  : 1000 loops, best of 3: 663 usec per loop

For comparison:

$ ./python -m timeit -s "import re; p = re.compile('\n'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
1000 loops, best of 3: 329 usec per loop
$ ./python -m timeit -s "import re; p = re.compile('\n\r'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
1000 loops, best of 3: 338 usec per loop

Optimized also more complex but rare cases, such as '\n()\r' or '((\n)(\r))'.

Fast searching no longer can be disabled.

PatrickMaupin · 2015-06-11T22:54:09Z

Thank you for the quick response, Serhiy. I had started investigating and come to the conclusion that it was a problem with the compiler rather than the C engine. Interestingly, my next step was going to be to use names for the compiler constants, and then I noticed you did that exact same thing in the 3.6 tree.

I will try this out on my use-case tomorrow to make sure it fixes my issue, but now I'm intrigued by the inner workings of this, so I have two questions:

Do you know if anybody maintains a patched version of the Python code anywhere? I could put a package up on github/PyPI, if not.
Do you know if anybody has done a good writeup on the behavior of the instruction stream to the C engine? I could try to do some work on this and put it with the package, if not, or point to it if so.

Thanks and best regards,
Pat

serhiy-storchaka · 2015-06-13T11:12:38Z

Do you know if anybody maintains a patched version of the Python code anywhere? I could put a package up on github/PyPI, if not.

Sorry, perhaps I misunderstood you. There are unofficial mirrors of CPython on Bitbucket [1] and GitHub [2]. They don't contain unofficial patches, but perhaps there are private clones with additional patches. Of course different Linux distributives can provide Python with own patches. And you can maintain private fork of CPython with your patches for your own or your company needs.

But if you needs only optimized regular expressions, I suggest you to look on the regex module [3]. It is more powerful and better supports Unicode.

Results of the same mickrobenchmarks for regex:

$ ./python -m timeit -s "import regex as re; p = re.compile('\n'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
1000 loops, best of 3: 544 usec per loop
$ ./python -m timeit -s "import regex as re; p = re.compile('(\n)'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
1000 loops, best of 3: 661 usec per loop
$ ./python -m timeit -s "import regex as re; p = re.compile('\n\r'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
1000 loops, best of 3: 521 usec per loop
$ ./python -m timeit -s "import regex as re; p = re.compile('(\n\r)'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
1000 loops, best of 3: 743 usec per loop

regex is slightly slower than optimized re in these cases, but is much faster than non-optimized re in the case of splitting with capturing group.

Do you know if anybody has done a good writeup on the behavior of the instruction stream to the C engine? I could try to do some work on this and put it with the package, if not, or point to it if so.

Sorry, I don't understood you. Do you mean documenting codes of compiled re pattern? This is implementation detail and will be changed in future.

[1] https://bitbucket.org/mirror/cpython
[2] https://github.com/python/cpython
[3] https://pypi.python.org/pypi/regex

PatrickMaupin · 2015-06-13T14:04:36Z

(stuff about cPython)

No, I was curious about whether somebody maintained pure-Python fixes (e.g. to the re parser and compiler). Those could be in a regular package that fixed some corner cases such as the capture group you just applied a patch for.

... regex is more powerful and better supports Unicode.

Unfortunately, it is still not competitive. For example, for one package I maintain (github.com/pmaupin/pdfrw), I have one unittest which reads in and parses several PDF files, and then outputs them to new PDF files:

Python 2.7 with re -- 5.9 s
Python 2.7 with regex -- 6.9 s
Python 3.4 with re -- 7.2 s
Python 3.4 with regex -- 8.2 s

A large reason for the difference between 2.7 and 3.4 is the fact that I'm too lazy, or it seems too error-prone, to put the b'xxx' in front of every string, so the package uses the same source code for 2.7 and 3.4, which means unicode strings for 3.4 and byte strings for 2.7.

Nonetheless, when you consider all the other work going on in the package, a 14% _overall_ slowdown to change to a "better" re package seems like going the wrong direction, especially when stacked on top of the 22% slowdown for switching to Python3.

Do you mean documenting codes of compiled re pattern?

Yes.

This is implementation detail and will be changed in future.

Understood, and that's fine. If the documentation existed, it would have helped if I want to create a pure-python package that simply performed optimizations (like the one in your patch) against existing Python implementations, for use until regex (which is a HUGE undertaking) is ready.

Thanks,
Pat

serhiy-storchaka · 2015-06-13T15:51:37Z

This is a reason to file a feature request to regex. In 3.3 re was slower than regex in some cases:

$ ./python -m timeit -s "import re; p = re.compile('\n\r'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
Python 3.3 re   : 1000 loops, best of 3: 952 usec per loop
Python 3.4 regex: 1000 loops, best of 3: 757 usec per loop
Python 3.4 re   : 1000 loops, best of 3: 323 usec per loop

And this optimization (bpo-18685 or others) can be applied to regex.

As for this particular issue, the optimization of splitting with 1-character capturing group needs changes to C part of re engine. Python part of my patch is not needed for this, it is here only for generalizing support of other corner cases. So this issue can't be fixed with patching only Python code.

PatrickMaupin · 2015-06-13T19:48:40Z

OK, thanks.

python-dev · 2015-06-21T11:07:22Z

New changeset 7e46a503dd16 by Serhiy Storchaka in branch 'default':
Issue bpo-24426: Fast searching optimization in regular expressions now works
https://hg.python.org/cpython/rev/7e46a503dd16

PatrickMaupin mannequin added topic-regex performance Performance or resource usage labels Jun 10, 2015

serhiy-storchaka self-assigned this Jun 11, 2015

serhiy-storchaka closed this as completed Jun 21, 2015

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re.split performance degraded significantly by capturing group #68614

re.split performance degraded significantly by capturing group #68614

PatrickMaupin mannequin commented Jun 10, 2015

PatrickMaupin mannequin commented Jun 10, 2015

serhiy-storchaka commented Jun 10, 2015

serhiy-storchaka commented Jun 10, 2015

PatrickMaupin mannequin commented Jun 10, 2015

PatrickMaupin mannequin commented Jun 10, 2015

serhiy-storchaka commented Jun 11, 2015

PatrickMaupin mannequin commented Jun 11, 2015

serhiy-storchaka commented Jun 13, 2015

PatrickMaupin mannequin commented Jun 13, 2015

serhiy-storchaka commented Jun 13, 2015

PatrickMaupin mannequin commented Jun 13, 2015

python-dev mannequin commented Jun 21, 2015

re.split performance degraded significantly by capturing group #68614

re.split performance degraded significantly by capturing group #68614

Comments

PatrickMaupin mannequin commented Jun 10, 2015

PatrickMaupin mannequin commented Jun 10, 2015

serhiy-storchaka commented Jun 10, 2015

serhiy-storchaka commented Jun 10, 2015

PatrickMaupin mannequin commented Jun 10, 2015

PatrickMaupin mannequin commented Jun 10, 2015

serhiy-storchaka commented Jun 11, 2015

PatrickMaupin mannequin commented Jun 11, 2015

serhiy-storchaka commented Jun 13, 2015

PatrickMaupin mannequin commented Jun 13, 2015

serhiy-storchaka commented Jun 13, 2015

PatrickMaupin mannequin commented Jun 13, 2015

python-dev mannequin commented Jun 21, 2015