classification
Title: re.split performance degraded significantly by capturing group
Type: performance Stage: resolved
Components: Regular Expressions Versions: Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Patrick Maupin, ezio.melotti, mrabarnett, python-dev, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2015-06-10 18:53 by Patrick Maupin, last changed 2015-06-21 11:10 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
splitter2.py Patrick Maupin, 2015-06-10 18:53 Script showing re.split capture performance problem
re_literal_prefix_with_groups.patch serhiy.storchaka, 2015-06-11 20:22 review
Messages (12)
msg245137 - (view) Author: Patrick Maupin (Patrick Maupin) * Date: 2015-06-10 18:53
The addition of a capturing group in a re.split() pattern, e.g. using '(\n)' instead of '\n', causes a factor of 10 performance degradation.

I use re.split a() lot, but never noticed the issue before.  It was extremely noticeable on 1000 patterns in a 5BG file, though, requiring 40 seconds instead of 4.

I have attached a script demonstrating the issue.  I have tested on 2.7 and 3.4, but have no reason to believe it doesn't exist on other vesions as well.

Thanks,
Pat
msg245139 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-06-10 19:18
Regular expression is optimized for the case when it starts with constant string or charset. It is no degradation when using '(\n)', but rather an optimization of '\n'.
msg245141 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-06-10 19:34
Splitting with pattern '\n(?<=(\n))' produces the same result as with pattern '(\n)' and is as fast as with pattern '\n'.
msg245144 - (view) Author: Patrick Maupin (Patrick Maupin) * Date: 2015-06-10 20:16
1) I have obviously oversimplified my test case, to the point where a developer thinks I'm silly enough to reach for the regex module just to split on a linefeed.

2) '\n(?<=(\n))' -- yes, of course, any casual user of the re module would immediately choose that as the most obvious thing to do.

3) My real regex is r'( [a-zA-Z0-9_]+ \[[0-9]+\][0-9:]+\].*\n)' because I am taking nasty broken output from a Cadence tool, fixing it up, and dumping it back out to a file.  Yes, I'm sure this could be optimized, as well, but when I can just remove the parentheses and get a 10X speedup, and then figure out the string I meant to capture by looking at string lengths, shouldn't there at least be a warning that the re module has performance issues with capturing groups with split(), and casual users like me should figure out what the matching strings are some other way?


I assumed that, since I saw almost exactly the same performance degradation with \n as I did with this, that that was a valid testcase.  If that was a bad assumption and this is insufficient to debug it, I can submit a bigger testcase.


But if this issue is going to be wontfixed for some reason, there should certainly be a documentation note added, because it is not intuitive that splitting 5GB of data into 1000 strings of around 5MB each should be 10X faster than doing the same thing, but also capturing the 1K ten-byte strings inbetween the big ones.


Thanks,
Pat
msg245145 - (view) Author: Patrick Maupin (Patrick Maupin) * Date: 2015-06-10 21:28
Just to be perfectly clear, this is no exaggeration:

My original file was slightly over 5GB.

I have approximately 1050 bad strings in it, averaging around 11 characters per string.

If I split it without capturing those 1050 strings, it takes 3.7 seconds.

If I split it and capture those 1050 strings, it takes 39 seconds.

ISTM that 33 ms to create a capture group with a single 11 character string is excessive, so there is probably something else going on like excessive object copying, that just isn't noticeable on a smaller source string.

In the small example I posted, if I replace the line:

data = 100 * (200000 * ' ' + '\n')

with 

data = 1000 * (500000 * ' ' + '\n')

then I get approximately the same 3.7 second vs 39 second results on that (somewhat older) machine.  I didn't start out with that in the example, because I thought the problem should still be obvious from the scaled down example.

Obviously, your CPU numbers will be somewhat different.  The question remains, though, why it takes around 60 million CPU cycles for each and every returned capture group.  Or, to put it another way, why can I stop doing the capture group, and grab the same string with pure Python by looking at the string lengths of the intervening strings, well over 100 times faster than it takes for the re module to give me that group?

Thanks,
Pat
msg245184 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-06-11 20:22
Here is a patch that adds more optimizations for searching patterns that starts with a literal string and groups. In particular it includes a case when a pattern starts with a group containing single character.

Examples:

$ ./python -m timeit -s "import re; p = re.compile('(\n)'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
Unpatched: 100 loops, best of 3: 4.58 msec per loop
Patched  : 1000 loops, best of 3: 562 usec per loop

$ ./python -m timeit -s "import re; p = re.compile('(\n\r)'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
Unpatched: 100 loops, best of 3: 3.1 msec per loop
Patched  : 1000 loops, best of 3: 663 usec per loop

For comparison:

$ ./python -m timeit -s "import re; p = re.compile('\n'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
1000 loops, best of 3: 329 usec per loop
$ ./python -m timeit -s "import re; p = re.compile('\n\r'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
1000 loops, best of 3: 338 usec per loop

Optimized also more complex but rare cases, such as '\n()\r' or '((\n)(\r))'.

Fast searching no longer can be disabled.
msg245198 - (view) Author: Patrick Maupin (Patrick Maupin) * Date: 2015-06-11 22:54
Thank you for the quick response, Serhiy.  I had started investigating and come to the conclusion that it was a problem with the compiler rather than the C engine.  Interestingly, my next step was going to be to use names for the compiler constants, and then I noticed you did that exact same thing in the 3.6 tree.

I will try this out on my use-case tomorrow to make sure it fixes my issue, but now I'm intrigued by the inner workings of this, so I have two questions:

1) Do you know if anybody maintains a patched version of the Python code anywhere?  I could put a package up on github/PyPI, if not.

2) Do you know if anybody has done a good writeup on the behavior of the instruction stream to the C engine?  I could try to do some work on this and put it with the package, if not, or point to it if so.

Thanks and best regards,
Pat
msg245310 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-06-13 11:12
> 1) Do you know if anybody maintains a patched version of the Python code anywhere?  I could put a package up on github/PyPI, if not.

Sorry, perhaps I misunderstood you. There are unofficial mirrors of CPython on Bitbucket [1] and GitHub [2]. They don't contain unofficial patches, but perhaps there are private clones with additional patches. Of course different Linux distributives can provide Python with own patches. And you can maintain private fork of CPython with your patches for your own or your company needs.

But if you needs only optimized regular expressions, I suggest you to look on the regex module [3]. It is more powerful and better supports Unicode.

Results of the same mickrobenchmarks for regex:

$ ./python -m timeit -s "import regex as re; p = re.compile('\n'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
1000 loops, best of 3: 544 usec per loop
$ ./python -m timeit -s "import regex as re; p = re.compile('(\n)'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
1000 loops, best of 3: 661 usec per loop
$ ./python -m timeit -s "import regex as re; p = re.compile('\n\r'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
1000 loops, best of 3: 521 usec per loop
$ ./python -m timeit -s "import regex as re; p = re.compile('(\n\r)'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
1000 loops, best of 3: 743 usec per loop

regex is slightly slower than optimized re in these cases, but is much faster than non-optimized re in the case of splitting with capturing group.

> 2) Do you know if anybody has done a good writeup on the behavior of the instruction stream to the C engine?  I could try to do some work on this and put it with the package, if not, or point to it if so.

Sorry, I don't understood you. Do you mean documenting codes of compiled re pattern? This is implementation detail and will be changed in future.

[1] https://bitbucket.org/mirror/cpython
[2] https://github.com/python/cpython
[3] https://pypi.python.org/pypi/regex
msg245316 - (view) Author: Patrick Maupin (Patrick Maupin) * Date: 2015-06-13 14:04
> (stuff about cPython)

No, I was curious about whether somebody maintained pure-Python fixes (e.g. to the re parser and compiler).  Those could be in a regular package that fixed some corner cases such as the capture group you just applied a patch for.

> ... regex is more powerful and better supports Unicode.

Unfortunately, it is still not competitive.  For example, for one package I maintain (github.com/pmaupin/pdfrw), I have one unittest which reads in and parses several PDF files, and then outputs them to new PDF files:

Python 2.7 with re -- 5.9 s
Python 2.7 with regex -- 6.9 s
Python 3.4 with re -- 7.2 s
Python 3.4 with regex -- 8.2 s

A large reason for the difference between 2.7 and 3.4 is the fact that I'm too lazy, or it seems too error-prone, to put the b'xxx' in front of every string, so the package uses the same source code for 2.7 and 3.4, which means unicode strings for 3.4 and byte strings for 2.7.

Nonetheless, when you consider all the other work going on in the package, a 14% _overall_ slowdown to change to a "better" re package seems like going the wrong direction, especially when stacked on top of the 22% slowdown for switching to Python3.

> Do you mean documenting codes of compiled re pattern?

Yes.


> This is implementation detail and will be changed in future.

Understood, and that's fine.  If the documentation existed, it would have helped if I want to create a pure-python package that simply performed optimizations (like the one in your patch) against existing Python implementations, for use until regex (which is a HUGE undertaking) is ready.

Thanks,
Pat
msg245319 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-06-13 15:51
This is a reason to file a feature request to regex. In 3.3 re was slower than regex in some cases:

$ ./python -m timeit -s "import re; p = re.compile('\n\r'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
Python 3.3 re   : 1000 loops, best of 3: 952 usec per loop
Python 3.4 regex: 1000 loops, best of 3: 757 usec per loop
Python 3.4 re   : 1000 loops, best of 3: 323 usec per loop

And this optimization (issue18685 or others) can be applied to regex.

As for this particular issue, the optimization of splitting with 1-character capturing group needs changes to C part of re engine. Python part of my patch is not needed for this, it is here only for generalizing support of other corner cases. So this issue can't be fixed with patching only Python code.
msg245326 - (view) Author: Patrick Maupin (Patrick Maupin) * Date: 2015-06-13 19:48
OK, thanks.
msg245589 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-06-21 11:07
New changeset 7e46a503dd16 by Serhiy Storchaka in branch 'default':
Issue #24426: Fast searching optimization in regular expressions now works
https://hg.python.org/cpython/rev/7e46a503dd16
History
Date User Action Args
2015-06-21 11:10:26serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2015-06-21 11:07:22python-devsetnosy: + python-dev
messages: + msg245589
2015-06-13 19:48:40Patrick Maupinsetmessages: + msg245326
2015-06-13 15:51:36serhiy.storchakasetmessages: + msg245319
2015-06-13 14:04:36Patrick Maupinsetmessages: + msg245316
2015-06-13 11:12:38serhiy.storchakasetmessages: + msg245310
2015-06-11 22:54:09Patrick Maupinsetmessages: + msg245198
2015-06-11 20:22:51serhiy.storchakasetfiles: + re_literal_prefix_with_groups.patch
versions: + Python 3.6, - Python 2.7, Python 3.4
messages: + msg245184

assignee: serhiy.storchaka
keywords: + patch
stage: patch review
2015-06-10 21:28:44Patrick Maupinsetmessages: + msg245145
2015-06-10 20:16:59Patrick Maupinsetmessages: + msg245144
2015-06-10 19:34:35serhiy.storchakasetmessages: + msg245141
2015-06-10 19:18:48serhiy.storchakasetmessages: + msg245139
2015-06-10 18:55:07ezio.melottisetnosy: + serhiy.storchaka
2015-06-10 18:53:09Patrick Maupincreate