Issue 24426: re.split performance degraded significantly by capturing group

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/68614

classification

Title:	re.split performance degraded significantly by capturing group
Type:	performance	Stage:	resolved
Components:	Regular Expressions	Versions:	Python 3.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	Patrick Maupin, ezio.melotti, mrabarnett, python-dev, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2015-06-10 18:53 by Patrick Maupin, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
splitter2.py	Patrick Maupin, 2015-06-10 18:53	Script showing re.split capture performance problem
re_literal_prefix_with_groups.patch	serhiy.storchaka, 2015-06-11 20:22		review

Messages (12)
msg245137 - (view)	Author: Patrick Maupin (Patrick Maupin) *	Date: 2015-06-10 18:53
The addition of a capturing group in a re.split() pattern, e.g. using '(\n)' instead of '\n', causes a factor of 10 performance degradation. I use re.split a() lot, but never noticed the issue before. It was extremely noticeable on 1000 patterns in a 5BG file, though, requiring 40 seconds instead of 4. I have attached a script demonstrating the issue. I have tested on 2.7 and 3.4, but have no reason to believe it doesn't exist on other vesions as well. Thanks, Pat
msg245139 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-06-10 19:18
Regular expression is optimized for the case when it starts with constant string or charset. It is no degradation when using '(\n)', but rather an optimization of '\n'.
msg245141 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-06-10 19:34
Splitting with pattern '\n(?<=(\n))' produces the same result as with pattern '(\n)' and is as fast as with pattern '\n'.
msg245144 - (view)	Author: Patrick Maupin (Patrick Maupin) *	Date: 2015-06-10 20:16
1) I have obviously oversimplified my test case, to the point where a developer thinks I'm silly enough to reach for the regex module just to split on a linefeed. 2) '\n(?<=(\n))' -- yes, of course, any casual user of the re module would immediately choose that as the most obvious thing to do. 3) My real regex is r'( [a-zA-Z0-9_]+ \[[0-9]+\][0-9:]+\].*\n)' because I am taking nasty broken output from a Cadence tool, fixing it up, and dumping it back out to a file. Yes, I'm sure this could be optimized, as well, but when I can just remove the parentheses and get a 10X speedup, and then figure out the string I meant to capture by looking at string lengths, shouldn't there at least be a warning that the re module has performance issues with capturing groups with split(), and casual users like me should figure out what the matching strings are some other way? I assumed that, since I saw almost exactly the same performance degradation with \n as I did with this, that that was a valid testcase. If that was a bad assumption and this is insufficient to debug it, I can submit a bigger testcase. But if this issue is going to be wontfixed for some reason, there should certainly be a documentation note added, because it is not intuitive that splitting 5GB of data into 1000 strings of around 5MB each should be 10X faster than doing the same thing, but also capturing the 1K ten-byte strings inbetween the big ones. Thanks, Pat
msg245145 - (view)	Author: Patrick Maupin (Patrick Maupin) *	Date: 2015-06-10 21:28
Just to be perfectly clear, this is no exaggeration: My original file was slightly over 5GB. I have approximately 1050 bad strings in it, averaging around 11 characters per string. If I split it without capturing those 1050 strings, it takes 3.7 seconds. If I split it and capture those 1050 strings, it takes 39 seconds. ISTM that 33 ms to create a capture group with a single 11 character string is excessive, so there is probably something else going on like excessive object copying, that just isn't noticeable on a smaller source string. In the small example I posted, if I replace the line: data = 100 * (200000 * ' ' + '\n') with data = 1000 * (500000 * ' ' + '\n') then I get approximately the same 3.7 second vs 39 second results on that (somewhat older) machine. I didn't start out with that in the example, because I thought the problem should still be obvious from the scaled down example. Obviously, your CPU numbers will be somewhat different. The question remains, though, why it takes around 60 million CPU cycles for each and every returned capture group. Or, to put it another way, why can I stop doing the capture group, and grab the same string with pure Python by looking at the string lengths of the intervening strings, well over 100 times faster than it takes for the re module to give me that group? Thanks, Pat
msg245184 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-06-11 20:22
Here is a patch that adds more optimizations for searching patterns that starts with a literal string and groups. In particular it includes a case when a pattern starts with a group containing single character. Examples: $ ./python -m timeit -s "import re; p = re.compile('(\n)'); s = ('a'100 + '\n')1000" -- "p.split(s)" Unpatched: 100 loops, best of 3: 4.58 msec per loop Patched : 1000 loops, best of 3: 562 usec per loop $ ./python -m timeit -s "import re; p = re.compile('(\n\r)'); s = ('a'100 + '\n\r')1000" -- "p.split(s)" Unpatched: 100 loops, best of 3: 3.1 msec per loop Patched : 1000 loops, best of 3: 663 usec per loop For comparison: $ ./python -m timeit -s "import re; p = re.compile('\n'); s = ('a'100 + '\n')1000" -- "p.split(s)" 1000 loops, best of 3: 329 usec per loop $ ./python -m timeit -s "import re; p = re.compile('\n\r'); s = ('a'100 + '\n\r')1000" -- "p.split(s)" 1000 loops, best of 3: 338 usec per loop Optimized also more complex but rare cases, such as '\n()\r' or '((\n)(\r))'. Fast searching no longer can be disabled.
msg245198 - (view)	Author: Patrick Maupin (Patrick Maupin) *	Date: 2015-06-11 22:54
Thank you for the quick response, Serhiy. I had started investigating and come to the conclusion that it was a problem with the compiler rather than the C engine. Interestingly, my next step was going to be to use names for the compiler constants, and then I noticed you did that exact same thing in the 3.6 tree. I will try this out on my use-case tomorrow to make sure it fixes my issue, but now I'm intrigued by the inner workings of this, so I have two questions: 1) Do you know if anybody maintains a patched version of the Python code anywhere? I could put a package up on github/PyPI, if not. 2) Do you know if anybody has done a good writeup on the behavior of the instruction stream to the C engine? I could try to do some work on this and put it with the package, if not, or point to it if so. Thanks and best regards, Pat
msg245310 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-06-13 11:12
> 1) Do you know if anybody maintains a patched version of the Python code anywhere? I could put a package up on github/PyPI, if not. Sorry, perhaps I misunderstood you. There are unofficial mirrors of CPython on Bitbucket [1] and GitHub [2]. They don't contain unofficial patches, but perhaps there are private clones with additional patches. Of course different Linux distributives can provide Python with own patches. And you can maintain private fork of CPython with your patches for your own or your company needs. But if you needs only optimized regular expressions, I suggest you to look on the regex module [3]. It is more powerful and better supports Unicode. Results of the same mickrobenchmarks for regex: $ ./python -m timeit -s "import regex as re; p = re.compile('\n'); s = ('a'100 + '\n')1000" -- "p.split(s)" 1000 loops, best of 3: 544 usec per loop $ ./python -m timeit -s "import regex as re; p = re.compile('(\n)'); s = ('a'100 + '\n')1000" -- "p.split(s)" 1000 loops, best of 3: 661 usec per loop $ ./python -m timeit -s "import regex as re; p = re.compile('\n\r'); s = ('a'100 + '\n\r')1000" -- "p.split(s)" 1000 loops, best of 3: 521 usec per loop $ ./python -m timeit -s "import regex as re; p = re.compile('(\n\r)'); s = ('a'100 + '\n\r')1000" -- "p.split(s)" 1000 loops, best of 3: 743 usec per loop regex is slightly slower than optimized re in these cases, but is much faster than non-optimized re in the case of splitting with capturing group. > 2) Do you know if anybody has done a good writeup on the behavior of the instruction stream to the C engine? I could try to do some work on this and put it with the package, if not, or point to it if so. Sorry, I don't understood you. Do you mean documenting codes of compiled re pattern? This is implementation detail and will be changed in future. [1] https://bitbucket.org/mirror/cpython [2] https://github.com/python/cpython [3] https://pypi.python.org/pypi/regex
msg245316 - (view)	Author: Patrick Maupin (Patrick Maupin) *	Date: 2015-06-13 14:04
> (stuff about cPython) No, I was curious about whether somebody maintained pure-Python fixes (e.g. to the re parser and compiler). Those could be in a regular package that fixed some corner cases such as the capture group you just applied a patch for. > ... regex is more powerful and better supports Unicode. Unfortunately, it is still not competitive. For example, for one package I maintain (github.com/pmaupin/pdfrw), I have one unittest which reads in and parses several PDF files, and then outputs them to new PDF files: Python 2.7 with re -- 5.9 s Python 2.7 with regex -- 6.9 s Python 3.4 with re -- 7.2 s Python 3.4 with regex -- 8.2 s A large reason for the difference between 2.7 and 3.4 is the fact that I'm too lazy, or it seems too error-prone, to put the b'xxx' in front of every string, so the package uses the same source code for 2.7 and 3.4, which means unicode strings for 3.4 and byte strings for 2.7. Nonetheless, when you consider all the other work going on in the package, a 14% _overall_ slowdown to change to a "better" re package seems like going the wrong direction, especially when stacked on top of the 22% slowdown for switching to Python3. > Do you mean documenting codes of compiled re pattern? Yes. > This is implementation detail and will be changed in future. Understood, and that's fine. If the documentation existed, it would have helped if I want to create a pure-python package that simply performed optimizations (like the one in your patch) against existing Python implementations, for use until regex (which is a HUGE undertaking) is ready. Thanks, Pat
msg245319 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-06-13 15:51
This is a reason to file a feature request to regex. In 3.3 re was slower than regex in some cases: $ ./python -m timeit -s "import re; p = re.compile('\n\r'); s = ('a'100 + '\n\r')1000" -- "p.split(s)" Python 3.3 re : 1000 loops, best of 3: 952 usec per loop Python 3.4 regex: 1000 loops, best of 3: 757 usec per loop Python 3.4 re : 1000 loops, best of 3: 323 usec per loop And this optimization (issue18685 or others) can be applied to regex. As for this particular issue, the optimization of splitting with 1-character capturing group needs changes to C part of re engine. Python part of my patch is not needed for this, it is here only for generalizing support of other corner cases. So this issue can't be fixed with patching only Python code.
msg245326 - (view)	Author: Patrick Maupin (Patrick Maupin) *	Date: 2015-06-13 19:48
OK, thanks.
msg245589 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-06-21 11:07
New changeset 7e46a503dd16 by Serhiy Storchaka in branch 'default': Issue #24426: Fast searching optimization in regular expressions now works https://hg.python.org/cpython/rev/7e46a503dd16

History
Date	User	Action	Args
2022-04-11 14:58:17	admin	set	github: 68614
2015-06-21 11:10:26	serhiy.storchaka	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2015-06-21 11:07:22	python-dev	set	nosy: + python-dev messages: + msg245589
2015-06-13 19:48:40	Patrick Maupin	set	messages: + msg245326
2015-06-13 15:51:36	serhiy.storchaka	set	messages: + msg245319
2015-06-13 14:04:36	Patrick Maupin	set	messages: + msg245316
2015-06-13 11:12:38	serhiy.storchaka	set	messages: + msg245310
2015-06-11 22:54:09	Patrick Maupin	set	messages: + msg245198
2015-06-11 20:22:51	serhiy.storchaka	set	files: + re_literal_prefix_with_groups.patch versions: + Python 3.6, - Python 2.7, Python 3.4 messages: + msg245184 assignee: serhiy.storchaka keywords: + patch stage: patch review
2015-06-10 21:28:44	Patrick Maupin	set	messages: + msg245145
2015-06-10 20:16:59	Patrick Maupin	set	messages: + msg245144
2015-06-10 19:34:35	serhiy.storchaka	set	messages: + msg245141
2015-06-10 19:18:48	serhiy.storchaka	set	messages: + msg245139
2015-06-10 18:55:07	ezio.melotti	set	nosy: + serhiy.storchaka
2015-06-10 18:53:09	Patrick Maupin	create