New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re.split performance degraded significantly by capturing group #68614
Comments
The addition of a capturing group in a re.split() pattern, e.g. using '(\n)' instead of '\n', causes a factor of 10 performance degradation. I use re.split a() lot, but never noticed the issue before. It was extremely noticeable on 1000 patterns in a 5BG file, though, requiring 40 seconds instead of 4. I have attached a script demonstrating the issue. I have tested on 2.7 and 3.4, but have no reason to believe it doesn't exist on other vesions as well. Thanks, |
Regular expression is optimized for the case when it starts with constant string or charset. It is no degradation when using '(\n)', but rather an optimization of '\n'. |
Splitting with pattern '\n(?<=(\n))' produces the same result as with pattern '(\n)' and is as fast as with pattern '\n'. |
I assumed that, since I saw almost exactly the same performance degradation with \n as I did with this, that that was a valid testcase. If that was a bad assumption and this is insufficient to debug it, I can submit a bigger testcase. But if this issue is going to be wontfixed for some reason, there should certainly be a documentation note added, because it is not intuitive that splitting 5GB of data into 1000 strings of around 5MB each should be 10X faster than doing the same thing, but also capturing the 1K ten-byte strings inbetween the big ones. Thanks, |
Just to be perfectly clear, this is no exaggeration: My original file was slightly over 5GB. I have approximately 1050 bad strings in it, averaging around 11 characters per string. If I split it without capturing those 1050 strings, it takes 3.7 seconds. If I split it and capture those 1050 strings, it takes 39 seconds. ISTM that 33 ms to create a capture group with a single 11 character string is excessive, so there is probably something else going on like excessive object copying, that just isn't noticeable on a smaller source string. In the small example I posted, if I replace the line: data = 100 * (200000 * ' ' + '\n') with data = 1000 * (500000 * ' ' + '\n') then I get approximately the same 3.7 second vs 39 second results on that (somewhat older) machine. I didn't start out with that in the example, because I thought the problem should still be obvious from the scaled down example. Obviously, your CPU numbers will be somewhat different. The question remains, though, why it takes around 60 million CPU cycles for each and every returned capture group. Or, to put it another way, why can I stop doing the capture group, and grab the same string with pure Python by looking at the string lengths of the intervening strings, well over 100 times faster than it takes for the re module to give me that group? Thanks, |
Here is a patch that adds more optimizations for searching patterns that starts with a literal string and groups. In particular it includes a case when a pattern starts with a group containing single character. Examples: $ ./python -m timeit -s "import re; p = re.compile('(\n)'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
Unpatched: 100 loops, best of 3: 4.58 msec per loop
Patched : 1000 loops, best of 3: 562 usec per loop
$ ./python -m timeit -s "import re; p = re.compile('(\n\r)'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
Unpatched: 100 loops, best of 3: 3.1 msec per loop
Patched : 1000 loops, best of 3: 663 usec per loop For comparison: $ ./python -m timeit -s "import re; p = re.compile('\n'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
1000 loops, best of 3: 329 usec per loop
$ ./python -m timeit -s "import re; p = re.compile('\n\r'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
1000 loops, best of 3: 338 usec per loop Optimized also more complex but rare cases, such as '\n()\r' or '((\n)(\r))'. Fast searching no longer can be disabled. |
Thank you for the quick response, Serhiy. I had started investigating and come to the conclusion that it was a problem with the compiler rather than the C engine. Interestingly, my next step was going to be to use names for the compiler constants, and then I noticed you did that exact same thing in the 3.6 tree. I will try this out on my use-case tomorrow to make sure it fixes my issue, but now I'm intrigued by the inner workings of this, so I have two questions:
Thanks and best regards, |
Sorry, perhaps I misunderstood you. There are unofficial mirrors of CPython on Bitbucket [1] and GitHub [2]. They don't contain unofficial patches, but perhaps there are private clones with additional patches. Of course different Linux distributives can provide Python with own patches. And you can maintain private fork of CPython with your patches for your own or your company needs. But if you needs only optimized regular expressions, I suggest you to look on the regex module [3]. It is more powerful and better supports Unicode. Results of the same mickrobenchmarks for regex: $ ./python -m timeit -s "import regex as re; p = re.compile('\n'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
1000 loops, best of 3: 544 usec per loop
$ ./python -m timeit -s "import regex as re; p = re.compile('(\n)'); s = ('a'*100 + '\n')*1000" -- "p.split(s)"
1000 loops, best of 3: 661 usec per loop
$ ./python -m timeit -s "import regex as re; p = re.compile('\n\r'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
1000 loops, best of 3: 521 usec per loop
$ ./python -m timeit -s "import regex as re; p = re.compile('(\n\r)'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
1000 loops, best of 3: 743 usec per loop regex is slightly slower than optimized re in these cases, but is much faster than non-optimized re in the case of splitting with capturing group.
Sorry, I don't understood you. Do you mean documenting codes of compiled re pattern? This is implementation detail and will be changed in future. [1] https://bitbucket.org/mirror/cpython |
No, I was curious about whether somebody maintained pure-Python fixes (e.g. to the re parser and compiler). Those could be in a regular package that fixed some corner cases such as the capture group you just applied a patch for.
Unfortunately, it is still not competitive. For example, for one package I maintain (github.com/pmaupin/pdfrw), I have one unittest which reads in and parses several PDF files, and then outputs them to new PDF files: Python 2.7 with re -- 5.9 s A large reason for the difference between 2.7 and 3.4 is the fact that I'm too lazy, or it seems too error-prone, to put the b'xxx' in front of every string, so the package uses the same source code for 2.7 and 3.4, which means unicode strings for 3.4 and byte strings for 2.7. Nonetheless, when you consider all the other work going on in the package, a 14% _overall_ slowdown to change to a "better" re package seems like going the wrong direction, especially when stacked on top of the 22% slowdown for switching to Python3.
Yes.
Understood, and that's fine. If the documentation existed, it would have helped if I want to create a pure-python package that simply performed optimizations (like the one in your patch) against existing Python implementations, for use until regex (which is a HUGE undertaking) is ready. Thanks, |
This is a reason to file a feature request to regex. In 3.3 re was slower than regex in some cases: $ ./python -m timeit -s "import re; p = re.compile('\n\r'); s = ('a'*100 + '\n\r')*1000" -- "p.split(s)"
Python 3.3 re : 1000 loops, best of 3: 952 usec per loop
Python 3.4 regex: 1000 loops, best of 3: 757 usec per loop
Python 3.4 re : 1000 loops, best of 3: 323 usec per loop And this optimization (bpo-18685 or others) can be applied to regex. As for this particular issue, the optimization of splitting with 1-character capturing group needs changes to C part of re engine. Python part of my patch is not needed for this, it is here only for generalizing support of other corner cases. So this issue can't be fixed with patching only Python code. |
OK, thanks. |
New changeset 7e46a503dd16 by Serhiy Storchaka in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: