This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Restore re performance to pre-PEP393 level
Type: performance Stage: resolved
Components: Regular Expressions, Unicode Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: 18672 Superseder:
Assigned To: serhiy.storchaka Nosy List: Tal.Weiss, ezio.melotti, mrabarnett, pitrou, python-dev, serhiy.storchaka, tim.peters, vstinner
Priority: normal Keywords: patch

Created on 2013-08-08 13:10 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
sre_optimize.patch serhiy.storchaka, 2013-08-08 13:10 review
sre_optimize_2.patch serhiy.storchaka, 2013-10-21 17:03 review
sre_optimize_3.patch serhiy.storchaka, 2013-10-25 20:28 review
Messages (23)
msg194669 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-08 13:10
Before PEP 393 the regex functions scanned an array of char or Py_UNICODE and character testing was cheap. After PEP 393 they checks a kind of an unicode string for every tested character and processing of unicode strings becomes slower. _sre.c already generates two sets of functions from one source -- for byte and unicode strings. The proposed patch uses same technique to generate three sets of functions -- for byte/UCS1, UCS2 and UCS4 strings. This simplifies the code (now it more similar to pre-PEP393 version) and makes characters testing faster.

Benchmark example:

Python 3.2:
$ python3.2 -m timeit -s "import re; f = re.compile(b'abc').search; x = b'x'*100000"  "f(x)"
1000 loops, best of 3: 613 usec per loop
$ python3.2 -m timeit -s "import re; f = re.compile('abc').search; x = 'x'*100000"  "f(x)"
1000 loops, best of 3: 232 usec per loop
$ python3.2 -m timeit -s "import re; f = re.compile('abc').search; x = '\u20ac'*100000"  "f(x)"
1000 loops, best of 3: 217 usec per loop

Python 3.4.0a1+ unpatched:
$ ./python -m timeit -s "import re; f = re.compile(b'abc').search; x = b'x'*100000"  "f(x)"
1000 loops, best of 3: 485 usec per loop
$ ./python -m timeit -s "import re; f = re.compile('abc').search; x = 'x'*100000"  "f(x)"
1000 loops, best of 3: 790 usec per loop
$ ./python -m timeit -s "import re; f = re.compile('abc').search; x = '\u20ac'*100000"  "f(x)"
1000 loops, best of 3: 1.09 msec per loop

Python 3.4.0a1+ patched:
$ ./python -m timeit -s "import re; f = re.compile(b'abc').search; x = b'x'*100000"  "f(x)"
1000 loops, best of 3: 250 usec per loop
$ ./python -m timeit -s "import re; f = re.compile('abc').search; x = 'x'*100000"  "f(x)"
1000 loops, best of 3: 250 usec per loop
$ ./python -m timeit -s "import re; f = re.compile('abc').search; x = '\u20ac'*100000"  "f(x)"
1000 loops, best of 3: 256 usec per loop

I also propose for simplicity extract a template part of _sre.c to separated file (i.e. srelib.h) and get rid of recursion.
msg194683 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2013-08-08 15:05
It appears that in your tests Python 3.2 is faster with Unicode than bytestrings and that unpatched Python 3.4 is a lot slower.

I get somewhat different results (Windows XP Pro, 32-bit):

C:\Python32\python.exe -m timeit -s "import re; f = re.compile(b'abc').search; x = b'x'*100000" "f(x)"
1000 loops, best of 3: 449 usec per loop

C:\Python32\python.exe -m timeit -s "import re; f = re.compile('abc').search; x = 'x'*100000" "f(x)"
1000 loops, best of 3: 506 usec per loop

C:\Python32\python.exe -m timeit -s "import re; f = re.compile('abc').search; x = '\u20ac'*100000" "f(x)"
1000 loops, best of 3: 506 usec per loop


C:\Python34\python.exe -m timeit -s "import re; f = re.compile(b'abc').search; x = b'x'*100000" "f(x)"
1000 loops, best of 3: 227 usec per loop

C:\Python34\python.exe -m timeit -s "import re; f = re.compile('abc').search; x = 'x'*100000" "f(x)"
1000 loops, best of 3: 339 usec per loop

C:\Python34\python.exe -m timeit -s "import re; f = re.compile('abc').search; x = '\u20ac'*100000" "f(x)"
1000 loops, best of 3: 504 usec per loop

For comparison, in the regex module I don't duplicate whole sections of code, but instead have a pointer to one of 3 functions (for UCS1, UCS2 and UCS4) that gets the codepoint, except for some tight loops. Doing that might be too much of a change for re.

However, the speed appears to be a lot more consistent:

C:\Python32\python.exe -m timeit -s "import regex; f = regex.compile(b'abc').search; x = b'x'*100000" "f(x)"
10000 loops, best of 3: 113 usec per loop

C:\Python32\python.exe -m timeit -s "import regex; f = regex.compile('abc').search; x = 'x'*100000" "f(x)"
10000 loops, best of 3: 113 usec per loop

C:\Python32\python.exe -m timeit -s "import regex; f = regex.compile('abc').search; x = '\u20ac'*100000" "f(x)"
10000 loops, best of 3: 113 usec per loop


C:\Python34\python.exe -m timeit -s "import regex; f = regex.compile(b'abc').search; x = b'x'*100000" "f(x)"
10000 loops, best of 3: 113 usec per loop

C:\Python34\python.exe -m timeit -s "import regex; f = regex.compile('abc').search; x = 'x'*100000" "f(x)"
10000 loops, best of 3: 113 usec per loop

C:\Python34\python.exe -m timeit -s "import regex; f = regex.compile('abc').search; x = '\u20ac'*100000" "f(x)"
10000 loops, best of 3: 113 usec per loop
msg194727 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-09 13:29
I get the same kind of results as Serhiy:

$ python3.2 -m timeit -s "import re; f = re.compile(b'abc').search; x = b'x'*100000"  "f(x)"
10000 loops, best of 3: 81.7 usec per loop
$ python3.2 -m timeit -s "import re; f = re.compile('abc').search; x = 'x'*100000"  "f(x)"
10000 loops, best of 3: 31.1 usec per loop
$ python3.2 -m timeit -s "import re; f = re.compile('abc').search; x = '\u20ac'*100000"  "f(x)"
10000 loops, best of 3: 31.1 usec per loop

Unpatched 3.4:

$ ./python -m timeit -s "import re; f = re.compile(b'abc').search; x = b'x'*100000"  "f(x)"
10000 loops, best of 3: 81.6 usec per loop
$ ./python -m timeit -s "import re; f = re.compile('abc').search; x = 'x'*100000"  "f(x)"
10000 loops, best of 3: 163 usec per loop
$ ./python -m timeit -s "import re; f = re.compile('abc').search; x = '\u20ac'*100000"  "f(x)"
10000 loops, best of 3: 190 usec per loop

Patched 3.4:

$ ./python -m timeit -s "import re; f = re.compile(b'abc').search; x = b'x'*100000"  "f(x)"
10000 loops, best of 3: 54.4 usec per loop
$ ./python -m timeit -s "import re; f = re.compile('abc').search; x = 'x'*100000"  "f(x)"
10000 loops, best of 3: 54.2 usec per loop
$ ./python -m timeit -s "import re; f = re.compile('abc').search; x = '\u20ac'*100000"  "f(x)"
10000 loops, best of 3: 54.5 usec per loop
msg194749 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2013-08-09 16:04
@Antoine: Are you on the same OS as Serhiy?

IIRC, wasn't the performance regression that wxjmfauth complained about in Python 3.3 apparent on Windows, but not on Linux?
msg194757 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-09 18:57
> @Antoine: Are you on the same OS as Serhiy?

I don't know, I'm under Linux with gcc (on a x86-64 machine) :-)

> IIRC, wasn't the performance regression that wxjmfauth complained
> about in Python 3.3 apparent on Windows, but not on Linux?

I don't know, but I'm not willing to give any attention to something
reported by jmfauth. He's very far in the trollzone, as far as I'm
concerned.

However, if you are under Windows and can give it a try, it would be
nice to have performance numbers for Serhiy's patch.
msg194760 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2013-08-09 20:03
With the patch the results are:

C:\Python34\python.exe -m timeit -s "import re; f = re.compile(b'abc').search; x = b'x'*100000" "f(x)" 
10000 loops, best of 3: 113 usec per loop

C:\Python34\python.exe -m timeit -s "import re; f = re.compile('abc').search; x = 'x'*100000" "f(x)" 
10000 loops, best of 3: 113 usec per loop

C:\Python34\python.exe -m timeit -s "import re; f = re.compile('abc').search; x = '\u20ac'*100000" "f(x)" 
10000 loops, best of 3: 113 usec per loop

I'm most impressed! :-)
msg194762 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-09 20:11
I'm under 32-bit Linux with gcc 4.6.3.

The above test is only one example for which I expect largest difference. I suppose other tests will show a gain too.
msg194765 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-09 20:18
Ok, here are some results from the benchmarks suite:

Report on Linux fsol 3.8.0-27-generic #40-Ubuntu SMP Tue Jul 9 00:17:05 UTC 2013 x86_64 x86_64
Total CPU cores: 4

### regex_effbot ###
Min: 0.058952 -> 0.054367: 1.08x faster
Avg: 0.059060 -> 0.054378: 1.09x faster
Significant (t=132.69)
Stddev: 0.00008 -> 0.00001: 5.9597x smaller

### regex_v8 ###
Min: 0.063401 -> 0.050701: 1.25x faster
Avg: 0.066147 -> 0.053530: 1.24x faster
Significant (t=3.22)
Stddev: 0.00608 -> 0.00630: 1.0363x larger

The following not significant results are hidden, use -v to show them:
regex_compile.
msg195101 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-08-13 22:57
Using #include "_sre.c" in _sre.c looks weird. Instead of huge sections delimited by "#ifdef SRE_RECURSIVE", I would prefer something similar to the stringlib. ".h" template files included more than once. I also expect shorter files: _sre.c is close to 4000 lines of C code :-(

If you move code from _sre.c to a new file, you should use "hg cp" to keep the history. For the review, it's maybe better to continue with your SRE_RECURSIVE hack :)

--

#define SRE_CHAR Py_UCS1
#define SIZEOF_SRE_CHAR 1
..
#define SRE_CHAR Py_UCS2
#define SIZEOF_SRE_CHAR 1
...
#define SRE_CHAR Py_UCS4
#define SIZEOF_SRE_CHAR 1

The value of SIZEOF_SRE_CHAR looks suspicious.

Does test_re have some non-ASCII tests? If not, we should probably start by adding such tests!
msg195128 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-14 09:44
> Using #include "_sre.c" in _sre.c looks weird. Instead of huge sections delimited by "#ifdef SRE_RECURSIVE", I would prefer something similar to the stringlib. ".h" template files included more than once. I also expect shorter files: _sre.c is close to 4000 lines of C code :-(

Agree, but a patch will be larger and harder for the synchronization and for the review in Rietveld. I'm going first solve other issues (issue18647, issue18672) before creating a large final patch.

> The value of SIZEOF_SRE_CHAR looks suspicious.

Good catch. Actually this macro is used only in order to skip some checks for UCS4. It should not affects the correctness, only possible the performance.

> Does test_re have some non-ASCII tests? If not, we should probably start by adding such tests!

There is a small number (about 10) of tests for non-ASCII data.
msg200813 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-21 17:03
Rebased patch to tip and added non-ASCII tests for main re functions.
msg201189 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-24 21:15
Please review this patch. I will extract template part into separated file in separated commit.
msg201233 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-25 12:43
Posted review on Rietveld. See also issue #19387.
msg201289 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-25 20:28
Updated patch addresses Antoine's comments.
msg201302 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-25 23:12
Looks good to me.
msg201324 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-10-26 07:46
New changeset 66e2dfbb1d70 by Serhiy Storchaka in branch 'default':
Issue #18685: Restore re performance to pre-PEP 393 levels.
http://hg.python.org/cpython/rev/66e2dfbb1d70
msg201327 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2013-10-26 08:15
Sorry, I was busy with my tracemalloc PEP, I didn't havee time to review your patch. I'm happy that you restored Python 3.2 performances! Thanks Serhiy.
msg201328 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-10-26 08:19
New changeset 00e61cb3b11c by Serhiy Storchaka in branch 'default':
Issue #18685: Extract template part of _sre.c into separated sre_lib.h file.
http://hg.python.org/cpython/rev/00e61cb3b11c
msg201334 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-26 10:28
Didn't you forget to add sre_lib.h?
msg201335 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-26 10:32
Ah, sorry, no. I was fooled by the commit e-mail.
msg201336 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-26 11:00
Yes, the commit e-mail looks queer.
msg201372 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-10-26 16:50
I suppose this issue can be fixed then. Thanks for doing this!
msg201375 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-10-26 17:08
Thank you for your review Antoine and Victor.
History
Date User Action Args
2022-04-11 14:57:49adminsetgithub: 62885
2013-10-26 17:08:54serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg201375

stage: commit review -> resolved
2013-10-26 16:50:24pitrousetmessages: + msg201372
2013-10-26 11:00:57serhiy.storchakasetmessages: + msg201336
2013-10-26 10:32:17pitrousetmessages: + msg201335
2013-10-26 10:28:42pitrousetmessages: + msg201334
2013-10-26 08:19:30python-devsetmessages: + msg201328
2013-10-26 08:15:17vstinnersetmessages: + msg201327
2013-10-26 07:46:23python-devsetnosy: + python-dev
messages: + msg201324
2013-10-25 23:12:20pitrousetmessages: + msg201302
stage: patch review -> commit review
2013-10-25 20:28:26serhiy.storchakasetfiles: + sre_optimize_3.patch

messages: + msg201289
2013-10-25 12:43:40pitrousetmessages: + msg201233
2013-10-24 21:15:31serhiy.storchakasetmessages: + msg201189
2013-10-23 20:32:00serhiy.storchakasetdependencies: - Pointers point out of array bound in _sre.c
2013-10-21 17:03:30serhiy.storchakasetfiles: + sre_optimize_2.patch

messages: + msg200813
2013-08-29 10:13:50serhiy.storchakasetdependencies: + Fix format specifiers for debug output in _sre.c, Pointers point out of array bound in _sre.c
2013-08-25 17:14:49Tal.Weisssetnosy: + Tal.Weiss
2013-08-14 09:44:08serhiy.storchakasetmessages: + msg195128
2013-08-13 22:57:45vstinnersetmessages: + msg195101
2013-08-09 20:18:26pitrousetmessages: + msg194765
2013-08-09 20:11:54serhiy.storchakasetmessages: + msg194762
2013-08-09 20:03:01mrabarnettsetmessages: + msg194760
2013-08-09 20:00:01vstinnersetnosy: + vstinner
2013-08-09 18:57:15pitrousetmessages: + msg194757
2013-08-09 16:04:49mrabarnettsetmessages: + msg194749
2013-08-09 13:29:51pitrousetnosy: + pitrou
messages: + msg194727
2013-08-09 08:08:50pitrousetnosy: + tim.peters
2013-08-08 15:05:54mrabarnettsetmessages: + msg194683
2013-08-08 13:10:17serhiy.storchakacreate