Issue 27898: regexp performance degradation between 2.7.6 and 2.7.12

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/72085

classification

Title:	regexp performance degradation between 2.7.6 and 2.7.12
Type:	performance	Stage:	resolved
Components:		Versions:

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:		Nosy List:	gregory.p.smith, rhettinger, serhiy.storchaka, steve.newcomb, vstinner
Priority:	normal	Keywords:

Created on 2016-08-30 16:27 by steve.newcomb, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
profiles_2.7.6_vs_2.7.12	steve.newcomb, 2016-08-30 16:27	profiles of processing same data with same Python code on 2.7.6 and 2.7.12.
newcomb_profile_report_for_2.7.6.txt	steve.newcomb, 2016-08-31 15:14
newcomb_profile_report_for_2.7.12.txt	steve.newcomb, 2016-08-31 15:15
test.py	steve.newcomb, 2016-09-01 20:52
test_output_2.7.6.txt	steve.newcomb, 2016-09-01 20:52
p17-188.htm	steve.newcomb, 2016-09-01 20:52
test_output_2.7.12.txt	steve.newcomb, 2016-09-01 20:53

Messages (10)
msg273932 - (view)	Author: Steve Newcomb (steve.newcomb) *	Date: 2016-08-30 16:27
Our most regular-expression-processing-intensive Python 2.7 code takes 2.5x more execution time in 2.7.12 than it did in 2.7.6. I discovered this after upgrading from Ubuntu 14.04 to Ubuntu 16.04. Basically this code runs thousands of compiled regular expressions on thousands of texts. Both the multiprocessing module and the re module are heavily used. See attached profiler outputs, which look quite different in several respects. I used the profiling module to profile the same Python code, processing the same data, using the same hardware, under both Ubuntu 14.04 (Python 2.7.6) and Ubuntu 16.04 (Python 2.7.12). It is striking, for example, that cPickle.load appears so prominently in the 2.7.12 profile -- a fact which appears to implicate the multiprocessing module somehow. But I suspect that the re module is more likely the main source of the problem, because the execution times of other production steps -- steps that do not call the multiprocessing module -- also appear to be extended to a degree that is roughly proportional to the amount of regular expression processing done in those other steps. I will happily provide any further information I can. Any insights about this surprisingly severe performance degradation would be welcome.
msg273935 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2016-08-30 16:46
It would be helpful if you could run "hg bisect" with your set-up to isolate the change that causes the problem. Alternatively, make a small set of regular expressions that demonstrate the performance regression.
msg273940 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2016-08-30 17:24
According to your profile results all re functions are 2.5-4 times faster under 2.7.12 than under 2.7.6. May be I misinterpret it? Note that 96-99% of time (2847.099 of 2980.718 seconds under 2.7.6 and 4474.890 of 4519.872 seconds under 2.7.12) is spent in posix.waitpid. The rest of time is larger under 2.7.6 (2980.718 - 2847.099 = 133.619) than under 2.7.12 (4519.872 - 4474.890 = 44.982).
msg273946 - (view)	Author: Steve Newcomb (steve.newcomb) *	Date: 2016-08-30 18:14
On 08/30/2016 12:46 PM, Raymond Hettinger wrote: > Raymond Hettinger added the comment: > > It would be helpful if you could run "hg bisect" with your set-up to isolate the change that causes the problem. I don't think I understand you. There's no difference in the Python code we're using in both cases. The only differences, AFAIK, are in the Python interpreter and in the Linux distribution. I'm not qualified to analyze the differences in the latter items. > Alternatively, make a small set of regular expressions that demonstrate the performance regression. It will be hard to do that, because the code is so complex, and because debugging in the multiprocessing context is so hairy. Still, it's the only approach I can think of, too. Sigh. I'm thinking about it.
msg273951 - (view)	Author: Steve Newcomb (steve.newcomb) *	Date: 2016-08-30 18:33
On 08/30/2016 01:24 PM, Serhiy Storchaka wrote: > Serhiy Storchaka added the comment: > > According to your profile results all re functions are 2.5-4 times faster under 2.7.12 than under 2.7.6. May be I misinterpret it? I can't explain the profiler's report. I'm kind of glad that you, too, find it baffling. Is it possible that the profiler doesn't actually work predictably in the multiprocessing context? If so, one thing I can easily do is to disable multiprocessing in that code and see what the profiler reports are then. It will take all night, but I'm beginning to think it would be worthwhile, because it might point the finger of blame at either the multiprocessing module or the re module, but not both at once. (I originally provided a "disable multiprocessing" capability in that code in order to use the Python debugger with it. It would kind of make sense if the profiler had limitations similar to those of the debugger.) > > Note that 96-99% of time (2847.099 of 2980.718 seconds under 2.7.6 and 4474.890 of 4519.872 seconds under 2.7.12) is spent in posix.waitpid. The rest of time is larger under 2.7.6 (2980.718 - 2847.099 = 133.619) than under 2.7.12 (4519.872 - 4474.890 = 44.982). Yeah, I'm beginning to wonder if those strange statistics, too, are artifacts of using a single-process profiler in a multiprocessing context.
msg274029 - (view)	Author: Steve Newcomb (steve.newcomb) *	Date: 2016-08-31 15:15
Having disabled multiprocessing, I ran new, more extensive profiles. The biggest problem is in re.sub(), which in our codebase evidently uses 3 times as much processing time in Python 2.7.12 as it does in Python 2.7.6. I'm attaching the profile reports for both runs. Here's a possibly-relevant question: Are regular expressions now being unnecessarily recompiled by re.sub()? We never call the sub() method of a compiled regular expression; we always call sub() as "re.sub()" and pass the compiled regular expression as the first argument. Years ago when our code was first developed, re.sub() did not always work correctly when invoked as a method of a compiled regular expression object. I don't know if that's still a problem, but the problem persisted for years, and so we have a convention of always invoking sub() at the module level. Steve Newcomb
msg274182 - (view)	Author: Steve Newcomb (steve.newcomb) *	Date: 2016-09-01 20:53
On 08/30/2016 12:46 PM, Raymond Hettinger wrote: > Raymond Hettinger added the comment: > > It would be helpful if you ... make a small set of regular expressions that demonstrate the performance regression. > Done. Attachments: test.py : Code that exercises re.sub() and outputs a profile report. test_output_2.7.6.txt : Output of test.py under Python 2.7.6. test_output_2.7.12.txt : Output of test.py under Python 2.7.12. p17.188.htm -- test data: public information from the U.S. Internal Revenue Service. Equivalent hardware was used in both cases. The outputs show that 2.7.12's re.sub() takes 1.2 times as long as 2.7.6's. It's a significant difference, but... ...it was not the dramatic degradation I expected to find in this exercise. Therefore I attempted to tease what I was looking for out of the profile stats I already uploaded to this site, made from actual production runs. My attempts are all found in an hg repository that can be downloaded from sftp://sftp@coolheads.com//files/py-re-perform-276-2712 using password bysIe20H . I do not feel the latter work took me where I wanted to go, and I think the reason is that, at least for purposes of our application, Python 2.7.12 has been so extensively refactored since Python 2.7.6. So it's an apples-to-oranges comparison, apparently. Still, the performance difference for re.sub() is quite dramatic , and re.sub() is the only comparable function whose performance dramatically worsened: in our application, 2.7.12's re.sub() takes 3.04 times as long as 2.7.6's. The good news, of course, is that by and large the performance of the other comparable functions largely improved, often dramatically. But at least in our application, it doesn't come close to making up for the degradation in re.sub(). My by-the-gut bottom line: somebody who really knows the re module should take a deep look at re.sub(). Why would re.sub(), unlike all others, take so much longer to run, while every other function in the re module get (often much) faster? It feels like there's a bug somewhere in re.sub(). Steve Newcomb
msg274183 - (view)	Author: Steve Newcomb (steve.newcomb) *	Date: 2016-09-01 21:01
Oops. The correct url is sftp://coolheads.com/files/py-re-perform-276v2712/ On 09/01/2016 04:52 PM, Steve Newcomb wrote: > On 08/30/2016 12:46 PM, Raymond Hettinger wrote: >> Raymond Hettinger added the comment: >> >> It would be helpful if you ... make a small set of regular >> expressions that demonstrate the performance regression. >> > Done. Attachments: > > test.py : Code that exercises re.sub() and outputs a profile report. > > test_output_2.7.6.txt : Output of test.py under Python 2.7.6. > > test_output_2.7.12.txt : Output of test.py under Python 2.7.12. > > p17.188.htm -- test data: public information from the U.S. Internal > Revenue Service. > > Equivalent hardware was used in both cases. > > The outputs show that 2.7.12's re.sub() takes 1.2 times as long as > 2.7.6's. It's a significant difference, but... > > ...it was not the dramatic degradation I expected to find in this > exercise. Therefore I attempted to tease what I was looking for out > of the profile stats I already uploaded to this site, made from actual > production runs. My attempts are all found in an hg repository that > can be downloaded from > sftp://sftp@coolheads.com//files/py-re-perform-276-2712 using password > bysIe20H . > > I do not feel the latter work took me where I wanted to go, and I > think the reason is that, at least for purposes of our application, > Python 2.7.12 has been so extensively refactored since Python 2.7.6. > So it's an apples-to-oranges comparison, apparently. Still, the > performance difference for re.sub() is quite dramatic , and re.sub() > is the only comparable function whose performance dramatically > worsened: in our application, 2.7.12's re.sub() takes 3.04 times as > long as 2.7.6's. > > The good news, of course, is that by and large the performance of the > other comparable functions largely improved, often dramatically. > But at least in our application, it doesn't come close to making up > for the degradation in re.sub(). > > My by-the-gut bottom line: somebody who really knows the re module > should take a deep look at re.sub(). Why would re.sub(), unlike all > others, take so much longer to run, while every other function in > the re module get (often much) faster? It feels like there's a bug > somewhere in re.sub(). > > Steve Newcomb >
msg274189 - (view)	Author: Steve Newcomb (steve.newcomb) *	Date: 2016-09-02 00:35
On 09/01/2016 05:01 PM, Steve Newcomb wrote: > >> The outputs show that 2.7.12's re.sub() takes 1.2 times as long as >> 2.7.6's. It's a significant difference, but... >> >> ...it was not the dramatic degradation I expected to find in this >> exercise. On second (third?) thought, the degree of degradation could easily depend on the source data being processed. Maybe test.py does, in fact, demonstrate the problem, but the test data I used (p17-118.htm) do not demonstrate a terribly severe case of the problem.
msg379073 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2020-10-20 01:29
2.7 is end of life. if you have regular expression performance issues with something in 3, please open a new issue.

History
Date	User	Action	Args
2022-04-11 14:58:35	admin	set	github: 72085
2020-10-20 01:29:48	gregory.p.smith	set	status: pending -> closed nosy: + gregory.p.smith messages: + msg379073 resolution: wont fix stage: resolved
2019-09-09 07:32:08	serhiy.storchaka	set	status: open -> pending
2016-09-02 00:35:58	steve.newcomb	set	messages: + msg274189
2016-09-01 21:18:35	vstinner	set	nosy: + vstinner type: performance
2016-09-01 21:01:26	steve.newcomb	set	messages: + msg274183
2016-09-01 20:53:00	steve.newcomb	set	files: + test.py, test_output_2.7.6.txt, p17-188.htm, test_output_2.7.12.txt messages: + msg274182
2016-08-31 15:15:16	steve.newcomb	set	files: + newcomb_profile_report_for_2.7.6.txt, newcomb_profile_report_for_2.7.12.txt messages: + msg274029
2016-08-30 18:33:35	steve.newcomb	set	messages: + msg273951
2016-08-30 18:14:47	steve.newcomb	set	messages: + msg273946
2016-08-30 17:24:33	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg273940
2016-08-30 16:46:58	rhettinger	set	nosy: + rhettinger messages: + msg273935
2016-08-30 16:27:50	steve.newcomb	create