This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: regexp performance degradation between 2.7.6 and 2.7.12
Type: performance Stage: resolved
Components: Versions:
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: gregory.p.smith, rhettinger, serhiy.storchaka, steve.newcomb, vstinner
Priority: normal Keywords:

Created on 2016-08-30 16:27 by steve.newcomb, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
profiles_2.7.6_vs_2.7.12 steve.newcomb, 2016-08-30 16:27 profiles of processing same data with same Python code on 2.7.6 and 2.7.12.
newcomb_profile_report_for_2.7.6.txt steve.newcomb, 2016-08-31 15:14
newcomb_profile_report_for_2.7.12.txt steve.newcomb, 2016-08-31 15:15
test.py steve.newcomb, 2016-09-01 20:52
test_output_2.7.6.txt steve.newcomb, 2016-09-01 20:52
p17-188.htm steve.newcomb, 2016-09-01 20:52
test_output_2.7.12.txt steve.newcomb, 2016-09-01 20:53
Messages (10)
msg273932 - (view) Author: Steve Newcomb (steve.newcomb) * Date: 2016-08-30 16:27
Our most regular-expression-processing-intensive Python 2.7 code takes 2.5x more execution time in 2.7.12 than it did in 2.7.6.  I discovered this after upgrading from Ubuntu 14.04 to Ubuntu 16.04.  Basically this code runs thousands of compiled regular expressions on thousands of texts.  Both the multiprocessing module and the re module are heavily used.

See attached profiler outputs, which look quite different in several respects.  I used the profiling module to profile the same Python code, processing the same data, using the same hardware, under both Ubuntu 14.04 (Python 2.7.6) and Ubuntu 16.04 (Python 2.7.12).  

It is striking, for example, that cPickle.load appears so prominently in the 2.7.12 profile -- a fact which appears to implicate the multiprocessing module somehow.  But I suspect that the re module is more likely the main source of the problem, because the execution times of other production steps -- steps that do not call the multiprocessing module -- also appear to be extended to a degree that is roughly proportional to the amount of regular expression processing done in those other steps.

I will happily provide any further information I can.  Any insights about this surprisingly severe performance degradation would be welcome.
msg273935 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2016-08-30 16:46
It would be helpful if you could run "hg bisect" with your set-up to isolate the change that causes the problem.  Alternatively, make a small set of regular expressions that demonstrate the performance regression.
msg273940 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-08-30 17:24
According to your profile results all re functions are 2.5-4 times faster under 2.7.12 than under 2.7.6. May be I misinterpret it?

Note that 96-99% of time (2847.099 of 2980.718 seconds under 2.7.6 and 4474.890 of 4519.872 seconds under 2.7.12) is spent in posix.waitpid. The rest of time is larger under 2.7.6 (2980.718 - 2847.099 = 133.619) than under 2.7.12 (4519.872 - 4474.890 = 44.982).
msg273946 - (view) Author: Steve Newcomb (steve.newcomb) * Date: 2016-08-30 18:14
On 08/30/2016 12:46 PM, Raymond Hettinger wrote:
> Raymond Hettinger added the comment:
>
> It would be helpful if you could run "hg bisect" with your set-up to isolate the change that causes the problem.
I don't think I understand you.  There's no difference in the Python 
code we're using in both cases.  The only differences, AFAIK, are in the 
Python interpreter and in the Linux distribution.  I'm not qualified to 
analyze the differences in the latter items.
>    Alternatively, make a small set of regular expressions that demonstrate the performance regression.
It will be hard to do that, because the code is so complex, and because 
debugging in the multiprocessing context is so hairy. Still, it's the 
only approach I can think of, too.  Sigh.  I'm thinking about it.
msg273951 - (view) Author: Steve Newcomb (steve.newcomb) * Date: 2016-08-30 18:33
On 08/30/2016 01:24 PM, Serhiy Storchaka wrote:
> Serhiy Storchaka added the comment:
>
> According to your profile results all re functions are 2.5-4 times faster under 2.7.12 than under 2.7.6. May be I misinterpret it?
I can't explain the profiler's report.  I'm kind of glad that you, too, 
find it baffling.  Is it possible that the profiler doesn't actually 
work predictably in the multiprocessing context?  If so, one thing I can 
*easily* do is to disable multiprocessing in that code and see what the 
profiler reports are then.  It will take all night, but I'm beginning to 
think it would be worthwhile, because it might point the finger of blame 
at either the multiprocessing module or the re module, but not both at once.

(I originally provided a "disable multiprocessing" capability in that 
code in order to use the Python debugger with it.  It would kind of make 
sense if the profiler had limitations similar to those of the debugger.)
>
> Note that 96-99% of time (2847.099 of 2980.718 seconds under 2.7.6 and 4474.890 of 4519.872 seconds under 2.7.12) is spent in posix.waitpid. The rest of time is larger under 2.7.6 (2980.718 - 2847.099 = 133.619) than under 2.7.12 (4519.872 - 4474.890 = 44.982).
Yeah, I'm beginning to wonder if those strange statistics, too, are 
artifacts of using a single-process profiler in a multiprocessing context.
msg274029 - (view) Author: Steve Newcomb (steve.newcomb) * Date: 2016-08-31 15:15
Having disabled multiprocessing, I ran new, more extensive profiles.  
The biggest problem is in re.sub(), which in our codebase evidently uses 
3 times as much processing time in Python 2.7.12 as it does in Python 2.7.6.

I'm attaching the profile reports for both runs.

Here's a possibly-relevant question: Are regular expressions now being 
unnecessarily recompiled by re.sub()?

We never call the sub() method of a compiled regular expression; we 
always call sub() as "re.sub()" and pass the compiled regular expression 
as the first argument.  Years ago when our code was first developed, 
re.sub() did not always work correctly when invoked as a method of a 
compiled regular expression object.  I don't know if that's still a 
problem, but the problem persisted for years, and so we have a 
convention of always invoking sub() at the module level.

Steve Newcomb
msg274182 - (view) Author: Steve Newcomb (steve.newcomb) * Date: 2016-09-01 20:53
On 08/30/2016 12:46 PM, Raymond Hettinger wrote:
> Raymond Hettinger added the comment:
>
> It would be helpful if you ... make a small set of regular expressions that demonstrate the performance regression.
>
Done.  Attachments:

test.py : Code that exercises re.sub() and outputs a profile report.

test_output_2.7.6.txt : Output of test.py under Python 2.7.6.

test_output_2.7.12.txt : Output of test.py under Python 2.7.12.

p17.188.htm -- test data: public information from the U.S. Internal 
Revenue Service.

Equivalent hardware was used in both cases.

The outputs show that 2.7.12's re.sub() takes 1.2 times as long as 
2.7.6's.  It's a significant difference, but...

...it was not the dramatic degradation I expected to find in this 
exercise.  Therefore I attempted to tease what I was looking for out of 
the profile stats I already uploaded to this site, made from actual 
production runs.  My attempts are all found in an hg repository that can 
be downloaded from 
sftp://sftp@coolheads.com//files/py-re-perform-276-2712 using password 
bysIe20H .

I do not feel the latter work took me where I wanted to go, and I think 
the reason is that, at least for purposes of our application, Python 
2.7.12 has been so extensively refactored since Python 2.7.6.  So it's 
an apples-to-oranges comparison, apparently.  Still, the performance 
difference for re.sub() is quite dramatic , and re.sub() is the only 
comparable function whose performance dramatically worsened: in our 
application, 2.7.12's re.sub() takes 3.04 times as long as 2.7.6's.

The good news, of course, is that by and large the performance of the 
other *comparable* functions largely improved, often dramatically.  But 
at least in our application, it doesn't come close to making up for the 
degradation in re.sub().

My by-the-gut bottom line: somebody who really knows the re module 
should take a deep look at re.sub().  Why would re.sub(), unlike all 
others, take so much longer to run, while *every* other function in the 
re module get (often much) faster?  It feels like there's a bug 
somewhere in re.sub().

Steve Newcomb
msg274183 - (view) Author: Steve Newcomb (steve.newcomb) * Date: 2016-09-01 21:01
Oops.  The correct url is sftp://coolheads.com/files/py-re-perform-276v2712/

On 09/01/2016 04:52 PM, Steve Newcomb wrote:
> On 08/30/2016 12:46 PM, Raymond Hettinger wrote:
>> Raymond Hettinger added the comment:
>>
>> It would be helpful if you ... make a small set of regular 
>> expressions that demonstrate the performance regression.
>>
> Done.  Attachments:
>
> test.py : Code that exercises re.sub() and outputs a profile report.
>
> test_output_2.7.6.txt : Output of test.py under Python 2.7.6.
>
> test_output_2.7.12.txt : Output of test.py under Python 2.7.12.
>
> p17.188.htm -- test data: public information from the U.S. Internal 
> Revenue Service.
>
> Equivalent hardware was used in both cases.
>
> The outputs show that 2.7.12's re.sub() takes 1.2 times as long as 
> 2.7.6's.  It's a significant difference, but...
>
> ...it was not the dramatic degradation I expected to find in this 
> exercise.  Therefore I attempted to tease what I was looking for out 
> of the profile stats I already uploaded to this site, made from actual 
> production runs.  My attempts are all found in an hg repository that 
> can be downloaded from 
> sftp://sftp@coolheads.com//files/py-re-perform-276-2712 using password 
> bysIe20H .
>
> I do not feel the latter work took me where I wanted to go, and I 
> think the reason is that, at least for purposes of our application, 
> Python 2.7.12 has been so extensively refactored since Python 2.7.6.  
> So it's an apples-to-oranges comparison, apparently.  Still, the 
> performance difference for re.sub() is quite dramatic , and re.sub() 
> is the only comparable function whose performance dramatically 
> worsened: in our application, 2.7.12's re.sub() takes 3.04 times as 
> long as 2.7.6's.
>
> The good news, of course, is that by and large the performance of the 
> other *comparable* functions largely improved, often dramatically.  
> But at least in our application, it doesn't come close to making up 
> for the degradation in re.sub().
>
> My by-the-gut bottom line: somebody who really knows the re module 
> should take a deep look at re.sub().  Why would re.sub(), unlike all 
> others, take so much longer to run, while *every* other function in 
> the re module get (often much) faster?  It feels like there's a bug 
> somewhere in re.sub().
>
> Steve Newcomb
>
msg274189 - (view) Author: Steve Newcomb (steve.newcomb) * Date: 2016-09-02 00:35
On 09/01/2016 05:01 PM, Steve Newcomb wrote:
>
>> The outputs show that 2.7.12's re.sub() takes 1.2 times as long as 
>> 2.7.6's.  It's a significant difference, but...
>>
>> ...it was not the dramatic degradation I expected to find in this 
>> exercise.
On second (third?) thought, the degree of degradation could easily 
depend on the source data being processed.  Maybe test.py does, in fact, 
demonstrate the problem, but the test data I used (p17-118.htm) do not 
demonstrate a terribly severe case of the problem.
msg379073 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2020-10-20 01:29
2.7 is end of life.

if you have regular expression performance issues with something in 3, please open a new issue.
History
Date User Action Args
2022-04-11 14:58:35adminsetgithub: 72085
2020-10-20 01:29:48gregory.p.smithsetstatus: pending -> closed

nosy: + gregory.p.smith
messages: + msg379073

resolution: wont fix
stage: resolved
2019-09-09 07:32:08serhiy.storchakasetstatus: open -> pending
2016-09-02 00:35:58steve.newcombsetmessages: + msg274189
2016-09-01 21:18:35vstinnersetnosy: + vstinner
type: performance
2016-09-01 21:01:26steve.newcombsetmessages: + msg274183
2016-09-01 20:53:00steve.newcombsetfiles: + test.py, test_output_2.7.6.txt, p17-188.htm, test_output_2.7.12.txt

messages: + msg274182
2016-08-31 15:15:16steve.newcombsetfiles: + newcomb_profile_report_for_2.7.6.txt, newcomb_profile_report_for_2.7.12.txt

messages: + msg274029
2016-08-30 18:33:35steve.newcombsetmessages: + msg273951
2016-08-30 18:14:47steve.newcombsetmessages: + msg273946
2016-08-30 17:24:33serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg273940
2016-08-30 16:46:58rhettingersetnosy: + rhettinger
messages: + msg273935
2016-08-30 16:27:50steve.newcombcreate