msg273932 - (view) |
Author: Steve Newcomb (steve.newcomb) * |
Date: 2016-08-30 16:27 |
Our most regular-expression-processing-intensive Python 2.7 code takes 2.5x more execution time in 2.7.12 than it did in 2.7.6. I discovered this after upgrading from Ubuntu 14.04 to Ubuntu 16.04. Basically this code runs thousands of compiled regular expressions on thousands of texts. Both the multiprocessing module and the re module are heavily used.
See attached profiler outputs, which look quite different in several respects. I used the profiling module to profile the same Python code, processing the same data, using the same hardware, under both Ubuntu 14.04 (Python 2.7.6) and Ubuntu 16.04 (Python 2.7.12).
It is striking, for example, that cPickle.load appears so prominently in the 2.7.12 profile -- a fact which appears to implicate the multiprocessing module somehow. But I suspect that the re module is more likely the main source of the problem, because the execution times of other production steps -- steps that do not call the multiprocessing module -- also appear to be extended to a degree that is roughly proportional to the amount of regular expression processing done in those other steps.
I will happily provide any further information I can. Any insights about this surprisingly severe performance degradation would be welcome.
|
msg273935 - (view) |
Author: Raymond Hettinger (rhettinger) * |
Date: 2016-08-30 16:46 |
It would be helpful if you could run "hg bisect" with your set-up to isolate the change that causes the problem. Alternatively, make a small set of regular expressions that demonstrate the performance regression.
|
msg273940 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) * |
Date: 2016-08-30 17:24 |
According to your profile results all re functions are 2.5-4 times faster under 2.7.12 than under 2.7.6. May be I misinterpret it?
Note that 96-99% of time (2847.099 of 2980.718 seconds under 2.7.6 and 4474.890 of 4519.872 seconds under 2.7.12) is spent in posix.waitpid. The rest of time is larger under 2.7.6 (2980.718 - 2847.099 = 133.619) than under 2.7.12 (4519.872 - 4474.890 = 44.982).
|
msg273946 - (view) |
Author: Steve Newcomb (steve.newcomb) * |
Date: 2016-08-30 18:14 |
On 08/30/2016 12:46 PM, Raymond Hettinger wrote:
> Raymond Hettinger added the comment:
>
> It would be helpful if you could run "hg bisect" with your set-up to isolate the change that causes the problem.
I don't think I understand you. There's no difference in the Python
code we're using in both cases. The only differences, AFAIK, are in the
Python interpreter and in the Linux distribution. I'm not qualified to
analyze the differences in the latter items.
> Alternatively, make a small set of regular expressions that demonstrate the performance regression.
It will be hard to do that, because the code is so complex, and because
debugging in the multiprocessing context is so hairy. Still, it's the
only approach I can think of, too. Sigh. I'm thinking about it.
|
msg273951 - (view) |
Author: Steve Newcomb (steve.newcomb) * |
Date: 2016-08-30 18:33 |
On 08/30/2016 01:24 PM, Serhiy Storchaka wrote:
> Serhiy Storchaka added the comment:
>
> According to your profile results all re functions are 2.5-4 times faster under 2.7.12 than under 2.7.6. May be I misinterpret it?
I can't explain the profiler's report. I'm kind of glad that you, too,
find it baffling. Is it possible that the profiler doesn't actually
work predictably in the multiprocessing context? If so, one thing I can
*easily* do is to disable multiprocessing in that code and see what the
profiler reports are then. It will take all night, but I'm beginning to
think it would be worthwhile, because it might point the finger of blame
at either the multiprocessing module or the re module, but not both at once.
(I originally provided a "disable multiprocessing" capability in that
code in order to use the Python debugger with it. It would kind of make
sense if the profiler had limitations similar to those of the debugger.)
>
> Note that 96-99% of time (2847.099 of 2980.718 seconds under 2.7.6 and 4474.890 of 4519.872 seconds under 2.7.12) is spent in posix.waitpid. The rest of time is larger under 2.7.6 (2980.718 - 2847.099 = 133.619) than under 2.7.12 (4519.872 - 4474.890 = 44.982).
Yeah, I'm beginning to wonder if those strange statistics, too, are
artifacts of using a single-process profiler in a multiprocessing context.
|
msg274029 - (view) |
Author: Steve Newcomb (steve.newcomb) * |
Date: 2016-08-31 15:15 |
Having disabled multiprocessing, I ran new, more extensive profiles.
The biggest problem is in re.sub(), which in our codebase evidently uses
3 times as much processing time in Python 2.7.12 as it does in Python 2.7.6.
I'm attaching the profile reports for both runs.
Here's a possibly-relevant question: Are regular expressions now being
unnecessarily recompiled by re.sub()?
We never call the sub() method of a compiled regular expression; we
always call sub() as "re.sub()" and pass the compiled regular expression
as the first argument. Years ago when our code was first developed,
re.sub() did not always work correctly when invoked as a method of a
compiled regular expression object. I don't know if that's still a
problem, but the problem persisted for years, and so we have a
convention of always invoking sub() at the module level.
Steve Newcomb
|
msg274182 - (view) |
Author: Steve Newcomb (steve.newcomb) * |
Date: 2016-09-01 20:53 |
On 08/30/2016 12:46 PM, Raymond Hettinger wrote:
> Raymond Hettinger added the comment:
>
> It would be helpful if you ... make a small set of regular expressions that demonstrate the performance regression.
>
Done. Attachments:
test.py : Code that exercises re.sub() and outputs a profile report.
test_output_2.7.6.txt : Output of test.py under Python 2.7.6.
test_output_2.7.12.txt : Output of test.py under Python 2.7.12.
p17.188.htm -- test data: public information from the U.S. Internal
Revenue Service.
Equivalent hardware was used in both cases.
The outputs show that 2.7.12's re.sub() takes 1.2 times as long as
2.7.6's. It's a significant difference, but...
...it was not the dramatic degradation I expected to find in this
exercise. Therefore I attempted to tease what I was looking for out of
the profile stats I already uploaded to this site, made from actual
production runs. My attempts are all found in an hg repository that can
be downloaded from
sftp://sftp@coolheads.com//files/py-re-perform-276-2712 using password
bysIe20H .
I do not feel the latter work took me where I wanted to go, and I think
the reason is that, at least for purposes of our application, Python
2.7.12 has been so extensively refactored since Python 2.7.6. So it's
an apples-to-oranges comparison, apparently. Still, the performance
difference for re.sub() is quite dramatic , and re.sub() is the only
comparable function whose performance dramatically worsened: in our
application, 2.7.12's re.sub() takes 3.04 times as long as 2.7.6's.
The good news, of course, is that by and large the performance of the
other *comparable* functions largely improved, often dramatically. But
at least in our application, it doesn't come close to making up for the
degradation in re.sub().
My by-the-gut bottom line: somebody who really knows the re module
should take a deep look at re.sub(). Why would re.sub(), unlike all
others, take so much longer to run, while *every* other function in the
re module get (often much) faster? It feels like there's a bug
somewhere in re.sub().
Steve Newcomb
|
msg274183 - (view) |
Author: Steve Newcomb (steve.newcomb) * |
Date: 2016-09-01 21:01 |
Oops. The correct url is sftp://coolheads.com/files/py-re-perform-276v2712/
On 09/01/2016 04:52 PM, Steve Newcomb wrote:
> On 08/30/2016 12:46 PM, Raymond Hettinger wrote:
>> Raymond Hettinger added the comment:
>>
>> It would be helpful if you ... make a small set of regular
>> expressions that demonstrate the performance regression.
>>
> Done. Attachments:
>
> test.py : Code that exercises re.sub() and outputs a profile report.
>
> test_output_2.7.6.txt : Output of test.py under Python 2.7.6.
>
> test_output_2.7.12.txt : Output of test.py under Python 2.7.12.
>
> p17.188.htm -- test data: public information from the U.S. Internal
> Revenue Service.
>
> Equivalent hardware was used in both cases.
>
> The outputs show that 2.7.12's re.sub() takes 1.2 times as long as
> 2.7.6's. It's a significant difference, but...
>
> ...it was not the dramatic degradation I expected to find in this
> exercise. Therefore I attempted to tease what I was looking for out
> of the profile stats I already uploaded to this site, made from actual
> production runs. My attempts are all found in an hg repository that
> can be downloaded from
> sftp://sftp@coolheads.com//files/py-re-perform-276-2712 using password
> bysIe20H .
>
> I do not feel the latter work took me where I wanted to go, and I
> think the reason is that, at least for purposes of our application,
> Python 2.7.12 has been so extensively refactored since Python 2.7.6.
> So it's an apples-to-oranges comparison, apparently. Still, the
> performance difference for re.sub() is quite dramatic , and re.sub()
> is the only comparable function whose performance dramatically
> worsened: in our application, 2.7.12's re.sub() takes 3.04 times as
> long as 2.7.6's.
>
> The good news, of course, is that by and large the performance of the
> other *comparable* functions largely improved, often dramatically.
> But at least in our application, it doesn't come close to making up
> for the degradation in re.sub().
>
> My by-the-gut bottom line: somebody who really knows the re module
> should take a deep look at re.sub(). Why would re.sub(), unlike all
> others, take so much longer to run, while *every* other function in
> the re module get (often much) faster? It feels like there's a bug
> somewhere in re.sub().
>
> Steve Newcomb
>
|
msg274189 - (view) |
Author: Steve Newcomb (steve.newcomb) * |
Date: 2016-09-02 00:35 |
On 09/01/2016 05:01 PM, Steve Newcomb wrote:
>
>> The outputs show that 2.7.12's re.sub() takes 1.2 times as long as
>> 2.7.6's. It's a significant difference, but...
>>
>> ...it was not the dramatic degradation I expected to find in this
>> exercise.
On second (third?) thought, the degree of degradation could easily
depend on the source data being processed. Maybe test.py does, in fact,
demonstrate the problem, but the test data I used (p17-118.htm) do not
demonstrate a terribly severe case of the problem.
|
msg379073 - (view) |
Author: Gregory P. Smith (gregory.p.smith) * |
Date: 2020-10-20 01:29 |
2.7 is end of life.
if you have regular expression performance issues with something in 3, please open a new issue.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:58:35 | admin | set | github: 72085 |
2020-10-20 01:29:48 | gregory.p.smith | set | status: pending -> closed
nosy:
+ gregory.p.smith messages:
+ msg379073
resolution: wont fix stage: resolved |
2019-09-09 07:32:08 | serhiy.storchaka | set | status: open -> pending |
2016-09-02 00:35:58 | steve.newcomb | set | messages:
+ msg274189 |
2016-09-01 21:18:35 | vstinner | set | nosy:
+ vstinner type: performance
|
2016-09-01 21:01:26 | steve.newcomb | set | messages:
+ msg274183 |
2016-09-01 20:53:00 | steve.newcomb | set | files:
+ test.py, test_output_2.7.6.txt, p17-188.htm, test_output_2.7.12.txt
messages:
+ msg274182 |
2016-08-31 15:15:16 | steve.newcomb | set | files:
+ newcomb_profile_report_for_2.7.6.txt, newcomb_profile_report_for_2.7.12.txt
messages:
+ msg274029 |
2016-08-30 18:33:35 | steve.newcomb | set | messages:
+ msg273951 |
2016-08-30 18:14:47 | steve.newcomb | set | messages:
+ msg273946 |
2016-08-30 17:24:33 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages:
+ msg273940
|
2016-08-30 16:46:58 | rhettinger | set | nosy:
+ rhettinger messages:
+ msg273935
|
2016-08-30 16:27:50 | steve.newcomb | create | |