Message 245144 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Patrick Maupin
Recipients	Patrick Maupin, ezio.melotti, mrabarnett, serhiy.storchaka
Date	2015-06-10.20:16:59
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1433967419.61.0.656860695251.issue24426@psf.upfronthosting.co.za>
In-reply-to

Content
1) I have obviously oversimplified my test case, to the point where a developer thinks I'm silly enough to reach for the regex module just to split on a linefeed. 2) '\n(?<=(\n))' -- yes, of course, any casual user of the re module would immediately choose that as the most obvious thing to do. 3) My real regex is r'( [a-zA-Z0-9_]+ \[[0-9]+\][0-9:]+\].*\n)' because I am taking nasty broken output from a Cadence tool, fixing it up, and dumping it back out to a file. Yes, I'm sure this could be optimized, as well, but when I can just remove the parentheses and get a 10X speedup, and then figure out the string I meant to capture by looking at string lengths, shouldn't there at least be a warning that the re module has performance issues with capturing groups with split(), and casual users like me should figure out what the matching strings are some other way? I assumed that, since I saw almost exactly the same performance degradation with \n as I did with this, that that was a valid testcase. If that was a bad assumption and this is insufficient to debug it, I can submit a bigger testcase. But if this issue is going to be wontfixed for some reason, there should certainly be a documentation note added, because it is not intuitive that splitting 5GB of data into 1000 strings of around 5MB each should be 10X faster than doing the same thing, but also capturing the 1K ten-byte strings inbetween the big ones. Thanks, Pat

1) I have obviously oversimplified my test case, to the point where a developer thinks I'm silly enough to reach for the regex module just to split on a linefeed.

2) '\n(?<=(\n))' -- yes, of course, any casual user of the re module would immediately choose that as the most obvious thing to do.

3) My real regex is r'( [a-zA-Z0-9_]+ \[[0-9]+\][0-9:]+\].*\n)' because I am taking nasty broken output from a Cadence tool, fixing it up, and dumping it back out to a file.  Yes, I'm sure this could be optimized, as well, but when I can just remove the parentheses and get a 10X speedup, and then figure out the string I meant to capture by looking at string lengths, shouldn't there at least be a warning that the re module has performance issues with capturing groups with split(), and casual users like me should figure out what the matching strings are some other way?


I assumed that, since I saw almost exactly the same performance degradation with \n as I did with this, that that was a valid testcase.  If that was a bad assumption and this is insufficient to debug it, I can submit a bigger testcase.


But if this issue is going to be wontfixed for some reason, there should certainly be a documentation note added, because it is not intuitive that splitting 5GB of data into 1000 strings of around 5MB each should be 10X faster than doing the same thing, but also capturing the 1K ten-byte strings inbetween the big ones.


Thanks,
Pat

History
Date	User	Action	Args
2015-06-10 20:16:59	Patrick Maupin	set	recipients: + Patrick Maupin, ezio.melotti, mrabarnett, serhiy.storchaka
2015-06-10 20:16:59	Patrick Maupin	set	messageid: <1433967419.61.0.656860695251.issue24426@psf.upfronthosting.co.za>
2015-06-10 20:16:59	Patrick Maupin	link	issue24426 messages
2015-06-10 20:16:59	Patrick Maupin	create