This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Patrick Maupin
Recipients Patrick Maupin, ezio.melotti, mrabarnett, serhiy.storchaka
Date 2015-06-10.20:16:59
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1433967419.61.0.656860695251.issue24426@psf.upfronthosting.co.za>
In-reply-to
Content
1) I have obviously oversimplified my test case, to the point where a developer thinks I'm silly enough to reach for the regex module just to split on a linefeed.

2) '\n(?<=(\n))' -- yes, of course, any casual user of the re module would immediately choose that as the most obvious thing to do.

3) My real regex is r'( [a-zA-Z0-9_]+ \[[0-9]+\][0-9:]+\].*\n)' because I am taking nasty broken output from a Cadence tool, fixing it up, and dumping it back out to a file.  Yes, I'm sure this could be optimized, as well, but when I can just remove the parentheses and get a 10X speedup, and then figure out the string I meant to capture by looking at string lengths, shouldn't there at least be a warning that the re module has performance issues with capturing groups with split(), and casual users like me should figure out what the matching strings are some other way?


I assumed that, since I saw almost exactly the same performance degradation with \n as I did with this, that that was a valid testcase.  If that was a bad assumption and this is insufficient to debug it, I can submit a bigger testcase.


But if this issue is going to be wontfixed for some reason, there should certainly be a documentation note added, because it is not intuitive that splitting 5GB of data into 1000 strings of around 5MB each should be 10X faster than doing the same thing, but also capturing the 1K ten-byte strings inbetween the big ones.


Thanks,
Pat
History
Date User Action Args
2015-06-10 20:16:59Patrick Maupinsetrecipients: + Patrick Maupin, ezio.melotti, mrabarnett, serhiy.storchaka
2015-06-10 20:16:59Patrick Maupinsetmessageid: <1433967419.61.0.656860695251.issue24426@psf.upfronthosting.co.za>
2015-06-10 20:16:59Patrick Maupinlinkissue24426 messages
2015-06-10 20:16:59Patrick Maupincreate