Author Patrick Maupin
Recipients Patrick Maupin, ezio.melotti, mrabarnett, serhiy.storchaka
Date 2015-06-10.21:28:44
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
Just to be perfectly clear, this is no exaggeration:

My original file was slightly over 5GB.

I have approximately 1050 bad strings in it, averaging around 11 characters per string.

If I split it without capturing those 1050 strings, it takes 3.7 seconds.

If I split it and capture those 1050 strings, it takes 39 seconds.

ISTM that 33 ms to create a capture group with a single 11 character string is excessive, so there is probably something else going on like excessive object copying, that just isn't noticeable on a smaller source string.

In the small example I posted, if I replace the line:

data = 100 * (200000 * ' ' + '\n')


data = 1000 * (500000 * ' ' + '\n')

then I get approximately the same 3.7 second vs 39 second results on that (somewhat older) machine.  I didn't start out with that in the example, because I thought the problem should still be obvious from the scaled down example.

Obviously, your CPU numbers will be somewhat different.  The question remains, though, why it takes around 60 million CPU cycles for each and every returned capture group.  Or, to put it another way, why can I stop doing the capture group, and grab the same string with pure Python by looking at the string lengths of the intervening strings, well over 100 times faster than it takes for the re module to give me that group?

Date User Action Args
2015-06-10 21:28:44Patrick Maupinsetrecipients: + Patrick Maupin, ezio.melotti, mrabarnett, serhiy.storchaka
2015-06-10 21:28:44Patrick Maupinsetmessageid: <>
2015-06-10 21:28:44Patrick Maupinlinkissue24426 messages
2015-06-10 21:28:44Patrick Maupincreate