Message 96326 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	nirai
Recipients	asnakelover, brian.curtin, jackdied, nirai, pitrou
Date	2009-12-13.09:38:52
SpamBayes Score	4.5082814e-07
Marked as misclassified	No
Message-id	<1260697137.98.0.871265257089.issue7471@psf.upfronthosting.co.za>
In-reply-to

Content
First patch, please forgive long comment :) I submit a small patch which speeds up readline() on my data set - a 74MB (5MB .gz) log file with 600K lines. The speedup is 350%. Source of slowness is that (~20KB) extrabuf is allocated/deallocated in read() and _unread() with each call to readline(). In the patch read() returns a slice from extrabuf and defers manipulation of extrabuf to _read(). In the following, the first timeit() corresponds to reading extrabuf slices while the second timeit() corresponds to read() and _unread() as they are done today: >>> timeit.Timer("x[10000: 10100]", "x = 'x' * 20000").timeit() 0.25299811363220215 >>> timeit.Timer("x[: 100]; x[100:]; x[100:] + x[: 100]", "x = 'x' * 10000").timeit() 5.843876838684082 Another speedup is achieved by doing a small shortcut in readline() for the typical case in which the entire line is already in extrabuf. The patch only addresses the typical case of calling readline() with no arguments. It does not address other problems in readline() logic. In particular the current 512 chunk size is not a sweet spot. Regardless of the size argument passed to readline(), read() will continue to decompress just 1024 bytes with each call as the size of extrabuf swings around the target size argument as result of the interaction between _unread() and read().

First patch, please forgive long comment :)

I submit a small patch which speeds up readline() on my data set - a 
74MB (5MB .gz) log file with 600K lines.

The speedup is 350%.

Source of slowness is that (~20KB) extrabuf is allocated/deallocated in 
read() and _unread() with each call to readline().

In the patch read() returns a slice from extrabuf and defers 
manipulation of extrabuf to _read().

In the following, the first timeit() corresponds to reading extrabuf 
slices while the second timeit() corresponds to read() and _unread() as 
they are done today:

>>> timeit.Timer("x[10000: 10100]", "x = 'x' * 20000").timeit()
0.25299811363220215

>>> timeit.Timer("x[: 100]; x[100:]; x[100:] + x[: 100]", "x = 'x' * 
10000").timeit()
5.843876838684082

Another speedup is achieved by doing a small shortcut in readline() for 
the typical case in which the entire line is already in extrabuf.

The patch only addresses the typical case of calling readline() with no 
arguments. It does not address other problems in readline() logic. In 
particular the current 512 chunk size is not a sweet spot. Regardless of 
the size argument passed to readline(), read() will continue to 
decompress just 1024 bytes with each call as the size of extrabuf swings 
around the target size argument as result of the interaction between 
_unread() and read().

History
Date	User	Action	Args
2009-12-13 09:38:58	nirai	set	recipients: + nirai, pitrou, jackdied, brian.curtin, asnakelover
2009-12-13 09:38:57	nirai	set	messageid: <1260697137.98.0.871265257089.issue7471@psf.upfronthosting.co.za>
2009-12-13 09:38:55	nirai	link	issue7471 messages
2009-12-13 09:38:54	nirai	create