Message96326
First patch, please forgive long comment :)
I submit a small patch which speeds up readline() on my data set - a
74MB (5MB .gz) log file with 600K lines.
The speedup is 350%.
Source of slowness is that (~20KB) extrabuf is allocated/deallocated in
read() and _unread() with each call to readline().
In the patch read() returns a slice from extrabuf and defers
manipulation of extrabuf to _read().
In the following, the first timeit() corresponds to reading extrabuf
slices while the second timeit() corresponds to read() and _unread() as
they are done today:
>>> timeit.Timer("x[10000: 10100]", "x = 'x' * 20000").timeit()
0.25299811363220215
>>> timeit.Timer("x[: 100]; x[100:]; x[100:] + x[: 100]", "x = 'x' *
10000").timeit()
5.843876838684082
Another speedup is achieved by doing a small shortcut in readline() for
the typical case in which the entire line is already in extrabuf.
The patch only addresses the typical case of calling readline() with no
arguments. It does not address other problems in readline() logic. In
particular the current 512 chunk size is not a sweet spot. Regardless of
the size argument passed to readline(), read() will continue to
decompress just 1024 bytes with each call as the size of extrabuf swings
around the target size argument as result of the interaction between
_unread() and read(). |
|
Date |
User |
Action |
Args |
2009-12-13 09:38:58 | nirai | set | recipients:
+ nirai, pitrou, jackdied, brian.curtin, asnakelover |
2009-12-13 09:38:57 | nirai | set | messageid: <1260697137.98.0.871265257089.issue7471@psf.upfronthosting.co.za> |
2009-12-13 09:38:55 | nirai | link | issue7471 messages |
2009-12-13 09:38:54 | nirai | create | |
|