Message 189552 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Michael.Fox
Recipients	Michael.Fox, nadeem.vawda, vstinner
Date	2013-05-18.21:48:21
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<CABbL6oZnJtD7CE6hUcvBhO2Px1x86g5fPBapm2f9mRCWREPuyw@mail.gmail.com>
In-reply-to	<CABbL6oaiyhE_WcXdPd9fOp8XLQ-co5cSiFGfHWEo9YYFdy5erg@mail.gmail.com>

Content
I looked into it a little and it looks like pyliblzma is a pure C extension whereas new lzma library wraps liblzma but the rest is python. In particular this happens for every line: if size < 0: end = self._buffer.find(b"\n", self._buffer_offset) + 1 if end > 0: line = self._buffer[self._buffer_offset : end] self._buffer_offset = end self._pos += len(line) return line And while that doesn't look like a lot of overhead, it's definitely something. So, unless someone thinks that a pure C extension is the right technical direction, lzma in 3.4 is probably as fast as it's ever going to be. I will just use the workaround of piping in unxz regardless. On Sat, May 18, 2013 at 2:12 PM, Michael Fox <415fox@gmail.com> wrote: > 3.4 is much better but still 4x slower than 2.7 > > m@air:~/q/topaz/parse_datalog$ time python2.7 lzmaperf.py > 102368 > > real 0m0.053s > user 0m0.052s > sys 0m0.000s > m@air:~/q/topaz/parse_datalog$ time > ~/tmp/cpython-23836f17e4a2/bin/python3.4 lzmaperf.py > 102368 > > real 0m0.229s > user 0m0.212s > sys 0m0.012s > > The bottleneck has moved here: > 102369 0.151 0.000 0.226 0.000 lzma.py:333(readline) > > I don't know if this is a strictly fair comparison. The lzma module > and pyliblzma may not be of the same quality. I've just come across a > real bug in pyliblzma. It doesn't apply to this test, but who knows > what shortcuts it's taking. > > Finally, here's a baseline: > > m@air:~/q/topaz/parse_datalog$ time xzcat bigfile.xz \| wc -l > 102368 > > real 0m0.034s > user 0m0.024s > sys 0m0.016s > > On Sat, May 18, 2013 at 12:46 PM, Nadeem Vawda <report@bugs.python.org> wrote: >> >> Nadeem Vawda added the comment: >> >> Have you tried running the benchmark against the default (3.4) branch? >> There was some significant optimization work done in issue 16034, but >> the changes were not backported to 3.3. >> >> ---------- >> >> _______________________________________ >> Python tracker <report@bugs.python.org> >> <http://bugs.python.org/issue18003> >> _______________________________________ > > > > -- > > - > Michael -- - Michael

I looked into it a little and it looks like pyliblzma is a pure C
extension whereas new lzma library wraps liblzma but the rest is
python. In particular this happens for every line:

        if size < 0:
            end = self._buffer.find(b"\n", self._buffer_offset) + 1
            if end > 0:
                line = self._buffer[self._buffer_offset : end]
                self._buffer_offset = end
                self._pos += len(line)
                return line

And while that doesn't look like a lot of overhead, it's definitely
something. So, unless someone thinks that a pure C extension is the
right technical direction, lzma in 3.4 is probably as fast as it's
ever going to be. I will just use the workaround of piping in unxz
regardless.

On Sat, May 18, 2013 at 2:12 PM, Michael Fox <415fox@gmail.com> wrote:
> 3.4 is much better but still 4x slower than 2.7
>
> m@air:~/q/topaz/parse_datalog$ time python2.7 lzmaperf.py
> 102368
>
> real    0m0.053s
> user    0m0.052s
> sys     0m0.000s
> m@air:~/q/topaz/parse_datalog$ time
> ~/tmp/cpython-23836f17e4a2/bin/python3.4 lzmaperf.py
> 102368
>
> real    0m0.229s
> user    0m0.212s
> sys     0m0.012s
>
> The bottleneck has moved here:
>  102369    0.151    0.000    0.226    0.000 lzma.py:333(readline)
>
> I don't know if this is a strictly fair comparison. The lzma module
> and pyliblzma may not be of the same quality. I've just come across a
> real bug in pyliblzma. It doesn't apply to this test, but who knows
> what shortcuts it's taking.
>
> Finally, here's a baseline:
>
> m@air:~/q/topaz/parse_datalog$ time xzcat bigfile.xz | wc -l
> 102368
>
> real    0m0.034s
> user    0m0.024s
> sys     0m0.016s
>
> On Sat, May 18, 2013 at 12:46 PM, Nadeem Vawda <report@bugs.python.org> wrote:
>>
>> Nadeem Vawda added the comment:
>>
>> Have you tried running the benchmark against the default (3.4) branch?
>> There was some significant optimization work done in issue 16034, but
>> the changes were not backported to 3.3.
>>
>> ----------
>>
>> _______________________________________
>> Python tracker <report@bugs.python.org>
>> <http://bugs.python.org/issue18003>
>> _______________________________________
>
>
>
> --
>
> -
> Michael

-- 

-
Michael

History
Date	User	Action	Args
2013-05-18 21:48:22	Michael.Fox	set	recipients: + Michael.Fox, vstinner, nadeem.vawda
2013-05-18 21:48:22	Michael.Fox	link	issue18003 messages
2013-05-18 21:48:21	Michael.Fox	create