classification
Title: Remove RLock from BZ2File
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.10
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: gregory.p.smith, malin, methane, xtreak
Priority: normal Keywords: patch

Created on 2021-04-09 05:42 by methane, last changed 2021-04-13 23:23 by methane. This issue is now closed.

Files
File name Uploaded Description Edit
dec.py methane, 2021-04-09 05:42
create.py methane, 2021-04-09 05:42
Pull Requests
URL Status Linked Edit
PR 25299 merged methane, 2021-04-09 05:52
PR 25351 merged methane, 2021-04-12 05:57
Messages (7)
msg390588 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021-04-09 05:42
The original issue is reported here.
https://discuss.python.org/t/non-optimal-bz2-reading-speed/6869

1. Only BZ2File uses RLock()

lzma and gzip don't use RLock(). It adds significant performance overhead.
When I removed `with self._lock:`, decompression speed improved from about 148k line/sec to 200k line/sec.


2. The default __iter__ calls `readline()` for each iteration.

BZ2File.readline() is implemented in C so it is slightly slow than C implementation.

If I add this `__iter__()` to BZ2File, decompression speed improved from about 148k lines/sec (or 200k lines/sec) to 500k lines/sec.

    def __iter__(self):
        self._check_can_read()
        return iter(self._buffer)

If this __iter__ method is safe, it can be added to gzip and lzma too.
msg390596 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021-04-09 08:07
I will create a separated issue for __iter__, because it is same to gzip and lzma.
msg390604 - (view) Author: Ma Lin (malin) * Date: 2021-04-09 10:21
This change is backwards incompatible, it may break some code silently.

If someone really needs better performance, they can write a BZ2File class without RLock by themselves, it should be easy.

FYI, zlib module was added in 1997, bz2 module was added in 2002, lzma module was added in 2011. (Just curious for these years)
msg390796 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2021-04-11 21:32
I'm not worried about compatibility on this for 3.10.  Nobody realistically expects to be able to have multiple readers from a single BZ2File stream.  They don't for the gzip or lzma equivalents.  That isn't even a normal thing to do on an actual file.  Lets go forward with this for 3.10betas and see if anyone even notices.  I doubt they will.

But the addition of __iter__ deferring to iter(self._buffer) belongs in its own PR and issue and should be done for all of bz2, gzip, and lzma all at once.

I've edited the PR branch to remove __iter__ and cleanup a couple of minor nits.
msg390797 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2021-04-11 21:35
I see you're already tracking __iter__ in https://bugs.python.org/issue43787, perfect! :)
msg390821 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021-04-12 05:47
New changeset cc2ffcdfd78df3a18edae60df81b2f1b044b1634 by Inada Naoki in branch 'master':
bpo-43785: Improve BZ2File performance by removing RLock (GH-25299)
https://github.com/python/cpython/commit/cc2ffcdfd78df3a18edae60df81b2f1b044b1634
msg391013 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021-04-13 23:23
New changeset 695d47b51e3e197de5448a1eb2f618bef6d59ac8 by Inada Naoki in branch 'master':
bpo-43785: Update bz2 document (GH-25351)
https://github.com/python/cpython/commit/695d47b51e3e197de5448a1eb2f618bef6d59ac8
History
Date User Action Args
2021-04-13 23:23:10methanesetmessages: + msg391013
2021-04-13 23:23:08methanesetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2021-04-12 05:57:40methanesetpull_requests: + pull_request24086
2021-04-12 05:47:00methanesetmessages: + msg390821
2021-04-11 21:35:59gregory.p.smithsetmessages: + msg390797
2021-04-11 21:32:40gregory.p.smithsetassignee: gregory.p.smith

messages: + msg390796
nosy: + gregory.p.smith
2021-04-09 10:21:20malinsetmessages: + msg390604
2021-04-09 08:07:30methanesetmessages: + msg390596
title: bz2 performance issue. -> Remove RLock from BZ2File
2021-04-09 07:30:17malinsetnosy: + malin
2021-04-09 07:05:55xtreaksetnosy: + xtreak
2021-04-09 05:52:46methanesetkeywords: + patch
stage: patch review
pull_requests: + pull_request24031
2021-04-09 05:42:37methanesetfiles: + create.py
type: performance
2021-04-09 05:42:16methanecreate