Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove RLock from BZ2File #87951

Closed
methane opened this issue Apr 9, 2021 · 7 comments
Closed

Remove RLock from BZ2File #87951

methane opened this issue Apr 9, 2021 · 7 comments
Assignees
Labels
3.10 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@methane
Copy link
Member

methane commented Apr 9, 2021

BPO 43785
Nosy @gpshead, @methane, @animalize, @tirkarthi
PRs
  • bpo-43785: Improve BZ2File performance #25299
  • bpo-43785: Update bz2 document #25351
  • Files
  • dec.py
  • create.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gpshead'
    closed_at = <Date 2021-04-13.23:23:08.068>
    created_at = <Date 2021-04-09.05:42:16.058>
    labels = ['library', '3.10', 'performance']
    title = 'Remove RLock from BZ2File'
    updated_at = <Date 2021-04-13.23:23:10.845>
    user = 'https://github.com/methane'

    bugs.python.org fields:

    activity = <Date 2021-04-13.23:23:10.845>
    actor = 'methane'
    assignee = 'gregory.p.smith'
    closed = True
    closed_date = <Date 2021-04-13.23:23:08.068>
    closer = 'methane'
    components = ['Library (Lib)']
    creation = <Date 2021-04-09.05:42:16.058>
    creator = 'methane'
    dependencies = []
    files = ['49948', '49949']
    hgrepos = []
    issue_num = 43785
    keywords = ['patch']
    message_count = 7.0
    messages = ['390588', '390596', '390604', '390796', '390797', '390821', '391013']
    nosy_count = 4.0
    nosy_names = ['gregory.p.smith', 'methane', 'malin', 'xtreak']
    pr_nums = ['25299', '25351']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue43785'
    versions = ['Python 3.10']

    @methane
    Copy link
    Member Author

    methane commented Apr 9, 2021

    The original issue is reported here.
    https://discuss.python.org/t/non-optimal-bz2-reading-speed/6869

    1. Only BZ2File uses RLock()

    lzma and gzip don't use RLock(). It adds significant performance overhead.
    When I removed `with self._lock:`, decompression speed improved from about 148k line/sec to 200k line/sec.

    1. The default iter calls readline() for each iteration.

    BZ2File.readline() is implemented in C so it is slightly slow than C implementation.

    If I add this __iter__() to BZ2File, decompression speed improved from about 148k lines/sec (or 200k lines/sec) to 500k lines/sec.

        def __iter__(self):
            self._check_can_read()
            return iter(self._buffer)

    If this __iter__ method is safe, it can be added to gzip and lzma too.

    @methane methane added 3.10 only security fixes stdlib Python modules in the Lib dir performance Performance or resource usage labels Apr 9, 2021
    @methane
    Copy link
    Member Author

    methane commented Apr 9, 2021

    I will create a separated issue for __iter__, because it is same to gzip and lzma.

    @methane methane changed the title bz2 performance issue. Remove RLock from BZ2File Apr 9, 2021
    @methane methane changed the title bz2 performance issue. Remove RLock from BZ2File Apr 9, 2021
    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Apr 9, 2021

    This change is backwards incompatible, it may break some code silently.

    If someone really needs better performance, they can write a BZ2File class without RLock by themselves, it should be easy.

    FYI, zlib module was added in 1997, bz2 module was added in 2002, lzma module was added in 2011. (Just curious for these years)

    @gpshead
    Copy link
    Member

    gpshead commented Apr 11, 2021

    I'm not worried about compatibility on this for 3.10. Nobody realistically expects to be able to have multiple readers from a single BZ2File stream. They don't for the gzip or lzma equivalents. That isn't even a normal thing to do on an actual file. Lets go forward with this for 3.10betas and see if anyone even notices. I doubt they will.

    But the addition of __iter__ deferring to iter(self._buffer) belongs in its own PR and issue and should be done for all of bz2, gzip, and lzma all at once.

    I've edited the PR branch to remove __iter__ and cleanup a couple of minor nits.

    @gpshead
    Copy link
    Member

    gpshead commented Apr 11, 2021

    I see you're already tracking __iter__ in https://bugs.python.org/issue43787, perfect! :)

    @methane
    Copy link
    Member Author

    methane commented Apr 12, 2021

    New changeset cc2ffcd by Inada Naoki in branch 'master':
    bpo-43785: Improve BZ2File performance by removing RLock (GH-25299)
    cc2ffcd

    @methane
    Copy link
    Member Author

    methane commented Apr 13, 2021

    New changeset 695d47b by Inada Naoki in branch 'master':
    bpo-43785: Update bz2 document (GH-25351)
    695d47b

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants