Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TarFile expose copyfileobj bufsize to improve throughput #71386

Closed
fried mannequin opened this issue Jun 3, 2016 · 4 comments
Closed

TarFile expose copyfileobj bufsize to improve throughput #71386

fried mannequin opened this issue Jun 3, 2016 · 4 comments
Assignees
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@fried
Copy link
Mannequin

fried mannequin commented Jun 3, 2016

BPO 27199
Nosy @gustaebel, @asvetlov, @ambv, @fried
Files
  • buftest.py: test file to generate two random tar files and test extraction time improvements
  • copybufsize.patch: patch to expose the copy buffer size
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ambv'
    closed_at = <Date 2016-09-10.02:51:11.259>
    created_at = <Date 2016-06-03.18:55:37.748>
    labels = ['library', 'performance']
    title = 'TarFile expose copyfileobj bufsize to improve throughput'
    updated_at = <Date 2016-09-10.02:51:11.258>
    user = 'https://github.com/fried'

    bugs.python.org fields:

    activity = <Date 2016-09-10.02:51:11.258>
    actor = 'lukasz.langa'
    assignee = 'lukasz.langa'
    closed = True
    closed_date = <Date 2016-09-10.02:51:11.259>
    closer = 'lukasz.langa'
    components = ['Library (Lib)']
    creation = <Date 2016-06-03.18:55:37.748>
    creator = 'fried'
    dependencies = []
    files = ['43158', '43159']
    hgrepos = []
    issue_num = 27199
    keywords = ['patch']
    message_count = 4.0
    messages = ['267134', '268234', '275546', '275547']
    nosy_count = 5.0
    nosy_names = ['lars.gustaebel', 'asvetlov', 'lukasz.langa', 'python-dev', 'fried']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue27199'
    versions = ['Python 3.6']

    @fried
    Copy link
    Mannequin Author

    fried mannequin commented Jun 3, 2016

    The default of 16k while good for memory usage it is not well suited for all cases. if we increased this to 4MB we saw a pretty large improvement to tar file creation and extraction on linux servers.

    For a 1gb tar file containing 1024 random files each of 10MB in size.
    Time Delta for TarFile: 146.3240258693695
    Time Delta for FastTarFile 4MB copybufsize: 102.76440262794495
    Time Diff: 43.55962324142456 0.2976928975444698

    @fried fried mannequin added stdlib Python modules in the Lib dir performance Performance or resource usage labels Jun 3, 2016
    @ambv
    Copy link
    Contributor

    ambv commented Jun 11, 2016

    New feature -> 3.6.

    @ambv ambv self-assigned this Jun 11, 2016
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 10, 2016

    New changeset 0bac85e355b5 by Łukasz Langa in branch 'default':
    Issue bpo-27199: TarFile expose copyfileobj bufsize to improve throughput
    https://hg.python.org/cpython/rev/0bac85e355b5

    @ambv
    Copy link
    Contributor

    ambv commented Sep 10, 2016

    Thanks for the patch!

    @ambv ambv closed this as completed Sep 10, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    bdraco added a commit to bdraco/securetar that referenced this issue Mar 30, 2023
    cpython uses copyfileobj under the hood for fast copies
    but the default buffer size is quite low which increases
    the amount of time in python code when copying the sqlite
    database. As this is the usually the bulk of the backup,
    increasing the buffer can help reduce the backup time
    quite a bit
    
    related:
    python/cpython#71386
    pvizeli pushed a commit to pvizeli/securetar that referenced this issue Mar 31, 2023
    * Make bufsize adjustable
    
    cpython uses copyfileobj under the hood for fast copies
    but the default buffer size is quite low which increases
    the amount of time in python code when copying the sqlite
    database. As this is the usually the bulk of the backup,
    increasing the buffer can help reduce the backup time
    quite a bit
    
    related:
    python/cpython#71386
    
    * coverage
    bdraco added a commit to bdraco/supervisor that referenced this issue Apr 6, 2023
    This is the same change as home-assistant/core#90613
    but for supervisor
    
    If the backup takes too long, core will release the lock on the database
    and the backup will be no good
    
    https://github.com/home-assistant/core/blob/2fc34e7cced87a8e042919e059d3a07bb760c77f/homeassistant/components/recorder/core.py#L926
    
    cpython uses copyfileobj under the hood for fast copies but the default buffer size is quite low which increases the amount of time in python code when copying the sqlite database. As this is the usually the bulk of the backup, increasing the buffer can help reduce the backup time quite a bit.
    
    Ideally this would all use sendfile under the hood as it would shift nearly all the burden out of userspace but tarfile doesn't currently try that https://github.com/python/cpython/blob/4664a7cf689946f0c9854cadee7c6aa9c276a8cf/Lib/shutil.py#L106
    
    related:
    In testing (non encrypted) improvement was at least as good as python/cpython#71386
    pvizeli pushed a commit to home-assistant/supervisor that referenced this issue Apr 19, 2023
    * Speed up backups by increasing buffer size
    
    This is the same change as home-assistant/core#90613
    but for supervisor
    
    If the backup takes too long, core will release the lock on the database
    and the backup will be no good
    
    https://github.com/home-assistant/core/blob/2fc34e7cced87a8e042919e059d3a07bb760c77f/homeassistant/components/recorder/core.py#L926
    
    cpython uses copyfileobj under the hood for fast copies but the default buffer size is quite low which increases the amount of time in python code when copying the sqlite database. As this is the usually the bulk of the backup, increasing the buffer can help reduce the backup time quite a bit.
    
    Ideally this would all use sendfile under the hood as it would shift nearly all the burden out of userspace but tarfile doesn't currently try that https://github.com/python/cpython/blob/4664a7cf689946f0c9854cadee7c6aa9c276a8cf/Lib/shutil.py#L106
    
    related:
    In testing (non encrypted) improvement was at least as good as python/cpython#71386
    
    * add the const
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant