Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tarfile stream read performance regression #78191

Closed
hajoscher mannequin opened this issue Jun 30, 2018 · 10 comments
Closed

tarfile stream read performance regression #78191

hajoscher mannequin opened this issue Jun 30, 2018 · 10 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir

Comments

@hajoscher
Copy link
Mannequin

hajoscher mannequin commented Jun 30, 2018

BPO 34010
Nosy @vstinner, @methane, @miss-islington, @hajoscher
PRs
  • bpo-34010: Fix tarfile read performance regression #8020
  • [3.7] bpo-34010: Fix tarfile read performance regression (GH-8020) #8082
  • [3.6] bpo-34010: Fix tarfile read performance regression (GH-8020) #8083
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-07-04.08:25:19.312>
    created_at = <Date 2018-06-30.09:27:00.578>
    labels = ['3.8', '3.7', 'library', 'performance']
    title = 'tarfile stream read performance regression'
    updated_at = <Date 2018-07-04.12:27:59.474>
    user = 'https://github.com/hajoscher'

    bugs.python.org fields:

    activity = <Date 2018-07-04.12:27:59.474>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-07-04.08:25:19.312>
    closer = 'methane'
    components = ['Library (Lib)']
    creation = <Date 2018-06-30.09:27:00.578>
    creator = 'hajoscher'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 34010
    keywords = ['patch', '3.2regression']
    message_count = 10.0
    messages = ['320763', '320848', '321011', '321018', '321020', '321021', '321022', '321023', '321040', '321041']
    nosy_count = 4.0
    nosy_names = ['vstinner', 'methane', 'miss-islington', 'hajoscher']
    pr_nums = ['8020', '8082', '8083']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue34010'
    versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

    @hajoscher
    Copy link
    Mannequin Author

    hajoscher mannequin commented Jun 30, 2018

    Buffer read of large files in a compressed tarfile stream performs poorly.

    The buffered read in tarfile _Stream is extending a bytes object.
    It is much more efficient to use a list followed by a join.
    Using a list can mean seconds instead of minutes.

    This performance regression was introduced in b506dc3.

    How to test:
    # create random tarfile 50Mb
    dd if=/dev/urandom of=test.bin count=50 bs=1M
    tar czvf test.tgz test.bin

    # read with tarfile as stream (note pipe symbol in 'r|gz')
    import tarfile
    tfile = tarfile.open("test.tgz", 'r|gz')
    for t in tfile:
        file = tfile.extractfile(t)
        if file:
            print(len(file.read()))

    @hajoscher hajoscher mannequin added 3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir performance Performance or resource usage labels Jun 30, 2018
    @methane
    Copy link
    Member

    methane commented Jul 1, 2018

    Nice catch.

    I confirmed this is a hard regression of performance.
    Decompressing a file must be O(n) when n=filesize, but O(n^2) now.

    While we live with this regression for a long time, I feel it's worth enough to backport.
    This can be DoS vulnerability.

    Can you write NEWS entry for it?

    @hajoscher
    Copy link
    Mannequin Author

    hajoscher mannequin commented Jul 4, 2018

    Yes, it performance is really bad for large files, and memory consumption as well. I will write something for NEWS.

    @methane methane changed the title tarfile stream read performance tarfile stream read performance regression Jul 4, 2018
    @methane
    Copy link
    Member

    methane commented Jul 4, 2018

    New changeset 12a08c4 by INADA Naoki (hajoscher) in branch 'master':
    bpo-34010: Fix tarfile read performance regression (GH-8020)
    12a08c4

    @methane
    Copy link
    Member

    methane commented Jul 4, 2018

    thanks

    @methane methane closed this as completed Jul 4, 2018
    @vstinner
    Copy link
    Member

    vstinner commented Jul 4, 2018

    https://github.com/python/cpython/pull/8020/files/77a54a39aace1a38794884218abe801b85b54e62#diff-ef64d8b610dda67977a63a9837f46349

    •        buf = "".join(t)
      

    + buf = b"".join(t)

    @hajoscher: "It never caused a problem, since this line is never called; size is never None in the function call. But still, should be fixed, I guess."

    Would it be possible to have an unit test for this modified line? Untested code is broken, as you showed :-)

    @miss-islington
    Copy link
    Contributor

    New changeset c1b75b5 by Miss Islington (bot) in branch '3.7':
    bpo-34010: Fix tarfile read performance regression (GH-8020)
    c1b75b5

    @miss-islington
    Copy link
    Contributor

    New changeset d7a0ad7 by Miss Islington (bot) in branch '3.6':
    bpo-34010: Fix tarfile read performance regression (GH-8020)
    d7a0ad7

    @methane
    Copy link
    Member

    methane commented Jul 4, 2018

    @victor I think removing unused code is better than adding test for it and maintain it.

    So I removed that unused code block in #52336.

    @vstinner
    Copy link
    Member

    vstinner commented Jul 4, 2018

    @victor I think removing unused code is better than adding test for it and maintain it.

    Sure. I reviewed your PR 8089.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants