Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LZMA library sometimes fails to decompress a file #66071

Closed
vnummela mannequin opened this issue Jun 25, 2014 · 20 comments
Closed

LZMA library sometimes fails to decompress a file #66071

vnummela mannequin opened this issue Jun 25, 2014 · 20 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@vnummela
Copy link
Mannequin

vnummela mannequin commented Jun 25, 2014

BPO 21872
Nosy @gpshead, @4kir4, @peterjc, @serhiy-storchaka, @MojoVampire, @animalize, @miss-islington, @websurfer5
PRs
  • bpo-21872: fix lzma library decompresses data incompletely #14048
  • [3.8] bpo-21872: fix lzma library decompresses data incompletely (GH-14048) #16054
  • [3.7] bpo-21872: fix lzma library decompresses data incompletely (GH-14048) #16055
  • Files
  • Archive.zip: Example lzma-compressed files, a good one and a bad one
  • more_bad_lzma_files.zip: 15 more example files that fail lzma decompression
  • decompress-example-files.py
  • 02h_ticks.bi5: http://www.dukascopy.com/datafeed/EURUSD/2014/00/22/02h_ticks.bi5
  • failed_files_more.zip: 2 more failing files
  • fix-bug.diff
  • test_bad_files.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-09-12.15:25:19.616>
    created_at = <Date 2014-06-25.18:28:55.827>
    labels = ['3.7', '3.8', 'type-bug', 'library', '3.9']
    title = 'LZMA library sometimes fails to decompress a file'
    updated_at = <Date 2019-09-14.04:31:45.330>
    user = 'https://bugs.python.org/vnummela'

    bugs.python.org fields:

    activity = <Date 2019-09-14.04:31:45.330>
    actor = 'malin'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-09-12.15:25:19.616>
    closer = 'gregory.p.smith'
    components = ['Library (Lib)']
    creation = <Date 2014-06-25.18:28:55.827>
    creator = 'vnummela'
    dependencies = []
    files = ['35779', '35822', '37241', '40612', '47349', '48391', '48425']
    hgrepos = []
    issue_num = 21872
    keywords = ['patch']
    message_count = 20.0
    messages = ['221566', '221583', '221597', '221599', '221784', '222052', '231466', '231467', '251784', '309005', '344530', '344668', '345491', '345971', '345972', '352176', '352183', '352185', '352200', '352405']
    nosy_count = 13.0
    nosy_names = ['gregory.p.smith', 'nadeem.vawda', 'akira', 'maubp', 'serhiy.storchaka', 'Esa.Peuha', 'josh.r', 'malin', 'vnummela', 'kenorb', 'peremen', 'miss-islington', 'Jeffrey.Kintscher']
    pr_nums = ['14048', '16054', '16055']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'commit review'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue21872'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @vnummela
    Copy link
    Mannequin Author

    vnummela mannequin commented Jun 25, 2014

    Python lzma library sometimes fails to decompress a file, even though the file does not appear to be corrupt.

    Originally discovered with OS X 10.9 / Python 2.7.7 / bacports.lzma
    Now also reproduced on OS X / Python 3.4 / lzma, please see
    peterjc/backports.lzma#6 for more details.

    Two example files are provided, a good one and a bad one. Both are compressed using the older lzma algorithm (not xz). An attempt to decompress the 'bad' file raises "EOFError: Compressed file ended before the end-of-stream marker was reached."

    The 'bad' file appears to be ok, because

    • a direct call to XZ Utils processes the files without complaints
    • the decompressed files' contents appear to be ok.

    The example files contain tick data and have been downloaded from the Dukascopy bank's historical data feed service. The service is well known for it's high data quality and utilised by multiple analysis SW platforms. Thus I think it is unlikely that a file integrity issue on their end would have gone unnoticed.

    The error occurs relatively rarely; only around 1 - 5 times per 1000 downloaded files.

    @vnummela vnummela mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jun 25, 2014
    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Jun 25, 2014

    Just to be clear, when you say "1 - 5 times per 1000 downloaded files", have you confirmed that redownloading the same file a second time produces the same error? Just making sure we've ruled out corruption during transfer over the network; small errors might make it past one decompressor with minimal effect in the midst of a huge data file, while a more stringent error checking decompressor would reject them.

    @serhiy-storchaka
    Copy link
    Member

    >>> import lzma
    >>> f = lzma.open('22h_ticks_bad.bi5')
    >>> len(f.read())
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/serhiy/py/cpython/Lib/lzma.py", line 310, in read
        return self._read_all()
      File "/home/serhiy/py/cpython/Lib/lzma.py", line 251, in _read_all
        while self._fill_buffer():
      File "/home/serhiy/py/cpython/Lib/lzma.py", line 225, in _fill_buffer
        raise EOFError("Compressed file ended before the "
    EOFError: Compressed file ended before the end-of-stream marker was reached

    This is similar to bpo-1159051. We need a way to say "read as much as possible without error and raise EOFError only on next read".

    @vnummela
    Copy link
    Mannequin Author

    vnummela mannequin commented Jun 26, 2014

    My stats so far:

    As of writing this, I have attempted to decompress about 5000 downloaded files (two years of tick data). 25 'bad' files were found within this lot.

    I re-downloaded all of them, plus about 500 other files as the minimum lot the server supplies is 24 hours / files at a time.

    I compared all these 528 file pairs using hashlib.md5 and got identical hashes for all of them.

    I guess what I should do next is to go through the decompressed data and look for suspicious anomalies, but unfortunately I don't have the tools in place to do that quite yet.

    @EsaPeuha
    Copy link
    Mannequin

    EsaPeuha mannequin commented Jun 28, 2014

    This code

    import _lzma
    with open('22h_ticks_bad.bi5', 'rb') as f:
        infile = f.read()
    for i in range(8191, 8195):
        decompressor = _lzma.LZMADecompressor()
        first_out = decompressor.decompress(infile[:i])
        first_len = len(first_out)
        last_out = decompressor.decompress(infile[i:])
        last_len = len(last_out)
        print(i, first_len, first_len + last_len, decompressor.eof)

    prints this

    8191 36243 45480 True
    8192 36251 45473 False
    8193 36253 45475 False
    8194 36260 45480 True

    It seems to me that this is a subtle bug in liblzma; if the input stream to the incremental decompressor is broken at the wrong place, the internal state of the decompressor is corrupted. For this particular file, it happens when the break occurs after reading 8192 or 8193 bytes, and lzma.py happens to use a buffer of 8192 bytes. There is nothing wrong with the compressed file, since lzma.py decompresses it correctly if the buffer size is set to almost any other value.

    @vnummela
    Copy link
    Mannequin Author

    vnummela mannequin commented Jul 1, 2014

    Uploading a few more 'bad' lzma files for testing.

    @4kir4
    Copy link
    Mannequin

    4kir4 mannequin commented Nov 21, 2014

    @esa changing the buffer size helps with some "bad" files
    but lzma module still fails on some files.

    I've uploaded decompress-example-files.py script that demonstrates it.

    @4kir4
    Copy link
    Mannequin

    4kir4 mannequin commented Nov 21, 2014

    If lzma._BUFFER_SIZE is less than 2048 then all example files are
    decompressed successfully (at least lzma module produces the same
    results as xz utility)

    @kenorb
    Copy link
    Mannequin

    kenorb mannequin commented Sep 28, 2015

    The same with this attached file. It fails with Python 3.5 (small buffers like 128, 255, 1023, etc.) , but it seems to work in Python 3.4 with lzma._BUFFER_SIZE = 1023. So it looks like something regressed.

    @peremen
    Copy link
    Mannequin

    peremen mannequin commented Dec 24, 2017

    Hi, I think I encountered this bug with Ubuntu 17.10 / Python 3.6.3. The same error was triggered by Python's LZMA library, while the xz command line tool can extract the problematic file. Not sure whether there is the bug in 3.7/3.8. I am attaching the problematic archives, they should contain UTF-16LE encoded text.

    @websurfer5
    Copy link
    Mannequin

    websurfer5 mannequin commented Jun 4, 2019

    I adapted the example in msg221784:

    with open('22h_ticks_bad.bi5', 'rb') as f:
        infile = f.read()
    
    for i in range(1, 9000):
        decompressor = _lzma.LZMADecompressor()
        first_out = decompressor.decompress(infile[:i])
        first_len = len(first_out)
        last_out = decompressor.decompress(infile[i:])
        last_len = len(last_out)
        if not decompressor.eof:
            print(i, first_len, first_len + last_len, decompressor.eof)

    which outputs this using both 3.7.3 and 3.8.0a3+ on macOS 10.14.4:

    648 2682 45479 False
    1834 7442 45479 False
    2766 11667 45473 False
    2767 11668 45474 False
    3591 15428 45473 False
    5051 21743 45473 False
    5052 21745 45475 False
    5589 24387 45475 False
    5590 24388 45476 False
    6560 28823 45476 False
    6561 28824 45477 False
    7327 32325 45474 False
    8192 36251 45473 False
    8193 36253 45475 False
    8368 37283 45475 False
    8369 37285 45477 False

    So, yes, still an active bug.

    @websurfer5 websurfer5 mannequin added the 3.7 (EOL) end of life label Jun 4, 2019
    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Jun 5, 2019

    fix-bug.diff fixes this bug, I will submit a PR after thoroughly understanding the problem.

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Jun 13, 2019

    I wrote a review guide in PR 14048.

    @animalize animalize mannequin added 3.8 only security fixes 3.9 only security fixes labels Jun 13, 2019
    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Jun 18, 2019

    I investigated this problem.

    Here is the toggle conditions:

    • The format is FORMAT_ALONE, this is the legacy .lzma container format.
    • The file's header recorded "Uncompressed Size".
    • The file doesn't have "End of Payload Marker" or "End of Stream Marker".

    Otherwise, liblzma's internal state doesn't hold any bytes that can be output.

    Good news is:

    • lzma module's default compressing format is FORMAT_XZ, not FORMAT_ALONE.
    • Even FORMAT_ALONE files generated by lzma module (underlying xz library), always have "End of Payload Marker".
    • Maybe FORMAT_ALONE format is being outdated in the world.

    Attached file test_bad_files.py, test DecompressReader.read(size=-1) function [1] with different max_length values (from -1 to 1000, exclude 0), can ensure that the needs_input mechanism works properly.
    Usage: modify DIR variable to bad files' folder.

    [1] https://github.com/python/cpython/blob/v3.8.0b1/Lib/_compression.py#L72-L111

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Jun 18, 2019

    toggle conditions -> trigger conditions

    @gpshead
    Copy link
    Member

    gpshead commented Sep 12, 2019

    New changeset 4ffd05d by Gregory P. Smith (animalize) in branch 'master':
    bpo-21872: fix lzma library decompresses data incompletely (GH-14048)
    4ffd05d

    @miss-islington
    Copy link
    Contributor

    New changeset 824407f by Miss Islington (bot) in branch '3.8':
    bpo-21872: fix lzma library decompresses data incompletely (GH-14048)
    824407f

    @miss-islington
    Copy link
    Contributor

    New changeset a3c53a1 by Miss Islington (bot) in branch '3.7':
    bpo-21872: fix lzma library decompresses data incompletely (GH-14048)
    a3c53a1

    @gpshead
    Copy link
    Member

    gpshead commented Sep 12, 2019

    thanks!

    @gpshead gpshead closed this as completed Sep 12, 2019
    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Sep 14, 2019

    Some memos:

    1, In liblzma, these missing bytes were copied inside dict_repeat function:

    788 case SEQ_COPY:
    789 // Repeat len bytes from distance of rep0.
    790 if (unlikely(dict_repeat(&dict, rep0, &len))) {

    See liblzma's source code (xz-5.2 branch):
    https://git.tukaani.org/?p=xz.git;a=blob;f=src/liblzma/lzma/lzma_decoder.c

    2, Above replies said xz's command line tools can extract the problematic files successfully.

    This is because xz checks if (avail_out == 0) first, then checks if (avail_in == 0)
    See uncompress function in this source code (xz-5.2 branch):
    https://git.tukaani.org/?p=xz.git;a=blob;f=src/xzdec/xzdec.c;hb=refs/heads/v5.2

    This check order just avoids the problem.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants