New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LZMA library sometimes fails to decompress a file #66071
Comments
Python lzma library sometimes fails to decompress a file, even though the file does not appear to be corrupt. Originally discovered with OS X 10.9 / Python 2.7.7 / bacports.lzma Two example files are provided, a good one and a bad one. Both are compressed using the older lzma algorithm (not xz). An attempt to decompress the 'bad' file raises "EOFError: Compressed file ended before the end-of-stream marker was reached." The 'bad' file appears to be ok, because
The example files contain tick data and have been downloaded from the Dukascopy bank's historical data feed service. The service is well known for it's high data quality and utilised by multiple analysis SW platforms. Thus I think it is unlikely that a file integrity issue on their end would have gone unnoticed. The error occurs relatively rarely; only around 1 - 5 times per 1000 downloaded files. |
Just to be clear, when you say "1 - 5 times per 1000 downloaded files", have you confirmed that redownloading the same file a second time produces the same error? Just making sure we've ruled out corruption during transfer over the network; small errors might make it past one decompressor with minimal effect in the midst of a huge data file, while a more stringent error checking decompressor would reject them. |
>>> import lzma
>>> f = lzma.open('22h_ticks_bad.bi5')
>>> len(f.read())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/serhiy/py/cpython/Lib/lzma.py", line 310, in read
return self._read_all()
File "/home/serhiy/py/cpython/Lib/lzma.py", line 251, in _read_all
while self._fill_buffer():
File "/home/serhiy/py/cpython/Lib/lzma.py", line 225, in _fill_buffer
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached This is similar to bpo-1159051. We need a way to say "read as much as possible without error and raise EOFError only on next read". |
My stats so far: As of writing this, I have attempted to decompress about 5000 downloaded files (two years of tick data). 25 'bad' files were found within this lot. I re-downloaded all of them, plus about 500 other files as the minimum lot the server supplies is 24 hours / files at a time. I compared all these 528 file pairs using hashlib.md5 and got identical hashes for all of them. I guess what I should do next is to go through the decompressed data and look for suspicious anomalies, but unfortunately I don't have the tools in place to do that quite yet. |
This code import _lzma
with open('22h_ticks_bad.bi5', 'rb') as f:
infile = f.read()
for i in range(8191, 8195):
decompressor = _lzma.LZMADecompressor()
first_out = decompressor.decompress(infile[:i])
first_len = len(first_out)
last_out = decompressor.decompress(infile[i:])
last_len = len(last_out)
print(i, first_len, first_len + last_len, decompressor.eof) prints this 8191 36243 45480 True It seems to me that this is a subtle bug in liblzma; if the input stream to the incremental decompressor is broken at the wrong place, the internal state of the decompressor is corrupted. For this particular file, it happens when the break occurs after reading 8192 or 8193 bytes, and lzma.py happens to use a buffer of 8192 bytes. There is nothing wrong with the compressed file, since lzma.py decompresses it correctly if the buffer size is set to almost any other value. |
Uploading a few more 'bad' lzma files for testing. |
@esa changing the buffer size helps with some "bad" files I've uploaded decompress-example-files.py script that demonstrates it. |
If lzma._BUFFER_SIZE is less than 2048 then all example files are |
The same with this attached file. It fails with Python 3.5 (small buffers like 128, 255, 1023, etc.) , but it seems to work in Python 3.4 with lzma._BUFFER_SIZE = 1023. So it looks like something regressed. |
Hi, I think I encountered this bug with Ubuntu 17.10 / Python 3.6.3. The same error was triggered by Python's LZMA library, while the xz command line tool can extract the problematic file. Not sure whether there is the bug in 3.7/3.8. I am attaching the problematic archives, they should contain UTF-16LE encoded text. |
I adapted the example in msg221784: with open('22h_ticks_bad.bi5', 'rb') as f:
infile = f.read()
for i in range(1, 9000):
decompressor = _lzma.LZMADecompressor()
first_out = decompressor.decompress(infile[:i])
first_len = len(first_out)
last_out = decompressor.decompress(infile[i:])
last_len = len(last_out)
if not decompressor.eof:
print(i, first_len, first_len + last_len, decompressor.eof) which outputs this using both 3.7.3 and 3.8.0a3+ on macOS 10.14.4: 648 2682 45479 False So, yes, still an active bug. |
fix-bug.diff fixes this bug, I will submit a PR after thoroughly understanding the problem. |
I wrote a review guide in PR 14048. |
I investigated this problem. Here is the toggle conditions:
Otherwise, liblzma's internal state doesn't hold any bytes that can be output. Good news is:
Attached file test_bad_files.py, test [1] https://github.com/python/cpython/blob/v3.8.0b1/Lib/_compression.py#L72-L111 |
toggle conditions -> trigger conditions |
thanks! |
Some memos: 1, In liblzma, these missing bytes were copied inside 788 case SEQ_COPY: See liblzma's source code (xz-5.2 branch): 2, Above replies said xz's command line tools can extract the problematic files successfully. This is because xz checks This check order just avoids the problem. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: