classification
Title: BZ2File doesn't decompress some .bz2 files correctly
Type: Stage: resolved
Components: IO Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: James.Dominy, nadeem.vawda, serhiy.storchaka
Priority: normal Keywords:

Created on 2014-02-26 11:59 by James.Dominy, last changed 2014-02-28 08:14 by James.Dominy. This issue is now closed.

Files
File name Uploaded Description Edit
example-file.csv.bz2 James.Dominy, 2014-02-26 11:59 Sample data file which causes bz2 to break
Messages (9)
msg212250 - (view) Author: James Dominy (James.Dominy) Date: 2014-02-26 11:59
bz2.BZ2File does not decompress a file (see attached) correctly. This file can be decompressed and compressed via stadard unix tools (bzip2 and bunzip2) without change.

Consider ...

$ python
Python 2.7.6 (default, Dec  7 2013, 22:49:16) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import bz2
>>> import hashlib
>>> len(bz2.BZ2File("example-file.csv.bz2", "r", 0).read())
900000
>>> hashlib.md5(bz2.BZ2File("example-file.csv.bz2", "r", 0).read()).hexdigest()
'e2d4ce212a040c879cb256f88c9faab9'
>>> len(bz2.BZ2File("example-file.csv.bz2", "rb", 0).read())
900000
>>> hashlib.md5(bz2.BZ2File("example-file.csv.bz2", "rb", 0).read()).hexdigest()
'e2d4ce212a040c879cb256f88c9faab9'
>>> 

It looks like bz2 is not dealing with the second block. This is not the first file I've come across that has this problem, and initially I thought it was the file not the module. I've attached a copy of the file.

I use gentoo on a 64bit intel core i5.
msg212251 - (view) Author: James Dominy (James.Dominy) Date: 2014-02-26 12:02
Whoops, forget to add the output from the standard binutils

$ bzcat example-file.csv.bz2 | wc -c
909602
$ bzcat example-file.csv.bz2 | md5sum
48f4b69b2b8bb0b171ebc36313eb6616  -

As you can see file sizes and hashes do not match
msg212299 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-02-26 20:25
All works on 3.4, but on 3.3 and 2.7 it looks hanged.
msg212301 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-02-26 20:48
Oh, no, I just not pressed <Enter> after copying long testing command line. ;)

All works on 3.3 too, but on 2.7 I got incomplete result.

$ ./python -c 'import bz2, hashlib; d = bz2.BZ2File("../example-file.csv.bz2").read(); print len(d), hashlib.md5(d).hexdigest()'
900000 e2d4ce212a040c879cb256f88c9faab9
msg212306 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-02-26 20:58
Actually this file is composed of two bzip2 streams. Python 2.7 doesn't support decompressing of multi-stream inputs, this feature was added in 3.3. So this is not a bug.
msg212307 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2014-02-26 21:17
As Serhiy said, multi-stream support was only added to the bz2 module in 3.3,
and there is no plan to backport functionality this to 2.7.

However, the bz2file package on PyPI [1] does support multi-stream inputs,
and you can use its BZ2File class as a drop-in replacement for the built-in
one on 2.7.

[1] https://pypi.python.org/pypi/bz2file
msg212339 - (view) Author: James Dominy (James.Dominy) Date: 2014-02-27 08:22
How does one create a multi-stream bzip2 file in the first place? And how do I tell it's multi-stream.
msg212342 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2014-02-27 09:24
> How does one create a multi-stream bzip2 file in the first place?

If you didn't do so deliberately, I would guess that you used a parallel
compression tool like pbzip2 or lbzip2 to create your bz2 file. These tools work
by splitting the input into chunks, compressing each chunk as a separate stream,
and then concatenating these streams afterward.

Another possibility is that you just concatenated two existing bz2 files, e.g.:

    $ cat first.bz2 second.bz2 >multi.bz2


> And how do I tell it's multi-stream.

I don't know of any pre-existing tools to do this, but you can write a script
for it yourself, by feeding the file's data through a BZ2Decompressor. When the
decompress() method raises EOFError, you're at the end of the first stream. If
the decompressor's unused_data attribute is non-empty, or there is data that has
not yet been read from the input file, then it is either (a) a multi-stream bz2
file or (b) a bz2 file with other metadata tacked on to the end.

To distinguish between cases (a) and (b), take unused_data + rest_of_input_file
and feed it into a new BZ2Decompressor. If don't get an IOError, then you've got
a multi-stream bz2 file.

(If you *do* get an IOError, then that's case (b) - someone's appended non-bz2
 data to the end of a bz2 file. For example, Gentoo and Sabayon Linux packages
 are bz2 files with package metadata appended, according to issue 19839.)
msg212413 - (view) Author: James Dominy (James.Dominy) Date: 2014-02-28 08:14
Ah, I did some digging. It turns out pbzip2 is installed on the system in question, and more annoyingly, /usr/bin/bzip2 is a symlink to pbzip2. I didn't realise the file was compressed by pbzip2.

Thanks for the help.
History
Date User Action Args
2014-02-28 08:14:37James.Dominysetmessages: + msg212413
2014-02-27 09:24:06nadeem.vawdasetmessages: + msg212342
2014-02-27 08:22:18James.Dominysetmessages: + msg212339
2014-02-26 21:17:02nadeem.vawdasetmessages: + msg212307
2014-02-26 20:58:18serhiy.storchakasetstatus: open -> closed
resolution: not a bug
messages: + msg212306

stage: resolved
2014-02-26 20:48:08serhiy.storchakasetmessages: + msg212301
2014-02-26 20:25:18serhiy.storchakasetmessages: + msg212299
2014-02-26 20:07:05serhiy.storchakasetnosy: + nadeem.vawda, serhiy.storchaka
2014-02-26 12:02:13James.Dominysetmessages: + msg212251
2014-02-26 11:59:44James.Dominysettitle: BZ2File does decompress some .bz2 files correctly -> BZ2File doesn't decompress some .bz2 files correctly
2014-02-26 11:59:20James.Dominycreate