classification
Title: bz2 module fails to uncompress large files
Type: Stage:
Components: Library (Lib) Versions: Python 3.2, Python 2.7, Python 2.5
process
Status: closed Resolution: duplicate
Dependencies: Superseder: bz2.BZ2File doesn't support multiple streams
View: 1625
Assigned To: Nosy List: Eric.Wolf, neologix, niemeyer, pitrou, wrobell
Priority: normal Keywords:

Created on 2011-01-13 17:14 by wrobell, last changed 2011-03-02 07:55 by neologix. This issue is now closed.

Files
File name Uploaded Description Edit
bz2wc.py wrobell, 2011-01-13 17:14 bz2 test script
OSM_Extract.py Eric.Wolf, 2011-03-01 01:25 First pass reading OSM full-planet
strace_bz2.txt Eric.Wolf, 2011-03-01 18:18 strace output
Messages (15)
msg126186 - (view) Author: wrobell (wrobell) Date: 2011-01-13 17:14
There is problem to uncompress large files with bz2 module.

For example, please download 13GB OpenStreetMap file using following torrent

http://osm-torrent.torres.voyager.hr/files/planet-latest.osm.bz2.torrent

Try to count lines in the compressed file with command...

   python3.2 bz2wc.py planet-110105.osm.bz2 
   3971

... but there is much more lines in that file

   bzip2 -dc < planet-110105.osm.bz2 | wc -l
   
The command

   bzip2 -t planet-110105.osm.bz2

validates the file successfully.
msg126193 - (view) Author: wrobell (wrobell) Date: 2011-01-13 19:06
Forgot the mention the real amount of lines!

    bzip2 -dc < planet-110105.osm.bz2 | wc -l
    2783595867
msg129734 - (view) Author: Eric Wolf (Eric.Wolf) Date: 2011-03-01 01:25
I'm experiencing the same thing. My script works perfectly on a 165MB file but fails after reading 900,000 bytes on a 22GB file.

My script uses a buffered bz2file.read and is agnostic about end-of-lines. Opening with "rb" does not help. It is specifically written to avoid reading too much into memory at once.

I have tested this script on:
Python 2.5.1 (r251:54863) (ESRI ArcGIS version) (WinXP 64-bit)
Python 2.7.1.4 (r271:86832) (64-bit ActiveState version) (WinXP 64-bit)
Python 2.6.4 (r264:75706) (Ubuntu 9.10 64-bit)

Check here for some really big BZ2 files:

http://planet.openstreetmap.org/full-experimental/
msg129752 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-03-01 11:27
@Eric.Wolf

Could you try with this:

            # Read in anohter chunk of the file
            # NOTE: It's possible that an XML tag will be greater than buffsize
            #       This will break in that situation
-            newb = self.fp.read(self.bufpos)
+            newb = self.fp.read(self.buffsize)

Also, could you provide the output of
strace -emmap2,sbrk,brk python <script>

I could be completely wrong, but both in your case and in wrobell's case, there's a lot of _PyBytes_Resize going on, and given how PyObject_Realloc is implemented, this could lead to heavy heap fragmentation.
msg129787 - (view) Author: Eric Wolf (Eric.Wolf) Date: 2011-03-01 18:18
I tried the change you suggested. It still fails but now at 572,320 bytes instead of 900,000. I'm not sure why the difference in bytes read. I'll explore this more in a bit.

I also converted the BZ2 to GZ and used the gzip module. It's failing after reading 46628864 bytes. The GZ file is 33GB compared to the 22GB BZ2.

I've attached the strace output. I was getting an error with the sbrk parameter, so I left it out. Let me know if there's anything else I can provide.
msg129794 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-03-01 18:58
> I've attached the strace output. I was getting an error with the sbrk parameter, so I left it out.

Yeah, sbrk is not a syscall ;-)

> Let me know if there's anything else I can provide.

Stupid questions:
- have you checked the file's md5sum ?
- what does "bzip2 -cd <file> > /dev/null" return ?
msg129807 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-03-01 21:14
After running this under gdb, it turns out that it's actually bzlib's bzRead that's returning a BZ_STREAM_END after only 900k bytes.
So it confims what I've been suspecting, i.e. that the file is corrupt (I got the error at exactly the same offset as you - it could be a bug in bzlib, but it'd be quite surprising).
Note that google returns quite a few occurrences of corrupted OSM archives, e.g. http://www.mail-archive.com/newbies@openstreetmap.org/msg01854.html
msg129808 - (view) Author: Eric Wolf (Eric.Wolf) Date: 2011-03-01 21:22
Stupid questions are always worth asking. I did check the MD5 sum earlier and just checked it again (since I copied the file from one machine to another):

ebwolf@ubuntu:/opt$ md5sum /host/full-planet-110115-1800.osm.bz2 
0e3f81ef0dd415d8f90f1378666a400c  /host/full-planet-110115-1800.osm.bz2
ebwolf@ubuntu:/opt$ cat full-planet-110115-1800.osm.bz2.md5 
0e3f81ef0dd415d8f90f1378666a400c  full-planet-110115-1800.osm.bz2

There you have it. I was able to convert the bz2 to gzip with no errors:

bzcat full-planet-110115-1800.osm.bz2 | gzip > full-planet.osm.gz

FYI: This problem came up last year with no resolution:

http://mail.python.org/pipermail/tutor/2010-February/074610.html

Thanks for looking at this. Let me know if there's anything else you'd like me to try. In general, is it best to always read the same number of bytes? And what is the best value to pass for buffering in BZ2File? I just made up something hoping it would work.

I'm still waiting on the bzcat to /dev/null
msg129814 - (view) Author: Eric Wolf (Eric.Wolf) Date: 2011-03-01 21:56
The only problem with the theory that the file is corrupt is that at least three people have encountered exactly the same problem with three files:

http://mail.python.org/pipermail/tutor/2010-June/076343.html

Colin was using an OSM planet file from some time last year and it quit at exactly 900000 bytes.

I'm trying bzip2 -t on the file to see if it reports any problems. These things take time... the bzcat to /dev/null still hasn't completed.
msg129821 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-03-01 22:11
> Stupid questions are always worth asking. I did check the MD5 sum earlier
> and just checked it again (since I copied the file from one machine to
> another):
>
> ebwolf@ubuntu:/opt$ md5sum /host/full-planet-110115-1800.osm.bz2
> 0e3f81ef0dd415d8f90f1378666a400c  /host/full-planet-110115-1800.osm.bz2
> ebwolf@ubuntu:/opt$ cat full-planet-110115-1800.osm.bz2.md5
> 0e3f81ef0dd415d8f90f1378666a400c  full-planet-110115-1800.osm.bz2
>

Well, that only proves that the file wasn't corrupted during the download.
But this doesn't prove that the file on the remote server isn't
corrupt (see for example the link I gave you, the guy used rsync and
had a correct checksum but was still unable to extract the file).

> There you have it. I was able to convert the bz2 to gzip with no errors:
>
> bzcat full-planet-110115-1800.osm.bz2 | gzip > full-planet.osm.gz
>

How big is full-planet.osm.gz ?
Since bzip2 uses bzlib, and can very well return after having
uncompressed only half the file.
A more interesting test would be
$ bzip2 -cd full-planet-110115-1800.osm.bz2 | bzip2 -c > full-planet.new.osm.bz2
$ md5sum full-planet.*.bz2

> FYI: This problem came up last year with no resolution:
>
> http://mail.python.org/pipermail/tutor/2010-February/074610.html
>

Yeah, and it was also on an OSM file.
Now, I know that OSM are probably one of the biggest providers of huge
archives, but it's surprising that everytime there's a problem with
bz2, it's with an OSM file, no ?

Look at what I just found, a message from an OSM admin dating from later 2010:

"""
On 26 October 2010 13:47, Anthony <osm <at> inbox.org> wrote:
> a <at> A-PC:/media/usbdrive$ cat full-planet-101022.osm.bz2.md5
> 0a90fec8ce66bdd82984c2ee8c6bb6ac  full-planet-101022.osm.bz2
> a <at> A-PC:/media/usbdrive$ md5sum full-planet-101022.osm.bz2
> c652430b00668c30bb04816ff16cbfbe  full-planet-101022.osm.bz2
>
> Just me?
>

We had problems with the network card in that machine last night
causing some corruption, try
rsync://planet.openstreetmap.org/planet/full-experimental/ the file
into a good state.

Although best to wait a few hours, currently packet loss issues on
server's upstream network.

Regards
 Grant
"""

> In general, is it best to always read the same number of bytes?

In that case, it doesn't matter.

> And what is the best value to pass for buffering in BZ2File? I just made up
> something hoping it would work.

The default one ;-) (don't provide any)

> Colin was using an OSM planet file from some time last year and it quit at exactly 900000 bytes.

OSM again :-)
900.000 is exacty the default bz2 block size...
msg129826 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-03-01 22:28
Perhaps your bz2 files are simply multi-stream files? The bz2 module currently doesn't support them (it only decompresses the first stream); see issue1625 for a patch.

I'm not an expert on this, but it seems you can do:

$ bzip2 -tvvv foo.bz2 
  foo.bz2: 
    [1: huff+mtf rt+rld {0x135c15ac, 0x135c15ac}]
    combined CRCs: stored = 0x135c15ac, computed = 0x135c15ac
    [1: huff+mtf rt+rld {0x6ff631c1, 0x6ff631c1}]
    combined CRCs: stored = 0x6ff631c1, computed = 0x6ff631c1
    ok

My intuition is that if you get several lines about CRCs, it means there are several streams in the bz2 file.
msg129833 - (view) Author: Eric Wolf (Eric.Wolf) Date: 2011-03-01 23:41
I just got confirmation that OSM is using pbzip2 to generate these files. So they are multi-stream. At least that gives a final answer but doesn't solve my problem.

I saw this: http://bugs.python.org/issue1625

Does anyone know the current status of the patch supporting multistream bz2?
msg129834 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-03-01 23:47
> Does anyone know the current status of the patch supporting multistream 
> bz2?

1) patch needs updating for current 3.x (probably not difficult)
2) need to make sure the legal statement is ok with vmware (since apparently they are a bit picky about this)

Closing as duplicate of issue1625.
msg129857 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-03-02 06:35
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> Perhaps your bz2 files are simply multi-stream files? The bz2 module
> currently doesn't support them (it only decompresses the first stream); see
> issue1625 for a patch.

That explains why it was seeing an end-of-stream so early...
Thanks for the explanation, I didn't know about multi-stream bzip2.

Charles
msg129859 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-03-02 07:55
2011/3/2 Eric Wolf <report@bugs.python.org>:
>
> Eric Wolf <ebwolf@gmail.com> added the comment:
>
> I just got confirmation that OSM is using pbzip2 to generate these files. So they are multi-stream. At least that gives a final answer but doesn't solve my problem.
>

At least on Unix, you can use this workaround:

-        self.fp = bz2.BZ2File(filename,'rb',16384*64)
+       self.fp = os.popen('bzip2 -cd ' + filename)

It's ugly, not as portable, but it should work on any Unix with bzip2
installed (and supporting multi-stream files).
History
Date User Action Args
2011-03-02 07:55:38neologixsetnosy: niemeyer, pitrou, wrobell, neologix, Eric.Wolf
messages: + msg129859
2011-03-02 06:35:23neologixsetnosy: niemeyer, pitrou, wrobell, neologix, Eric.Wolf
messages: + msg129857
title: bz2 module fails to multi-stream files completely -> bz2 module fails to uncompress large files
2011-03-01 23:47:37pitrousetstatus: open -> closed
title: bz2 module fails to uncompress large files -> bz2 module fails to multi-stream files completely
nosy: niemeyer, pitrou, wrobell, neologix, Eric.Wolf
messages: + msg129834

superseder: bz2.BZ2File doesn't support multiple streams
resolution: duplicate
2011-03-01 23:41:17Eric.Wolfsetnosy: niemeyer, pitrou, wrobell, neologix, Eric.Wolf
messages: + msg129833
2011-03-01 22:28:37pitrousetnosy: + pitrou
messages: + msg129826
2011-03-01 22:11:26neologixsetnosy: niemeyer, wrobell, neologix, Eric.Wolf
messages: + msg129821
2011-03-01 21:56:44Eric.Wolfsetnosy: niemeyer, wrobell, neologix, Eric.Wolf
messages: + msg129814
2011-03-01 21:22:24Eric.Wolfsetnosy: niemeyer, wrobell, neologix, Eric.Wolf
messages: + msg129808
2011-03-01 21:14:05neologixsetnosy: niemeyer, wrobell, neologix, Eric.Wolf
messages: + msg129807
2011-03-01 18:58:23neologixsetnosy: niemeyer, wrobell, neologix, Eric.Wolf
messages: + msg129794
2011-03-01 18:18:10Eric.Wolfsetfiles: + strace_bz2.txt

versions: + Python 2.5
messages: + msg129787
nosy: niemeyer, wrobell, neologix, Eric.Wolf
2011-03-01 11:27:55neologixsetnosy: + neologix
messages: + msg129752
2011-03-01 01:25:35Eric.Wolfsetfiles: + OSM_Extract.py
nosy: + Eric.Wolf
messages: + msg129734

2011-01-13 19:06:16wrobellsetmessages: + msg126193
2011-01-13 17:39:31SilentGhostsetnosy: + niemeyer, - gustavo
2011-01-13 17:38:40SilentGhostsetnosy: + gustavo
2011-01-13 17:14:22wrobellcreate