This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author neologix
Recipients Eric.Wolf, neologix, niemeyer, wrobell
Date 2011-03-01.22:11:25
SpamBayes Score 2.78065e-09
Marked as misclassified No
Message-id <AANLkTinyf5E_fDASqL42gq09X9-wfG1+gs+TD8eUXtLm@mail.gmail.com>
In-reply-to <1299014545.23.0.590778714643.issue10900@psf.upfronthosting.co.za>
Content
> Stupid questions are always worth asking. I did check the MD5 sum earlier
> and just checked it again (since I copied the file from one machine to
> another):
>
> ebwolf@ubuntu:/opt$ md5sum /host/full-planet-110115-1800.osm.bz2
> 0e3f81ef0dd415d8f90f1378666a400c  /host/full-planet-110115-1800.osm.bz2
> ebwolf@ubuntu:/opt$ cat full-planet-110115-1800.osm.bz2.md5
> 0e3f81ef0dd415d8f90f1378666a400c  full-planet-110115-1800.osm.bz2
>

Well, that only proves that the file wasn't corrupted during the download.
But this doesn't prove that the file on the remote server isn't
corrupt (see for example the link I gave you, the guy used rsync and
had a correct checksum but was still unable to extract the file).

> There you have it. I was able to convert the bz2 to gzip with no errors:
>
> bzcat full-planet-110115-1800.osm.bz2 | gzip > full-planet.osm.gz
>

How big is full-planet.osm.gz ?
Since bzip2 uses bzlib, and can very well return after having
uncompressed only half the file.
A more interesting test would be
$ bzip2 -cd full-planet-110115-1800.osm.bz2 | bzip2 -c > full-planet.new.osm.bz2
$ md5sum full-planet.*.bz2

> FYI: This problem came up last year with no resolution:
>
> http://mail.python.org/pipermail/tutor/2010-February/074610.html
>

Yeah, and it was also on an OSM file.
Now, I know that OSM are probably one of the biggest providers of huge
archives, but it's surprising that everytime there's a problem with
bz2, it's with an OSM file, no ?

Look at what I just found, a message from an OSM admin dating from later 2010:

"""
On 26 October 2010 13:47, Anthony <osm <at> inbox.org> wrote:
> a <at> A-PC:/media/usbdrive$ cat full-planet-101022.osm.bz2.md5
> 0a90fec8ce66bdd82984c2ee8c6bb6ac  full-planet-101022.osm.bz2
> a <at> A-PC:/media/usbdrive$ md5sum full-planet-101022.osm.bz2
> c652430b00668c30bb04816ff16cbfbe  full-planet-101022.osm.bz2
>
> Just me?
>

We had problems with the network card in that machine last night
causing some corruption, try
rsync://planet.openstreetmap.org/planet/full-experimental/ the file
into a good state.

Although best to wait a few hours, currently packet loss issues on
server's upstream network.

Regards
 Grant
"""

> In general, is it best to always read the same number of bytes?

In that case, it doesn't matter.

> And what is the best value to pass for buffering in BZ2File? I just made up
> something hoping it would work.

The default one ;-) (don't provide any)

> Colin was using an OSM planet file from some time last year and it quit at exactly 900000 bytes.

OSM again :-)
900.000 is exacty the default bz2 block size...
History
Date User Action Args
2011-03-01 22:11:33neologixsetrecipients: + neologix, niemeyer, wrobell, Eric.Wolf
2011-03-01 22:11:26neologixlinkissue10900 messages
2011-03-01 22:11:25neologixcreate