classification
Title: Reading with bz2.BZ2File() returns one garbage character
Type: Stage:
Components: Extension Modules Versions: Python 2.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: jafo Nosy List: cpn, georg.brandl, jafo
Priority: high Keywords:

Created on 2006-11-15 14:19 by cpn, last changed 2007-09-17 06:48 by jafo. This issue is now closed.

Files
File name Uploaded Description Edit
bzp.py cpn, 2006-11-15 14:21 python script to reproduce the bug
python-trunk-bz2.patch jafo, 2007-08-28 10:26
python-trunk-bz2-v2.patch jafo, 2007-08-30 09:39
Messages (8)
msg30548 - (view) Author: Clodoaldo Pinto Neto (cpn) Date: 2006-11-15 14:19
When comparing two files which should be equal the last line is
different:

The first file is a bzip2 compressed file and is read with
bz2.BZ2File()
The second file is the same file uncompressed and read with open()

The first file named file.txt.bz2 is uncompressed with:

$ bunzip2 -k file.txt.bz2

To compare I use this script:
###############################
import bz2

f1 = bz2.BZ2File(r'file.txt.bz2', 'r')
f2 = open(r'file.txt', 'r')
lines = 0
while True:
   line1 = f1.readline()
   line2 = f2.readline()
   if line1 == '':
      break
   lines += 1
   if line1 != line2:
      print 'line number:', lines
      print repr(line1)
      print repr(line2)
f1.close()
f2.close()
##############################

Output:

$ python bzp.py
line number: 588317
'\x07'
'' 

The offending attached file is 5.5 MB. Sorry, i could not reproduce this problem
with a smaller file.

Tested in Fedora Core 5 and Python 2.4.3
msg30549 - (view) Author: Clodoaldo Pinto Neto (cpn) Date: 2006-11-15 14:28
I can't upload the bz2 sample file. So it is here:
http://fahstats.com/img/file.txt.bz2 
msg30550 - (view) Author: Clodoaldo Pinto Neto (cpn) Date: 2006-11-15 14:35
Confirmed in Windows Python 2.4 and 2.5

http://groups.google.com/group/comp.lang.python/tree/browse_frm/thread/3010fd664d78010f/4166d429b25c9ed4?rnum=1&_done=%2Fgroup%2Fcomp.lang.python%2Fbrowse_frm%2Fthread%2F3010fd664d78010f%2F4166d429b25c9ed4%3Ftvc%3D1%26#doc_7770aa47861db452
msg30551 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2006-11-15 17:30
With your file, I can reproduce that on Linux, Python 2.5.

Which compressor did you compress your file with?
I unpacked it with bunzip2 without problems, then recompressed it with bzip2, which resulted
in a slightly smaller (51 bytes) file, which then didn't trigger the bug.
msg30552 - (view) Author: Clodoaldo Pinto Neto (cpn) Date: 2006-11-15 17:46
I received this file already compressed. I don't know what was the used compressor.
There is no error if i test the compressed file with:

$ bzip2 -t file.txt.bz2
msg55363 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2007-08-28 10:26
There are some bugs in the bz2 module.  The problem boils down to the
following code, notice how *c is assigned *BEFORE* the check to see if
there was a read error:

   do {
      BZ2_bzRead(&bzerror, f->fp, &c, 1);
      f->pos++;
      *buf++ = c;
   } while (bzerror == BZ_OK && c != '\n' && buf != end);

This could be fixed by putting a "if (bzerror == BZ_OK) break;" after
the BZ2_bzRead() call.

However, I also noticed that in the universal newline section of the
code it is reading a character, incrementing f->pos, *THEN* checking if
buf == end and if so is throwing away the character.

I changed the code around so that the read loop is unified between
universal newlines and regular newlines.  I guess this is a small
performance penalty, since it's checking the newline mode for each
character read, however we're already doing a system call for every
character so one additional comparison and jump to merge duplicate code
for maintenance reasons is probably a good plan.  Especially since the
reason for this bug only existed in one of the two duplicated parts of
the code.

Please let me know if this looks good to commit.
msg55469 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2007-08-30 09:39
Found some problems in the previous version, this one passes the tests
and I've also spent time reviewing the code and I think this is correct.
 Part of the problem is that only bzerror was being checked, not the
number of bytes read.  When bzerror is not BZ_OK, the code expects that
it returns a byte that was read, but in some cases it returns an error
when no bytes were read.

This code passes the test and also correctly handles the bz2 file that
is the object of this bug.
msg55950 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2007-09-17 05:48
I have committed this into trunk and the 2.5 maintenance branch.  It
passes all tests and the resulting build passes the submitter-provided test.
History
Date User Action Args
2007-09-17 06:48:23jafosetresolution: fixed
2007-09-17 05:48:14jafosetstatus: open -> closed
messages: + msg55950
2007-08-30 09:39:57jafosetfiles: + python-trunk-bz2-v2.patch
messages: + msg55469
2007-08-28 10:26:52jafosetassignee: jafo
2007-08-28 10:26:08jafosetfiles: + python-trunk-bz2.patch
nosy: + jafo
messages: + msg55363
2006-11-15 14:19:09cpncreate