This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Iteration breaks with bz2.open(filename,'rt')
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: dabeaz, nadeem.vawda, pitrou, python-dev, serhiy.storchaka
Priority: normal Keywords:

Created on 2012-08-03 09:04 by dabeaz, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
access-log-0108.bz2 dabeaz, 2012-08-03 11:52
Messages (16)
msg167299 - (view) Author: David Beazley (dabeaz) Date: 2012-08-03 09:04
The bz2 library in Python3.3b1 doesn't support iteration for text-mode properly.  Example:

>>> f = bz2.open('access-log-0108.bz2')
>>> next(f)       # Works
b'140.180.132.213 - - [24/Feb/2008:00:08:59 -0600] "GET /ply/ply.html HTTP/1.1" 200 97238\n'

>>> g = bz2.open('access-log-0108.bz2','rt')
>>> next(g)       # Fails
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration
>>>
msg167305 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-08-03 11:29
I can't seem to reproduce this with an up-to-date checkout from Mercurial:

    >>> import bz2
    >>> g = bz2.open('access-log-0108.bz2','rt')
    >>> next(g)
    '140.180.132.213 - - [24/Feb/2008:00:08:59 -0600] "GET /ply/ply.html HTTP/1.1" 200 97238\n'

(where 'access-log-0108.bz2' is a file I created with the output above as
its first line, and a couple of other lines of random junk following that)

Would it be possible for you to upload the file you used to trigger this
bug?
msg167308 - (view) Author: David Beazley (dabeaz) Date: 2012-08-03 11:52
File attached.    The file can be read in its entirety in binary mode.
msg167369 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-08-03 22:27
The cause of this problem is that BZ2File.read1() sometimes returns b"", even though
the file is not at EOF. This happens when the underlying BZ2Decompressor cannot produce
any decompressed data from just the block passed to it in _fill_buffer(); in this case, it needs to read more of the compressed stream to make progress.

It would seem that BZ2File cannot satisfy the contract of the read1() method - we
can't guarantee that a single call to the read() method of the underlying file will
allow us to return a non-empty result, whereas returning b"" is reserved for the
case where we have reached EOF.

Simply removing the read1() method would simply trade this problem for a bigger one
(resurrecting issue 10791), so I propose amending BZ2File.read1() to make as many reads
from the underlying file as necessary to return a non-empty result.

Antoine, what do you think of this?
msg167370 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-08-03 22:29
> I propose amending BZ2File.read1() to make as many reads
> from the underlying file as necessary to return a non-empty result.

Agreed. IMO, read1()'s contract should be read as a best-effort thing, not an absolute guarantee. Returning an empty string when there is still data available is wrong.
msg167397 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-08-04 06:53
I encountered this when implemented bzip2 support in zipfile (issue14371). I solved this also by rewriting read and read1 to make as many reads from the underlying file as necessary to return a non-empty result.
msg167407 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-08-04 13:39
New changeset cdf27a213bd2 by Nadeem Vawda in branch 'default':
#15546: Fix BZ2File.read1()'s handling of pathological input data.
http://hg.python.org/cpython/rev/cdf27a213bd2
msg167408 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-08-04 13:41
OK, BZ2File should now be fixed. It looks like LZMAFile and GzipFile may
be susceptible to the same problem; I'll push fixes for them shortly.
msg167461 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-08-05 00:19
New changeset 5284e65e865b by Nadeem Vawda in branch 'default':
#15546: Fix {GzipFile,LZMAFile}.read1()'s handling of pathological input data.
http://hg.python.org/cpython/rev/5284e65e865b
msg167462 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-08-05 00:28
Done.

Thanks for the bug report, David.
msg167470 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-08-05 06:20
What about peek()?
msg167493 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-08-05 12:11
Before these fixes, it looks like all three classes' peek() methods were susceptible
to the same problem as read1().

The fixes for BZ2File.read1() and LZMAFile.read1() should have fixed peek() as well;
both methods are implemented in terms of _fill_buffer().

For GzipFile, peek() is still potentially broken - I'll push a fix shortly.
msg167497 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-08-05 12:48
New changeset 8c07ff7f882f by Nadeem Vawda in branch 'default':
#15546: Also fix GzipFile.peek().
http://hg.python.org/cpython/rev/8c07ff7f882f
msg167501 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-08-05 13:27
I have a doubts. Is it not a dead cycle if the end of the compressed data will happen on the end of reading block? Maybe instead of "while self.extrasize <= 0:" worth to write "while self.extrasize <= 0 and self.fileobj is not None:"?
msg167505 - (view) Author: Nadeem Vawda (nadeem.vawda) * (Python committer) Date: 2012-08-05 14:25
No, if _read() is called once the file is already at EOF, it raises an
EOFError (http://hg.python.org/cpython/file/8c07ff7f882f/Lib/gzip.py#l433),
which will then break out of the loop.
msg180388 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-01-22 13:59
New changeset 0f25119ceee8 by Serhiy Storchaka in branch '3.2':
#15546: Fix GzipFile.peek()'s handling of pathological input data.
http://hg.python.org/cpython/rev/0f25119ceee8
History
Date User Action Args
2022-04-11 14:57:33adminsetgithub: 59751
2013-01-22 13:59:29python-devsetmessages: + msg180388
2012-08-05 14:25:56nadeem.vawdasetmessages: + msg167505
2012-08-05 13:27:45serhiy.storchakasetmessages: + msg167501
2012-08-05 12:48:09python-devsetmessages: + msg167497
2012-08-05 12:11:34nadeem.vawdasetmessages: + msg167493
2012-08-05 06:20:08serhiy.storchakasetmessages: + msg167470
2012-08-05 00:28:17nadeem.vawdasetstatus: open -> closed
resolution: fixed
messages: + msg167462

stage: resolved
2012-08-05 00:19:50python-devsetmessages: + msg167461
2012-08-04 13:41:50nadeem.vawdasetmessages: + msg167408
2012-08-04 13:39:10python-devsetnosy: + python-dev
messages: + msg167407
2012-08-04 06:53:19serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg167397
2012-08-03 22:29:47pitrousetmessages: + msg167370
2012-08-03 22:27:40nadeem.vawdasetnosy: + pitrou
messages: + msg167369
2012-08-03 11:52:10dabeazsetfiles: + access-log-0108.bz2

messages: + msg167308
2012-08-03 11:29:17nadeem.vawdasetmessages: + msg167305
2012-08-03 10:37:17pitrousetnosy: + nadeem.vawda
2012-08-03 09:04:33dabeazcreate