This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: bz2.BZ2File.read() does not treat growing input file properly
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: Joshua.Chia, serhiy.storchaka
Priority: low Keywords:

Created on 2014-01-07 04:04 by Joshua.Chia, last changed 2022-04-11 14:57 by admin.

Files
File name Uploaded Description Edit
slow-copy.py Joshua.Chia, 2014-01-07 04:04
Messages (1)
msg207506 - (view) Author: Joshua Chia (Joshua.Chia) Date: 2014-01-07 04:04
When using bz2.BZ2File to read an input file that is growing slowly, repeated read()ing eventually catches up to the end and subsequently fails to produce any more data while the input file continues growing.

In 2.7, the symptom is that read() keeps returning no data even after the file grows. In 3.3, the symptom is "EOFError: Compressed file ended before the end-of-stream marker was reached".

The correct behavior is to not consume partial compressed data during read() and be able to read() properly later after the input file grows. The EOFError should not be raised until close() is called and the file is found to not ending at an end-of-stream marker.

While some existing software may depend on the current behavior, the new behavior may break the existing software. However, predicating the new behavior on constructor parameter buffer being non-zero may mitigate incompatibility problems as using buffer during reading currently doesn't seem to make much sense.

To repro the problem, use the attached slow-copy.py to slowly copy a large-enough source bz2 file to a destination bz2 file. Then run the following script on the slowly-growing destination bz2 file:

import bz2
import sys
import time

if len(sys.argv) != 2:
    exit(1)

total = 0
with bz2.BZ2File(sys.argv[1], 'r', buffering=8192) as input:
    while True:
        bytes = input.read(8192)
        bytes = len(bytes)
        total += bytes
        print('{} {}'.format(total, bytes))
        if bytes < 8192:
            time.sleep(0.5)
History
Date User Action Args
2022-04-11 14:57:56adminsetgithub: 64355
2014-11-18 20:23:18serhiy.storchakasetpriority: normal -> low
assignee: serhiy.storchaka

nosy: + serhiy.storchaka
versions: + Python 3.5, - Python 2.7, Python 3.3
2014-01-07 04:04:43Joshua.Chiacreate