When using bz2.BZ2File to read an input file that is growing slowly, repeated read()ing eventually catches up to the end and subsequently fails to produce any more data while the input file continues growing.
In 2.7, the symptom is that read() keeps returning no data even after the file grows. In 3.3, the symptom is "EOFError: Compressed file ended before the end-of-stream marker was reached".
The correct behavior is to not consume partial compressed data during read() and be able to read() properly later after the input file grows. The EOFError should not be raised until close() is called and the file is found to not ending at an end-of-stream marker.
While some existing software may depend on the current behavior, the new behavior may break the existing software. However, predicating the new behavior on constructor parameter buffer being non-zero may mitigate incompatibility problems as using buffer during reading currently doesn't seem to make much sense.
To repro the problem, use the attached slow-copy.py to slowly copy a large-enough source bz2 file to a destination bz2 file. Then run the following script on the slowly-growing destination bz2 file:
import bz2
import sys
import time
if len(sys.argv) != 2:
exit(1)
total = 0
with bz2.BZ2File(sys.argv[1], 'r', buffering=8192) as input:
while True:
bytes = input.read(8192)
bytes = len(bytes)
total += bytes
print('{} {}'.format(total, bytes))
if bytes < 8192:
time.sleep(0.5)
|