classification
Title: binary compressed file reading corrupts newlines (lzma, gzip, bz2)
Type: Stage: resolved
Components: Library (Lib) Versions:
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: jtaylor
Priority: normal Keywords:

Created on 2017-04-14 14:18 by jtaylor, last changed 2017-04-14 14:28 by jtaylor. This issue is now closed.

Messages (3)
msg291661 - (view) Author: Julian Taylor (jtaylor) Date: 2017-04-14 14:18
Probably a case of 'don't do that' but reading lines in a compressed files in binary mode produces bytes with invalid newlines in encodings that where '\n' is encoded as something else:

with lzma.open("test.xz", "wt", encoding="UTF-32-LE") as f:
    f.write('0 1 2\n3 4 5');

lzma.open("test.xz", "rb").readlines()[0].decode('UTF-32-LE')

Fails with:
UnicodeDecodeError: 'utf-32-le' codec can't decode byte 0x0a in position 20: truncated data

as readlines() produces:
b'0\x00\x00\x00 \x00\x00\x001\x00\x00\x00 \x00\x00\x002\x00\x00\x00\n'
The last newline should be '\n'.encode('UTF-32-LE') == b'\n\x00\x00\x00'
msg291663 - (view) Author: Julian Taylor (jtaylor) Date: 2017-04-14 14:27
on second though not really worth an issue as it is a general problem of readline on binary streams. Sorry for the noise.
msg291664 - (view) Author: Julian Taylor (jtaylor) Date: 2017-04-14 14:28
see also http://bugs.python.org/issue17083
History
Date User Action Args
2017-04-14 14:28:43jtaylorsetmessages: + msg291664
2017-04-14 14:27:46jtaylorsetstatus: open -> closed

messages: + msg291663
stage: resolved
2017-04-14 14:18:56jtaylorcreate