classification
Title: When using bz2 and lzma in mode 'wt', the BOM is not written
Type: behavior Stage:
Components: IO Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, ezio.melotti, janluke, lemburg, martin.panter
Priority: normal Keywords:

Created on 2019-03-15 13:19 by janluke, last changed 2019-03-18 16:16 by vstinner.

Files
File name Uploaded Description Edit
demonstrate_BOM_issue.py janluke, 2019-03-15 13:19 Demonstrate the issue
Messages (4)
msg337987 - (view) Author: Gianluca (janluke) Date: 2019-03-15 13:19
When bz2 and lzma files are used in writing text mode (wrapped in a TextIOWrapper), the BOM of encodings such as utf-16 and utf-32 is not written. The gzip package works as expected (it writes the BOM).

The code that demonstrate this behavior (tested with Python 3.7) is attached here and can also be found on stackoverflow: https://stackoverflow.com/questions/55171439/python-bz2-and-lzma-in-mode-wt-dont-write-the-bom-while-gzip-does-why?noredirect=1#comment97103212_55171439
msg338001 - (view) Author: Gianluca (janluke) Date: 2019-03-15 16:41
As one can read in the stackoverflow answer, using _pyio.TextIOWrapper works as expected. So it looks like this is a bug of _io.TextIOWrapper.
msg338045 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2019-03-16 00:24
I suspect this is caused by TextIOWrapper guessing if it is writing the start of a file versus in the middle, and being confused by “seekable” returning False. GzipFile implements some “seek” calls in write mode, but LZMAFile and BZ2File do not.

Using this test class:

class Writer(BufferedIOBase):
    def writable(self):
        return True
    def __init__(self, offset):
        self.offset = offset
    def seekable(self):
        result = self.offset is not None
        print('seekable ->', result)
        return result
    def tell(self):
        print('tell ->', self.offset)
        return self.offset
    def write(self, data):
        print('write', repr(data))

a BOM is inserted when “tell” returns zero:

>>> t = io.TextIOWrapper(Writer(0), 'utf-16')
seekable -> True
tell -> 0
>>> t.write('HI'); t.flush()  # Writes BOM
2
write b'\xff\xfeH\x00I\x00'

and not when “tell” returns a positive number:

>>> t = io.TextIOWrapper(Writer(1), 'utf-16')
seekable -> True
tell -> 1
>>> t.write('HI'); t.flush()  # Omits BOM
2
write b'H\x00I\x00'

However the “io” and “_pyio” behaviours differ when “seekable” returns False:

>>> t = io.TextIOWrapper(Writer(None), 'utf-16')
seekable -> False
>>> t.write('HI'); t.flush()  # io omits BOM
2
write b'H\x00I\x00'
>>> t = _pyio.TextIOWrapper(Writer(None), 'utf-16')
seekable -> False
>>> t.write('HI'); t.flush()  # _pyio writes BOM
write b'\xff\xfeH\x00I\x00'
2

IMO the “_pyio” behaviour is more sensible: write a BOM because that’s what the UTF-16 codec produces.
msg338241 - (view) Author: Gianluca (janluke) Date: 2019-03-18 15:27
In case the file is not seekable, we could decide based on the file mode:
- if mode='w', write the BOM
- if mode='a', don't write the BOM

Of course, mode "a" doesn't guarantee we are in the middle of the file, but it seems a consistent behavior not writing the BOM if we are "appending" to the file.
History
Date User Action Args
2019-03-18 16:16:40vstinnersetnosy: - vstinner
2019-03-18 15:27:57janlukesetmessages: + msg338241
2019-03-16 00:24:42martin.pantersetnosy: + martin.panter
messages: + msg338045
2019-03-15 21:13:39terry.reedysetnosy: + lemburg, vstinner, benjamin.peterson, ezio.melotti
2019-03-15 16:41:33janlukesetmessages: + msg338001
2019-03-15 13:19:03janlukecreate