classification
Title: gzip.open(filename, "rt") fails on Python 2.7.11 on win32, invalid mode rtb
Type: Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, martin.panter, maubp, r.david.murray
Priority: normal Keywords:

Created on 2017-04-07 09:29 by maubp, last changed 2017-04-08 08:04 by maubp. This issue is now closed.

Messages (8)
msg291259 - (view) Author: Peter (maubp) Date: 2017-04-07 09:29
Under Python 2, gzip.open defaults to giving (non-unicode) strings.

Under Python 3, gzip.open defaults to giving bytes. Therefore it was fixed to allow text mode be specified, see http://bugs.python.org/issue13989

In order to write Python 2 and 3 compatible code to get strings from gzip, I now use:

>>> import gzip
>>> handle = gzip.open(filename, "rt")

In general mode="rt" works great, but I just found this fails under Windows XP running Python 2.7, example below using the following gzipped plain text file:

https://github.com/biopython/biopython/blob/master/Doc/examples/ls_orchid.gbk.gz

This works perfectly on Linux giving strings on both Python 2 and 3 - not I am printing with repr to confirm we have a string object:

$ python2.7 -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline())); import sys; print(sys.version)"
'LOCUS       Z78533                   740 bp    DNA     linear   PLN 30-NOV-2006\n'
2.7.10 (default, Sep 28 2015, 13:58:31) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)]

Also with a slightly newer Python 2.7,

$ /mnt/apps/python/2.7/bin/python  -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline())); import sys; print(sys.version)"
'LOCUS       Z78533                   740 bp    DNA     linear   PLN 30-NOV-2006\n'
2.7.13 (default, Mar  9 2017, 15:07:48) 
[GCC 4.9.2 20150212 (Red Hat 4.9.2-6)]

$ python3.5 -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline())); import sys; print(sys.version)"
'LOCUS       Z78533                   740 bp    DNA     linear   PLN 30-NOV-2006\n'
3.5.0 (default, Sep 28 2015, 11:25:31) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)]

$ python3.4 -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline())); import sys; print(sys.version)"
'LOCUS       Z78533                   740 bp    DNA     linear   PLN 30-NOV-2006\n'
3.4.3 (default, Aug 21 2015, 11:12:32) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)]

$ python3.3 -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline())); import sys; print(sys.version)"
'LOCUS       Z78533                   740 bp    DNA     linear   PLN 30-NOV-2006\n'
3.3.0 (default, Nov  7 2012, 21:52:39) 
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)]


This works perfectly on macOS giving strings on both Python 2 and 3:


$ python2.7 -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline())); import sys; print(sys.version)"
'LOCUS       Z78533                   740 bp    DNA     linear   PLN 30-NOV-2006\n'
2.7.10 (default, Jul 30 2016, 19:40:32) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)]

$ python3.6 -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline())); import sys; print(sys.version)"
'LOCUS       Z78533                   740 bp    DNA     linear   PLN 30-NOV-2006\n'
3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]


This works perfectly on Python 3 running on Windows XP,


C:\repositories\biopython\Doc\examples>c:\Python33\python.exe -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline()\
)); import sys; print(sys.version)"
'LOCUS       Z78533                   740 bp    DNA     linear   PLN 30-NOV-2006\n'
3.3.5 (v3.3.5:62cf4e77f785, Mar  9 2014, 10:37:12) [MSC v.1600 32 bit (Intel)]

C:\repositories\biopython\Doc\examples> C:\Python34\python.exe -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline(\
))); import sys; print(sy
s.version)"
'LOCUS       Z78533                   740 bp    DNA     linear   PLN 30-NOV-2006\n'
3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (Intel)]



However, it fails on Windows XP running Python 2.7.11 and (after upgrading) Python 2.7.13 though:


C:\repositories\biopython\Doc\examples>c:\Python27\python -c "import sys; print(sys.version); import gzip; print(repr(gzip.open('ls_orch\
id.gbk.gz', 'rt').readlines()))"
2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)]

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "c:\Python27\lib\gzip.py", line 34, in open
    return GzipFile(filename, mode, compresslevel)
  File "c:\Python27\lib\gzip.py", line 94, in __init__
    fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
ValueError: Invalid mode ('rtb')


Note that the strangely contradictory mode seems to be accepted by Python 2.7 under Linux or macOS:


$ python
Python 2.7.10 (default, Sep 28 2015, 13:58:31) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> gzip.open('ls_orchid.gbk.gz', 'rt')
<gzip open file 'ls_orchid.gbk.gz', mode 'rtb' at 0x7f9af30c2f60 0x7f9aed1e5e50>
>>> quit()


$ python2.7
Python 2.7.10 (default, Jul 30 2016, 19:40:32) 
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> gzip.open('ls_orchid.gbk.gz', 'rt')
<gzip open file 'ls_orchid.gbk.gz', mode 'rtb' at 0x10282c6f0 0x10287ef10>
>>> quit()
msg291279 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-04-07 17:28
I don't think this is really a bug, I think it's a consequence of the different byte/string models of python2 and python3 coupled with the different binary/text models of posix and windows.
msg291283 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2017-04-07 18:23
In Python 3, gzip.open(filename, "rt") returns a TextIOWrapper using the system's default encoding. The decoded output is potentially very different from the byte string returned by 'text mode' in Python 2, even if using "rt" mode didn't result in the nonsensical "rtb" mode. I suggest using the default binary mode, and manually wrapping the file in an io.TextIOWrapper.
msg291293 - (view) Author: Peter (maubp) Date: 2017-04-07 20:53
I want a simple cross platform (Linux/Mac/Windows) and cross version (Python 2/3) way to be able to open a gzipped file and get a string handle (default encoding TextIOWrapper under Python 3 is fine). My use-case is specifically for documentation examples.

Previously I used gzip.open(filename) but with the introduction of Python 3 that stopped working because the Python 3 default was to give you bytes.

Thanks to http://bugs.python.org/issue13989 switching to  gzip.open(filename, "rt") almost covered my use case, leaving Python 2 windows as the odd one out.

I propose that under Python 2.7, gzip.open explicit accept but ignore "t" as part of the mode argument in order to allow cross-platform code to work nicely.

i.e. Formalise the observed Python 2.7 behaviour under Linux and Mac which ignore the "t", and change Windows so that it ignores the "t" as well.
msg291300 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2017-04-07 21:51
You want to hack a fake text mode, which won't do new-line translation or treat ^Z (0x1a) as EOF like a normal 2.x text mode on Windows. Can't you just use io.TextIOWrapper(gzip.open(filename))? This reads Unicode.
msg291303 - (view) Author: Peter (maubp) Date: 2017-04-07 22:13
A workaround for my use case is even simpler, something like this:

try:
    handle = gzip.open(filename, "rt")
except ValueError:
    # Workaround for Python 2.7 under Windows
    handle = gzip.open(filename, "r")
    
However, even this is troublesome for use in documentation intended to work on Python 2 and 3, over Linux, Mac and Windows.
msg291307 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-04-07 23:23
I agree this is not a bug. It is just one of the unfortunate compatibility breaks between Py 2 and 3. Mode="rt" is not one of the values that are supported according to the documentation; adding support would be a new feature.

I understand the file mode handling is stricter on Windows because the underlying OS or C library would crash.

To have code that works with Py 2 and 3, I would switch the mode depending on the version of Python:

if sys.version_info >= (3,):
    handle = gzip.open(filename, "rt")
else:
    handle = gzip.open(filename)
msg291325 - (view) Author: Peter (maubp) Date: 2017-04-08 08:04
OK, thanks. Given this is regarded as an enhancement rather than a bug fix, I understand the choice not to change this in Python 2.7.
History
Date User Action Args
2017-04-08 08:04:22maubpsetmessages: + msg291325
2017-04-07 23:23:14martin.pantersetstatus: open -> closed

nosy: + martin.panter
messages: + msg291307

resolution: not a bug
stage: resolved
2017-04-07 22:13:25maubpsetmessages: + msg291303
2017-04-07 21:51:12eryksunsetmessages: + msg291300
2017-04-07 20:53:42maubpsetmessages: + msg291293
2017-04-07 18:23:44eryksunsetnosy: + eryksun
messages: + msg291283
2017-04-07 17:28:25r.david.murraysetnosy: + r.david.murray
messages: + msg291279
2017-04-07 09:29:47maubpcreate