Under Python 2, gzip.open defaults to giving (non-unicode) strings.
Under Python 3, gzip.open defaults to giving bytes. Therefore it was fixed to allow text mode be specified, see http://bugs.python.org/issue13989
In order to write Python 2 and 3 compatible code to get strings from gzip, I now use:
>>> import gzip
>>> handle = gzip.open(filename, "rt")
In general mode="rt" works great, but I just found this fails under Windows XP running Python 2.7, example below using the following gzipped plain text file:
https://github.com/biopython/biopython/blob/master/Doc/examples/ls_orchid.gbk.gz
This works perfectly on Linux giving strings on both Python 2 and 3 - not I am printing with repr to confirm we have a string object:
$ python2.7 -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline())); import sys; print(sys.version)"
'LOCUS Z78533 740 bp DNA linear PLN 30-NOV-2006\n'
2.7.10 (default, Sep 28 2015, 13:58:31)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)]
Also with a slightly newer Python 2.7,
$ /mnt/apps/python/2.7/bin/python -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline())); import sys; print(sys.version)"
'LOCUS Z78533 740 bp DNA linear PLN 30-NOV-2006\n'
2.7.13 (default, Mar 9 2017, 15:07:48)
[GCC 4.9.2 20150212 (Red Hat 4.9.2-6)]
$ python3.5 -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline())); import sys; print(sys.version)"
'LOCUS Z78533 740 bp DNA linear PLN 30-NOV-2006\n'
3.5.0 (default, Sep 28 2015, 11:25:31)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)]
$ python3.4 -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline())); import sys; print(sys.version)"
'LOCUS Z78533 740 bp DNA linear PLN 30-NOV-2006\n'
3.4.3 (default, Aug 21 2015, 11:12:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)]
$ python3.3 -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline())); import sys; print(sys.version)"
'LOCUS Z78533 740 bp DNA linear PLN 30-NOV-2006\n'
3.3.0 (default, Nov 7 2012, 21:52:39)
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)]
This works perfectly on macOS giving strings on both Python 2 and 3:
$ python2.7 -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline())); import sys; print(sys.version)"
'LOCUS Z78533 740 bp DNA linear PLN 30-NOV-2006\n'
2.7.10 (default, Jul 30 2016, 19:40:32)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)]
$ python3.6 -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline())); import sys; print(sys.version)"
'LOCUS Z78533 740 bp DNA linear PLN 30-NOV-2006\n'
3.6.0 (v3.6.0:41df79263a11, Dec 22 2016, 17:23:13)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
This works perfectly on Python 3 running on Windows XP,
C:\repositories\biopython\Doc\examples>c:\Python33\python.exe -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline()\
)); import sys; print(sys.version)"
'LOCUS Z78533 740 bp DNA linear PLN 30-NOV-2006\n'
3.3.5 (v3.3.5:62cf4e77f785, Mar 9 2014, 10:37:12) [MSC v.1600 32 bit (Intel)]
C:\repositories\biopython\Doc\examples> C:\Python34\python.exe -c "import gzip; print(repr(gzip.open('ls_orchid.gbk.gz', 'rt').readline(\
))); import sys; print(sy
s.version)"
'LOCUS Z78533 740 bp DNA linear PLN 30-NOV-2006\n'
3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (Intel)]
However, it fails on Windows XP running Python 2.7.11 and (after upgrading) Python 2.7.13 though:
C:\repositories\biopython\Doc\examples>c:\Python27\python -c "import sys; print(sys.version); import gzip; print(repr(gzip.open('ls_orch\
id.gbk.gz', 'rt').readlines()))"
2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:42:59) [MSC v.1500 32 bit (Intel)]
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "c:\Python27\lib\gzip.py", line 34, in open
return GzipFile(filename, mode, compresslevel)
File "c:\Python27\lib\gzip.py", line 94, in __init__
fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
ValueError: Invalid mode ('rtb')
Note that the strangely contradictory mode seems to be accepted by Python 2.7 under Linux or macOS:
$ python
Python 2.7.10 (default, Sep 28 2015, 13:58:31)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> gzip.open('ls_orchid.gbk.gz', 'rt')
<gzip open file 'ls_orchid.gbk.gz', mode 'rtb' at 0x7f9af30c2f60 0x7f9aed1e5e50>
>>> quit()
$ python2.7
Python 2.7.10 (default, Jul 30 2016, 19:40:32)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import gzip
>>> gzip.open('ls_orchid.gbk.gz', 'rt')
<gzip open file 'ls_orchid.gbk.gz', mode 'rtb' at 0x10282c6f0 0x10287ef10>
>>> quit() |