This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: fileinput.hook_compressed returning bytes from gz file
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.10
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: eric.araujo Nosy List: amaury.forgeotdarc, eric.araujo, jaraco, methane, mnewman, r.david.murray, tati_alchueyr
Priority: normal Keywords: patch

Created on 2009-04-14 23:48 by mnewman, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
example.zip mnewman, 2009-04-14 23:47 ZIP file containing test example
issue5758_fileinput_gzip_with_encoding.patch tati_alchueyr, 2012-03-13 00:10 Fixed issue #5758. fileinput.hook_compressed returns string (and not bytes) from gz file again. review
issue5758_fileinput_gzip_with_encoding_v2.patch tati_alchueyr, 2012-03-13 07:36 New version of the patch after reviewer's considerations review
Pull Requests
URL Status Linked Edit
PR 25272 merged methane, 2021-04-08 03:31
Messages (21)
msg85978 - (view) Author: Michael Newman (mnewman) Date: 2009-04-14 23:47
The attached ZIP file contains "test.bat" which runs "test.py" with
Python 2.6 and Python 3.0.

Python 2.6 behaves as expected (see "py26.out"), since it returns
strings from both "mike.txt" and "mike.txt.gz". However, the same test
with Python 3.0 returns bytes from "mike.txt.gz", as shown in "py30.out":
Output: Hello from Mike.
Output: This is the second line.
Output: Why did the robot cross the road?
Output: b'Hello from Mike.'
Output: b'This is the second line.'
Output: b'Why did the robot cross the road?'

For reference, I tested this on Python versions:
Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on win32
Python 3.0.1 (r301:69561, Feb 13 2009, 20:04:18) [MSC v.1500 32 bit
(Intel)] on win32
msg111063 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-21 14:23
gzip.open() only implements the "rb" mode, and returns bytes.
fileinput could certainly wrap it with a io.TextIOWrapper.

Then the encoding issue arises.
fileinput.FileInput should grow an "encoding" parameter instead of always relying on the default encoding.
msg155471 - (view) Author: Tatiana Al-Chueyr (tati_alchueyr) * Date: 2012-03-12 20:09
Working on this
msg155532 - (view) Author: Tatiana Al-Chueyr (tati_alchueyr) * Date: 2012-03-13 00:10
I had a lot of help of R. David Murray, could someone review our patch, please?

We are specially concerned about backward compatibility.
msg155538 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-03-13 00:30
Thanks for the patch.  I commented on not-so-important things on the code review site (which should have sent an email to Tatiana) and wanted to raise an issue here.

Is it best that the encoding used defaults to Python’s default encoding (UTF-8) or the I/O default encoding (depends on the OS and locale), like what open does?
msg155540 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-13 00:34
I think it should use the same default as open, but frankly I couldn't remember the right way to achieve that.  Suggestions welcome :)
msg155543 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-03-13 00:40
We can use locale.getpreferredencoding, or be smarter and just pass down encoding=encoding to TextIOWrapper and let *it* call getpreferredencoding if it’s given a None :)
msg155548 - (view) Author: Tatiana Al-Chueyr (tati_alchueyr) * Date: 2012-03-13 01:03
Excellent! Thanks for all remarks, Éric! I'll follow your suggestions and will submit a new patch.
msg155553 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-13 01:26
Duh.  I should have remembered that the io module used None to represent the default encoding.  Thanks, Éric.
msg155554 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-03-13 01:27
Nobody can remember everything: I tried in a shell and looked at the docs to find it :)
msg155577 - (view) Author: Tatiana Al-Chueyr (tati_alchueyr) * Date: 2012-03-13 07:36
Improved patch, according to eric.araujo's suggestions and mnewman's guidance.
msg155595 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-03-13 12:11
Thanks.  Unless another core dev wants to do a complementary review I will slightly tweak the patch and commit it.  I need to finish waking up and eat some food before I do that :)

Technically adding a new argument means that this is a new feature and cannot be applied to the stable 3.2 version, but something needs to be done for this bug in 3.2 too, like a recipe in the docs for a hook_compressed that returns strings (i.e. a function that calls fileinput.hook_compressed and wraps it in a TextIOWrapper), or at least a note to warn about this bug.
msg155618 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-13 14:41
For 3.2 could we use the same fix, but without exposing the ability to *change* the encoding?  That is, we use TextIOWrapper but always with the default None for encoding.

It also occurs to me that this really exposes a weakness in the design.  What if the user wants to specify other open parameters?  I wonder if we should say that for better future-proofing openhooks should always take **kw.  You could even envision fileinput accepting **kw and passing them along to the openhook.  I think charset is the most important open paramenter in this context, though, so I don't think we have to solve the general problem in this fix.
msg155622 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-03-13 15:11
It also occurs to me that this fix makes the charset hook look rather odd.  We could render it redundant by passing charset to open in the non-openhook case, and mark it deprecated.

There is also a bug in the hook_encoding docs.  It says the file is opened with codecs.open, but that is not the case, regular open is used.
msg155624 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-03-13 15:20
> For 3.2 could we use the same fix, but without exposing the ability to *change* the encoding?
> That is, we use TextIOWrapper but always with the default None for encoding.
Yes!

> It also occurs to me that this really exposes a weakness in the design.  What if the user wants to
> specify other open parameters?  I wonder if we should say that for better future-proofing openhooks
> should always take **kw.  You could even envision fileinput accepting **kw and passing them along
> to the openhook.  I think charset is the most important open paramenter in this context, though, so
> I don't think we have to solve the general problem in this fix.
I concur.  I’ve never had to care about buffering for example, but mode is another parameter of open that people may want to give.  I’ll commit the minimal fix to 3.2 and merge in 3.3, and then we can discuss on a new RFE bug about adding encoding vs. **kwargs for 3.3.

Agreed on deprecating the charset hook when it becomes redundant.

Will fix the doc bug about codecs.open too.
msg348642 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-07-29 11:54
This issue is not newcomer friendly, I remove the easy keyword.
msg394286 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2021-05-25 02:32
The patch for this change has broken code that relied on the old behavior. For example:

```
import fileinput
import bz2
import pathlib

target = pathlib.Path('data.bz2')
target.write_bytes(bz2.compress(b'Foo\nBar\nBiz'))

inp = fileinput.FileInput([str(target)], openhook=fileinput.hook_compressed)
for line in inp:
    print(line.decode().strip())
```

That code passes on Python3.10a6 but on 3.10b1 fails with:

```
Traceback (most recent call last):
  File "/Users/jaraco/code/main/cmdix/text.py", line 10, in <module>
    print(line.decode().strip())
AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?
```

I encountered this issue in [this function and its test](https://github.com/jaraco/cmdix/blob/dc5fac3817ff9815b2f8d9a1dfad4258c14b1693/cmdix/lib.py#L165-L193).

I'm struggling with how to adapt the code to provide a uniform interface on Python 3.6+. Perhaps a backport of hook_compressed is in order.
msg394287 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2021-05-25 02:59
A backport now exists (https://pypi.org/project/backports.hook_compressed) and addresses the issue (https://github.com/jaraco/cmdix/actions/runs/873404846).
msg394290 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021-05-25 04:04
> I'm struggling with how to adapt the code to provide a uniform interface on Python 3.6+.

It's easy. Just passing `mode="rb"`.

```
FileInput(filelist, mode="rb", openhook=fileinput.hook_compressed)
```

This is better than `backports.hook_compressed`, because it returns bytes always regardless file is compressed or uncompressed.
msg394338 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2021-05-25 12:37
I did consider and confirm that mode="rb" does also provide a uniform solution, but it also regresses the behavior of uncompressed inputs, making them bytes where they were text. This approach feels like the "Python 1" compatibility approach.

At least in my case, the preference was to migrate to the future behavior, where all content is decoded to Unicode text, without having to wait for Python 3.9 to sunset.
msg394413 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021-05-26 03:46
> I did consider and confirm that mode="rb" does also provide a uniform solution, but it also regresses the behavior of uncompressed inputs, making them bytes where they were text.

Of course, I suggested to use "rb" when you want to read bytes.

* When reading binary from uncompressed file, mode="rb" just works for 3.6+.
* When reading text from uncompressed file, mode="r" just works for 3.6+.
* When reading binary from compressed file, mode="rb" just works for 3.6+.
* Reading text from compressed file is not supported until Python 3.10.
History
Date User Action Args
2022-04-11 14:56:47adminsetgithub: 50008
2021-05-26 03:46:18methanesetmessages: + msg394413
2021-05-25 12:37:27jaracosetmessages: + msg394338
2021-05-25 04:04:56methanesetmessages: + msg394290
2021-05-25 02:59:20jaracosetmessages: + msg394287
2021-05-25 02:32:44jaracosetnosy: + jaraco
messages: + msg394286
2021-04-14 05:13:39methanesetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2021-04-08 03:31:16methanesetnosy: + methane

pull_requests: + pull_request24008
stage: needs patch -> patch review
2021-04-08 01:34:40methanelinkissue36865 superseder
2021-04-07 10:01:51vstinnersetnosy: - vstinner
2021-04-07 07:54:20methanesetversions: + Python 3.10, - Python 3.2, Python 3.3
2019-07-29 11:54:09vstinnersetkeywords: - easy

messages: + msg348642
2012-03-13 15:20:46eric.araujosetmessages: + msg155624
2012-03-13 15:11:24r.david.murraysetmessages: + msg155622
2012-03-13 14:41:45r.david.murraysetmessages: + msg155618
2012-03-13 12:11:43eric.araujosetassignee: eric.araujo
messages: + msg155595
versions: + Python 3.3, - Python 3.1
2012-03-13 07:36:44tati_alchueyrsetfiles: + issue5758_fileinput_gzip_with_encoding_v2.patch

messages: + msg155577
2012-03-13 01:27:17eric.araujosetmessages: + msg155554
2012-03-13 01:26:25r.david.murraysetmessages: + msg155553
2012-03-13 01:03:13tati_alchueyrsetmessages: + msg155548
2012-03-13 00:40:43eric.araujosetmessages: + msg155543
2012-03-13 00:34:20r.david.murraysetnosy: + r.david.murray
messages: + msg155540
2012-03-13 00:30:35eric.araujosetnosy: + eric.araujo, vstinner
messages: + msg155538
2012-03-13 00:10:05tati_alchueyrsetfiles: + issue5758_fileinput_gzip_with_encoding.patch
keywords: + patch
messages: + msg155532
2012-03-12 20:09:15tati_alchueyrsetnosy: + tati_alchueyr
messages: + msg155471
2010-07-21 14:23:43amaury.forgeotdarcsetversions: - Python 2.7
nosy: + amaury.forgeotdarc

messages: + msg111063

keywords: + easy
2010-07-10 16:20:54BreamoreBoysetstage: needs patch
versions: + Python 3.1, Python 2.7, Python 3.2, - Python 3.0
2009-04-14 23:48:01mnewmancreate