Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fileinput.hook_compressed returning bytes from gz file #50008

Closed
mnewman mannequin opened this issue Apr 14, 2009 · 21 comments
Closed

fileinput.hook_compressed returning bytes from gz file #50008

mnewman mannequin opened this issue Apr 14, 2009 · 21 comments
Assignees
Labels
3.10 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@mnewman
Copy link
Mannequin

mnewman mannequin commented Apr 14, 2009

BPO 5758
Nosy @amauryfa, @jaraco, @merwok, @bitdancer, @methane
PRs
  • bpo-43712 : fileinput: Add encoding parameter #25272
  • Files
  • example.zip: ZIP file containing test example
  • issue5758_fileinput_gzip_with_encoding.patch: Fixed issue fileinput.hook_compressed returning bytes from gz file #50008. fileinput.hook_compressed returns string (and not bytes) from gz file again.
  • issue5758_fileinput_gzip_with_encoding_v2.patch: New version of the patch after reviewer's considerations
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/merwok'
    closed_at = <Date 2021-04-14.05:13:39.981>
    created_at = <Date 2009-04-14.23:48:01.306>
    labels = ['type-bug', 'library', '3.10']
    title = 'fileinput.hook_compressed returning bytes from gz file'
    updated_at = <Date 2021-05-26.03:46:18.079>
    user = 'https://bugs.python.org/mnewman'

    bugs.python.org fields:

    activity = <Date 2021-05-26.03:46:18.079>
    actor = 'methane'
    assignee = 'eric.araujo'
    closed = True
    closed_date = <Date 2021-04-14.05:13:39.981>
    closer = 'methane'
    components = ['Library (Lib)']
    creation = <Date 2009-04-14.23:48:01.306>
    creator = 'mnewman'
    dependencies = []
    files = ['13688', '24807', '24815']
    hgrepos = []
    issue_num = 5758
    keywords = ['patch']
    message_count = 21.0
    messages = ['85978', '111063', '155471', '155532', '155538', '155540', '155543', '155548', '155553', '155554', '155577', '155595', '155618', '155622', '155624', '348642', '394286', '394287', '394290', '394338', '394413']
    nosy_count = 7.0
    nosy_names = ['amaury.forgeotdarc', 'jaraco', 'eric.araujo', 'r.david.murray', 'mnewman', 'methane', 'tati_alchueyr']
    pr_nums = ['25272']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue5758'
    versions = ['Python 3.10']

    @mnewman
    Copy link
    Mannequin Author

    mnewman mannequin commented Apr 14, 2009

    The attached ZIP file contains "test.bat" which runs "test.py" with
    Python 2.6 and Python 3.0.

    Python 2.6 behaves as expected (see "py26.out"), since it returns
    strings from both "mike.txt" and "mike.txt.gz". However, the same test
    with Python 3.0 returns bytes from "mike.txt.gz", as shown in "py30.out":
    Output: Hello from Mike.
    Output: This is the second line.
    Output: Why did the robot cross the road?
    Output: b'Hello from Mike.'
    Output: b'This is the second line.'
    Output: b'Why did the robot cross the road?'

    For reference, I tested this on Python versions:
    Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit
    (Intel)] on win32
    Python 3.0.1 (r301:69561, Feb 13 2009, 20:04:18) [MSC v.1500 32 bit
    (Intel)] on win32

    @mnewman mnewman mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Apr 14, 2009
    @amauryfa
    Copy link
    Member

    gzip.open() only implements the "rb" mode, and returns bytes.
    fileinput could certainly wrap it with a io.TextIOWrapper.

    Then the encoding issue arises.
    fileinput.FileInput should grow an "encoding" parameter instead of always relying on the default encoding.

    @amauryfa amauryfa added the easy label Jul 21, 2010
    @tatialchueyr
    Copy link
    Mannequin

    tatialchueyr mannequin commented Mar 12, 2012

    Working on this

    @tatialchueyr
    Copy link
    Mannequin

    tatialchueyr mannequin commented Mar 13, 2012

    I had a lot of help of R. David Murray, could someone review our patch, please?

    We are specially concerned about backward compatibility.

    @merwok
    Copy link
    Member

    merwok commented Mar 13, 2012

    Thanks for the patch. I commented on not-so-important things on the code review site (which should have sent an email to Tatiana) and wanted to raise an issue here.

    Is it best that the encoding used defaults to Python’s default encoding (UTF-8) or the I/O default encoding (depends on the OS and locale), like what open does?

    @bitdancer
    Copy link
    Member

    I think it should use the same default as open, but frankly I couldn't remember the right way to achieve that. Suggestions welcome :)

    @merwok
    Copy link
    Member

    merwok commented Mar 13, 2012

    We can use locale.getpreferredencoding, or be smarter and just pass down encoding=encoding to TextIOWrapper and let *it* call getpreferredencoding if it’s given a None :)

    @tatialchueyr
    Copy link
    Mannequin

    tatialchueyr mannequin commented Mar 13, 2012

    Excellent! Thanks for all remarks, Éric! I'll follow your suggestions and will submit a new patch.

    @bitdancer
    Copy link
    Member

    Duh. I should have remembered that the io module used None to represent the default encoding. Thanks, Éric.

    @merwok
    Copy link
    Member

    merwok commented Mar 13, 2012

    Nobody can remember everything: I tried in a shell and looked at the docs to find it :)

    @tatialchueyr
    Copy link
    Mannequin

    tatialchueyr mannequin commented Mar 13, 2012

    Improved patch, according to eric.araujo's suggestions and mnewman's guidance.

    @merwok
    Copy link
    Member

    merwok commented Mar 13, 2012

    Thanks. Unless another core dev wants to do a complementary review I will slightly tweak the patch and commit it. I need to finish waking up and eat some food before I do that :)

    Technically adding a new argument means that this is a new feature and cannot be applied to the stable 3.2 version, but something needs to be done for this bug in 3.2 too, like a recipe in the docs for a hook_compressed that returns strings (i.e. a function that calls fileinput.hook_compressed and wraps it in a TextIOWrapper), or at least a note to warn about this bug.

    @merwok merwok self-assigned this Mar 13, 2012
    @bitdancer
    Copy link
    Member

    For 3.2 could we use the same fix, but without exposing the ability to *change* the encoding? That is, we use TextIOWrapper but always with the default None for encoding.

    It also occurs to me that this really exposes a weakness in the design. What if the user wants to specify other open parameters? I wonder if we should say that for better future-proofing openhooks should always take **kw. You could even envision fileinput accepting **kw and passing them along to the openhook. I think charset is the most important open paramenter in this context, though, so I don't think we have to solve the general problem in this fix.

    @bitdancer
    Copy link
    Member

    It also occurs to me that this fix makes the charset hook look rather odd. We could render it redundant by passing charset to open in the non-openhook case, and mark it deprecated.

    There is also a bug in the hook_encoding docs. It says the file is opened with codecs.open, but that is not the case, regular open is used.

    @merwok
    Copy link
    Member

    merwok commented Mar 13, 2012

    For 3.2 could we use the same fix, but without exposing the ability to *change* the encoding?
    That is, we use TextIOWrapper but always with the default None for encoding.
    Yes!

    It also occurs to me that this really exposes a weakness in the design. What if the user wants to
    specify other open parameters? I wonder if we should say that for better future-proofing openhooks
    should always take **kw. You could even envision fileinput accepting **kw and passing them along
    to the openhook. I think charset is the most important open paramenter in this context, though, so
    I don't think we have to solve the general problem in this fix.
    I concur. I’ve never had to care about buffering for example, but mode is another parameter of open that people may want to give. I’ll commit the minimal fix to 3.2 and merge in 3.3, and then we can discuss on a new RFE bug about adding encoding vs. **kwargs for 3.3.

    Agreed on deprecating the charset hook when it becomes redundant.

    Will fix the doc bug about codecs.open too.

    @vstinner
    Copy link
    Member

    This issue is not newcomer friendly, I remove the easy keyword.

    @vstinner vstinner removed the easy label Jul 29, 2019
    @methane methane added the 3.10 only security fixes label Apr 7, 2021
    @methane methane closed this as completed Apr 14, 2021
    @jaraco
    Copy link
    Member

    jaraco commented May 25, 2021

    The patch for this change has broken code that relied on the old behavior. For example:

    import fileinput
    import bz2
    import pathlib
    
    target = pathlib.Path('data.bz2')
    target.write_bytes(bz2.compress(b'Foo\nBar\nBiz'))
    
    inp = fileinput.FileInput([str(target)], openhook=fileinput.hook_compressed)
    for line in inp:
        print(line.decode().strip())
    

    That code passes on Python3.10a6 but on 3.10b1 fails with:

    Traceback (most recent call last):
      File "/Users/jaraco/code/main/cmdix/text.py", line 10, in <module>
        print(line.decode().strip())
    AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?
    

    I encountered this issue in this function and its test.

    I'm struggling with how to adapt the code to provide a uniform interface on Python 3.6+. Perhaps a backport of hook_compressed is in order.

    @jaraco
    Copy link
    Member

    jaraco commented May 25, 2021

    @methane
    Copy link
    Member

    methane commented May 25, 2021

    I'm struggling with how to adapt the code to provide a uniform interface on Python 3.6+.

    It's easy. Just passing mode="rb".

    FileInput(filelist, mode="rb", openhook=fileinput.hook_compressed)
    

    This is better than backports.hook_compressed, because it returns bytes always regardless file is compressed or uncompressed.

    @jaraco
    Copy link
    Member

    jaraco commented May 25, 2021

    I did consider and confirm that mode="rb" does also provide a uniform solution, but it also regresses the behavior of uncompressed inputs, making them bytes where they were text. This approach feels like the "Python 1" compatibility approach.

    At least in my case, the preference was to migrate to the future behavior, where all content is decoded to Unicode text, without having to wait for Python 3.9 to sunset.

    @methane
    Copy link
    Member

    methane commented May 26, 2021

    I did consider and confirm that mode="rb" does also provide a uniform solution, but it also regresses the behavior of uncompressed inputs, making them bytes where they were text.

    Of course, I suggested to use "rb" when you want to read bytes.

    • When reading binary from uncompressed file, mode="rb" just works for 3.6+.
    • When reading text from uncompressed file, mode="r" just works for 3.6+.
    • When reading binary from compressed file, mode="rb" just works for 3.6+.
    • Reading text from compressed file is not supported until Python 3.10.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants