Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zipfile.ZipExtFile.read() is missing universal newline support #51008

Closed
ryles mannequin opened this issue Aug 22, 2009 · 19 comments
Closed

zipfile.ZipExtFile.read() is missing universal newline support #51008

ryles mannequin opened this issue Aug 22, 2009 · 19 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@ryles
Copy link
Mannequin

ryles mannequin commented Aug 22, 2009

BPO 6759
Nosy @pitrou, @bitdancer, @serhiy-storchaka
Files
  • issue6759_1.diff: Patch for Lib/ziptest.py and Lib/test/test_zipfile.py
  • issue6759_2.diff
  • issue6759_3.diff: Patch 3
  • issue6759_4.diff: Patch 4
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-03-07.16:36:02.474>
    created_at = <Date 2009-08-22.06:22:01.396>
    labels = ['type-feature', 'library']
    title = 'zipfile.ZipExtFile.read() is missing universal newline support'
    updated_at = <Date 2017-03-07.16:36:02.472>
    user = 'https://bugs.python.org/ryles'

    bugs.python.org fields:

    activity = <Date 2017-03-07.16:36:02.472>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-03-07.16:36:02.474>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2009-08-22.06:22:01.396>
    creator = 'ryles'
    dependencies = []
    files = ['14772', '14784', '14785', '14787']
    hgrepos = []
    issue_num = 6759
    keywords = ['patch']
    message_count = 19.0
    messages = ['91854', '91880', '91962', '91971', '91976', '91977', '91981', '91987', '91990', '91994', '111491', '111504', '159594', '159608', '159613', '159614', '159755', '225472', '289171']
    nosy_count = 5.0
    nosy_names = ['pitrou', 'r.david.murray', 'ryles', 'agillesp', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue6759'
    versions = ['Python 3.3']

    @ryles
    Copy link
    Mannequin Author

    ryles mannequin commented Aug 22, 2009

    The zipfile.ZipFile.open() behavior with mode 'U' or 'rU' is not quite
    as advertised in

    http://docs.python.org/library/zipfile.html#zipfile.ZipFile.open

    Here is an example:

    $ echo -ne "This is an example\r\nWhich demonstrates a problem\r\nwith
    ZipFile.open(..., 'U')\r\n" > foo.txt
    $ cat -v foo.txt
    This is an example^M
    Which demonstrates a problem^M
    with ZipFile.open(..., 'U')^M
    $ zip foo.zip foo.txt
      adding: foo.txt (deflated 1%)
    $ python
    Python 2.6.2 (r262:71600, Aug 21 2009, 17:52:12)
    [GCC 4.1.2 20071124 (Red Hat 4.1.2-42)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> open("foo.txt", 'U').read()
    "This is an example\nWhich demonstrates a problem\nwith
    ZipFile.open(..., 'U')\n"
    >>> from zipfile import ZipFile
    >>> ZipFile("foo.zip").open("foo.txt", 'U').read()
    "This is an example\r\nWhich demonstrates a problem\r\nwith
    ZipFile.open(..., 'U')\r\n"
    >>>

    The open() method was added here:

    http://bugs.python.org/issue1121142

    The cause is that the universal newline implementation is specific to
    readline(), which also implements readlines() and next() as well.
    Support was never added for read(), which is independent.

    Note that test_zipfile.UniversalNewlineTests.readTest() passes. This is
    suspect because it's explicitly coded to *not* expect translation of new
    line sequences.

    @ryles ryles mannequin added the stdlib Python modules in the Lib dir label Aug 22, 2009
    @agillesp
    Copy link
    Mannequin

    agillesp mannequin commented Aug 22, 2009

    Patch for both zipfile.py and test_zipfile.py attached.

    • The universal newline logic is now in read instead of readline.
    • UniversalNewlineTests.read_test changed to check for \n rather than
      unchanged eol.

    @agillesp agillesp mannequin added the type-bug An unexpected behavior, bug, or error label Aug 22, 2009
    @ryles
    Copy link
    Mannequin Author

    ryles mannequin commented Aug 26, 2009

    Hi Art,

    Thanks for working on this. I've taken a look at the patch.

    The fix to read_test looks correct. Of course, I would consider a more
    descriptive variable name than 'b'.

    The changes to read() are an improvement, but I think we need to be
    careful when we replace "\r\n" with "\n". Basically, we've turned two
    characters into one and are now potentially one character short of
    'size' bytes. This doesn't match the behavior of file.read().

    Another thing to work out is the lack of the 'newlines' attribute,
    discussed in PEP-278.

    I've noted that bz2 seems to do a pretty good job with universal newline
    handling: python/trunk/Modules/bz2module.c.

    It's in C, however, and I don't think there's actually anything in the
    library doing UL in pure Python.

    @bitdancer
    Copy link
    Member

    The python version of the io module does (_pyio.py).

    I've set the stage to 'test needed' because the test needs to be turned
    into a unit test.

    @agillesp
    Copy link
    Mannequin

    agillesp mannequin commented Aug 26, 2009

    Hi Ryan,

    Thanks for the feedback.

    I've attached a new patch that fixes the read(nbytes) behavior--It will
    now always return the requested number of bytes regardless of newline
    replacement. There's now a unit test for this as well.

    I also added the newlines attribute per PEP-278 and a corresponding test.

    I'm not sure I understood David's comment that read_test needed to be
    turned into a unit test: it's called by several of the unit tests
    (test_read_stored, test_read_deflated, et. al.) in that class.

    @bitdancer
    Copy link
    Member

    My apologies, I misread the patch.

    @bitdancer
    Copy link
    Member

    Thanks for working on this.

    Comments on patch:

    (1) I think you should retain the full "ugly check" comment explaining
    how the \r\n spanning a buffer is handled. I think that's a helpful
    explanation for a non-obvious piece of code.

    (2) Your code for setting the newlines attributes and the tests for it
    look wrong to me. The attribute is supposed to list all separators
    seen. I think your code is going to set it to ('\r\n', '\r') when the
    separators are just '\r\n', and your test doesn't check to make sure
    newlines gets set to _only_ what was seen. (Nor does it check for the
    case of actual mixed line endings...a test for which would require some
    additional test hackery since the existing test code didn't try to test
    mixed line end files either.)

    (3) Sorry for being dense, but I don't understand your code to adjust
    the number of bytes in the read method or the test that tests it. I
    haven't tried to play with it, but it looks odd to me that you'd be
    adding '2', since I would think that the read could cover more than one
    line...or am I misunderstanding how 'read(n)' works (which is entirely
    possible)?

    @agillesp
    Copy link
    Mannequin

    agillesp mannequin commented Aug 26, 2009

    Hi David,

    Thanks for the review. Patch attached.

    (1) I've moved that comment to the check's new location.

    (2) Fixed the bug and added tests for only one separator. Also added
    test data and tests for mixed eol files.

    (3) I changed this so that when the file is opened with universal
    newline support, read(size) makes multiple calls to _do_read until size
    bytes are read or EOF is reached.

    @agillesp
    Copy link
    Mannequin

    agillesp mannequin commented Aug 26, 2009

    Just found another bug in the code that sets the newlines attribute.
    Please disregard issue6759_3.diff

    @agillesp
    Copy link
    Mannequin

    agillesp mannequin commented Aug 27, 2009

    Latest patch attached.

    • Fixed the code that populates the newlines attribute. I think I've
      covered all the cases...

    • Found another deviation from file object behavior in this module:
      Calling read with a negative size parameter does not always return the
      remainder of the file as described in
      http://docs.python.org/library/stdtypes.html#file.read I went ahead and
      fixed this--please let me know if I should open a separate issue and
      submit a separate patch.

    • Added more tests for mixed eol files, calling read with a negative
      size parameter, reading a file with only crlfs

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 24, 2010

    @Art: @david: A lot of changes have been made to test_zipfile since the last patch was produced. I'm unsure as to whether the patch could be applied to 2.7 as is, or it would be simpler to rework the patch for 2.7 and the py3k branches. Could one of you take a look please, thanks.

    @bitdancer
    Copy link
    Member

    Well, you could try applying the patch and see if it applies cleanly and passes the tests. But it would be helpful to have it reworked for py3k, since we now commit first to py3k and then backport to 2.7.

    I have not reviewed the updated patch.

    @serhiy-storchaka
    Copy link
    Member

    Support lines in zipfile looks contradictory and buggy. This complicates the code and makes the behavior of zipfile.ZipExtFile incompatible with the interface of other file-like objects. For example, the behavior of the read, read1 and peek differ from one described in io module.

    If we are working with binary data, conversion of newlines is meaningless (and how about newlines in comments?). If we are working with text, the bytes must be decoded to str. This will help io.TextIOWrapper.

    I suggest two alternatives:

    1. Deprecate universal newline support in zipfile. zipfile.ZipExtFile should always work with non-modified bytes, and who need the text, let wraps zipfile.ZipExtFile with io.TextIOWrapper.

    2. Automatically wrap a zipfile.ZipExtFile with io.TextIOWrapper if universal newlines mode is enabled. In this case, the data type will change from bytes to str. Add modes "t" and "b" to explicitly specify the data type. Add an encoding parameter (and other parameters if needed).

    @pitrou
    Copy link
    Member

    pitrou commented Apr 29, 2012

    Deprecate universal newline support in zipfile. zipfile.ZipExtFile
    should always work with non-modified bytes, and who need the text, let
    wraps zipfile.ZipExtFile with io.TextIOWrapper.

    This would be fine with me.

    @pitrou pitrou added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Apr 29, 2012
    @serhiy-storchaka
    Copy link
    Member

    This would be fine with me.

    It may be worth to deprecate PEP-278? Oh, only ten years have passed
    since 2.3, but it seems it was so long ago.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 29, 2012

    It may be worth to deprecate PEP-278? Oh, only ten years have passed
    since 2.3, but it seems it was so long ago.

    Well, I don't know if PEPs ever get deprecated. In this case, PEP-3116
    is probably the superseder.

    @serhiy-storchaka
    Copy link
    Member

    I know how to remove universal newline support, I know how after this
    correct these functions (with bpo-14371 they partially corrected), but
    I don't know how to deprecate universal newline support. What should be
    done? Can you initiate a discussion in Python-Ideas or Python-Dev? Now
    only zipfile trying to implement this PEP (and failed), io ignores 'U'
    mode, gzip checks and raises.

    With regard to this issue, I note that there are tests that check that
    read() returns unmodified data (UniversalNewlineTests.read_test in
    Lib/test/test_zipfile.py). It looks weird, but apparently, this behavior
    is expected and deliberate.

    @serhiy-storchaka
    Copy link
    Member

    Mode 'U' is deprecated now (bpo-15204).

    @serhiy-storchaka
    Copy link
    Member

    Support of the 'U' mode is removed in 3.6 (bpo-27029).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants