Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readline() + seek() on codecs.EncodedFile breaks next readline() #77542

Closed
da mannequin opened this issue Apr 26, 2018 · 14 comments
Closed

readline() + seek() on codecs.EncodedFile breaks next readline() #77542

da mannequin opened this issue Apr 26, 2018 · 14 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir topic-IO type-bug An unexpected behavior, bug, or error

Comments

@da
Copy link
Mannequin

da mannequin commented Apr 26, 2018

BPO 33361
Nosy @berkerpeksag, @elenaoat, @MojoVampire
PRs
  • bpo-33361: Fix bug with seeking in StreamRecoders #8278
  • [3.7] bpo-33361: Fix bug with seeking in StreamRecoders (GH-8278) #13708
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-05-31.20:04:36.752>
    created_at = <Date 2018-04-26.01:06:03.630>
    labels = ['3.7', '3.8', 'type-bug', 'library', 'expert-IO']
    title = 'readline() + seek() on codecs.EncodedFile breaks next readline()'
    updated_at = <Date 2019-05-31.20:04:36.751>
    user = 'https://bugs.python.org/da'

    bugs.python.org fields:

    activity = <Date 2019-05-31.20:04:36.751>
    actor = 'berker.peksag'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-05-31.20:04:36.752>
    closer = 'berker.peksag'
    components = ['Library (Lib)', 'IO']
    creation = <Date 2018-04-26.01:06:03.630>
    creator = 'da'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 33361
    keywords = ['patch']
    message_count = 14.0
    messages = ['315768', '315788', '315808', '315809', '315835', '315841', '315842', '317245', '321634', '343407', '343596', '344111', '344115', '344116']
    nosy_count = 4.0
    nosy_names = ['berker.peksag', 'Elena.Oat', 'josh.r', 'da']
    pr_nums = ['8278', '13708']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue33361'
    versions = ['Python 3.7', 'Python 3.8']

    @da
    Copy link
    Mannequin Author

    da mannequin commented Apr 26, 2018

    It appears that calling readline() on a codecs.EncodedFile stream breaks seeking and causes subsequent attempts to iterate over the lines or call readline() to backtrack and return already consumed lines.

    A minimal example:

    from __future__ import print_function
    
    import codecs
    import io
    
    
    def run(stream):
        offset = stream.tell()
        try:
            stream.seek(0)
            header_row = stream.readline()
        finally:
            stream.seek(offset)
    
        print('Got header: %r' % header_row)
    
        if stream.tell() == 0:
            print('Skipping the header: %r' % stream.readline())
    
        for index, line in enumerate(stream, start=2):
            print('Line %d: %r' % (index, line))
    
    
    b = io.BytesIO(u'a,b\r\n"asdf","jkl;"\r\n'.encode('utf-16-le'))
    s = codecs.EncodedFile(b, 'utf-8', 'utf-16-le')
    
    run(s)
    

    Output:

    Got header: 'a,b\r\n'
    Skipping the header: '"asdf","jkl;"\r\n'    <-- this is line 2
    Line 2: 'a,b\r\n'                           <-- this is line 1
    Line 3: '"asdf","jkl;"\r\n'                 <-- now we're back to line 2
    

    As you can see, the line being skipped is actually the second line, and when we try reading from the stream again, the iterator starts from the beginning of the file.

    Even weirder, adding a second call to readline() to skip the second line shows it's going **backwards**:

    Got header: 'a,b\r\n'
    Skipping the header: '"asdf","jkl;"\r\n'    <-- this is actually line 2
    Skipping the second line: 'a,b\r\n'         <-- this is line 1
    Line 2: '"asdf","jkl;"\r\n'                 <-- this is now correct
    

    The expected output shows that we got a header, skipped it, and then read one data line.

    Got header: 'a,b'
    Skipping the header: 'a,b\r\n'
    Line 2: '"asdf","jkl;"\r\n'
    

    I'm sure this is related to the implementation of readline() because if we change this:

    header_row = stream.readline()
    

    to this:

    header_row = stream.read().splitlines()[0]
    

    then we get the expected output. If on the other hand we comment out the seek() in the finally clause, we also get the expected output (minus the "skipping the header") code.

    @da da mannequin added stdlib Python modules in the Lib dir topic-IO type-bug An unexpected behavior, bug, or error labels Apr 26, 2018
    @elenaoat
    Copy link
    Mannequin

    elenaoat mannequin commented Apr 26, 2018

    I cannot replicate this when the stream is:

    In: stream_ex = io.BytesIO(u"abc\ndef\nghi\n".encode("utf-8"))
    In: f = codecs.EncodedFile(stream_ex, 'utf-8')

    In: run(f)

    Out: Got header: b'abc\n'
    Skipping the header: b'abc\n'
    Line 2: b'def\n'
    Line 3: b'ghi\n'

    @da
    Copy link
    Mannequin Author

    da mannequin commented Apr 26, 2018

    That's because the stream isn't transcoding, since UTF-8 is ASCII-compatible. Try using something not ASCII-compatible as the codec e.g. 'ibm500' and it'll give incorrect results.

    b = io.BytesIO(u'a,b\r\n"asdf","jkl;"\r\n'.encode('ibm500'))
    s = codecs.EncodedFile(b, 'ibm500')
    
    Got header: '\x81k\x82\r%'
    Skipping the header. '\x7f\x81\xa2\x84\x86\x7fk\x7f\x91\x92\x93^\x7f\r%'
    Line 2: '\x81k\x82\r%'
    Line 3: '\x7f\x81\xa2\x84\x86\x7fk\x7f\x91\x92\x93^\x7f\r%'
    

    @da
    Copy link
    Mannequin Author

    da mannequin commented Apr 26, 2018

    Update: If I run your exact code it still breaks for me:

    Got header: 'abc\n'
    Skipping the header. 'def\n'
    Line 2: 'ghi\n'
    Line 3: 'abc\n'
    Line 4: 'def\n'
    Line 5: 'ghi\n'
    

    I'm running Python 2.7.14 and 3.6.5 on OSX 10.13.4. Startup banners:

    Python 2.7.14 (default, Feb 7 2018, 14:15:12)
    [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin

    Python 3.6.5 (default, Apr 2 2018, 14:03:12)
    [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)] on darwin

    @elenaoat
    Copy link
    Mannequin

    elenaoat mannequin commented Apr 27, 2018

    I've tried this with Python 3.6.0 on OSX 10.13.4

    @elenaoat
    Copy link
    Mannequin

    elenaoat mannequin commented Apr 27, 2018

    For you specific example I get also a weird result. Tried this in Python 2.7.10 and Python 3.6.0.

    @elenaoat
    Copy link
    Mannequin

    elenaoat mannequin commented Apr 27, 2018

    I've modified a little your example and it's clearly that the readline moves the cursor.

    from __future__ import print_function
    
    import codecs
    import io
    
    
    def run(stream):
        offset = stream.tell()
        try:
            stream.seek(0)
            header_row = stream.readline()
        finally:
            stream.seek(offset)
        print(offset)
        print(stream.tell())
        print('Got header: %r' % header_row)
    
        if stream.tell() == 0:
            print(stream.tell())
            print(stream.readline())
            print('Skipping the header: %r' % stream.readline())
    
        for index, line in enumerate(stream, start=2):
            print('Line %d: %r' % (index, line))
    
    
    b = io.BytesIO(u'ab\r\ncd\ndef\n'.encode('utf-16-le'))
    s = codecs.EncodedFile(b, 'utf-8', 'utf-16-le')
    run(s)
    
    

    The first call to readline returns cd instead of ab.

    @da
    Copy link
    Mannequin Author

    da mannequin commented May 21, 2018

    Update: Tested this on Python 3.5.4, 3.4.8, and 3.7.0b3 on OSX 10.13.4. They also exhibit the bug. Updating the ticket accordingly.

    @da da mannequin added the 3.7 (EOL) end of life label May 21, 2018
    @MojoVampire MojoVampire mannequin changed the title readline() + seek() on io.EncodedFile breaks next readline() readline() + seek() on codecs.EncodedFile breaks next readline() May 21, 2018
    @da
    Copy link
    Mannequin Author

    da mannequin commented Jul 13, 2018

    Bug still present in 3.7.0, now seeing it in 3.8.0a0 as well.

    @da da mannequin added the 3.8 only security fixes label Jul 13, 2018
    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented May 24, 2019

    Possibly related to bpo-8260 ("When I use codecs.open(...) and f.readline() follow up by f.read() return bad result"), which was never fully fixed in that issue, though bpo-32110 ("Make codecs.StreamReader.read() more compatible with read() of other files") may have fixed more (all?) of it.

    @da
    Copy link
    Mannequin Author

    da mannequin commented May 27, 2019

    though bpo-32110 ("Make codecs.StreamReader.read() more compatible with read() of other files") may have fixed more (all?) of it.

    Still seeing this in 3.7.3 so I don't think so?

    @berkerpeksag
    Copy link
    Member

    New changeset a6ec1ce by Berker Peksag (Ammar Askar) in branch 'master':
    bpo-33361: Fix bug with seeking in StreamRecoders (GH-8278)
    a6ec1ce

    @berkerpeksag
    Copy link
    Member

    New changeset a6dc5d4 by Berker Peksag (Miss Islington (bot)) in branch '3.7':
    bpo-33361: Fix bug with seeking in StreamRecoders (GH-8278)
    a6dc5d4

    @berkerpeksag
    Copy link
    Member

    Thank you for the report, Diego and thank you for the patch, Ammar!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir topic-IO type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant