classification
Title: readline() + seek() on codecs.EncodedFile breaks next readline()
Type: behavior Stage: resolved
Components: IO, Library (Lib) Versions: Python 3.8, Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Elena.Oat, berker.peksag, da, josh.r
Priority: normal Keywords: patch

Created on 2018-04-26 01:06 by da, last changed 2019-05-31 20:04 by berker.peksag. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 8278 merged ammar2, 2018-07-14 00:20
PR 13708 merged miss-islington, 2019-05-31 19:44
Messages (14)
msg315768 - (view) Author: Diego Argueta (da) * Date: 2018-04-26 01:06
It appears that calling readline() on a codecs.EncodedFile stream breaks seeking and causes subsequent attempts to iterate over the lines or call readline() to backtrack and return already consumed lines.

A minimal example:

```
from __future__ import print_function

import codecs
import io


def run(stream):
    offset = stream.tell()
    try:
        stream.seek(0)
        header_row = stream.readline()
    finally:
        stream.seek(offset)

    print('Got header: %r' % header_row)

    if stream.tell() == 0:
        print('Skipping the header: %r' % stream.readline())

    for index, line in enumerate(stream, start=2):
        print('Line %d: %r' % (index, line))


b = io.BytesIO(u'a,b\r\n"asdf","jkl;"\r\n'.encode('utf-16-le'))
s = codecs.EncodedFile(b, 'utf-8', 'utf-16-le')

run(s)
```

Output:

```
Got header: 'a,b\r\n'
Skipping the header: '"asdf","jkl;"\r\n'    <-- this is line 2
Line 2: 'a,b\r\n'                           <-- this is line 1
Line 3: '"asdf","jkl;"\r\n'                 <-- now we're back to line 2
```

As you can see, the line being skipped is actually the second line, and when we try reading from the stream again, the iterator starts from the beginning of the file.

Even weirder, adding a second call to readline() to skip the second line shows it's going **backwards**:

```
Got header: 'a,b\r\n'
Skipping the header: '"asdf","jkl;"\r\n'    <-- this is actually line 2
Skipping the second line: 'a,b\r\n'         <-- this is line 1
Line 2: '"asdf","jkl;"\r\n'                 <-- this is now correct
```

The expected output shows that we got a header, skipped it, and then read one data line.

```
Got header: 'a,b'
Skipping the header: 'a,b\r\n'
Line 2: '"asdf","jkl;"\r\n'
```

I'm sure this is related to the implementation of readline() because if we change this:

```
header_row = stream.readline()
```

to this:

```
header_row = stream.read().splitlines()[0]
```

then we get the expected output. If on the other hand we comment out the seek() in the finally clause, we also get the expected output (minus the "skipping the header") code.
msg315788 - (view) Author: Elena Oat (Elena.Oat) * Date: 2018-04-26 12:16
I cannot replicate this when the stream is:

In: stream_ex = io.BytesIO(u"abc\ndef\nghi\n".encode("utf-8"))
In: f = codecs.EncodedFile(stream_ex, 'utf-8')

In: run(f)

Out: Got header: b'abc\n'
Skipping the header: b'abc\n'
Line 2: b'def\n'
Line 3: b'ghi\n'
msg315808 - (view) Author: Diego Argueta (da) * Date: 2018-04-26 18:02
That's because the stream isn't transcoding, since UTF-8 is ASCII-compatible. Try using something not ASCII-compatible as the codec e.g. 'ibm500' and it'll give incorrect results.

```
b = io.BytesIO(u'a,b\r\n"asdf","jkl;"\r\n'.encode('ibm500'))
s = codecs.EncodedFile(b, 'ibm500')
```

```
Got header: '\x81k\x82\r%'
Skipping the header. '\x7f\x81\xa2\x84\x86\x7fk\x7f\x91\x92\x93^\x7f\r%'
Line 2: '\x81k\x82\r%'
Line 3: '\x7f\x81\xa2\x84\x86\x7fk\x7f\x91\x92\x93^\x7f\r%'
```
msg315809 - (view) Author: Diego Argueta (da) * Date: 2018-04-26 18:08
Update: If I run your exact code it still breaks for me:

```
Got header: 'abc\n'
Skipping the header. 'def\n'
Line 2: 'ghi\n'
Line 3: 'abc\n'
Line 4: 'def\n'
Line 5: 'ghi\n'
```

I'm running Python 2.7.14 and 3.6.5 on OSX 10.13.4. Startup banners:

Python 2.7.14 (default, Feb  7 2018, 14:15:12) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin

Python 3.6.5 (default, Apr  2 2018, 14:03:12) 
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)] on darwin
msg315835 - (view) Author: Elena Oat (Elena.Oat) * Date: 2018-04-27 11:53
I've tried this with Python 3.6.0 on OSX 10.13.4
msg315841 - (view) Author: Elena Oat (Elena.Oat) * Date: 2018-04-27 13:46
For you specific example I get also a weird result. Tried this in Python 2.7.10 and Python 3.6.0.
msg315842 - (view) Author: Elena Oat (Elena.Oat) * Date: 2018-04-27 13:53
I've modified a little your example and it's clearly that the readline moves the cursor.

```
from __future__ import print_function

import codecs
import io


def run(stream):
    offset = stream.tell()
    try:
        stream.seek(0)
        header_row = stream.readline()
    finally:
        stream.seek(offset)
    print(offset)
    print(stream.tell())
    print('Got header: %r' % header_row)

    if stream.tell() == 0:
        print(stream.tell())
        print(stream.readline())
        print('Skipping the header: %r' % stream.readline())

    for index, line in enumerate(stream, start=2):
        print('Line %d: %r' % (index, line))


b = io.BytesIO(u'ab\r\ncd\ndef\n'.encode('utf-16-le'))
s = codecs.EncodedFile(b, 'utf-8', 'utf-16-le')
run(s)

```
The first call to readline returns cd instead of ab.
msg317245 - (view) Author: Diego Argueta (da) * Date: 2018-05-21 17:47
Update: Tested this on Python 3.5.4, 3.4.8, and 3.7.0b3 on OSX 10.13.4. They also exhibit the bug. Updating the ticket accordingly.
msg321634 - (view) Author: Diego Argueta (da) * Date: 2018-07-13 21:05
Bug still present in 3.7.0, now seeing it in 3.8.0a0 as well.
msg343407 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2019-05-24 16:20
Possibly related to #8260 ("When I use codecs.open(...) and f.readline() follow up by f.read() return bad result"), which was never fully fixed in that issue, though #32110 ("Make codecs.StreamReader.read() more compatible with read() of other files") may have fixed more (all?) of it.
msg343596 - (view) Author: Diego Argueta (da) * Date: 2019-05-27 01:38
> though #32110 ("Make codecs.StreamReader.read() more compatible with read() of other files") may have fixed more (all?) of it.

Still seeing this in 3.7.3 so I don't think so?
msg344111 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2019-05-31 19:44
New changeset a6ec1ce1ac05b1258931422e96eac215b6a05459 by Berker Peksag (Ammar Askar) in branch 'master':
bpo-33361: Fix bug with seeking in StreamRecoders (GH-8278)
https://github.com/python/cpython/commit/a6ec1ce1ac05b1258931422e96eac215b6a05459
msg344115 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2019-05-31 20:03
New changeset a6dc5d4e1c9ef465dc1f1ad95c382aa8e32b178f by Berker Peksag (Miss Islington (bot)) in branch '3.7':
bpo-33361: Fix bug with seeking in StreamRecoders (GH-8278)
https://github.com/python/cpython/commit/a6dc5d4e1c9ef465dc1f1ad95c382aa8e32b178f
msg344116 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2019-05-31 20:04
Thank you for the report, Diego and thank you for the patch, Ammar!
History
Date User Action Args
2019-05-31 20:04:36berker.peksagsetstatus: open -> closed
versions: - Python 2.7, Python 3.4, Python 3.5, Python 3.6
messages: + msg344116

resolution: fixed
stage: patch review -> resolved
2019-05-31 20:03:28berker.peksagsetmessages: + msg344115
2019-05-31 19:44:22miss-islingtonsetpull_requests: + pull_request13595
2019-05-31 19:44:16berker.peksagsetnosy: + berker.peksag
messages: + msg344111
2019-05-27 01:38:06dasetmessages: + msg343596
2019-05-24 16:20:39josh.rsetnosy: + josh.r
messages: + msg343407
2018-07-14 00:20:38ammar2setkeywords: + patch
stage: patch review
pull_requests: + pull_request7813
2018-07-13 21:05:39dasetmessages: + msg321634
versions: + Python 3.8
2018-05-21 21:33:39josh.rsettitle: readline() + seek() on io.EncodedFile breaks next readline() -> readline() + seek() on codecs.EncodedFile breaks next readline()
2018-05-21 17:47:08dasetmessages: + msg317245
versions: + Python 3.4, Python 3.5, Python 3.7
2018-04-27 13:53:51Elena.Oatsetmessages: + msg315842
2018-04-27 13:46:27Elena.Oatsetmessages: + msg315841
2018-04-27 11:53:59Elena.Oatsetmessages: + msg315835
2018-04-26 18:08:38dasetmessages: + msg315809
2018-04-26 18:02:41dasetmessages: + msg315808
2018-04-26 12:16:14Elena.Oatsetnosy: + Elena.Oat
messages: + msg315788
2018-04-26 01:06:03dacreate