Message 315768 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	da
Recipients	da
Date	2018-04-26.01:06:01
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1524704763.78.0.682650639539.issue33361@psf.upfronthosting.co.za>
In-reply-to

Content
It appears that calling readline() on a codecs.EncodedFile stream breaks seeking and causes subsequent attempts to iterate over the lines or call readline() to backtrack and return already consumed lines. A minimal example: ``` from __future__ import print_function import codecs import io def run(stream): offset = stream.tell() try: stream.seek(0) header_row = stream.readline() finally: stream.seek(offset) print('Got header: %r' % header_row) if stream.tell() == 0: print('Skipping the header: %r' % stream.readline()) for index, line in enumerate(stream, start=2): print('Line %d: %r' % (index, line)) b = io.BytesIO(u'a,b\r\n"asdf","jkl;"\r\n'.encode('utf-16-le')) s = codecs.EncodedFile(b, 'utf-8', 'utf-16-le') run(s) ``` Output: ``` Got header: 'a,b\r\n' Skipping the header: '"asdf","jkl;"\r\n' <-- this is line 2 Line 2: 'a,b\r\n' <-- this is line 1 Line 3: '"asdf","jkl;"\r\n' <-- now we're back to line 2 ``` As you can see, the line being skipped is actually the second line, and when we try reading from the stream again, the iterator starts from the beginning of the file. Even weirder, adding a second call to readline() to skip the second line shows it's going backwards: ``` Got header: 'a,b\r\n' Skipping the header: '"asdf","jkl;"\r\n' <-- this is actually line 2 Skipping the second line: 'a,b\r\n' <-- this is line 1 Line 2: '"asdf","jkl;"\r\n' <-- this is now correct ``` The expected output shows that we got a header, skipped it, and then read one data line. ``` Got header: 'a,b' Skipping the header: 'a,b\r\n' Line 2: '"asdf","jkl;"\r\n' ``` I'm sure this is related to the implementation of readline() because if we change this: ``` header_row = stream.readline() ``` to this: ``` header_row = stream.read().splitlines()[0] ``` then we get the expected output. If on the other hand we comment out the seek() in the finally clause, we also get the expected output (minus the "skipping the header") code.

It appears that calling readline() on a codecs.EncodedFile stream breaks seeking and causes subsequent attempts to iterate over the lines or call readline() to backtrack and return already consumed lines.

A minimal example:

```
from __future__ import print_function

import codecs
import io


def run(stream):
    offset = stream.tell()
    try:
        stream.seek(0)
        header_row = stream.readline()
    finally:
        stream.seek(offset)

    print('Got header: %r' % header_row)

    if stream.tell() == 0:
        print('Skipping the header: %r' % stream.readline())

    for index, line in enumerate(stream, start=2):
        print('Line %d: %r' % (index, line))


b = io.BytesIO(u'a,b\r\n"asdf","jkl;"\r\n'.encode('utf-16-le'))
s = codecs.EncodedFile(b, 'utf-8', 'utf-16-le')

run(s)
```

Output:

```
Got header: 'a,b\r\n'
Skipping the header: '"asdf","jkl;"\r\n'    <-- this is line 2
Line 2: 'a,b\r\n'                           <-- this is line 1
Line 3: '"asdf","jkl;"\r\n'                 <-- now we're back to line 2
```

As you can see, the line being skipped is actually the second line, and when we try reading from the stream again, the iterator starts from the beginning of the file.

Even weirder, adding a second call to readline() to skip the second line shows it's going **backwards**:

```
Got header: 'a,b\r\n'
Skipping the header: '"asdf","jkl;"\r\n'    <-- this is actually line 2
Skipping the second line: 'a,b\r\n'         <-- this is line 1
Line 2: '"asdf","jkl;"\r\n'                 <-- this is now correct
```

The expected output shows that we got a header, skipped it, and then read one data line.

```
Got header: 'a,b'
Skipping the header: 'a,b\r\n'
Line 2: '"asdf","jkl;"\r\n'
```

I'm sure this is related to the implementation of readline() because if we change this:

```
header_row = stream.readline()
```

to this:

```
header_row = stream.read().splitlines()[0]
```

then we get the expected output. If on the other hand we comment out the seek() in the finally clause, we also get the expected output (minus the "skipping the header") code.

History
Date	User	Action	Args
2018-04-26 01:06:03	da	set	recipients: + da
2018-04-26 01:06:03	da	set	messageid: <1524704763.78.0.682650639539.issue33361@psf.upfronthosting.co.za>
2018-04-26 01:06:03	da	link	issue33361 messages
2018-04-26 01:06:01	da	create