classification
Title: line buffering isn't always
Type: behavior Stage: needs patch
Components: Documentation Versions: Python 3.2, Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: r.david.murray Nosy List: loewis, pitrou, r.david.murray
Priority: normal Keywords:

Created on 2012-06-12 01:15 by r.david.murray, last changed 2012-06-13 12:56 by pitrou.

Messages (4)
msg162656 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-06-12 01:15
rdmurray@hey:~/python/p32>cat bad.py

    This line is just ascii
    A second line for good measure.
    This comment contains undecodable stuff: "�" or "\\xe9" in "pass�"" cannot be decoded.

The last line above is in latin-1, with an é inside those quotes.

    rdmurray@hey:~/python/p32>cat bug.py      
    import sys
    with open('./bad.py', buffering=int(sys.argv[1])) as f:
        for line in f:
            print(line, end='')
    rdmurray@hey:~/python/p32>python3 bug.py -1
    Traceback (most recent call last):
      File "bug.py", line 3, in <module>
        for line in f:
      File "/usr/lib/python3.2/codecs.py", line 300, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 99: invalid continuation byte
    rdmurray@hey:~/python/p32>python3 bug.py 1 
    Traceback (most recent call last):
      File "bug.py", line 3, in <module>
        for line in f:
      File "/usr/lib/python3.2/codecs.py", line 300, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 99:
    invalid continuation byte
    rdmurray@hey:~/python/p32>python3 bug.py 2
    This line is just ascii
    A second line for good measure.
    Traceback (most recent call last):
      File "bug.py", line 3, in <module>
        for line in f:
      File "/usr/lib/python3.2/codecs.py", line 300, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: invalid
    continuation byte

So, line buffering does not appear to buffer line by line.

I ran into this problem because I had a much larger file that I thought
was in utf-8.  When I got the encoding error, I was annoyed that the
error message didn't really tell me which line the error was on, but I
figured, OK, I'll just set line buffering and then I'll be able to tell.
But that didn't work.  Fortunately using '2' did work....but at a minimum
the docs need to be updated to indicate when line buffering really is
line buffering and when it isn't.
msg162678 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-06-12 14:46
Without looking at the code, it seems that

http://docs.python.org/release/3.1.5/library/io.html?highlight=io#io.TextIOWrapper

gives the answer

"If line_buffering is True, flush() is implied when a call to write contains a newline character."

So, "line buffering" may have a meaning only for writing. I don't think there is a reasonable way to implement it for reading.
msg162682 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-06-12 15:13
That makes sense.  I'll add a mention of this to the 'open' docs that discuss the buffering parameter.
msg162706 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-06-13 12:56
Indeed, line buffering on the read size would very slow (since you would have to read and decode one byte at a time from the raw stream to make sure you don't overshoot the line boundaries).
History
Date User Action Args
2012-06-13 12:56:28pitrousetmessages: + msg162706
2012-06-12 15:13:33r.david.murraysetassignee: r.david.murray
messages: + msg162682
components: + Documentation
2012-06-12 14:46:24loewissetnosy: + loewis
messages: + msg162678
2012-06-12 01:15:24r.david.murraycreate