classification
Title: When I use codecs.open(...) and f.readline() follow up by f.read() return bad result
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.4, Python 3.3, Python 3.2, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ajaksu2, amaury.forgeotdarc, eric.araujo, harobed, haypo, lemburg, r.david.murray, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2010-03-29 15:09 by harobed, last changed 2012-12-07 20:04 by serhiy.storchaka.

Files
File name Uploaded Description Edit
codecs_read.patch amaury.forgeotdarc, 2010-03-31 10:00
codecs_read-2.patch amaury.forgeotdarc, 2010-03-31 11:11
Messages (11)
msg101892 - (view) Author: harobed (harobed) Date: 2010-03-29 15:09
This is an example, last assert return an error :

f = open('data.txt', 'w')
f.write("""line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10
line 11
""")
f.close()


f = open('data.txt', 'r')
assert f.readline() == 'line 1\n'

assert f.read() == """line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10
line 11
"""

f.close()

import codecs

f = codecs.open('data.txt', 'r', 'utf8')

assert f.read() == """line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10
line 11
"""

f.close()

f = codecs.open('data.txt', 'r', 'utf8')

assert f.readline() == 'line 1\n'

# this assert return a ERROR
assert f.read() == """line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10
line 11
"""

f.close()

Regards,
Stephane
msg101980 - (view) Author: Daniel Diniz (ajaksu2) Date: 2010-03-31 06:12
Hi Stephane,

I think you're seeing different buffering behavior, which I suspect is correct according to docs.

codecs.open should default to line buffering[1], while open uses the system default[2].

The read() where the assert fails is returning the remaining buffer from the readline (which read 72 chars).

Asserting e.g. "f.read(1024) == ..." will give you the expected result.

[1] http://docs.python.org/library/codecs.html#codecs.open
[2] http://docs.python.org/library/functions.html#open
msg101987 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-03-31 10:00
Buffering applies when writing, not when reading a file.

There is indeed a problem in codecs.py: after a readline(), read() will return the content of the internal buffer, and not more.

The "size" parameter is a hint, and should not be used to decide whether the character buffer is enough to satisfy the read() request.
Patch is attached, with test.
msg101988 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-03-31 10:28
Amaury Forgeot d'Arc wrote:
> 
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
> 
> Buffering applies when writing, not when reading a file.
> 
> There is indeed a problem in codecs.py: after a readline(), read() will return the content of the internal buffer, and not more.
> 
> The "size" parameter is a hint, and should not be used to decide whether the character buffer is enough to satisfy the read() request.
> Patch is attached, with test.

Agreed.

The patch looks good except the if-line should read:

if chars >= 0 and len(self.charbuffer) >= chars:
  ...

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
msg101990 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-03-31 11:11
Updated patch.

[I also tried to avoid reading the underlying file if len(self.bytebuffer)>=size, but it does not work with multibytes chars when size=1]
msg122823 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-11-29 16:17
I applied the diff to test_codecs in py3k, removed the u prefixes and ran: failure.  I applied the fix and the test passed.
msg138265 - (view) Author: harobed (harobed) Date: 2011-06-13 17:55
Up, I think this patch isn't applied in Python 3.3a0.
msg138273 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-13 19:43
According to this ticket it hasn't been applied anywhere yet (a message will be posted here when it is).
msg139465 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-06-30 08:08
See also #12446.
msg177119 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-07 19:52
I think the patch is wrong or is not optimal for case when chars is -1, but size is not.

If we want to read all data in any case, then we should call self.stream.read() without argument if chars < 0 or size < 0.

If we want to read no more than size bytes, then all loop code should be totally rewritten.

Perhaps I am wrong.
msg177123 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-12-07 20:04
As showed in issue12446, issue14475 and issue16636 there are different methods to reproduce this bug (read(size, chars) + readlines(), readline() + readlines()). All this cases should be tested.
History
Date User Action Args
2012-12-07 20:04:06serhiy.storchakasetmessages: + msg177123
2012-12-07 20:03:53serhiy.storchakalinkissue16636 superseder
2012-12-07 20:03:38serhiy.storchakalinkissue14475 superseder
2012-12-07 20:03:22serhiy.storchakalinkissue12446 superseder
2012-12-07 19:52:34serhiy.storchakasetnosy: + serhiy.storchaka

messages: + msg177119
versions: + Python 3.4
2011-06-30 08:08:06hayposetmessages: + msg139465
2011-06-13 19:43:50r.david.murraysetnosy: + r.david.murray

messages: + msg138273
versions: + Python 3.3, - Python 3.1
2011-06-13 19:31:20hayposetnosy: + haypo
2011-06-13 17:55:55harobedsetmessages: + msg138265
2010-11-29 16:17:29eric.araujosetnosy: + eric.araujo
title: When I use codecs.open(...) and f.readline() follow up by f.read() return bad result -> When I use codecs.open(...) and f.readline() follow up by f.read() return bad result
messages: + msg122823

versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6
2010-03-31 11:11:07amaury.forgeotdarcsetfiles: + codecs_read-2.patch

messages: + msg101990
2010-03-31 10:28:20lemburgsetnosy: + lemburg
title: When I use codecs.open(...) and f.readline() follow up by f.read() return bad result -> When I use codecs.open(...) and f.readline() follow up by f.read() return bad result
messages: + msg101988
2010-03-31 10:00:19amaury.forgeotdarcsetfiles: + codecs_read.patch

nosy: + amaury.forgeotdarc
messages: + msg101987

keywords: + patch
stage: test needed -> patch review
2010-03-31 06:12:21ajaksu2setpriority: normal

nosy: + ajaksu2
messages: + msg101980

stage: test needed
2010-03-29 15:09:53harobedcreate