classification
Title: http.client.IncompleteRead from HTTPResponse read after part reading file
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: SilentGhost, kristjan.jonsson, martin.panter, maubp, python-dev, serhiy.storchaka
Priority: normal Keywords: 3.5regression, patch

Created on 2016-03-07 10:10 by maubp, last changed 2016-03-16 08:39 by martin.panter. This issue is now closed.

Files
File name Uploaded Description Edit
issue26499.diff SilentGhost, 2016-03-07 17:49 review
issue26499_2.diff SilentGhost, 2016-03-09 08:30 review
issue26499_3.diff SilentGhost, 2016-03-09 19:58 review
issue26499_4.diff SilentGhost, 2016-03-12 15:36 review
issue26499_5.diff SilentGhost, 2016-03-13 11:38 review
issue26499_6.diff martin.panter, 2016-03-15 05:59 review
Messages (12)
msg261287 - (view) Author: Peter (maubp) Date: 2016-03-07 10:10
This is a regression in Python 3.5 tested under Linux and Mac OS X, spotted from a failing test in Biopython https://github.com/biopython/biopython/issues/773 where we would parse a file from the internet. The trigger is partially reading the network handle line by line (e.g. until an end record marker is found), and then calling handle.read() to fetch any remaining data. Self contained examples below.

Note that partially reading a file like this still works:


$ python3.5
Python 3.5.0 (default, Sep 14 2015, 12:13:24) 
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> from urllib.request import urlopen
>>> handle = urlopen("http://www.python.org")
>>> chunk = handle.read(50)
>>> rest = handle.read()
>>> handle.close()


However, the following variants read a few lines and then attempt to call handle.read() and fail. The URL is not important (as long as it has over four lines in these examples).

Using readline,


>>> from urllib.request import urlopen
>>> handle = urlopen("http://www.python.org")
>>> for i in range(4):
...     line = handle.readline()
... 
>>> rest = handle.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xxx/lib/python3.5/http/client.py", line 446, in read
    s = self._safe_read(self.length)
  File "/Users/xxx/lib/python3.5/http/client.py", line 594, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(46698 bytes read, 259 more expected)


Using line iteration via next,


>>> from urllib.request import urlopen
>>> handle = urlopen("http://www.python.org")
>>> for i in range(4):
...      line = next(handle)
... 
>>> rest = handle.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xxx/lib/python3.5/http/client.py", line 446, in read
    s = self._safe_read(self.length)
  File "/Users/xxx/lib/python3.5/http/client.py", line 594, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(46698 bytes read, 259 more expected)


Using line iteration directly,


>>> from urllib.request import urlopen
>>> count = 0
>>> handle = urlopen("http://www.python.org")
>>> for line in handle:
...     count += 1
...     if count == 4:
...         break
... 
>>> rest = handle.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xxx/lib/python3.5/http/client.py", line 446, in read
    s = self._safe_read(self.length)
  File "/Users/xxx/lib/python3.5/http/client.py", line 594, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(46698 bytes read, 259 more expected)



These examples all worked on Python 3.3 and 3.4 so this is a regression.
msg261309 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2016-03-07 17:49
As far as I'm able to track it, it was a refactoring in issue 19009 that is responsible for this regression (rev 49017c391564). I'm adding Kristján, so that he'd have a look at the attached fix and test.
msg261401 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-03-09 07:29
Thanks for the report and the patch. It looks okay as far as it goes, but I think there are other related bugs:

## read1() doesn’t update length either ##

>>> handle = urlopen("https://www.python.org/")
>>> len(handle.read1())
7374
>>> handle.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/proj/python/cpython/Lib/http/client.py", line 462, in read
    s = self._safe_read(self.length)
  File "/home/proj/python/cpython/Lib/http/client.py", line 614, in _safe_read
    raise IncompleteRead(b''.join(s), amt)
http.client.IncompleteRead: IncompleteRead(39583 bytes read, 7374 more expected)

## readline() and read1() blindly read beyond Content-Length ##

>>> conn = HTTPSConnection("www.python.org")
>>> conn.request("GET", "/")
>>> resp = conn.getresponse()
>>> resp.readlines()  # Hangs until the connection is closed

I wonder if anyone has considered implementing HTTPResponse by wrapping or subclassing BufferedReader; that way you get well-tested readline() etc functionality for free. The HTTP protocol (Connection: close, Content-Length, chunked decoding) could be handled by an internal RawIOBase.readinto() method.
msg261405 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2016-03-09 08:30
Here is the updated patch. I only included the additional fix for read1 since readlines is not overwritten in the HTTPConnection. Not sure how to write test for it, does it need a much longer body (compared to the one in tests) to produce this behaviour?

The larger refactoring might be appropriate for 3.6, but I believe these smaller fixed could go into the next micro release.
msg261409 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-03-09 09:21
Yes I agree it would be best to fix the bug(s) without major refactoring if practical.

To fix the other bug, I think the parameters in read1(n) and readline(limit) need to be capped at self.length.

The inherited BufferedIOBase.readlines() implementation should just call readline() multiple times. The point is that if you call readline() when self.length is reduced to zero, it should behave like EOF, rather than waiting for more data from the socket. To test this, you might be able to use the FakeSocket class. You might have to add a second HTTP response to the data, and test that readline() etc doesn’t read any of the second response.
msg261470 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2016-03-09 19:58
All the highlighted issue are now fixed. The limit on n in read1 wasn't tested.

Your suggestion regarding testing went a bit over my head, Martin. So, just trying to make sure we're on the same page. ExtendedReadTest, where I thought placing these new tests, is already employing FakeSocket, but I'm not sure how one would add a second response and to what. In any case, some of the code in that class seems rather specific, so it's not clear if it could or should be reused.
msg261622 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-03-12 00:27
To add a second response you would just concatenate it to the existing response. (Your existing test already uses FakeSocket.) The FakeSocket parameter just represents data that would be sent from the server. So:

body = (
   b"HTTP/1.1 200 OK\r\n"
   b"Content-Length: 3\r\n"
   b"\r\n"
   b"abc"
   b"HTTP/1.1 408 Next response should not be read yet\r\n"
)

In fact, see the BasicTest.test_content_length_sync() case, which literally has "extradata" as that last line. I think we just need to adapt or duplicate this test to cover readline() and read1(), not just read(). Maybe also with read1(100), readline(100) cases as well.
msg261656 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2016-03-12 15:36
OK, here is the patch including the tests that seem to exercise the behaviour.
msg261673 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-03-13 06:08
Thanks Silent Ghost, this patch looks pretty good. I did leave a couple more comments.

Also I just realized that HTTPResponse.readline() and read1() are not really documented. The BufferedIOBase support was meant to be added in 3.5, although readline() was supposed to be supported earlier for urlopen() compatibility.
msg261695 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2016-03-13 11:38
Updated patch addresses the rietveld comments.
msg261801 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-03-15 05:59
Here is a modified version of the tests that I propose to commit soon. I trimmed the size of the tests down and simplified some of them. I also mentioned that BufferedIOBase is supported in the documentation.
msg261842 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-03-16 07:20
New changeset e95e9701b472 by Martin Panter in branch '3.5':
Issue #26499: Fixes to HTTPResponse.readline() and read1(), by Silent Ghost
https://hg.python.org/cpython/rev/e95e9701b472

New changeset 74b3523d6256 by Martin Panter in branch 'default':
Issue #26499: Merge HTTPResponse fix from 3.5
https://hg.python.org/cpython/rev/74b3523d6256
History
Date User Action Args
2016-03-16 08:39:57martin.pantersetstatus: open -> closed
resolution: fixed
stage: commit review -> resolved
2016-03-16 07:20:24python-devsetnosy: + python-dev
messages: + msg261842
2016-03-15 05:59:49martin.pantersetfiles: + issue26499_6.diff

messages: + msg261801
stage: patch review -> commit review
2016-03-13 11:38:45SilentGhostsetfiles: + issue26499_5.diff

messages: + msg261695
2016-03-13 06:08:19martin.pantersetmessages: + msg261673
2016-03-12 15:36:22SilentGhostsetfiles: + issue26499_4.diff

messages: + msg261656
2016-03-12 00:27:18martin.pantersetmessages: + msg261622
2016-03-09 19:58:22SilentGhostsetfiles: + issue26499_3.diff

messages: + msg261470
2016-03-09 09:21:58martin.pantersetmessages: + msg261409
2016-03-09 08:30:19SilentGhostsetfiles: + issue26499_2.diff

messages: + msg261405
2016-03-09 07:29:47martin.pantersetnosy: + martin.panter
messages: + msg261401
2016-03-07 17:49:15SilentGhostsetfiles: + issue26499.diff

versions: + Python 3.6
keywords: + 3.5regression, patch
nosy: + kristjan.jonsson, SilentGhost, serhiy.storchaka

messages: + msg261309
stage: patch review
2016-03-07 10:10:53maubpcreate