Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http.client.IncompleteRead from HTTPResponse read after part reading file #70686

Closed
peterjc mannequin opened this issue Mar 7, 2016 · 12 comments
Closed

http.client.IncompleteRead from HTTPResponse read after part reading file #70686

peterjc mannequin opened this issue Mar 7, 2016 · 12 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@peterjc
Copy link
Mannequin

peterjc mannequin commented Mar 7, 2016

BPO 26499
Nosy @kristjanvalur, @peterjc, @vadmium, @serhiy-storchaka
Files
  • issue26499.diff
  • issue26499_2.diff
  • issue26499_3.diff
  • issue26499_4.diff
  • issue26499_5.diff
  • issue26499_6.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-03-16.08:39:57.154>
    created_at = <Date 2016-03-07.10:10:53.557>
    labels = ['type-bug', 'library']
    title = 'http.client.IncompleteRead from HTTPResponse read after part reading file'
    updated_at = <Date 2016-03-16.08:39:57.153>
    user = 'https://github.com/peterjc'

    bugs.python.org fields:

    activity = <Date 2016-03-16.08:39:57.153>
    actor = 'martin.panter'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-03-16.08:39:57.154>
    closer = 'martin.panter'
    components = ['Library (Lib)']
    creation = <Date 2016-03-07.10:10:53.557>
    creator = 'maubp'
    dependencies = []
    files = ['42084', '42097', '42106', '42151', '42155', '42170']
    hgrepos = []
    issue_num = 26499
    keywords = ['patch', '3.5regression']
    message_count = 12.0
    messages = ['261287', '261309', '261401', '261405', '261409', '261470', '261622', '261656', '261673', '261695', '261801', '261842']
    nosy_count = 6.0
    nosy_names = ['kristjan.jonsson', 'SilentGhost', 'maubp', 'python-dev', 'martin.panter', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue26499'
    versions = ['Python 3.5', 'Python 3.6']

    @peterjc
    Copy link
    Mannequin Author

    peterjc mannequin commented Mar 7, 2016

    This is a regression in Python 3.5 tested under Linux and Mac OS X, spotted from a failing test in Biopython biopython/biopython#773 where we would parse a file from the internet. The trigger is partially reading the network handle line by line (e.g. until an end record marker is found), and then calling handle.read() to fetch any remaining data. Self contained examples below.

    Note that partially reading a file like this still works:

    $ python3.5
    Python 3.5.0 (default, Sep 14 2015, 12:13:24) 
    [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> 
    >>> from urllib.request import urlopen
    >>> handle = urlopen("http://www.python.org")
    >>> chunk = handle.read(50)
    >>> rest = handle.read()
    >>> handle.close()

    However, the following variants read a few lines and then attempt to call handle.read() and fail. The URL is not important (as long as it has over four lines in these examples).

    Using readline,

    >>> from urllib.request import urlopen
    >>> handle = urlopen("http://www.python.org")
    >>> for i in range(4):
    ...     line = handle.readline()
    ... 
    >>> rest = handle.read()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/xxx/lib/python3.5/http/client.py", line 446, in read
        s = self._safe_read(self.length)
      File "/Users/xxx/lib/python3.5/http/client.py", line 594, in _safe_read
        raise IncompleteRead(b''.join(s), amt)
    http.client.IncompleteRead: IncompleteRead(46698 bytes read, 259 more expected)

    Using line iteration via next,

    >>> from urllib.request import urlopen
    >>> handle = urlopen("http://www.python.org")
    >>> for i in range(4):
    ...      line = next(handle)
    ... 
    >>> rest = handle.read()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/xxx/lib/python3.5/http/client.py", line 446, in read
        s = self._safe_read(self.length)
      File "/Users/xxx/lib/python3.5/http/client.py", line 594, in _safe_read
        raise IncompleteRead(b''.join(s), amt)
    http.client.IncompleteRead: IncompleteRead(46698 bytes read, 259 more expected)

    Using line iteration directly,

    >>> from urllib.request import urlopen
    >>> count = 0
    >>> handle = urlopen("http://www.python.org")
    >>> for line in handle:
    ...     count += 1
    ...     if count == 4:
    ...         break
    ... 
    >>> rest = handle.read()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/xxx/lib/python3.5/http/client.py", line 446, in read
        s = self._safe_read(self.length)
      File "/Users/xxx/lib/python3.5/http/client.py", line 594, in _safe_read
        raise IncompleteRead(b''.join(s), amt)
    http.client.IncompleteRead: IncompleteRead(46698 bytes read, 259 more expected)

    These examples all worked on Python 3.3 and 3.4 so this is a regression.

    @peterjc peterjc mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Mar 7, 2016
    @SilentGhost
    Copy link
    Mannequin

    SilentGhost mannequin commented Mar 7, 2016

    As far as I'm able to track it, it was a refactoring in bpo-19009 that is responsible for this regression (rev 49017c391564). I'm adding Kristján, so that he'd have a look at the attached fix and test.

    @vadmium
    Copy link
    Member

    vadmium commented Mar 9, 2016

    Thanks for the report and the patch. It looks okay as far as it goes, but I think there are other related bugs:

    ## read1() doesn’t update length either ##

    >>> handle = urlopen("https://www.python.org/")
    >>> len(handle.read1())
    7374
    >>> handle.read()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/proj/python/cpython/Lib/http/client.py", line 462, in read
        s = self._safe_read(self.length)
      File "/home/proj/python/cpython/Lib/http/client.py", line 614, in _safe_read
        raise IncompleteRead(b''.join(s), amt)
    http.client.IncompleteRead: IncompleteRead(39583 bytes read, 7374 more expected)

    ## readline() and read1() blindly read beyond Content-Length ##

    >> conn = HTTPSConnection("www.python.org")
    >> conn.request("GET", "/")
    >> resp = conn.getresponse()
    >> resp.readlines() # Hangs until the connection is closed

    I wonder if anyone has considered implementing HTTPResponse by wrapping or subclassing BufferedReader; that way you get well-tested readline() etc functionality for free. The HTTP protocol (Connection: close, Content-Length, chunked decoding) could be handled by an internal RawIOBase.readinto() method.

    @SilentGhost
    Copy link
    Mannequin

    SilentGhost mannequin commented Mar 9, 2016

    Here is the updated patch. I only included the additional fix for read1 since readlines is not overwritten in the HTTPConnection. Not sure how to write test for it, does it need a much longer body (compared to the one in tests) to produce this behaviour?

    The larger refactoring might be appropriate for 3.6, but I believe these smaller fixed could go into the next micro release.

    @vadmium
    Copy link
    Member

    vadmium commented Mar 9, 2016

    Yes I agree it would be best to fix the bug(s) without major refactoring if practical.

    To fix the other bug, I think the parameters in read1(n) and readline(limit) need to be capped at self.length.

    The inherited BufferedIOBase.readlines() implementation should just call readline() multiple times. The point is that if you call readline() when self.length is reduced to zero, it should behave like EOF, rather than waiting for more data from the socket. To test this, you might be able to use the FakeSocket class. You might have to add a second HTTP response to the data, and test that readline() etc doesn’t read any of the second response.

    @SilentGhost
    Copy link
    Mannequin

    SilentGhost mannequin commented Mar 9, 2016

    All the highlighted issue are now fixed. The limit on n in read1 wasn't tested.

    Your suggestion regarding testing went a bit over my head, Martin. So, just trying to make sure we're on the same page. ExtendedReadTest, where I thought placing these new tests, is already employing FakeSocket, but I'm not sure how one would add a second response and to what. In any case, some of the code in that class seems rather specific, so it's not clear if it could or should be reused.

    @vadmium
    Copy link
    Member

    vadmium commented Mar 12, 2016

    To add a second response you would just concatenate it to the existing response. (Your existing test already uses FakeSocket.) The FakeSocket parameter just represents data that would be sent from the server. So:

    body = (
       b"HTTP/1.1 200 OK\r\n"
       b"Content-Length: 3\r\n"
       b"\r\n"
       b"abc"
       b"HTTP/1.1 408 Next response should not be read yet\r\n"
    )

    In fact, see the BasicTest.test_content_length_sync() case, which literally has "extradata" as that last line. I think we just need to adapt or duplicate this test to cover readline() and read1(), not just read(). Maybe also with read1(100), readline(100) cases as well.

    @SilentGhost
    Copy link
    Mannequin

    SilentGhost mannequin commented Mar 12, 2016

    OK, here is the patch including the tests that seem to exercise the behaviour.

    @vadmium
    Copy link
    Member

    vadmium commented Mar 13, 2016

    Thanks Silent Ghost, this patch looks pretty good. I did leave a couple more comments.

    Also I just realized that HTTPResponse.readline() and read1() are not really documented. The BufferedIOBase support was meant to be added in 3.5, although readline() was supposed to be supported earlier for urlopen() compatibility.

    @SilentGhost
    Copy link
    Mannequin

    SilentGhost mannequin commented Mar 13, 2016

    Updated patch addresses the rietveld comments.

    @vadmium
    Copy link
    Member

    vadmium commented Mar 15, 2016

    Here is a modified version of the tests that I propose to commit soon. I trimmed the size of the tests down and simplified some of them. I also mentioned that BufferedIOBase is supported in the documentation.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 16, 2016

    New changeset e95e9701b472 by Martin Panter in branch '3.5':
    Issue bpo-26499: Fixes to HTTPResponse.readline() and read1(), by Silent Ghost
    https://hg.python.org/cpython/rev/e95e9701b472

    New changeset 74b3523d6256 by Martin Panter in branch 'default':
    Issue bpo-26499: Merge HTTPResponse fix from 3.5
    https://hg.python.org/cpython/rev/74b3523d6256

    @vadmium vadmium closed this as completed Mar 16, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant