Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlopen returns extra, spurious bytes #48881

Closed
dato mannequin opened this issue Dec 11, 2008 · 23 comments
Closed

urlopen returns extra, spurious bytes #48881

dato mannequin opened this issue Dec 11, 2008 · 23 comments
Labels
release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@dato
Copy link
Mannequin

dato mannequin commented Dec 11, 2008

BPO 4631
Nosy @amauryfa, @pitrou, @devdanzin
Files
  • urllib_bytes.diff: Makes urlib.request.urlopen.HTTPHandler use HTTP 1.0 again
  • urllib-chunked.diff
  • test_urllib_chunked2.diff: Tests "Transfer-Encoding: chunked" handling in urllib
  • urllib-chunked2.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2009-02-11.01:02:33.050>
    created_at = <Date 2008-12-11.11:23:01.655>
    labels = ['type-bug', 'library', 'release-blocker']
    title = 'urlopen returns extra, spurious bytes'
    updated_at = <Date 2009-02-11.01:02:32.933>
    user = 'https://bugs.python.org/dato'

    bugs.python.org fields:

    activity = <Date 2009-02-11.01:02:32.933>
    actor = 'pitrou'
    assignee = 'jhylton'
    closed = True
    closed_date = <Date 2009-02-11.01:02:33.050>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2008-12-11.11:23:01.655>
    creator = 'dato'
    dependencies = []
    files = ['12351', '12361', '12988', '13021']
    hgrepos = []
    issue_num = 4631
    keywords = ['patch']
    message_count = 23.0
    messages = ['77603', '77605', '77611', '77612', '77775', '77778', '77779', '77780', '77781', '77789', '77794', '77796', '77846', '77879', '77889', '78144', '81360', '81385', '81428', '81433', '81610', '81611', '81617']
    nosy_count = 8.0
    nosy_names = ['jhylton', 'amaury.forgeotdarc', 'pitrou', 'ajaksu2', 'craigh', 'dato', 'ResulCetin', 'trodgers']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue4631'
    versions = ['Python 3.0']

    @dato
    Copy link
    Mannequin Author

    dato mannequin commented Dec 11, 2008

    This is very odd, but it was reproduced by people in #python as well.
    Compare, in python 2.5:

    >>> 
    urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline()
    'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001\n'

    To the equivalent in python 3.0:

    >>> 
    urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline()
    b'f65\r\n'

    @dato dato mannequin added the stdlib Python modules in the Lib dir label Dec 11, 2008
    @amauryfa
    Copy link
    Member

    I don't reproduce the problem:

    >>
    urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline()
    b'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001\n'

    I connect through a http proxy.

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Dec 11, 2008

    Confirmed:

    Python 3.1a0 (py3k:67702, Dec 11 2008, 11:09:14)
    [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import urllib.request
    >>>
    urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()
    [b'f65\r\n', b'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47
    2001\n',  ...

    Perhaps it's related to the \r at read boundaries bug?

    @exarkun
    Copy link
    Mannequin

    exarkun mannequin commented Dec 11, 2008

    The "f65" is the chunk length for the first chunk returned when
    requesting that URL. A proxy could easily hide this by switching to a
    different transfer encoding.

    @jhylton
    Copy link
    Mannequin

    jhylton mannequin commented Dec 14, 2008

    Does the same thing happen with 2.6?

    Jeremy

    On Thu, Dec 11, 2008 at 8:53 AM, Jean-Paul Calderone
    <report@bugs.python.org> wrote:

    Jean-Paul Calderone <exarkun@divmod.com> added the comment:

    The "f65" is the chunk length for the first chunk returned when
    requesting that URL. A proxy could easily hide this by switching to a
    different transfer encoding.

    ----------
    nosy: +exarkun


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue4631\>



    Python-bugs-list mailing list
    Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/jeremy%40alum.mit.edu

    @dato
    Copy link
    Mannequin Author

    dato mannequin commented Dec 14, 2008

    Does the same thing happen with 2.6?

    No, I can't reproduce with 2.6.1.

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Dec 14, 2008

    Jeremy: no, it doesn't.

    Python 2.6.1+ (release26-maint:67716M, Dec 13 2008, 10:30:52)
    [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2

    ~/release26-maint$ ./python -c "import urllib; print
    urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()[0]"
    From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001

    ~/release26-maint$ ./python -c "from __future__ import unicode_literals;
    import urllib; print
    urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()[0]"
    From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001

    FWIW, there are trailing spurious bytes too (note read() gives bytes,
    while readlines() both bytes and strings in 3.0):
    >>> import urllib.request; content =
    urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read()
    
    Python 3.1a0 (py3k:67702, Dec 11 2008, 11:09:14)
    [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import urllib.request
    
    >>> content =
    urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read()
    
    >>> content[-30:]
    b'PGP SIGNATURE-----\n\n\n\n\n\r\n0\r\n\r\n'
    
    >>> content[:10]
    b'f65\r\nFrom '
    
    While in 2.6:
    >>> import urllib
    >>> content =
    urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read()
    >>> content[-30:]
    '---END PGP SIGNATURE-----\n\n\n\n\n'

    @dato
    Copy link
    Mannequin Author

    dato mannequin commented Dec 14, 2008

    FWIW, there are trailing spurious bytes too

    And in the middle of the document as well. Each time there's a chunk, I
    guess?

    @ResulCetin
    Copy link
    Mannequin

    ResulCetin mannequin commented Dec 14, 2008

    I have the same problem with that code:

    (exchange USERNAME with your delicious username and PASSWORD with your
    delicious password):
    import urllib.request
    auth_handler = urllib.request.HTTPBasicAuthHandler()
    auth_handler.add_password('del.icio.us API', 'api.del.icio.us',
    USERNAME, PASSWORD)
    opener = urllib.request.build_opener(auth_handler)
    print(str(opener.open('https://api.del.icio.us/v1/posts/all').read(20),
    "utf-8"))

    And I don't use a proxy or anything like that. This makes python 3
    completely unusable for me. And python 2.6 gives me what I want (the
    content of that virtual file) without any extra data in front or in the
    middle of the content.

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Dec 14, 2008

    Took me a bit of Wiresharking to find this out, the problem is that we
    are asking for the page using HTTP 1.1 in 3.0.

    Here's a workaround patch for those who need it quick, I have yet to
    look at urllib to see what can be fixed there.

    ---
    Index: Lib/http/client.py
    ===================================================================
    --- Lib/http/client.py (revision 67716)
    +++ Lib/http/client.py (working copy)
    @@ -600,7 +600,7 @@
    class HTTPConnection:

    _http_vsn = 11
    -    _http_vsn_str = 'HTTP/1.1'
    +    _http_vsn_str = 'HTTP/1.0'
    
    response_class = HTTPResponse
    default_port = HTTP_PORT

    This is what we send in 2.5 and 3.0:

    GET /cgi-bin/bugreport.cgi?mbox=yes;bug=123456 HTTP/1.0
    User-Agent: Python-urllib/1.17

    GET /cgi-bin/bugreport.cgi?mbox=yes;bug=123456 HTTP/1.1
    Accept-Encoding: identity
    User-Agent: Python-urllib/3.1

    @pitrou pitrou added the type-bug An unexpected behavior, bug, or error label Dec 14, 2008
    @pitrou
    Copy link
    Member

    pitrou commented Dec 14, 2008

    IMO we should downgrade urlopen to HTTP 1.0 in 3.0.1. Implementing
    chunked encoding will come later if people are interested in doing it.

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Dec 14, 2008

    Clarifying the diagnosis, the offending spurious bytes are only present
    when we use 3.0's GET above.

    That's because urllib.request.HTTPHandler asks for a vanilla
    http.client.HTTPConnection, which uses HTTP 1.1.

    IIUC, either we change the request version back to 1.0 (attached patch)
    or correct the way the response is processed (is it at all?).

    I think HTTPSHandler will also suffer from this, perhaps
    [Fancy]URLopener too.

    [Antoine: cool, an edit conflict that agrees with what I was about to
    post :D]

    @jhylton jhylton mannequin self-assigned this Dec 14, 2008
    @jhylton
    Copy link
    Mannequin

    jhylton mannequin commented Dec 15, 2008

    Brief update: The Python 2.x code works because readline() is provided
    by socket._fileobject. The Python 3.x code fails because it grabs the
    HTTPResponse.fp instance variable at the end of
    AbstractHTTPHandler.do_open. That method needs to pass the response to
    addinfourl(), but needs to have support for readline / readlines before
    it can do that.

    @jhylton
    Copy link
    Mannequin

    jhylton mannequin commented Dec 15, 2008

    I have a patch here that seems to work for the specific url and that
    passes all the tests. Can anyone check whether it works for a larger
    set of cases?

    I'm a little concerned because I don't understand the new io library in
    much detail. There's an override for _checkClosed() in the HTTPResponse
    that seems a little dodgy. I'll try to get someone to review that
    specifically.

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Dec 15, 2008

    I think your patch is good, but there may be another bug around:

    I wrote a script to check results of 3.x against 2.x, but many pages
    (http://groups.google.com/, http://en.wikipedia.org/) give 403:
    Forbidden for 3.x... but work with 2.x!

    If you think of this as a bug in 3.x, it could retry the request
    identifying as 2.x on 403.

    Other than that, your patch gives me identical results to 2.5/2.6 for
    128 sites I tested (only a read(100) for each).

    Interestingly, my patched version gives a file closer to the buggy
    version in size, at 12700 bytes versus 12707. Your version agrees with
    2.x and simple maths (128 x 100) in giving a 12799 bytes result. I have
    no idea why.

    HTH,
    Daniel

    @loewis loewis mannequin added the release-blocker label Dec 21, 2008
    @pitrou
    Copy link
    Member

    pitrou commented Dec 21, 2008

    The patch should have at least a test so that we don't have a regression
    on this one.

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Feb 8, 2009

    Here's a test (in test_urllib2_localnet) that fails before the patch and
    passes after, mostly lifted from test_httplib:

        def test_chunked(self):
            expected_response = b"hello world"
            chunked_start = (
                            b'a\r\n'
                            b'hello worl\r\n'
                            b'1\r\n'
                            b'd\r\n'
                            )
            response = [(200, [("Transfer-Encoding", "chunked")],
    chunked_start)]
            handler = self.start_server(response)
            data = self.urlopen("http://localhost:%s/" % handler.port)
            self.assertEquals(data, expected_response)

    Output:

    test test_urllib2_localnet failed -- Traceback (most recent call last):
      File "~/py3k/Lib/test/test_urllib2_localnet.py", line 390, in test_chunked
        self.assertEquals(data, expected_response)
    AssertionError: b'a\r\nhello worl\r\n1\r\nd\r\n' != b'hello world'

    To allow this test to work, the attached patch also touches
    FakeHTTPRequestHandler and TestUrlopen.urlopen.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 8, 2009

    On the principle, the test looks good.
    If you want to avoid the 'if "%" in value' hack, you can use the
    named-parameter form of string formatting:

    >>> "localhost:%(port)s" % dict(port=8080)
    'localhost:8080'
    >>> "localhost" % dict(port=8080)
    'localhost'

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Feb 9, 2009

    Antoine,
    Thanks for reviewing, here's an updated version.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 9, 2009

    The test looks good to me.
    I can't comment on the bugfix patch, but if it's ok to you, you can go
    ahead :)

    @pitrou
    Copy link
    Member

    pitrou commented Feb 10, 2009

    I took a look at the patch and it looks ok, apart from the
    _checkClosed() hack (but I don't think there's any immediate solution).
    It should be noted that HTTPResponse.readline() will be awfully slow
    since, as HTTPResponse doesn't have peek(), readline() will call read()
    one byte at a time...

    (slow I/O is nothing new in py3k, however :-))

    @pitrou
    Copy link
    Member

    pitrou commented Feb 10, 2009

    Here is a patch without the _checkClosed() hack. The solution is simply
    to remove redundant _checkClosed() calls in IOBase (for example,
    readline() doesn't need to do an explicit closed check as it calls
    read()).

    @pitrou
    Copy link
    Member

    pitrou commented Feb 11, 2009

    Committed in r69513, r69514. Thanks everyone!

    @pitrou pitrou closed this as completed Feb 11, 2009
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    release-blocker stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants