urlopen returns extra, spurious bytes #48881

dato · 2008-12-11T11:23:02Z

BPO	4631
Nosy	@amauryfa, @pitrou, @devdanzin
Files	urllib_bytes.diff: Makes urlib.request.urlopen.HTTPHandler use HTTP 1.0 again urllib-chunked.diff test_urllib_chunked2.diff: Tests "Transfer-Encoding: chunked" handling in urllib urllib-chunked2.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2009-02-11.01:02:33.050>
created_at = <Date 2008-12-11.11:23:01.655>
labels = ['type-bug', 'library', 'release-blocker']
title = 'urlopen returns extra, spurious bytes'
updated_at = <Date 2009-02-11.01:02:32.933>
user = 'https://bugs.python.org/dato'

bugs.python.org fields:

activity = <Date 2009-02-11.01:02:32.933>
actor = 'pitrou'
assignee = 'jhylton'
closed = True
closed_date = <Date 2009-02-11.01:02:33.050>
closer = 'pitrou'
components = ['Library (Lib)']
creation = <Date 2008-12-11.11:23:01.655>
creator = 'dato'
dependencies = []
files = ['12351', '12361', '12988', '13021']
hgrepos = []
issue_num = 4631
keywords = ['patch']
message_count = 23.0
messages = ['77603', '77605', '77611', '77612', '77775', '77778', '77779', '77780', '77781', '77789', '77794', '77796', '77846', '77879', '77889', '78144', '81360', '81385', '81428', '81433', '81610', '81611', '81617']
nosy_count = 8.0
nosy_names = ['jhylton', 'amaury.forgeotdarc', 'pitrou', 'ajaksu2', 'craigh', 'dato', 'ResulCetin', 'trodgers']
pr_nums = []
priority = 'release blocker'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue4631'
versions = ['Python 3.0']

dato · 2008-12-11T11:23:00Z

This is very odd, but it was reproduced by people in #python as well.
Compare, in python 2.5:

>>> 
urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline()
'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001\n'

To the equivalent in python 3.0:

>>> 
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline()
b'f65\r\n'

amauryfa · 2008-12-11T12:12:45Z

I don't reproduce the problem:

>>
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline()
b'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001\n'

I connect through a http proxy.

devdanzin · 2008-12-11T13:13:23Z

Confirmed:

Python 3.1a0 (py3k:67702, Dec 11 2008, 11:09:14)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>>
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()
[b'f65\r\n', b'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47
2001\n',  ...

Perhaps it's related to the \r at read boundaries bug?

exarkun · 2008-12-11T13:53:58Z

The "f65" is the chunk length for the first chunk returned when
requesting that URL. A proxy could easily hide this by switching to a
different transfer encoding.

jhylton · 2008-12-14T02:54:54Z

Does the same thing happen with 2.6?

Jeremy

On Thu, Dec 11, 2008 at 8:53 AM, Jean-Paul Calderone
<report@bugs.python.org> wrote:

Jean-Paul Calderone <exarkun@divmod.com> added the comment:

The "f65" is the chunk length for the first chunk returned when
requesting that URL. A proxy could easily hide this by switching to a
different transfer encoding.

----------
nosy: +exarkun

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4631\>

Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/jeremy%40alum.mit.edu

dato · 2008-12-14T10:02:43Z

Does the same thing happen with 2.6?

No, I can't reproduce with 2.6.1.

devdanzin · 2008-12-14T10:08:11Z

Jeremy: no, it doesn't.

Python 2.6.1+ (release26-maint:67716M, Dec 13 2008, 10:30:52)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2

~/release26-maint$ ./python -c "import urllib; print
urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()[0]"
From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001

~/release26-maint$ ./python -c "from __future__ import unicode_literals;
import urllib; print
urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()[0]"
From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001

FWIW, there are trailing spurious bytes too (note read() gives bytes,
while readlines() both bytes and strings in 3.0):
>>> import urllib.request; content =
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read()

Python 3.1a0 (py3k:67702, Dec 11 2008, 11:09:14)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request

>>> content =
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read()

>>> content[-30:]
b'PGP SIGNATURE-----\n\n\n\n\n\r\n0\r\n\r\n'

>>> content[:10]
b'f65\r\nFrom '

While in 2.6:
>>> import urllib
>>> content =
urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read()
>>> content[-30:]
'---END PGP SIGNATURE-----\n\n\n\n\n'

dato · 2008-12-14T10:11:42Z

FWIW, there are trailing spurious bytes too

And in the middle of the document as well. Each time there's a chunk, I
guess?

ResulCetin · 2008-12-14T10:20:45Z

I have the same problem with that code:

(exchange USERNAME with your delicious username and PASSWORD with your
delicious password):
import urllib.request
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password('del.icio.us API', 'api.del.icio.us',
USERNAME, PASSWORD)
opener = urllib.request.build_opener(auth_handler)
print(str(opener.open('https://api.del.icio.us/v1/posts/all').read(20),
"utf-8"))

And I don't use a proxy or anything like that. This makes python 3
completely unusable for me. And python 2.6 gives me what I want (the
content of that virtual file) without any extra data in front or in the
middle of the content.

devdanzin · 2008-12-14T11:31:53Z

Took me a bit of Wiresharking to find this out, the problem is that we
are asking for the page using HTTP 1.1 in 3.0.

Here's a workaround patch for those who need it quick, I have yet to
look at urllib to see what can be fixed there.

---
Index: Lib/http/client.py
===================================================================
--- Lib/http/client.py (revision 67716)
+++ Lib/http/client.py (working copy)
@@ -600,7 +600,7 @@
class HTTPConnection:

_http_vsn = 11
-    _http_vsn_str = 'HTTP/1.1'
+    _http_vsn_str = 'HTTP/1.0'

response_class = HTTPResponse
default_port = HTTP_PORT

This is what we send in 2.5 and 3.0:

GET /cgi-bin/bugreport.cgi?mbox=yes;bug=123456 HTTP/1.0
User-Agent: Python-urllib/1.17

GET /cgi-bin/bugreport.cgi?mbox=yes;bug=123456 HTTP/1.1
Accept-Encoding: identity
User-Agent: Python-urllib/3.1

pitrou · 2008-12-14T12:12:24Z

IMO we should downgrade urlopen to HTTP 1.0 in 3.0.1. Implementing
chunked encoding will come later if people are interested in doing it.

devdanzin · 2008-12-14T12:17:17Z

Clarifying the diagnosis, the offending spurious bytes are only present
when we use 3.0's GET above.

That's because urllib.request.HTTPHandler asks for a vanilla
http.client.HTTPConnection, which uses HTTP 1.1.

IIUC, either we change the request version back to 1.0 (attached patch)
or correct the way the response is processed (is it at all?).

I think HTTPSHandler will also suffer from this, perhaps
[Fancy]URLopener too.

[Antoine: cool, an edit conflict that agrees with what I was about to
post :D]

jhylton · 2008-12-15T04:01:47Z

Brief update: The Python 2.x code works because readline() is provided
by socket._fileobject. The Python 3.x code fails because it grabs the
HTTPResponse.fp instance variable at the end of
AbstractHTTPHandler.do_open. That method needs to pass the response to
addinfourl(), but needs to have support for readline / readlines before
it can do that.

jhylton · 2008-12-15T19:15:16Z

I have a patch here that seems to work for the specific url and that
passes all the tests. Can anyone check whether it works for a larger
set of cases?

I'm a little concerned because I don't understand the new io library in
much detail. There's an override for _checkClosed() in the HTTPResponse
that seems a little dodgy. I'll try to get someone to review that
specifically.

devdanzin · 2008-12-15T23:00:11Z

I think your patch is good, but there may be another bug around:

I wrote a script to check results of 3.x against 2.x, but many pages
(http://groups.google.com/, http://en.wikipedia.org/) give 403:
Forbidden for 3.x... but work with 2.x!

If you think of this as a bug in 3.x, it could retry the request
identifying as 2.x on 403.

Other than that, your patch gives me identical results to 2.5/2.6 for
128 sites I tested (only a read(100) for each).

Interestingly, my patched version gives a file closer to the buggy
version in size, at 12700 bytes versus 12707. Your version agrees with
2.x and simple maths (128 x 100) in giving a 12799 bytes result. I have
no idea why.

HTH,
Daniel

pitrou · 2008-12-21T13:24:51Z

The patch should have at least a test so that we don't have a regression
on this one.

devdanzin · 2009-02-08T01:17:55Z

Here's a test (in test_urllib2_localnet) that fails before the patch and
passes after, mostly lifted from test_httplib:

    def test_chunked(self):
        expected_response = b"hello world"
        chunked_start = (
                        b'a\r\n'
                        b'hello worl\r\n'
                        b'1\r\n'
                        b'd\r\n'
                        )
        response = [(200, [("Transfer-Encoding", "chunked")],
chunked_start)]
        handler = self.start_server(response)
        data = self.urlopen("http://localhost:%s/" % handler.port)
        self.assertEquals(data, expected_response)

Output:

test test_urllib2_localnet failed -- Traceback (most recent call last):
  File "~/py3k/Lib/test/test_urllib2_localnet.py", line 390, in test_chunked
    self.assertEquals(data, expected_response)
AssertionError: b'a\r\nhello worl\r\n1\r\nd\r\n' != b'hello world'

To allow this test to work, the attached patch also touches
FakeHTTPRequestHandler and TestUrlopen.urlopen.

pitrou · 2009-02-08T15:58:59Z

On the principle, the test looks good.
If you want to avoid the 'if "%" in value' hack, you can use the
named-parameter form of string formatting:

>>> "localhost:%(port)s" % dict(port=8080)
'localhost:8080'
>>> "localhost" % dict(port=8080)
'localhost'

devdanzin · 2009-02-09T00:24:21Z

Antoine,
Thanks for reviewing, here's an updated version.

pitrou · 2009-02-09T02:23:14Z

The test looks good to me.
I can't comment on the bugfix patch, but if it's ok to you, you can go
ahead :)

pitrou · 2009-02-10T22:31:05Z

I took a look at the patch and it looks ok, apart from the
_checkClosed() hack (but I don't think there's any immediate solution).
It should be noted that HTTPResponse.readline() will be awfully slow
since, as HTTPResponse doesn't have peek(), readline() will call read()
one byte at a time...

(slow I/O is nothing new in py3k, however :-))

pitrou · 2009-02-10T23:03:42Z

Here is a patch without the _checkClosed() hack. The solution is simply
to remove redundant _checkClosed() calls in IOBase (for example,
readline() doesn't need to do an explicit closed check as it calls
read()).

pitrou · 2009-02-11T01:02:32Z

Committed in r69513, r69514. Thanks everyone!

dato mannequin added the stdlib Python modules in the Lib dir label Dec 11, 2008

pitrou added the type-bug An unexpected behavior, bug, or error label Dec 14, 2008

jhylton mannequin self-assigned this Dec 14, 2008

loewis mannequin added the release-blocker label Dec 21, 2008

pitrou closed this as completed Feb 11, 2009

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urlopen returns extra, spurious bytes #48881

urlopen returns extra, spurious bytes #48881

dato mannequin commented Dec 11, 2008

dato mannequin commented Dec 11, 2008

amauryfa commented Dec 11, 2008

devdanzin mannequin commented Dec 11, 2008

exarkun mannequin commented Dec 11, 2008

jhylton mannequin commented Dec 14, 2008

dato mannequin commented Dec 14, 2008

devdanzin mannequin commented Dec 14, 2008

dato mannequin commented Dec 14, 2008

ResulCetin mannequin commented Dec 14, 2008

devdanzin mannequin commented Dec 14, 2008

pitrou commented Dec 14, 2008

devdanzin mannequin commented Dec 14, 2008

jhylton mannequin commented Dec 15, 2008

jhylton mannequin commented Dec 15, 2008

devdanzin mannequin commented Dec 15, 2008

pitrou commented Dec 21, 2008

devdanzin mannequin commented Feb 8, 2009

pitrou commented Feb 8, 2009

devdanzin mannequin commented Feb 9, 2009

pitrou commented Feb 9, 2009

pitrou commented Feb 10, 2009

pitrou commented Feb 10, 2009

pitrou commented Feb 11, 2009

urlopen returns extra, spurious bytes #48881

urlopen returns extra, spurious bytes #48881

Comments

dato mannequin commented Dec 11, 2008

dato mannequin commented Dec 11, 2008

amauryfa commented Dec 11, 2008

devdanzin mannequin commented Dec 11, 2008

exarkun mannequin commented Dec 11, 2008

jhylton mannequin commented Dec 14, 2008

dato mannequin commented Dec 14, 2008

devdanzin mannequin commented Dec 14, 2008

dato mannequin commented Dec 14, 2008

ResulCetin mannequin commented Dec 14, 2008

devdanzin mannequin commented Dec 14, 2008

pitrou commented Dec 14, 2008

devdanzin mannequin commented Dec 14, 2008

jhylton mannequin commented Dec 15, 2008

jhylton mannequin commented Dec 15, 2008

devdanzin mannequin commented Dec 15, 2008

pitrou commented Dec 21, 2008

devdanzin mannequin commented Feb 8, 2009

pitrou commented Feb 8, 2009

devdanzin mannequin commented Feb 9, 2009

pitrou commented Feb 9, 2009

pitrou commented Feb 10, 2009

pitrou commented Feb 10, 2009

pitrou commented Feb 11, 2009