classification
Title: urlopen returns extra, spurious bytes
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: jhylton Nosy List: ResulCetin, ajaksu2, amaury.forgeotdarc, craigh, dato, jhylton, pitrou, trodgers
Priority: release blocker Keywords: patch

Created on 2008-12-11 11:23 by dato, last changed 2009-02-11 01:02 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
urllib_bytes.diff ajaksu2, 2008-12-14 12:17 Makes urlib.request.urlopen.HTTPHandler use HTTP 1.0 again
urllib-chunked.diff jhylton, 2008-12-15 19:15
test_urllib_chunked2.diff ajaksu2, 2009-02-09 00:24 Tests "Transfer-Encoding: chunked" handling in urllib
urllib-chunked2.diff pitrou, 2009-02-10 23:03
Messages (23)
msg77603 - (view) Author: Adeodato Simó (dato) Date: 2008-12-11 11:23
This is very odd, but it was reproduced by people in #python as well. 
Compare, in python 2.5:

>>> 
urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline()
'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001\n'

To the equivalent in python 3.0:

>>> 
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline()
b'f65\r\n'
msg77605 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-12-11 12:12
I don't reproduce the problem:

>>>
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline()
b'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001\n'

I connect through a http proxy.
msg77611 - (view) Author: Daniel Diniz (ajaksu2) (Python triager) Date: 2008-12-11 13:13
Confirmed:

Python 3.1a0 (py3k:67702, Dec 11 2008, 11:09:14)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>>
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()
[b'f65\r\n', b'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47
2001\n',  ...

Perhaps it's related to the \r at read boundaries bug?
msg77612 - (view) Author: Jean-Paul Calderone (exarkun) * (Python committer) Date: 2008-12-11 13:53
The "f65" is the chunk length for the first chunk returned when
requesting that URL.  A proxy could easily hide this by switching to a
different transfer encoding.
msg77775 - (view) Author: Jeremy Hylton (jhylton) (Python triager) Date: 2008-12-14 02:54
Does the same thing happen with 2.6?

Jeremy

On Thu, Dec 11, 2008 at 8:53 AM, Jean-Paul Calderone
<report@bugs.python.org> wrote:
>
> Jean-Paul Calderone <exarkun@divmod.com> added the comment:
>
> The "f65" is the chunk length for the first chunk returned when
> requesting that URL.  A proxy could easily hide this by switching to a
> different transfer encoding.
>
> ----------
> nosy: +exarkun
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue4631>
> _______________________________________
> _______________________________________________
> Python-bugs-list mailing list
> Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/jeremy%40alum.mit.edu
>
>
msg77778 - (view) Author: Adeodato Simó (dato) Date: 2008-12-14 10:02
> Does the same thing happen with 2.6?

No, I can't reproduce with 2.6.1.
msg77779 - (view) Author: Daniel Diniz (ajaksu2) (Python triager) Date: 2008-12-14 10:08
Jeremy: no, it doesn't.

Python 2.6.1+ (release26-maint:67716M, Dec 13 2008, 10:30:52)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2

~/release26-maint$ ./python -c "import urllib; print
urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()[0]"
From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001

~/release26-maint$ ./python -c "from __future__ import unicode_literals;
import urllib; print
urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()[0]"
From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001


FWIW, there are trailing spurious bytes too (note read() gives bytes,
while readlines() both bytes and strings in 3.0):
>>> import urllib.request; content =
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read()

Python 3.1a0 (py3k:67702, Dec 11 2008, 11:09:14)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request

>>> content =
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read()

>>> content[-30:]
b'PGP SIGNATURE-----\n\n\n\n\n\r\n0\r\n\r\n'

>>> content[:10]
b'f65\r\nFrom '

While in 2.6:
>>> import urllib
>>> content =
urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read()
>>> content[-30:]
'---END PGP SIGNATURE-----\n\n\n\n\n'
msg77780 - (view) Author: Adeodato Simó (dato) Date: 2008-12-14 10:11
> FWIW, there are trailing spurious bytes too

And in the middle of the document as well. Each time there's a chunk, I
guess?
msg77781 - (view) Author: Resul Cetin (ResulCetin) Date: 2008-12-14 10:20
I have the same problem with that code:

(exchange USERNAME with your delicious username and PASSWORD with your 
delicious password):
 import urllib.request
 auth_handler = urllib.request.HTTPBasicAuthHandler()
 auth_handler.add_password('del.icio.us API', 'api.del.icio.us',
USERNAME, PASSWORD)
 opener = urllib.request.build_opener(auth_handler)
 print(str(opener.open('https://api.del.icio.us/v1/posts/all').read(20),
"utf-8"))

And I don't use a proxy or anything like that. This makes python 3
completely unusable for me. And python 2.6 gives me what I want (the
content of that virtual file) without any extra data in front or in the
middle of the content.
msg77789 - (view) Author: Daniel Diniz (ajaksu2) (Python triager) Date: 2008-12-14 11:31
Took me a bit of Wiresharking to find this out, the problem is that we
are asking for the page using HTTP 1.1 in 3.0.

Here's a workaround patch for those who need it quick, I have yet to
look at urllib to see what can be fixed there.

---
Index: Lib/http/client.py
===================================================================
--- Lib/http/client.py  (revision 67716)
+++ Lib/http/client.py  (working copy)
@@ -600,7 +600,7 @@
class HTTPConnection:

_http_vsn = 11
-    _http_vsn_str = 'HTTP/1.1'
+    _http_vsn_str = 'HTTP/1.0'

response_class = HTTPResponse
default_port = HTTP_PORT
---


This is what we send in 2.5 and 3.0:

GET /cgi-bin/bugreport.cgi?mbox=yes;bug=123456 HTTP/1.0
               User-Agent: Python-urllib/1.17

GET /cgi-bin/bugreport.cgi?mbox=yes;bug=123456 HTTP/1.1
    Accept-Encoding: identity
    User-Agent: Python-urllib/3.1
msg77794 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-14 12:12
IMO we should downgrade urlopen to HTTP 1.0 in 3.0.1. Implementing
chunked encoding will come later if people are interested in doing it.
msg77796 - (view) Author: Daniel Diniz (ajaksu2) (Python triager) Date: 2008-12-14 12:17
Clarifying the diagnosis, the offending spurious bytes are only present
when we use 3.0's GET above.

That's because urllib.request.HTTPHandler asks for a vanilla
http.client.HTTPConnection, which uses HTTP 1.1.

IIUC, either we change the request version back to 1.0 (attached patch)
or correct the way the response is processed (is it at all?).

I think HTTPSHandler will also suffer from this, perhaps
[Fancy]URLopener too.

[Antoine: cool, an edit conflict that agrees with what I was about to
post :D]
msg77846 - (view) Author: Jeremy Hylton (jhylton) (Python triager) Date: 2008-12-15 04:01
Brief update:  The Python 2.x code works because readline() is provided
by socket._fileobject.  The Python 3.x code fails because it grabs the
HTTPResponse.fp instance variable at the end of
AbstractHTTPHandler.do_open.  That method needs to pass the response to
addinfourl(), but needs to have support for readline / readlines before
it can do that.
msg77879 - (view) Author: Jeremy Hylton (jhylton) (Python triager) Date: 2008-12-15 19:15
I have a patch here that seems to work for the specific url and that
passes all the tests.  Can anyone check whether it works for a larger
set of cases?

I'm a little concerned because I don't understand the new io library in
much detail.  There's an override for _checkClosed() in the HTTPResponse
that seems a little dodgy.  I'll try to get someone to review that
specifically.
msg77889 - (view) Author: Daniel Diniz (ajaksu2) (Python triager) Date: 2008-12-15 23:00
I think your patch is good, but there may be another bug around:

I wrote a script to check results of 3.x against 2.x, but many pages
(http://groups.google.com/, http://en.wikipedia.org/) give 403:
Forbidden for 3.x... but work with 2.x!

If you think of this as a bug in 3.x, it could retry the request
identifying as 2.x on 403.

Other than that, your patch gives me identical results to 2.5/2.6 for
128 sites I tested (only a read(100) for each).

Interestingly, my patched version gives a file closer to the buggy
version in size, at 12700 bytes versus 12707. Your version agrees with
2.x and simple maths (128 x 100) in giving a 12799 bytes result. I have
no idea why.

HTH,
Daniel
msg78144 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-12-21 13:24
The patch should have at least a test so that we don't have a regression
on this one.
msg81360 - (view) Author: Daniel Diniz (ajaksu2) (Python triager) Date: 2009-02-08 01:17
Here's a test (in test_urllib2_localnet) that fails before the patch and
passes after, mostly lifted from test_httplib:

    def test_chunked(self):
        expected_response = b"hello world"
        chunked_start = (
                        b'a\r\n'
                        b'hello worl\r\n'
                        b'1\r\n'
                        b'd\r\n'
                        )
        response = [(200, [("Transfer-Encoding", "chunked")],
chunked_start)]
        handler = self.start_server(response)
        data = self.urlopen("http://localhost:%s/" % handler.port)
        self.assertEquals(data, expected_response)

Output:

test test_urllib2_localnet failed -- Traceback (most recent call last):
  File "~/py3k/Lib/test/test_urllib2_localnet.py", line 390, in test_chunked
    self.assertEquals(data, expected_response)
AssertionError: b'a\r\nhello worl\r\n1\r\nd\r\n' != b'hello world'

To allow this test to work, the attached patch also touches
FakeHTTPRequestHandler and TestUrlopen.urlopen.
msg81385 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-08 15:58
On the principle, the test looks good.
If you want to avoid the 'if "%" in value' hack, you can use the
named-parameter form of string formatting:

>>> "localhost:%(port)s" % dict(port=8080)
'localhost:8080'
>>> "localhost" % dict(port=8080)
'localhost'
msg81428 - (view) Author: Daniel Diniz (ajaksu2) (Python triager) Date: 2009-02-09 00:24
Antoine,
Thanks for reviewing, here's an updated version.
msg81433 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-09 02:23
The test looks good to me.
I can't comment on the bugfix patch, but if it's ok to you, you can go
ahead :)
msg81610 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-10 22:31
I took a look at the patch and it looks ok, apart from the
_checkClosed() hack (but I don't think there's any immediate solution).
It should be noted that HTTPResponse.readline() will be awfully slow
since, as HTTPResponse doesn't have peek(), readline() will call read()
one byte at a time...

(slow I/O is nothing new in py3k, however :-))
msg81611 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-10 23:03
Here is a patch without the _checkClosed() hack. The solution is simply
to remove redundant _checkClosed() calls in IOBase (for example,
readline() doesn't need to do an explicit `closed` check as it calls
read()).
msg81617 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-02-11 01:02
Committed in r69513, r69514. Thanks everyone!
History
Date User Action Args
2009-02-11 01:02:33pitrousetstatus: pending -> closed
resolution: accepted -> fixed
messages: + msg81617
2009-02-11 00:39:28pitrousetstatus: open -> pending
resolution: accepted
2009-02-10 23:03:48pitrousetfiles: + urllib-chunked2.diff
messages: + msg81611
2009-02-10 22:31:04pitrousetmessages: + msg81610
2009-02-09 02:23:13pitrousetmessages: + msg81433
2009-02-09 00:24:38ajaksu2setfiles: - test_urllib_chunked.diff
2009-02-09 00:24:23ajaksu2setfiles: + test_urllib_chunked2.diff
messages: + msg81428
2009-02-08 15:58:59pitrousetmessages: + msg81385
2009-02-08 01:17:58ajaksu2setfiles: + test_urllib_chunked.diff
messages: + msg81360
2009-01-28 19:19:09exarkunsetnosy: - exarkun
2009-01-28 18:57:18trodgerssetnosy: + trodgers
2009-01-10 04:12:55craighsetnosy: + craigh
2008-12-21 13:24:51pitrousetmessages: + msg78144
2008-12-21 09:43:26loewissetpriority: critical -> release blocker
2008-12-15 23:00:12ajaksu2setmessages: + msg77889
2008-12-15 19:15:21jhyltonsetfiles: + urllib-chunked.diff
messages: + msg77879
2008-12-15 04:01:47jhyltonsetmessages: + msg77846
2008-12-15 03:59:44jhyltonlinkissue3761 superseder
2008-12-14 22:00:56jhyltonsetassignee: jhylton
2008-12-14 12:17:17ajaksu2setfiles: + urllib_bytes.diff
keywords: + patch
messages: + msg77796
2008-12-14 12:12:23pitrousetnosy: + pitrou
messages: + msg77794
2008-12-14 12:05:21pitrousetpriority: critical
type: behavior
2008-12-14 11:31:53ajaksu2setmessages: + msg77789
2008-12-14 10:20:45ResulCetinsetnosy: + ResulCetin
messages: + msg77781
2008-12-14 10:11:42datosetmessages: + msg77780
2008-12-14 10:08:12ajaksu2setmessages: + msg77779
2008-12-14 10:02:43datosetmessages: + msg77778
2008-12-14 02:54:55jhyltonsetnosy: + jhylton
messages: + msg77775
2008-12-11 13:53:58exarkunsetnosy: + exarkun
messages: + msg77612
2008-12-11 13:13:23ajaksu2setnosy: + ajaksu2
messages: + msg77611
2008-12-11 12:12:45amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg77605
2008-12-11 11:23:01datocreate