msg77603 - (view) |
Author: Adeodato Simó (dato) |
Date: 2008-12-11 11:23 |
This is very odd, but it was reproduced by people in #python as well.
Compare, in python 2.5:
>>>
urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline()
'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001\n'
To the equivalent in python 3.0:
>>>
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline()
b'f65\r\n'
|
msg77605 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * |
Date: 2008-12-11 12:12 |
I don't reproduce the problem:
>>>
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline()
b'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001\n'
I connect through a http proxy.
|
msg77611 - (view) |
Author: Daniel Diniz (ajaksu2) * |
Date: 2008-12-11 13:13 |
Confirmed:
Python 3.1a0 (py3k:67702, Dec 11 2008, 11:09:14)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>>
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()
[b'f65\r\n', b'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47
2001\n', ...
Perhaps it's related to the \r at read boundaries bug?
|
msg77612 - (view) |
Author: Jean-Paul Calderone (exarkun) * |
Date: 2008-12-11 13:53 |
The "f65" is the chunk length for the first chunk returned when
requesting that URL. A proxy could easily hide this by switching to a
different transfer encoding.
|
msg77775 - (view) |
Author: Jeremy Hylton (jhylton) |
Date: 2008-12-14 02:54 |
Does the same thing happen with 2.6?
Jeremy
On Thu, Dec 11, 2008 at 8:53 AM, Jean-Paul Calderone
<report@bugs.python.org> wrote:
>
> Jean-Paul Calderone <exarkun@divmod.com> added the comment:
>
> The "f65" is the chunk length for the first chunk returned when
> requesting that URL. A proxy could easily hide this by switching to a
> different transfer encoding.
>
> ----------
> nosy: +exarkun
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue4631>
> _______________________________________
> _______________________________________________
> Python-bugs-list mailing list
> Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/jeremy%40alum.mit.edu
>
>
|
msg77778 - (view) |
Author: Adeodato Simó (dato) |
Date: 2008-12-14 10:02 |
> Does the same thing happen with 2.6?
No, I can't reproduce with 2.6.1.
|
msg77779 - (view) |
Author: Daniel Diniz (ajaksu2) * |
Date: 2008-12-14 10:08 |
Jeremy: no, it doesn't.
Python 2.6.1+ (release26-maint:67716M, Dec 13 2008, 10:30:52)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
~/release26-maint$ ./python -c "import urllib; print
urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()[0]"
From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001
~/release26-maint$ ./python -c "from __future__ import unicode_literals;
import urllib; print
urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()[0]"
From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001
FWIW, there are trailing spurious bytes too (note read() gives bytes,
while readlines() both bytes and strings in 3.0):
>>> import urllib.request; content =
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read()
Python 3.1a0 (py3k:67702, Dec 11 2008, 11:09:14)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>> content =
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read()
>>> content[-30:]
b'PGP SIGNATURE-----\n\n\n\n\n\r\n0\r\n\r\n'
>>> content[:10]
b'f65\r\nFrom '
While in 2.6:
>>> import urllib
>>> content =
urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read()
>>> content[-30:]
'---END PGP SIGNATURE-----\n\n\n\n\n'
|
msg77780 - (view) |
Author: Adeodato Simó (dato) |
Date: 2008-12-14 10:11 |
> FWIW, there are trailing spurious bytes too
And in the middle of the document as well. Each time there's a chunk, I
guess?
|
msg77781 - (view) |
Author: Resul Cetin (ResulCetin) |
Date: 2008-12-14 10:20 |
I have the same problem with that code:
(exchange USERNAME with your delicious username and PASSWORD with your
delicious password):
import urllib.request
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password('del.icio.us API', 'api.del.icio.us',
USERNAME, PASSWORD)
opener = urllib.request.build_opener(auth_handler)
print(str(opener.open('https://api.del.icio.us/v1/posts/all').read(20),
"utf-8"))
And I don't use a proxy or anything like that. This makes python 3
completely unusable for me. And python 2.6 gives me what I want (the
content of that virtual file) without any extra data in front or in the
middle of the content.
|
msg77789 - (view) |
Author: Daniel Diniz (ajaksu2) * |
Date: 2008-12-14 11:31 |
Took me a bit of Wiresharking to find this out, the problem is that we
are asking for the page using HTTP 1.1 in 3.0.
Here's a workaround patch for those who need it quick, I have yet to
look at urllib to see what can be fixed there.
---
Index: Lib/http/client.py
===================================================================
--- Lib/http/client.py (revision 67716)
+++ Lib/http/client.py (working copy)
@@ -600,7 +600,7 @@
class HTTPConnection:
_http_vsn = 11
- _http_vsn_str = 'HTTP/1.1'
+ _http_vsn_str = 'HTTP/1.0'
response_class = HTTPResponse
default_port = HTTP_PORT
---
This is what we send in 2.5 and 3.0:
GET /cgi-bin/bugreport.cgi?mbox=yes;bug=123456 HTTP/1.0
User-Agent: Python-urllib/1.17
GET /cgi-bin/bugreport.cgi?mbox=yes;bug=123456 HTTP/1.1
Accept-Encoding: identity
User-Agent: Python-urllib/3.1
|
msg77794 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-12-14 12:12 |
IMO we should downgrade urlopen to HTTP 1.0 in 3.0.1. Implementing
chunked encoding will come later if people are interested in doing it.
|
msg77796 - (view) |
Author: Daniel Diniz (ajaksu2) * |
Date: 2008-12-14 12:17 |
Clarifying the diagnosis, the offending spurious bytes are only present
when we use 3.0's GET above.
That's because urllib.request.HTTPHandler asks for a vanilla
http.client.HTTPConnection, which uses HTTP 1.1.
IIUC, either we change the request version back to 1.0 (attached patch)
or correct the way the response is processed (is it at all?).
I think HTTPSHandler will also suffer from this, perhaps
[Fancy]URLopener too.
[Antoine: cool, an edit conflict that agrees with what I was about to
post :D]
|
msg77846 - (view) |
Author: Jeremy Hylton (jhylton) |
Date: 2008-12-15 04:01 |
Brief update: The Python 2.x code works because readline() is provided
by socket._fileobject. The Python 3.x code fails because it grabs the
HTTPResponse.fp instance variable at the end of
AbstractHTTPHandler.do_open. That method needs to pass the response to
addinfourl(), but needs to have support for readline / readlines before
it can do that.
|
msg77879 - (view) |
Author: Jeremy Hylton (jhylton) |
Date: 2008-12-15 19:15 |
I have a patch here that seems to work for the specific url and that
passes all the tests. Can anyone check whether it works for a larger
set of cases?
I'm a little concerned because I don't understand the new io library in
much detail. There's an override for _checkClosed() in the HTTPResponse
that seems a little dodgy. I'll try to get someone to review that
specifically.
|
msg77889 - (view) |
Author: Daniel Diniz (ajaksu2) * |
Date: 2008-12-15 23:00 |
I think your patch is good, but there may be another bug around:
I wrote a script to check results of 3.x against 2.x, but many pages
(http://groups.google.com/, http://en.wikipedia.org/) give 403:
Forbidden for 3.x... but work with 2.x!
If you think of this as a bug in 3.x, it could retry the request
identifying as 2.x on 403.
Other than that, your patch gives me identical results to 2.5/2.6 for
128 sites I tested (only a read(100) for each).
Interestingly, my patched version gives a file closer to the buggy
version in size, at 12700 bytes versus 12707. Your version agrees with
2.x and simple maths (128 x 100) in giving a 12799 bytes result. I have
no idea why.
HTH,
Daniel
|
msg78144 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-12-21 13:24 |
The patch should have at least a test so that we don't have a regression
on this one.
|
msg81360 - (view) |
Author: Daniel Diniz (ajaksu2) * |
Date: 2009-02-08 01:17 |
Here's a test (in test_urllib2_localnet) that fails before the patch and
passes after, mostly lifted from test_httplib:
def test_chunked(self):
expected_response = b"hello world"
chunked_start = (
b'a\r\n'
b'hello worl\r\n'
b'1\r\n'
b'd\r\n'
)
response = [(200, [("Transfer-Encoding", "chunked")],
chunked_start)]
handler = self.start_server(response)
data = self.urlopen("http://localhost:%s/" % handler.port)
self.assertEquals(data, expected_response)
Output:
test test_urllib2_localnet failed -- Traceback (most recent call last):
File "~/py3k/Lib/test/test_urllib2_localnet.py", line 390, in test_chunked
self.assertEquals(data, expected_response)
AssertionError: b'a\r\nhello worl\r\n1\r\nd\r\n' != b'hello world'
To allow this test to work, the attached patch also touches
FakeHTTPRequestHandler and TestUrlopen.urlopen.
|
msg81385 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-02-08 15:58 |
On the principle, the test looks good.
If you want to avoid the 'if "%" in value' hack, you can use the
named-parameter form of string formatting:
>>> "localhost:%(port)s" % dict(port=8080)
'localhost:8080'
>>> "localhost" % dict(port=8080)
'localhost'
|
msg81428 - (view) |
Author: Daniel Diniz (ajaksu2) * |
Date: 2009-02-09 00:24 |
Antoine,
Thanks for reviewing, here's an updated version.
|
msg81433 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-02-09 02:23 |
The test looks good to me.
I can't comment on the bugfix patch, but if it's ok to you, you can go
ahead :)
|
msg81610 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-02-10 22:31 |
I took a look at the patch and it looks ok, apart from the
_checkClosed() hack (but I don't think there's any immediate solution).
It should be noted that HTTPResponse.readline() will be awfully slow
since, as HTTPResponse doesn't have peek(), readline() will call read()
one byte at a time...
(slow I/O is nothing new in py3k, however :-))
|
msg81611 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-02-10 23:03 |
Here is a patch without the _checkClosed() hack. The solution is simply
to remove redundant _checkClosed() calls in IOBase (for example,
readline() doesn't need to do an explicit `closed` check as it calls
read()).
|
msg81617 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2009-02-11 01:02 |
Committed in r69513, r69514. Thanks everyone!
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:42 | admin | set | github: 48881 |
2009-02-11 01:02:33 | pitrou | set | status: pending -> closed resolution: accepted -> fixed messages:
+ msg81617 |
2009-02-11 00:39:28 | pitrou | set | status: open -> pending resolution: accepted |
2009-02-10 23:03:48 | pitrou | set | files:
+ urllib-chunked2.diff messages:
+ msg81611 |
2009-02-10 22:31:04 | pitrou | set | messages:
+ msg81610 |
2009-02-09 02:23:13 | pitrou | set | messages:
+ msg81433 |
2009-02-09 00:24:38 | ajaksu2 | set | files:
- test_urllib_chunked.diff |
2009-02-09 00:24:23 | ajaksu2 | set | files:
+ test_urllib_chunked2.diff messages:
+ msg81428 |
2009-02-08 15:58:59 | pitrou | set | messages:
+ msg81385 |
2009-02-08 01:17:58 | ajaksu2 | set | files:
+ test_urllib_chunked.diff messages:
+ msg81360 |
2009-01-28 19:19:09 | exarkun | set | nosy:
- exarkun |
2009-01-28 18:57:18 | trodgers | set | nosy:
+ trodgers |
2009-01-10 04:12:55 | craigh | set | nosy:
+ craigh |
2008-12-21 13:24:51 | pitrou | set | messages:
+ msg78144 |
2008-12-21 09:43:26 | loewis | set | priority: critical -> release blocker |
2008-12-15 23:00:12 | ajaksu2 | set | messages:
+ msg77889 |
2008-12-15 19:15:21 | jhylton | set | files:
+ urllib-chunked.diff messages:
+ msg77879 |
2008-12-15 04:01:47 | jhylton | set | messages:
+ msg77846 |
2008-12-15 03:59:44 | jhylton | link | issue3761 superseder |
2008-12-14 22:00:56 | jhylton | set | assignee: jhylton |
2008-12-14 12:17:17 | ajaksu2 | set | files:
+ urllib_bytes.diff keywords:
+ patch messages:
+ msg77796 |
2008-12-14 12:12:23 | pitrou | set | nosy:
+ pitrou messages:
+ msg77794 |
2008-12-14 12:05:21 | pitrou | set | priority: critical type: behavior |
2008-12-14 11:31:53 | ajaksu2 | set | messages:
+ msg77789 |
2008-12-14 10:20:45 | ResulCetin | set | nosy:
+ ResulCetin messages:
+ msg77781 |
2008-12-14 10:11:42 | dato | set | messages:
+ msg77780 |
2008-12-14 10:08:12 | ajaksu2 | set | messages:
+ msg77779 |
2008-12-14 10:02:43 | dato | set | messages:
+ msg77778 |
2008-12-14 02:54:55 | jhylton | set | nosy:
+ jhylton messages:
+ msg77775 |
2008-12-11 13:53:58 | exarkun | set | nosy:
+ exarkun messages:
+ msg77612 |
2008-12-11 13:13:23 | ajaksu2 | set | nosy:
+ ajaksu2 messages:
+ msg77611 |
2008-12-11 12:12:45 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages:
+ msg77605 |
2008-12-11 11:23:01 | dato | create | |