New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
urlopen returns extra, spurious bytes #48881
Comments
This is very odd, but it was reproduced by people in #python as well. >>>
urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline()
'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47 2001\n' To the equivalent in python 3.0: >>>
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readline()
b'f65\r\n' |
I don't reproduce the problem:
I connect through a http proxy. |
Confirmed: Python 3.1a0 (py3k:67702, Dec 11 2008, 11:09:14)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>>
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').readlines()
[b'f65\r\n', b'From mechanix@lucretia.debian.net Tue Dec 11 11:32:47
2001\n', ... Perhaps it's related to the \r at read boundaries bug? |
The "f65" is the chunk length for the first chunk returned when |
Does the same thing happen with 2.6? Jeremy On Thu, Dec 11, 2008 at 8:53 AM, Jean-Paul Calderone
|
No, I can't reproduce with 2.6.1. |
Jeremy: no, it doesn't. Python 2.6.1+ (release26-maint:67716M, Dec 13 2008, 10:30:52) ~/release26-maint$ ./python -c "import urllib; print ~/release26-maint$ ./python -c "from __future__ import unicode_literals; FWIW, there are trailing spurious bytes too (note read() gives bytes,
while readlines() both bytes and strings in 3.0):
>>> import urllib.request; content =
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read()
Python 3.1a0 (py3k:67702, Dec 11 2008, 11:09:14)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib.request
>>> content =
urllib.request.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read()
>>> content[-30:]
b'PGP SIGNATURE-----\n\n\n\n\n\r\n0\r\n\r\n'
>>> content[:10]
b'f65\r\nFrom '
While in 2.6:
>>> import urllib
>>> content =
urllib.urlopen('http://bugs.debian.org/cgi-bin/bugreport.cgi?mbox=yes;bug=123456').read()
>>> content[-30:]
'---END PGP SIGNATURE-----\n\n\n\n\n' |
And in the middle of the document as well. Each time there's a chunk, I |
I have the same problem with that code: (exchange USERNAME with your delicious username and PASSWORD with your And I don't use a proxy or anything like that. This makes python 3 |
Took me a bit of Wiresharking to find this out, the problem is that we Here's a workaround patch for those who need it quick, I have yet to --- _http_vsn = 11
- _http_vsn_str = 'HTTP/1.1'
+ _http_vsn_str = 'HTTP/1.0'
response_class = HTTPResponse
default_port = HTTP_PORT This is what we send in 2.5 and 3.0: GET /cgi-bin/bugreport.cgi?mbox=yes;bug=123456 HTTP/1.0 GET /cgi-bin/bugreport.cgi?mbox=yes;bug=123456 HTTP/1.1 |
IMO we should downgrade urlopen to HTTP 1.0 in 3.0.1. Implementing |
Clarifying the diagnosis, the offending spurious bytes are only present That's because urllib.request.HTTPHandler asks for a vanilla IIUC, either we change the request version back to 1.0 (attached patch) I think HTTPSHandler will also suffer from this, perhaps [Antoine: cool, an edit conflict that agrees with what I was about to |
Brief update: The Python 2.x code works because readline() is provided |
I have a patch here that seems to work for the specific url and that I'm a little concerned because I don't understand the new io library in |
I think your patch is good, but there may be another bug around: I wrote a script to check results of 3.x against 2.x, but many pages If you think of this as a bug in 3.x, it could retry the request Other than that, your patch gives me identical results to 2.5/2.6 for Interestingly, my patched version gives a file closer to the buggy HTH, |
The patch should have at least a test so that we don't have a regression |
Here's a test (in test_urllib2_localnet) that fails before the patch and def test_chunked(self):
expected_response = b"hello world"
chunked_start = (
b'a\r\n'
b'hello worl\r\n'
b'1\r\n'
b'd\r\n'
)
response = [(200, [("Transfer-Encoding", "chunked")],
chunked_start)]
handler = self.start_server(response)
data = self.urlopen("http://localhost:%s/" % handler.port)
self.assertEquals(data, expected_response) Output: test test_urllib2_localnet failed -- Traceback (most recent call last):
File "~/py3k/Lib/test/test_urllib2_localnet.py", line 390, in test_chunked
self.assertEquals(data, expected_response)
AssertionError: b'a\r\nhello worl\r\n1\r\nd\r\n' != b'hello world' To allow this test to work, the attached patch also touches |
On the principle, the test looks good. >>> "localhost:%(port)s" % dict(port=8080)
'localhost:8080'
>>> "localhost" % dict(port=8080)
'localhost' |
Antoine, |
The test looks good to me. |
I took a look at the patch and it looks ok, apart from the (slow I/O is nothing new in py3k, however :-)) |
Here is a patch without the _checkClosed() hack. The solution is simply |
Committed in r69513, r69514. Thanks everyone! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: