Issue 2464: urllib2 can't handle http://www.wikispaces.com

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/46716

classification

Title:	urllib2 can't handle http://www.wikispaces.com
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.0, Python 2.6, Python 2.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	facundobatista	Nosy List:	facundobatista, gregory.p.smith, jjlee, orsenthil, weijie90
Priority:	low	Keywords:	patch

Created on 2008-03-23 14:41 by weijie90, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue2464-facundo.diff	facundobatista, 2008-08-16 19:30
issue2464-py26-FINAL.diff	orsenthil, 2008-08-17 02:42
issue2464-PATCH1.diff	orsenthil, 2008-08-17 02:42
issue2463-py3k.diff	orsenthil, 2008-08-17 02:56

Messages (18)
msg64363 - (view)	Author: Koh Wei Jie (weijie90)	Date: 2008-03-23 14:41
Try the following code: import urllib2 gmail = urllib2.urlopen("https://www.gmail.com").read() wikispaces = urllib2.urlopen("http://www.wikispaces.com").read() Getting the html over HTTPS from gmail.com works, but not over HTTP from wikispaces. Here's the traceback: >>> wikispaces = urllib2.urlopen("http://www.wikispaces.com").read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.5/urllib2.py", line 121, in urlopen return _opener.open(url, data) File "/usr/lib/python2.5/urllib2.py", line 380, in open response = meth(req, response) File "/usr/lib/python2.5/urllib2.py", line 491, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib/python2.5/urllib2.py", line 412, in error result = self._call_chain(args) File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain result = func(args) File "/usr/lib/python2.5/urllib2.py", line 575, in http_error_302 return self.parent.open(new) File "/usr/lib/python2.5/urllib2.py", line 380, in open response = meth(req, response) File "/usr/lib/python2.5/urllib2.py", line 491, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib/python2.5/urllib2.py", line 412, in error result = self._call_chain(args) File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain result = func(args) File "/usr/lib/python2.5/urllib2.py", line 575, in http_error_302 return self.parent.open(new) File "/usr/lib/python2.5/urllib2.py", line 374, in open response = self._open(req, data) File "/usr/lib/python2.5/urllib2.py", line 392, in _open '_open', req) File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain result = func(*args) File "/usr/lib/python2.5/urllib2.py", line 1100, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.5/urllib2.py", line 1075, in do_open raise URLError(err) urllib2.URLError: <urlopen error (104, 'Connection reset by peer')> Note the two 302 redirects. I tried accessing wikispaces.com with SSL turned off in Firefox 2.0.0.12, which didn't work because SSL was required, perhaps in between the redirects that wikispaces uses. Why doesn't urllib2 handle the "hidden" SSL properly? (Not to be rude, but httplib2 works.) Thanks! WJ
msg64676 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2008-03-29 03:13
The problem does not appear to have anything to do with SSL. The problem is that the chain of HTTP requests goes: GET -> 302 -> 302 -> 301 On the final 301 urllib2's internal state is messed up such that by the time its in the handle_error_302 method it believes the Location header contains the following: 'http://www.wikispaces.com/\x00/?responseToken=481aec3249f429316459e01c00b7e522' The \x00 and everything after it should not be there and is not there if you look at what is sent over the socket itself. The ?responseToken=xxx value is leftover from the previous 302 response. No idea where the \x00 came from yet. I'm debugging...
msg64677 - (view)	Author: Koh Wei Jie (weijie90)	Date: 2008-03-29 03:41
Please take your time, because this bug isn't critical. Thanks!
msg64679 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2008-03-29 03:48
Instrumenting the code and looking closer at the tcpdump, its true. wikispaces.com is returning an invalid Location: header with a null byte in the middle of it. The "fix" on our end should be to handle such garbage from such broken web servers more gracefully. Other clients seem to treat the null as an end of string or end of that header.
msg64680 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2008-03-29 04:19
I'm not sure what the best solution for this is. If I truncate the header values at a \x00 character it ends in an infinite redirect loop (which urllib2 detects and raises on). If I simple remove all \x00 characters the resulting url is not accepted by wikispaces.com due to having an extra / in it. Verdict: wikispaces.com is broken. urllib2 could do better. wget and firefox deal with it properly. but i'll leave deciding which patch to use up to someone who cares about handling broken sites. patch to implement either behavior of dealing with nulls where they shouldn't be: Index: Lib/httplib.py =================================================================== --- Lib/httplib.py (revision 62033) +++ Lib/httplib.py (working copy) @@ -291,9 +291,18 @@ break headerseen = self.isheader(line) if headerseen: + # Some bad web servers reply with headers with a \x00 null + # embedded in the value. Other http clients deal with + # this by treating it as a value terminator, ignoring the + # rest so we will too. http://bugs.python.org/issue2464. + if '\x00' in line: + line = line[:line.find('\x00')] + # if you want to just remove nulls instead use this: + #line = line.replace('\x00', '') # It's a legal header line, save it. hlist.append(line) - self.addheader(headerseen, line[len(headerseen)+1:].strip()) + value = line[len(headerseen)+1:].strip() + self.addheader(headerseen, value) continue else: # It's not a header line; throw it back and stop here.
msg66889 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2008-05-16 03:07
The issue is not just with null character. If you observe now the diretion is 302-302-200 and there is no null character. However, still urllib2 is unable to handle multiple redirection properly (IIRC, there is a portion of code to handle multiple redirection and exit on infinite loop) >>> url = "http://www.wikispaces.com" >>> opened = urllib.urlopen(url) >>> print opened.geturl() http://www.wikispaces.com?responseToken=344289da354a29c67d48928dbe72042a >>> print opened.read() <html> <head><title>400 Bad Request</title></head> <body bgcolor="white"> <center><h1>400 Bad Request</h1></center> <hr><center>nginx/0.6.30</center> </body> </html> Needs a relook, IMO.
msg71231 - (view)	Author: Facundo Batista (facundobatista) *	Date: 2008-08-16 19:30
Senthil: Look at that URL that the server returned in the second redirect: http://www.wikispaces.com?responseToken=ee3fca88a9b0dc865152d8a9e5b6138d See that the "?" appears without a path between the host and it. Check the item 3.2.2 in the RFC 2616, it says that a HTTP URL should be: http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]] So, we should fix that URL that the server returned. Guess what: if we put a "/" (as obligates the RFC), everything works ok. The patch I attach here does that. All tests pass ok. What do you think?
msg71255 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2008-08-17 01:20
looks good to me.
msg71259 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2008-08-17 02:42
Ah, I that was a simple fix. :) I very much overlooked the problem after being so much given the hints at the web-sig. I have some comments on the patch, Facundo. 1) I don't think is a good idea to include that portion in the http_error_302 method. That makes the fix "very" specific to "this" issue only. Another point is, fixing broken url's should not be under urllib2, urlparse would be a better place. So, I came up with the approach wherein urllib2 does unparse(parse) of the url and parse methods will fix the url if it is broken. ( See attached issue2464-PATCH1.diff) But if we handle it in the urlparse methods, then we are much susceptible to breaking RFC conformance, breaking a lot of tests, Which is not a good idea. So,I introduced fix_broken() method in urlparse and called it to solve the issue, using the same logic as yours (issue2464-py26-FINAL.diff) With fix_broken() method in urlparse, we will have better control whenever we want to implement a behavior which is RFC non-confirming but implemented widely by browsers and clients. All tests pass with issue2464-py26-FINAL.diff Comments,please?
msg71260 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2008-08-17 02:56
Patch for py3k, but please test this before applying.
msg71261 - (view)	Author: Facundo Batista (facundobatista) *	Date: 2008-08-17 03:39
Senthil: I don't like that. Creating a public method called "fix_broken", introducing new behaviours now in beta, and actually not fixing the url in any broken possibility (just the path if it's not there), it's way too much for this fix. I commited the change I proposed. Maybe in the future will have a "generic url fixing" function, now is not the moment.
msg71295 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2008-08-17 22:38
i was pondering if it should go in urlparse instead. if it did, i think it should be part of urlparse.urlunparse to ensure that there is always a trailing slash after the host:port regardless of what the inputs are. anyways, agreed, this fixes this specific bug. should it be backported to release25-maint?
msg71303 - (view)	Author: Facundo Batista (facundobatista) *	Date: 2008-08-18 00:08
Maybe we can put it in urlunparse... do you all agree with this test cases? def test_alwayspath(self): u = urlparse.urlparse("http://netloc/path;params?query#fragment") self.assertEqual(urlparse.urlunparse(u), "http://netloc/path;params?query#fragment") u = urlparse.urlparse("http://netloc?query#fragment") self.assertEqual(urlparse.urlunparse(u), "http://netloc/?query#fragment") u = urlparse.urlparse("http://netloc#fragment") self.assertEqual(urlparse.urlunparse(u), "http://netloc/#fragment") Maybe we could backport this more general fix...
msg71304 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2008-08-18 00:51
That test case looks good to me for 2.6 and 3.0. Also add a note to the documentation with a versionchanged 2.6 about urlunparse always ensuring there is a / between the netloc and the rest of the url. I would not back port the more general urlunparse behavior change to 2.5.
msg71990 - (view)	Author: Facundo Batista (facundobatista) *	Date: 2008-08-26 17:51
Gregory... I tried to fill the path in urlunparse, and other functions that use this started to fail. As we're so close to final releases, I'll leave this as it's right now, that actually fixed the bug...
msg72024 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2008-08-27 12:50
That was reason in making fix_broken in the urlparse in my patch, Facundo. I had thought, it should be handled in urlparse module and if we make changes in the urlparse.urlunparse/urlparse.urlparse, then we are stepping into area which will break a lot of tests. I am kind of +0 with the current fix in urllib2. Should we think/plan for something in urlparse, akin to fix_broken?
msg76736 - (view)	Author: John J Lee (jjlee)	Date: 2008-12-02 13:00
This fix was applied in the wrong place. URI path components, and HTTP URI path components in particular, can be empty. See RFC 3986. So the comment in the code that was inserted with the fix for this bug that says "possibly malformed" is incorrect, and should instead just refer to section 3.2.2 of RFC 2616. Also, because 3.2.2 says "If the abs_path is not present in the URL, it MUST be given as "/" when used as a Request-URI for a resource (section 5.1.2)", it seems clear that this transformation (add the slash if there's no path component) should always happen when retrieving URIs, regardless of where the URI came from -- not only for redirects. Note that RFC 2616 incorrectly claims to refer to the definition of abs_path from RFC 2396. The fact that it's incorrect is made clear in 2616 itself, in section 3.2.3, when it says that abs_path can be empty. In any case, RFC 2396 is obsoleted by RFC 3986, which is clear on this issue, and reflects actual usage of URIs. URIs like http://python.org and http://python.org?spam=eggs have been in widespread use for a long time, and typing the latter URL into firefox (3.0.2) confirms that what's actually sent is "/?spam", whereas urllib2 still sends "?spam". No test was added with this fix, which makes it unnecessarily hard to work out what exactly the fix was supposed to fix. For the record, this is the sequence of redirections before the fix was applied (showing base URI + redirect URI reference --> redirect URI): 'http://www.wikispaces.com' + 'https://session.wikispaces.com/session/auth?authToken=token' --> 'https://session.wikispaces.com/session/auth?authToken=token' 'https://session.wikispaces.com/session/auth?authToken=token' + 'http://www.wikispaces.com?responseToken=token' --> 'http://www.wikispaces.com?responseToken=token' and after the fix was applied: 'http://www.wikispaces.com' + 'https://session.wikispaces.com/session/auth?authToken=token' --> 'https://session.wikispaces.com/session/auth?authToken=token' 'https://session.wikispaces.com/session/auth?authToken=token' + 'http://www.wikispaces.com/?responseToken=token' --> 'http://www.wikispaces.com/?responseToken=token' 'http://www.wikispaces.com/?responseToken=token' + 'http://www.wikispaces.com/' --> 'http://www.wikispaces.com/'
msg76779 - (view)	Author: John J Lee (jjlee)	Date: 2008-12-02 20:47
I've raised #4493 about the issue I raised in my previous comment.

History
Date	User	Action	Args
2022-04-11 14:56:32	admin	set	github: 46716
2009-02-12 19:14:38	ajaksu2	link	issue4493 dependencies
2008-12-02 20:47:14	jjlee	set	messages: + msg76779
2008-12-02 13:00:14	jjlee	set	nosy: + jjlee messages: + msg76736
2008-08-27 12:50:15	orsenthil	set	messages: + msg72024
2008-08-26 17:51:47	facundobatista	set	messages: + msg71990
2008-08-18 00:51:46	gregory.p.smith	set	messages: + msg71304
2008-08-18 00:08:53	facundobatista	set	messages: + msg71303
2008-08-17 22:38:06	gregory.p.smith	set	messages: + msg71295
2008-08-17 03:39:08	facundobatista	set	status: open -> closed resolution: fixed messages: + msg71261
2008-08-17 02:56:26	orsenthil	set	files: + issue2463-py3k.diff messages: + msg71260
2008-08-17 02:42:36	orsenthil	set	files: + issue2464-PATCH1.diff
2008-08-17 02:42:19	orsenthil	set	files: + issue2464-py26-FINAL.diff messages: + msg71259
2008-08-17 01:20:47	gregory.p.smith	set	messages: + msg71255
2008-08-16 19:30:08	facundobatista	set	files: + issue2464-facundo.diff keywords: + patch messages: + msg71231
2008-07-03 18:16:02	facundobatista	set	assignee: facundobatista nosy: + facundobatista
2008-05-16 03:07:50	orsenthil	set	nosy: + orsenthil messages: + msg66889
2008-03-29 04:19:08	gregory.p.smith	set	priority: normal -> low assignee: gregory.p.smith -> (no value) messages: + msg64680
2008-03-29 03:48:40	gregory.p.smith	set	messages: + msg64679
2008-03-29 03:41:53	weijie90	set	messages: + msg64677
2008-03-29 03:13:38	gregory.p.smith	set	priority: normal assignee: gregory.p.smith messages: + msg64676 nosy: + gregory.p.smith versions: + Python 2.6, Python 3.0
2008-03-23 14:41:09	weijie90	create