classification
Title: urllib2 can't handle http://www.wikispaces.com
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.0, Python 2.6, Python 2.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: facundobatista Nosy List: facundobatista, gregory.p.smith, jjlee, orsenthil, weijie90
Priority: low Keywords: patch

Created on 2008-03-23 14:41 by weijie90, last changed 2008-12-02 20:47 by jjlee. This issue is now closed.

Files
File name Uploaded Description Edit
issue2464-facundo.diff facundobatista, 2008-08-16 19:30
issue2464-py26-FINAL.diff orsenthil, 2008-08-17 02:42
issue2464-PATCH1.diff orsenthil, 2008-08-17 02:42
issue2463-py3k.diff orsenthil, 2008-08-17 02:56
Messages (18)
msg64363 - (view) Author: Koh Wei Jie (weijie90) Date: 2008-03-23 14:41
Try the following code:

import urllib2
gmail = urllib2.urlopen("https://www.gmail.com").read()
wikispaces = urllib2.urlopen("http://www.wikispaces.com").read()

Getting the html over HTTPS from gmail.com works, but not over HTTP from
wikispaces. Here's the traceback:
>>> wikispaces = urllib2.urlopen("http://www.wikispaces.com").read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.5/urllib2.py", line 121, in urlopen
    return _opener.open(url, data)
  File "/usr/lib/python2.5/urllib2.py", line 380, in open
    response = meth(req, response)
  File "/usr/lib/python2.5/urllib2.py", line 491, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.5/urllib2.py", line 412, in error
    result = self._call_chain(*args)
  File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.5/urllib2.py", line 575, in http_error_302
    return self.parent.open(new)
  File "/usr/lib/python2.5/urllib2.py", line 380, in open
    response = meth(req, response)
  File "/usr/lib/python2.5/urllib2.py", line 491, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.5/urllib2.py", line 412, in error
    result = self._call_chain(*args)
  File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.5/urllib2.py", line 575, in http_error_302
    return self.parent.open(new)
  File "/usr/lib/python2.5/urllib2.py", line 374, in open
    response = self._open(req, data)
  File "/usr/lib/python2.5/urllib2.py", line 392, in _open
    '_open', req)
  File "/usr/lib/python2.5/urllib2.py", line 353, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.5/urllib2.py", line 1100, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.5/urllib2.py", line 1075, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error (104, 'Connection reset by peer')>

Note the two 302 redirects.

I tried accessing wikispaces.com with SSL turned off in Firefox
2.0.0.12, which didn't work because SSL was required, perhaps in between
the redirects that wikispaces uses.

Why doesn't urllib2 handle the "hidden" SSL properly? (Not to be rude,
but httplib2 works.)

Thanks!
WJ
msg64676 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-03-29 03:13
The problem does not appear to have anything to do with SSL.  The
problem is that the chain of HTTP requests goes:

GET -> 302 -> 302 -> 301

On the final 301 urllib2's internal state is messed up such that by the
time its in the handle_error_302 method it believes the Location header
contains the following:

'http://www.wikispaces.com/\x00/?responseToken=481aec3249f429316459e01c00b7e522'

The \x00 and everything after it should not be there and is not there if
you look at what is sent over the socket itself.  The ?responseToken=xxx
value is leftover from the previous 302 response.  No idea where the
\x00 came from yet.  I'm debugging...
msg64677 - (view) Author: Koh Wei Jie (weijie90) Date: 2008-03-29 03:41
Please take your time, because this bug isn't critical. Thanks!
msg64679 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-03-29 03:48
Instrumenting the code and looking closer at the tcpdump, its true. 
wikispaces.com is returning an invalid Location: header with a null byte
in the middle of it.

The "fix" on our end should be to handle such garbage from such broken
web servers more gracefully.  Other clients seem to treat the null as an
end of string or end of that header.
msg64680 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-03-29 04:19
I'm not sure what the best solution for this is.  If I truncate the
header values at a \x00 character it ends in an infinite redirect loop
(which urllib2 detects and raises on).  If I simple remove all \x00
characters the resulting url is not accepted by wikispaces.com due to
having an extra / in it.

Verdict: wikispaces.com is broken.

urllib2 could do better.  wget and firefox deal with it properly.  but
i'll leave deciding which patch to use up to someone who cares about
handling broken sites.

patch to implement either behavior of dealing with nulls where they
shouldn't be:

Index: Lib/httplib.py
===================================================================
--- Lib/httplib.py      (revision 62033)
+++ Lib/httplib.py      (working copy)
@@ -291,9 +291,18 @@
                 break
             headerseen = self.isheader(line)
             if headerseen:
+                # Some bad web servers reply with headers with a \x00 null
+                # embedded in the value.  Other http clients deal with
+                # this by treating it as a value terminator, ignoring the
+                # rest so we will too.  http://bugs.python.org/issue2464.
+                if '\x00' in line:
+                    line = line[:line.find('\x00')]
+                    # if you want to just remove nulls instead use this:
+                    #line = line.replace('\x00', '')
                 # It's a legal header line, save it.
                 hlist.append(line)
-                self.addheader(headerseen,
line[len(headerseen)+1:].strip())
+                value = line[len(headerseen)+1:].strip()
+                self.addheader(headerseen, value)
                 continue
             else:
                 # It's not a header line; throw it back and stop here.
msg66889 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008-05-16 03:07
The issue is not just with null character. If you observe now the
diretion is 302-302-200 and there is no null character.
However, still urllib2 is unable to handle multiple redirection properly
(IIRC, there is a portion of code to handle multiple redirection and
exit on infinite loop)
>>> url = "http://www.wikispaces.com"
>>> opened = urllib.urlopen(url)
>>> print opened.geturl()
http://www.wikispaces.com?responseToken=344289da354a29c67d48928dbe72042a
>>> print opened.read()
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>nginx/0.6.30</center>
</body>
</html>

Needs a relook, IMO.
msg71231 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008-08-16 19:30
Senthil:

Look at that URL that the server returned in the second redirect:

http://www.wikispaces.com?responseToken=ee3fca88a9b0dc865152d8a9e5b6138d

See that the "?" appears without a path between the host and it.

Check the item 3.2.2 in the RFC 2616, it says that a HTTP URL should be:

  http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]

So, we should fix that URL that the server returned. Guess what: if we
put a "/" (as obligates the RFC), everything works ok.

The patch I attach here does that. All tests pass ok.

What do you think?
msg71255 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-08-17 01:20
looks good to me.
msg71259 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008-08-17 02:42
Ah, I that was a simple fix. :) I very much overlooked the problem after
being so much given the hints at the web-sig.

I have some comments on the patch, Facundo.
1) I don't think is a good idea to include that portion in the
http_error_302 method. That makes the fix "very" specific to "this"
issue only. 
Another point is, fixing broken url's should not be under urllib2,
urlparse would be a better place.
So, I came up with the approach wherein urllib2 does unparse(parse) of
the url and parse methods will fix the url if it is broken. ( See
attached   issue2464-PATCH1.diff)

But if we handle it in the urlparse methods, then we are much
susceptible to breaking RFC conformance, breaking a lot of tests, Which
is not a  good idea.

So,I introduced fix_broken() method in urlparse and called it to solve
the issue, using the same logic as yours (issue2464-py26-FINAL.diff)
With fix_broken() method in urlparse, we will have better control
whenever we want to implement a behavior which is RFC non-confirming but
implemented widely by browsers and clients.

All tests pass with issue2464-py26-FINAL.diff

Comments,please?
msg71260 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008-08-17 02:56
Patch for py3k, but please test this before applying.
msg71261 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008-08-17 03:39
Senthil: I don't like that.

Creating a public method called "fix_broken", introducing new behaviours
now in beta, and actually not fixing the url in any broken possibility
(just the path if it's not there), it's way too much for this fix.

I commited the change I proposed. Maybe in the future will have a
"generic url fixing" function, now is not the moment.
msg71295 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-08-17 22:38
i was pondering if it should go in urlparse instead.  if it did, i think
it should be part of urlparse.urlunparse to ensure that there is always
a trailing slash after the host:port regardless of what the inputs are.

anyways, agreed, this fixes this specific bug.  should it be backported
to release25-maint?
msg71303 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008-08-18 00:08
Maybe we can put it in urlunparse... do you all agree with this test cases?

def test_alwayspath(self):
    u = urlparse.urlparse("http://netloc/path;params?query#fragment")
    self.assertEqual(urlparse.urlunparse(u),
"http://netloc/path;params?query#fragment")

    u = urlparse.urlparse("http://netloc?query#fragment")
    self.assertEqual(urlparse.urlunparse(u),
"http://netloc/?query#fragment")

    u = urlparse.urlparse("http://netloc#fragment")
    self.assertEqual(urlparse.urlunparse(u), "http://netloc/#fragment")



Maybe we could backport this more general fix...
msg71304 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2008-08-18 00:51
That test case looks good to me for 2.6 and 3.0.  Also add a note to the
documentation with a versionchanged 2.6 about urlunparse always ensuring
there is a / between the netloc and the rest of the url.

I would not back port the more general urlunparse behavior change to 2.5.
msg71990 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008-08-26 17:51
Gregory... I tried to fill the path in urlunparse, and other functions
that use this started to fail.

As we're so close to final releases, I'll leave this as it's right now,
that actually fixed the bug...
msg72024 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008-08-27 12:50
That was reason in making fix_broken in the urlparse in my patch, Facundo.
I had thought, it should be handled in urlparse module and if we make
changes in the urlparse.urlunparse/urlparse.urlparse, then we are
stepping into area which will break a lot of tests.

I am kind of +0 with the current fix in urllib2. Should we think/plan
for something in urlparse, akin to fix_broken?
msg76736 - (view) Author: John J Lee (jjlee) Date: 2008-12-02 13:00
This fix was applied in the wrong place.

URI path components, and HTTP URI path components in particular, *can*
be empty.  See RFC 3986.  So the comment in the code that was inserted
with the fix for this bug that says "possibly malformed" is incorrect,
and should instead just refer to section 3.2.2 of RFC 2616.  Also,
because 3.2.2 says "If the abs_path is not present in the URL, it MUST
be given as "/" when used as a Request-URI for a resource (section
5.1.2)", it seems clear that this transformation (add the slash if
there's no path component) should always happen when retrieving URIs,
regardless of where the URI came from -- not only for redirects.

Note that RFC 2616 incorrectly claims to refer to the definition of
abs_path from RFC 2396.  The fact that it's incorrect is made clear in
2616 itself, in section 3.2.3, when it says that abs_path can be empty.
 In any case, RFC 2396 is obsoleted by RFC 3986, which is clear on this
issue, and reflects actual usage of URIs.  URIs like http://python.org
and http://python.org?spam=eggs have been in widespread use for a long
time, and typing the latter URL into firefox (3.0.2) confirms that
what's actually sent is "/?spam", whereas urllib2 still sends "?spam".

No test was added with this fix, which makes it unnecessarily hard to
work out what exactly the fix was supposed to fix.  For the record, this
is the sequence of redirections before the fix was applied (showing base
URI + redirect URI reference --> redirect URI):

'http://www.wikispaces.com' +
'https://session.wikispaces.com/session/auth?authToken=token' -->
'https://session.wikispaces.com/session/auth?authToken=token'
'https://session.wikispaces.com/session/auth?authToken=token' +
'http://www.wikispaces.com?responseToken=token' -->
'http://www.wikispaces.com?responseToken=token'

and after the fix was applied:

'http://www.wikispaces.com' +
'https://session.wikispaces.com/session/auth?authToken=token' -->
'https://session.wikispaces.com/session/auth?authToken=token'
'https://session.wikispaces.com/session/auth?authToken=token' +
'http://www.wikispaces.com/?responseToken=token' -->
'http://www.wikispaces.com/?responseToken=token'
'http://www.wikispaces.com/?responseToken=token' +
'http://www.wikispaces.com/' --> 'http://www.wikispaces.com/'
msg76779 - (view) Author: John J Lee (jjlee) Date: 2008-12-02 20:47
I've raised #4493 about the issue I raised in my previous comment.
History
Date User Action Args
2009-02-12 19:14:38ajaksu2linkissue4493 dependencies
2008-12-02 20:47:14jjleesetmessages: + msg76779
2008-12-02 13:00:14jjleesetnosy: + jjlee
messages: + msg76736
2008-08-27 12:50:15orsenthilsetmessages: + msg72024
2008-08-26 17:51:47facundobatistasetmessages: + msg71990
2008-08-18 00:51:46gregory.p.smithsetmessages: + msg71304
2008-08-18 00:08:53facundobatistasetmessages: + msg71303
2008-08-17 22:38:06gregory.p.smithsetmessages: + msg71295
2008-08-17 03:39:08facundobatistasetstatus: open -> closed
resolution: fixed
messages: + msg71261
2008-08-17 02:56:26orsenthilsetfiles: + issue2463-py3k.diff
messages: + msg71260
2008-08-17 02:42:36orsenthilsetfiles: + issue2464-PATCH1.diff
2008-08-17 02:42:19orsenthilsetfiles: + issue2464-py26-FINAL.diff
messages: + msg71259
2008-08-17 01:20:47gregory.p.smithsetmessages: + msg71255
2008-08-16 19:30:08facundobatistasetfiles: + issue2464-facundo.diff
keywords: + patch
messages: + msg71231
2008-07-03 18:16:02facundobatistasetassignee: facundobatista
nosy: + facundobatista
2008-05-16 03:07:50orsenthilsetnosy: + orsenthil
messages: + msg66889
2008-03-29 04:19:08gregory.p.smithsetpriority: normal -> low
assignee: gregory.p.smith -> (no value)
messages: + msg64680
2008-03-29 03:48:40gregory.p.smithsetmessages: + msg64679
2008-03-29 03:41:53weijie90setmessages: + msg64677
2008-03-29 03:13:38gregory.p.smithsetpriority: normal
assignee: gregory.p.smith
messages: + msg64676
nosy: + gregory.p.smith
versions: + Python 2.6, Python 3.0
2008-03-23 14:41:09weijie90create