Issue2776
Created on 2008-05-06 21:30 by ambarish, last changed 2008-05-10 15:50 by confluence.
| msg66334 (view) |
Author: Ambarish Malpani (ambarish) |
Date: 2008-05-06 21:30 |
|
Try the following code:
import urllib
import urllib2
url =
'http://features.us.reuters.com//autos/news/95ED98EE-A837-11DC-BCB3-4F218271.html'
data = urllib.urlopen(url).read()
data2 = urllib2.urlopen(url).read()
The attempt to get it with urllib works fine. With urllib2, the request
is malformed and I get back a HTTP 404
Request in the 2nd case is:
GET //autos/news/95ED98EE-A837-11DC-BCB3-4F218271.html HTTP/1.1\r\n
Accept-Encoding: identity\r\n
Host: autos\r\n
Connection: close\r\n
....
The host line seems to be looking for the last // rather than the first.
|
| msg66336 (view) |
Author: Ambarish Malpani (ambarish) |
Date: 2008-05-06 21:33 |
|
Sorry, should have added another line:
The reason this is important to fix, is I am getting that URL with a //
in a Moved (HTTP 302) message, so I can't just get rid of the //
|
| msg66343 (view) |
Author: Martin McNickle (BitTorment) |
Date: 2008-05-06 23:22 |
|
The problem lines are in AbstractHTTPHandler.do_request():
scheme, sel = splittype(request.get_selector())
sel_host, sel_path = splithost(sel)
if not request.has_header('Host'):
request.add_unredirected_header('Host', sel_host or host)
When there is a double '/' sel is something like '//path/to/resource'.
splithost(sel) then gives ('path', '/to/resource'). Therefore the
header 'Host' gets set to 'path'.
I don't understand why sel_host is used in preference for host. host
holds the correct value, even with the double slashes. Could someone
explain why sel_host is used at all?
|
| msg66537 (view) |
Author: Adrianna Pinska (confluence) |
Date: 2008-05-10 15:50 |
|
Ordinarily, request.get_selector() returns the portion of the url after
the host, and sel_host is None. However, if a proxy is set on the
request, the request's host is set to the proxy host, get_selector()
returns the original full url, and sel_host is the host from the
original url (and different to the host set on the request).
This bug is only triggered if the double slash comes directly after the
host and there is no proxy set on the request. do_request_ does not
check what get_selector() is returning, so the output is passed through
splithost even when this is not necessary, and in this particular case
it causes undesirable behaviour.
My patch causes do_request_ only to attempt to extract the host from
get_selector() if the proxy has been set.
|
|
| Date |
User |
Action |
Args |
| 2008-05-10 15:50:01 | confluence | set | files:
+ double_slash_after_host.patch keywords:
+ patch messages:
+ msg66537 nosy:
+ confluence |
| 2008-05-06 23:22:28 | BitTorment | set | components:
+ Library (Lib), - Extension Modules versions:
+ Python 2.6, Python 3.0 |
| 2008-05-06 23:22:11 | BitTorment | set | nosy:
+ BitTorment messages:
+ msg66343 |
| 2008-05-06 21:33:33 | ambarish | set | messages:
+ msg66336 |
| 2008-05-06 21:30:14 | ambarish | create | |
|