classification
Title: urllib2.urlopen() gets confused with path with // in it
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.0, Python 2.6, Python 2.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: facundobatista Nosy List: BitTorment, ambarish, confluence, facundobatista, orsenthil
Priority: normal Keywords: patch

Created on 2008-05-06 21:30 by ambarish, last changed 2008-08-16 14:45 by facundobatista. This issue is now closed.

Files
File name Uploaded Description Edit
double_slash_after_host.patch confluence, 2008-05-10 15:50
doubleslash_test.patch confluence, 2008-06-21 17:01
issue2776-py3k.diff orsenthil, 2008-08-12 15:50
Messages (7)
msg66334 - (view) Author: Ambarish Malpani (ambarish) Date: 2008-05-06 21:30
Try the following code:
import urllib
import urllib2

url =
'http://features.us.reuters.com//autos/news/95ED98EE-A837-11DC-BCB3-4F218271.html'

data = urllib.urlopen(url).read()
data2 = urllib2.urlopen(url).read()

The attempt to get it with urllib works fine. With urllib2, the request
is malformed and I get back a HTTP 404

Request in the 2nd case is:
GET //autos/news/95ED98EE-A837-11DC-BCB3-4F218271.html HTTP/1.1\r\n
Accept-Encoding: identity\r\n
Host: autos\r\n
Connection: close\r\n
....

The host line seems to be looking for the last // rather than the first.
msg66336 - (view) Author: Ambarish Malpani (ambarish) Date: 2008-05-06 21:33
Sorry, should have added another line:
The reason this is important to fix, is I am getting that URL with a //
in a Moved (HTTP 302) message, so I can't just get rid of the //
msg66343 - (view) Author: Martin McNickle (BitTorment) Date: 2008-05-06 23:22
The problem lines are in AbstractHTTPHandler.do_request():

    scheme, sel = splittype(request.get_selector())
    sel_host, sel_path = splithost(sel)
    if not request.has_header('Host'):
        request.add_unredirected_header('Host', sel_host or host)

When there is a double '/' sel is something like '//path/to/resource'. 
splithost(sel) then gives ('path', '/to/resource').  Therefore the
header 'Host' gets set to 'path'.

I don't understand why sel_host is used in preference for host.  host
holds the correct value, even with the double slashes.  Could someone
explain why sel_host is used at all?
msg66537 - (view) Author: Adrianna Pinska (confluence) Date: 2008-05-10 15:50
Ordinarily, request.get_selector() returns the portion of the url after
the host, and sel_host is None.  However, if a proxy is set on the
request, the request's host is set to the proxy host, get_selector()
returns the original full url, and sel_host is the host from the
original url (and different to the host set on the request).

This bug is only triggered if the double slash comes directly after the
host and there is no proxy set on the request.  do_request_ does not
check what get_selector() is returning, so the output is passed through
splithost even when this is not necessary, and in this particular case
it causes undesirable behaviour.

My patch causes do_request_ only to attempt to extract the host from
get_selector() if the proxy has been set.
msg68516 - (view) Author: Adrianna Pinska (confluence) Date: 2008-06-21 17:01
I have written a test to go with my patch.
msg71056 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008-08-12 15:50
I could reproduce this issue on trunk and p3k branch. The patch attached
by Adrianna Pinska "appropriately" fixes this issue. I agree with the
logic. Attaching the patch for py3k with the same fix.

Thanks,
Senthil
msg71215 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008-08-16 14:45
Commited in revs 65710 and 65711.

Thank you all!!
History
Date User Action Args
2008-08-16 14:45:57facundobatistasetstatus: open -> closed
resolution: fixed
2008-08-16 14:45:43facundobatistasetmessages: + msg71215
2008-08-12 15:50:37orsenthilsetfiles: + issue2776-py3k.diff
messages: + msg71056
2008-07-03 17:42:25facundobatistasetassignee: facundobatista
nosy: + facundobatista, orsenthil
2008-06-21 17:01:56confluencesetfiles: + doubleslash_test.patch
messages: + msg68516
2008-05-10 15:50:01confluencesetfiles: + double_slash_after_host.patch
keywords: + patch
messages: + msg66537
nosy: + confluence
2008-05-06 23:22:28BitTormentsetcomponents: + Library (Lib), - Extension Modules
versions: + Python 2.6, Python 3.0
2008-05-06 23:22:11BitTormentsetnosy: + BitTorment
messages: + msg66343
2008-05-06 21:33:33ambarishsetmessages: + msg66336
2008-05-06 21:30:14ambarishcreate