This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib.open sends full URL after GET command instead of local path
Type: behavior Stage: test needed
Components: Library (Lib) Versions: Python 2.6
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ajaksu2, ggenellina, olemis, orsenthil, pitrou
Priority: low Keywords: patch

Created on 2009-01-26 19:22 by olemis, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (9)
msg80586 - (view) Author: Olemis Lang (olemis) Date: 2009-01-26 19:22
Hello ... 

The first thing I have to say is that I searched the open issues and I 
found nothing similar to what I am going to report hereinafter. If this 
ticket is duplicate , I apologize ...

Yesterday I was testing how to access the wiki pages in a 
Trac [1]_ site and I realized that something wrong was happening 
(a bug? ...)

Initially the behavior was as follows :

{{{
#!python
>>> u = urllib.urlopen('http://localhost:8000/trac-dev')
>>> u.read()
'Environment not found'
>>> u.close()
}}}

And tracd reported a line like this 

{{{
127.0.0.1 - - [25/Jan/2009 17:32:08] "GET http://localhost:8000/trac-
dev HTTP/1.0" 404 -
}}}

Which means that a 'Not found' error code was sent back to urllib 
client.

I tried to access the same page from my browser and tracd reported

{{{
127.0.0.1 - - [25/Jan/2009 18:05:44] "GET /trac-dev HTTP/1.0" 200 -
}}}

The problem is obvious ... urllib was sending the full URL after GET
and it should send only the string after the network location.

I applied the following patch to urllib (yours will be better, I am 
sure about that ;)

{{{
#!diff

--- /usr/lib/python2.5/urllib.py        2008-07-31 13:40:40.000000000 
-0500
+++ /media/urllib_unix.py     2009-01-26 09:48:54.000000000 -0500
@@ -270,6 +270,7 @@
     def open_http(self, url, data=None):
         """Use HTTP protocol."""
         import httplib
+        from urlparse import urlparse
         user_passwd = None
         proxy_passwd= None
         if isinstance(url, str):
@@ -312,12 +313,17 @@
         else:
             auth = None
         h = httplib.HTTP(host)
+        target = ''.join(sep + part for sep, part in \
+                                zip(['', ';', '?', '#'], \
+                                    urlparse(selector)[2:]) \
+                                if part)
+        print target
         if data is not None:
-            h.putrequest('POST', selector)
+            h.putrequest('POST', target)
             h.putheader('Content-Type', 'application/x-www-form-
urlencoded')
             h.putheader('Content-Length', '%d' % len(data))
         else:
-            h.putrequest('GET', selector)
+            h.putrequest('GET', target)
         if proxy_auth: h.putheader('Proxy-Authorization', 'Basic %s' % 
proxy_auth)
         if auth: h.putheader('Authorization', 'Basic %s' % auth)
         if realhost: h.putheader('Host', realhost)


}}}

And everithing was «back» to normal ...

{{{
#!python
>>> u = urllib.urlopen('http://localhost:8000/trac-dev')
>>> u.read()
    ... # Lots of beautiful HTML code ;)
>>> u.close()
}}}

... tracd outputted ...

{{{
127.0.0.1 - - [25/Jan/2009 18:05:44] "GET /trac-dev HTTP/1.0" 200 -
}}}

The same picture is shown when using both Python 2.5.1 and 2.5.2 ...
I have not installed Python 2.6.x so I am not sure about whether this
issue has propagated onto newer versions of Python ... and I don't 
know euther if this issue is also present in urllib2 or not ...

... so further research is needed, but IMO this is a serious bug :(

PD: If this is a bug ... how could it be hidden so far ? Is there any 
    test case written to assert this kind of things ? I checked out 
    `test.test_urllib` and `test.test_urllibnet` modules and I saw
    nothing at all ... 

.. [1] Trac
       (http://trac.edgewall.org)
msg80588 - (view) Author: Olemis Lang (olemis) Date: 2009-01-26 19:28
Ooops ... sorry, remove the print statement. The patch is as follows :

{{{
#!diff

--- /usr/lib/python2.5/urllib.py        2008-07-31 13:40:40.000000000 
-0500
+++ /media/urllib_unix.py     2009-01-26 09:48:54.000000000 -0500
@@ -270,6 +270,7 @@
     def open_http(self, url, data=None):
         """Use HTTP protocol."""
         import httplib
+        from urlparse import urlparse
         user_passwd = None
         proxy_passwd= None
         if isinstance(url, str):
@@ -312,12 +313,17 @@
         else:
             auth = None
         h = httplib.HTTP(host)
+        target = ''.join(sep + part for sep, part in \
+                                zip(['', ';', '?', '#'], \
+                                    urlparse(selector)[2:]) \
+                                if part)
         if data is not None:
-            h.putrequest('POST', selector)
+            h.putrequest('POST', target)
             h.putheader('Content-Type', 'application/x-www-form-
urlencoded')
             h.putheader('Content-Length', '%d' % len(data))
         else:
-            h.putrequest('GET', selector)
+            h.putrequest('GET', target)
         if proxy_auth: h.putheader('Proxy-Authorization', 'Basic %s' % 
proxy_auth)
         if auth: h.putheader('Authorization', 'Basic %s' % auth)
         if realhost: h.putheader('Host', realhost)


}}}

I apologize once again ...
msg80600 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-01-27 00:09
I could not reproduce this issue neither with Python 2.6 nor 2.5.2
If I print host and selector near line 313, I get 'localhost:8000' and 
'/trac-dev', the expected results.
Do you have an HTTP proxy? running at the *same* port? (!)
msg80651 - (view) Author: Olemis Lang (olemis) Date: 2009-01-27 14:02
Actually I am using a proxy hosted in some other machine (i.e. not my 
PC ... sorry, I didnt mention :S ...) I «debugged» urllib and, when 
branching at this point (see below ;) in URLopener.open_http :

{{{
#!python

# urllib,py

    def open_http(self, url, data=None):
        """Use HTTP protocol."""
        import httplib
        user_passwd = None
        proxy_passwd= None
        if isinstance(url, str):             # Branching here !!!!!!!!!!
            host, selector = splithost(url)
            if host:
                user_passwd, host = splituser(host)
                host = unquote(host)
            realhost = host
        else:
            host, selector = url


}}}

url variable is bound to the following binary tuple 

{{{
#!python

('172.18.2.7:3128', 'http://localhost:8000/trac-dev')
}}}

My IP is 172.18.2.99 ... so the `else` branch is the one being executed 

If you need further details ... dont hesitate and ask anything you 
want ;)

PD: What d'u mean when you said?

> Do you have an HTTP proxy? running at the *same* port? (!)

I dont understand this since *I already said* that *I accessed* my Trac 
environment using my web browser (Opera 9.63, I dont know whether this 
is relevant at all ... ), *I sent you* the lines outputted by tracd to 
stdout (or stderr ... I am not very sure right now ... ;) and *I told 
you* that, once I applied the path *I submitted*, everything was *back 
to normal* ...

I dont understand how could all this be possible if I were running 
tracd and an HTTP proxy in the *same* port, or even in case 
`http_proxy` envvar be set to the hostname + port where my Trac 
instance is listening for incoming connections ... 

Anyway ... CMIIW ...

I also checked that immediately before executing the following 
statements ...

{{{
#!python

# urllib,py

        h = httplib.HTTP(host)
        if data is not None:
            h.putrequest('POST', selector)
            h.putheader('Content-Type', 'application/x-www-form-
urlencoded')
            h.putheader('Content-Length', '%d' % len(data))
        else:
            h.putrequest('GET', selector)

}}}

... `selector` is bound to 'http://localhost:8000/trac-dev' ... BTW the 
`else` clause *is the one executed* in this case, and this is 
consistent with tracd reports *I sent before* and is logical since 
`data` arg *is missing* in the code snippet I sent before.
msg80653 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-01-27 14:37
I suppose 172.18.2.7:3128 is the address:port of the your proxy, right?
In which case, urllib seems to do the right thing. When talking to an
HTTP proxy, requests are of the form "GET http://site.com/path", rather
than "GET /path". It's up to the proxy to strip the host part of the URL
when forwarding the request to the target server.

(but I suppose tracd could also be more permissive and allow the "GET
http://site.com/path" variant. It seems Apache does)
msg80654 - (view) Author: Olemis Lang (olemis) Date: 2009-01-27 15:11
> Quoting Antoine Pitrou ...

> I suppose 172.18.2.7:3128 is the address:port of the your proxy, 
right?

Yes ...

> In which case, urllib seems to do the right thing. When talking to an
HTTP proxy, requests are of the form "GET http://site.com/path", rather
than "GET /path". It's up to the proxy to strip the host part of the URL
when forwarding the request to the target server.

This being said ... 

> (but I suppose tracd could also be more permissive and allow the "GET
http://site.com/path" variant. It seems Apache does)

... It works with Apache (I am talking about trac once again ...) 
therefore I will report this issue to Trac devs instead ...

Thnx a lot ! Sorry if I caused you any trouble ...
msg80683 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-01-28 00:38
> > Do you have an HTTP proxy? running at the *same* port?
> (!)
> 
> I dont understand this since *I already said* that *I
> accessed* my Trac 
> environment using my web browser (Opera 9.63, I dont know
> whether this 
> is relevant at all ... ), *I sent you* the lines outputted
> by tracd to 
> stdout (or stderr ... I am not very sure right now ... ;)
> and *I told 
> you* that, once I applied the path *I submitted*,
> everything was *back 
> to normal* ...

If you had configured a proxy at localhost:8000, and *also* a Trac instance at that port, and Trac had "won the race" for the port, then you would observe exactly the symthoms you describe. That is, urllib talking to 8000 as it were a proxy, and the Trac instance actually there getting confused.

Your patch, as you surely understand now, is not correct; in fact, the code is OK as it is. urllib builds the request in that specific way *because* he thinks there is a proxy. If the proxy is buggy, misconfigured, or inexistent, it's not the library's fault :)

-- 
Gabriel Genellina
> 
> I dont understand how could all this be possible if I were
> running 
> tracd and an HTTP proxy in the *same* port, or even in case
> 
> `http_proxy` envvar be set to the hostname + port where my
> Trac 
> instance is listening for incoming connections ... 
> 
> Anyway ... CMIIW ...
> 
> I also checked that immediately before executing the
> following 
> statements ...
> 
> {{{
> #!python
> 
> # urllib,py
> 
>         h = httplib.HTTP(host)
>         if data is not None:
>             h.putrequest('POST', selector)
>             h.putheader('Content-Type',
> 'application/x-www-form-
> urlencoded')
>             h.putheader('Content-Length',
> '%d' % len(data))
>         else:
>             h.putrequest('GET', selector)
> 
> }}}
> 
> ... `selector` is bound to
> 'http://localhost:8000/trac-dev' ... BTW the 
> `else` clause *is the one executed* in this case, and this
> is 
> consistent with tracd reports *I sent before* and is
> logical since 
> `data` arg *is missing* in the code snippet I sent before.
> 
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue5072>
> _______________________________________

      Yahoo! Cocina
Recetas prácticas y comida saludable
http://ar.mujer.yahoo.com/cocina/
msg81798 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2009-02-12 18:37
Anyone against closing this as "works for me"?
msg82402 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2009-02-18 01:57
Yup, This should be closed too. Thanks.
History
Date User Action Args
2022-04-11 14:56:44adminsetgithub: 49322
2009-02-18 14:38:21ajaksu2setstatus: pending -> closed
2009-02-18 01:57:32orsenthilsetmessages: + msg82402
2009-02-18 01:52:33ajaksu2setstatus: open -> pending
priority: low
2009-02-12 18:37:06ajaksu2setkeywords: + patch
nosy: + ajaksu2, orsenthil
stage: test needed
messages: + msg81798
versions: + Python 2.6, - Python 2.5
2009-01-28 00:38:41ggenellinasetmessages: + msg80683
2009-01-27 15:11:52olemissetmessages: + msg80654
2009-01-27 14:37:56pitrousetnosy: + pitrou
messages: + msg80653
2009-01-27 14:02:43olemissetmessages: + msg80651
2009-01-27 00:09:17ggenellinasetnosy: + ggenellina
messages: + msg80600
2009-01-26 19:28:43olemissetmessages: + msg80588
2009-01-26 19:22:53olemiscreate