Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib.open sends full URL after GET command instead of local path #49322

Closed
olemis mannequin opened this issue Jan 26, 2009 · 9 comments
Closed

urllib.open sends full URL after GET command instead of local path #49322

olemis mannequin opened this issue Jan 26, 2009 · 9 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@olemis
Copy link
Mannequin

olemis mannequin commented Jan 26, 2009

BPO 5072
Nosy @orsenthil, @pitrou, @devdanzin

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2009-02-18.14:38:21.246>
created_at = <Date 2009-01-26.19:22:53.170>
labels = ['type-bug', 'library']
title = 'urllib.open sends full URL after GET command instead of local path'
updated_at = <Date 2009-02-18.14:38:21.245>
user = 'https://bugs.python.org/olemis'

bugs.python.org fields:

activity = <Date 2009-02-18.14:38:21.245>
actor = 'ajaksu2'
assignee = 'none'
closed = True
closed_date = <Date 2009-02-18.14:38:21.246>
closer = 'ajaksu2'
components = ['Library (Lib)']
creation = <Date 2009-01-26.19:22:53.170>
creator = 'olemis'
dependencies = []
files = []
hgrepos = []
issue_num = 5072
keywords = ['patch']
message_count = 9.0
messages = ['80586', '80588', '80600', '80651', '80653', '80654', '80683', '81798', '82402']
nosy_count = 5.0
nosy_names = ['ggenellina', 'orsenthil', 'pitrou', 'ajaksu2', 'olemis']
pr_nums = []
priority = 'low'
resolution = None
stage = 'test needed'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue5072'
versions = ['Python 2.6']

@olemis
Copy link
Mannequin Author

olemis mannequin commented Jan 26, 2009

Hello ...

The first thing I have to say is that I searched the open issues and I
found nothing similar to what I am going to report hereinafter. If this
ticket is duplicate , I apologize ...

Yesterday I was testing how to access the wiki pages in a
Trac [1]_ site and I realized that something wrong was happening
(a bug? ...)

Initially the behavior was as follows :

{{{
#!python
>>> u = urllib.urlopen('http://localhost:8000/trac-dev')
>>> u.read()
'Environment not found'
>>> u.close()
}}}

And tracd reported a line like this

{{{
127.0.0.1 - - [25/Jan/2009 17:32:08] "GET http://localhost:8000/trac-
dev HTTP/1.0" 404 -
}}}

Which means that a 'Not found' error code was sent back to urllib
client.

I tried to access the same page from my browser and tracd reported

{{{
127.0.0.1 - - [25/Jan/2009 18:05:44] "GET /trac-dev HTTP/1.0" 200 -
}}}

The problem is obvious ... urllib was sending the full URL after GET
and it should send only the string after the network location.

I applied the following patch to urllib (yours will be better, I am
sure about that ;)

{{{
#!diff

--- /usr/lib/python2.5/urllib.py        2008-07-31 13:40:40.000000000 
-0500
+++ /media/urllib_unix.py     2009-01-26 09:48:54.000000000 -0500
@@ -270,6 +270,7 @@
     def open_http(self, url, data=None):
         """Use HTTP protocol."""
         import httplib
+        from urlparse import urlparse
         user_passwd = None
         proxy_passwd= None
         if isinstance(url, str):
@@ -312,12 +313,17 @@
         else:
             auth = None
         h = httplib.HTTP(host)
+        target = ''.join(sep + part for sep, part in \
+                                zip(['', ';', '?', '#'], \
+                                    urlparse(selector)[2:]) \
+                                if part)
+        print target
         if data is not None:
-            h.putrequest('POST', selector)
+            h.putrequest('POST', target)
             h.putheader('Content-Type', 'application/x-www-form-
urlencoded')
             h.putheader('Content-Length', '%d' % len(data))
         else:
-            h.putrequest('GET', selector)
+            h.putrequest('GET', target)
         if proxy_auth: h.putheader('Proxy-Authorization', 'Basic %s' % 
proxy_auth)
         if auth: h.putheader('Authorization', 'Basic %s' % auth)
         if realhost: h.putheader('Host', realhost)

}}}

And everithing was «back» to normal ...

{{{
#!python
>>> u = urllib.urlopen('http://localhost:8000/trac-dev')
>>> u.read()
    ... # Lots of beautiful HTML code ;)
>>> u.close()
}}}

... tracd outputted ...

{{{
127.0.0.1 - - [25/Jan/2009 18:05:44] "GET /trac-dev HTTP/1.0" 200 -
}}}

The same picture is shown when using both Python 2.5.1 and 2.5.2 ...
I have not installed Python 2.6.x so I am not sure about whether this
issue has propagated onto newer versions of Python ... and I don't
know euther if this issue is also present in urllib2 or not ...

... so further research is needed, but IMO this is a serious bug :(

PD: If this is a bug ... how could it be hidden so far ? Is there any
test case written to assert this kind of things ? I checked out
test.test_urllib and test.test_urllibnet modules and I saw
nothing at all ...

.. [1] Trac
(http://trac.edgewall.org)

@olemis olemis mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 26, 2009
@olemis
Copy link
Mannequin Author

olemis mannequin commented Jan 26, 2009

Ooops ... sorry, remove the print statement. The patch is as follows :

{{{
#!diff

--- /usr/lib/python2.5/urllib.py        2008-07-31 13:40:40.000000000 
-0500
+++ /media/urllib_unix.py     2009-01-26 09:48:54.000000000 -0500
@@ -270,6 +270,7 @@
     def open_http(self, url, data=None):
         """Use HTTP protocol."""
         import httplib
+        from urlparse import urlparse
         user_passwd = None
         proxy_passwd= None
         if isinstance(url, str):
@@ -312,12 +313,17 @@
         else:
             auth = None
         h = httplib.HTTP(host)
+        target = ''.join(sep + part for sep, part in \
+                                zip(['', ';', '?', '#'], \
+                                    urlparse(selector)[2:]) \
+                                if part)
         if data is not None:
-            h.putrequest('POST', selector)
+            h.putrequest('POST', target)
             h.putheader('Content-Type', 'application/x-www-form-
urlencoded')
             h.putheader('Content-Length', '%d' % len(data))
         else:
-            h.putrequest('GET', selector)
+            h.putrequest('GET', target)
         if proxy_auth: h.putheader('Proxy-Authorization', 'Basic %s' % 
proxy_auth)
         if auth: h.putheader('Authorization', 'Basic %s' % auth)
         if realhost: h.putheader('Host', realhost)

}}}

I apologize once again ...

@ggenellina
Copy link
Mannequin

ggenellina mannequin commented Jan 27, 2009

I could not reproduce this issue neither with Python 2.6 nor 2.5.2
If I print host and selector near line 313, I get 'localhost:8000' and
'/trac-dev', the expected results.
Do you have an HTTP proxy? running at the *same* port? (!)

@olemis
Copy link
Mannequin Author

olemis mannequin commented Jan 27, 2009

Actually I am using a proxy hosted in some other machine (i.e. not my
PC ... sorry, I didnt mention :S ...) I «debugged» urllib and, when
branching at this point (see below ;) in URLopener.open_http :

{{{
#!python

# urllib,py

    def open_http(self, url, data=None):
        """Use HTTP protocol."""
        import httplib
        user_passwd = None
        proxy_passwd= None
        if isinstance(url, str):             # Branching here !!!!!!!!!!
            host, selector = splithost(url)
            if host:
                user_passwd, host = splituser(host)
                host = unquote(host)
            realhost = host
        else:
            host, selector = url

}}}

url variable is bound to the following binary tuple

{{{
#!python

('172.18.2.7:3128', 'http://localhost:8000/trac-dev')
}}}

My IP is 172.18.2.99 ... so the else branch is the one being executed

If you need further details ... dont hesitate and ask anything you
want ;)

PD: What d'u mean when you said?

Do you have an HTTP proxy? running at the *same* port? (!)

I dont understand this since *I already said* that *I accessed* my Trac
environment using my web browser (Opera 9.63, I dont know whether this
is relevant at all ... ), *I sent you* the lines outputted by tracd to
stdout (or stderr ... I am not very sure right now ... ;) and *I told
you* that, once I applied the path *I submitted*, everything was *back
to normal* ...

I dont understand how could all this be possible if I were running
tracd and an HTTP proxy in the same port, or even in case
http_proxy envvar be set to the hostname + port where my Trac
instance is listening for incoming connections ...

Anyway ... CMIIW ...

I also checked that immediately before executing the following
statements ...

{{{
#!python

# urllib,py

        h = httplib.HTTP(host)
        if data is not None:
            h.putrequest('POST', selector)
            h.putheader('Content-Type', 'application/x-www-form-
urlencoded')
            h.putheader('Content-Length', '%d' % len(data))
        else:
            h.putrequest('GET', selector)

}}}

... selector is bound to 'http://localhost:8000/trac-dev' ... BTW the
else clause is the one executed in this case, and this is
consistent with tracd reports I sent before and is logical since
data arg is missing in the code snippet I sent before.

@pitrou
Copy link
Member

pitrou commented Jan 27, 2009

I suppose 172.18.2.7:3128 is the address:port of the your proxy, right?
In which case, urllib seems to do the right thing. When talking to an
HTTP proxy, requests are of the form "GET http://site.com/path", rather
than "GET /path". It's up to the proxy to strip the host part of the URL
when forwarding the request to the target server.

(but I suppose tracd could also be more permissive and allow the "GET
http://site.com/path" variant. It seems Apache does)

@olemis
Copy link
Mannequin Author

olemis mannequin commented Jan 27, 2009

Quoting Antoine Pitrou ...

I suppose 172.18.2.7:3128 is the address:port of the your proxy,
right?

Yes ...

In which case, urllib seems to do the right thing. When talking to an
HTTP proxy, requests are of the form "GET http://site.com/path", rather
than "GET /path". It's up to the proxy to strip the host part of the URL
when forwarding the request to the target server.

This being said ...

(but I suppose tracd could also be more permissive and allow the "GET
http://site.com/path" variant. It seems Apache does)

... It works with Apache (I am talking about trac once again ...)
therefore I will report this issue to Trac devs instead ...

Thnx a lot ! Sorry if I caused you any trouble ...

@ggenellina
Copy link
Mannequin

ggenellina mannequin commented Jan 28, 2009

> Do you have an HTTP proxy? running at the *same* port?
(!)

I dont understand this since *I already said* that *I
accessed* my Trac
environment using my web browser (Opera 9.63, I dont know
whether this
is relevant at all ... ), *I sent you* the lines outputted
by tracd to
stdout (or stderr ... I am not very sure right now ... ;)
and *I told
you* that, once I applied the path *I submitted*,
everything was *back
to normal* ...

If you had configured a proxy at localhost:8000, and *also* a Trac instance at that port, and Trac had "won the race" for the port, then you would observe exactly the symthoms you describe. That is, urllib talking to 8000 as it were a proxy, and the Trac instance actually there getting confused.

Your patch, as you surely understand now, is not correct; in fact, the code is OK as it is. urllib builds the request in that specific way *because* he thinks there is a proxy. If the proxy is buggy, misconfigured, or inexistent, it's not the library's fault :)

--
Gabriel Genellina

I dont understand how could all this be possible if I were
running
tracd and an HTTP proxy in the same port, or even in case

http_proxy envvar be set to the hostname + port where my
Trac
instance is listening for incoming connections ...

Anyway ... CMIIW ...

I also checked that immediately before executing the
following
statements ...

{{{
#!python

urllib,py

    h = httplib.HTTP(host)
    if data is not None:
        h.putrequest('POST', selector)
        h.putheader('Content-Type',

'application/x-www-form-
urlencoded')
h.putheader('Content-Length',
'%d' % len(data))
else:
h.putrequest('GET', selector)

}}}

... selector is bound to
'http://localhost:8000/trac-dev' ... BTW the
else clause is the one executed in this case, and this
is
consistent with tracd reports I sent before and is
logical since
data arg is missing in the code snippet I sent before.


Python tracker report@bugs.python.org
http://bugs.python.org/issue5072


  Yahoo! Cocina

Recetas prácticas y comida saludable
http://ar.mujer.yahoo.com/cocina/

@devdanzin
Copy link
Mannequin

devdanzin mannequin commented Feb 12, 2009

Anyone against closing this as "works for me"?

@orsenthil
Copy link
Member

Yup, This should be closed too. Thanks.

@devdanzin devdanzin mannequin closed this as completed Feb 18, 2009
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

2 participants