classification
Title: httplib.py: ._tunnel() broken
Type: behavior Stage: test needed
Components: Library (Lib) Versions: Python 3.2, Python 3.1, Python 2.7, Python 2.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: brian.curtin, cameron, orsenthil
Priority: normal Keywords:

Created on 2010-01-25 04:34 by cameron, last changed 2010-01-27 03:06 by cameron.

Messages (6)
msg98264 - (view) Author: Cameron Simpson (cameron) Date: 2010-01-25 04:33
I'm trying to do HTTPS via a proxy in Python 2.6.4 (which is supposed to incorporate this fix from issue 1424152).

While trying to debug this starting from the suds library I've been reading httplib.py and urllib2.py to figure out what's going wrong
and found myself around line 687 of httplib.py at the _tunnel()
function.

_tunnel() is broken because _set_hostport() has side effects.

_tunnel() starts with:
  self._set_hostport(self._tunnel_host, self._tunnel_port)
to arrange that the subsequent connection is made to the proxy
host and port, and that is in itself ok.

However, _set_hostport() sets the .host and .port attributes in
the HTTPConnection object.

The next action _tunnel() takes is to send the CONNECT HTTP command,
filling in the endpoint host and port from self.host and self.port.
But these values have been overwritten by the preceeding _set_hostport()
call, and so we ask the proxy to connect to itself.

It seems to me that _tunnel() should be grabbing the original host and port before calling _set_hostport(), thus:

  ohost, oport = self.host, self.port
  self._set_hostport(self._tunnel_host, self._tunnel_port)
  self.send("CONNECT %s:%d HTTP/1.0\r\n\r\n" % (ohost, oport))

In fact the situation seems even worse: _tunnel() calls send(), send() calls connect(), and connect() calls _tunnel() in an infinite regress.
- Cameron Simpson
msg98266 - (view) Author: Cameron Simpson (cameron) Date: 2010-01-25 05:46
Amendment: regarding the infinite regress, it looks like there will not be a recursion if the caller leaps straight to the .connect() method. However, if they do that then the call to _tunnel() from within connect() will happen _after_ the socket is made directly to the origin host, not via the proxy. So the behaviour seems incorrect then also; it looks very much like _tunnel() must always be called before the real socket connection is established, and .connect() calls _tunnel() afterwards, not before.
msg98268 - (view) Author: Cameron Simpson (cameron) Date: 2010-01-25 06:57
It's looking like I have my idea of .host versus ._tunnel_host swapped. I think things are still buggy, but my interpretation of the bug is wrong or misleading.

I gather that after _set_tunnel(), .host is the proxy host and that ._tunnel_host is the original target host.

I'll follow up here in a bit when I've better characterised the problem.
I think I'm letting urllib2's complicated state stuff confuse me too...
msg98311 - (view) Author: Cameron Simpson (cameron) Date: 2010-01-26 02:24
Well, I've established a few things:
  - I'm mischaracterised this issue
  - httplib's _set_tunnel() is really meant to be called from
    urllib2, because using it directly with httplib is totally
    counter intuitive
  - a bare urllib2 setup fails with its own bug

To the first item: _tunnel() feels really fragile with that recursion issue, though it doesn't recurse called from urllib2.

For the second, here's my test script using httplib:

  H = httplib.HTTPSConnection("localhost", 3128)
  print H
  H._set_tunnel("localhost", 443)
  H.request("GET", "/boguspath")
  os.system("lsof -p %d | grep IPv4" % (os.getpid(),))
  R = H.getresponse()
  print R.status, R.reason

As you can see, one builds the HTTPSConnection object with the proxy's details instead of those of the target URL, and then put the target URL details in with _set_tunnel(). Am I alone in find this strange?

For the third, my test code is this:

  U = urllib2.Request('https://localhost/boguspath')
  U.set_proxy('localhost:3128', 'https')
  f = urllib2.urlopen(R)
  print f.read()

which fails like this:

  Traceback (most recent call last):
    File "thttp.py", line 15, in <module>
      f = urllib2.urlopen(R)
    File "/opt/python-2.6.4/lib/python2.6/urllib2.py", line 131, in urlopen
      return _opener.open(url, data, timeout)
    File "/opt/python-2.6.4/lib/python2.6/urllib2.py", line 395, in open
      protocol = req.get_type()
  AttributeError: HTTPResponse instance has no attribute 'get_type'

The line numbers are slightly off because I've got some debugging statements in there.

Finally, I flat out do not understand urllib2's set_proxy() method:
  
    def set_proxy(self, host, type):
        if self.type == 'https' and not self._tunnel_host:
            self._tunnel_host = self.host
        else:
            self.type = type
            self.__r_host = self.__original
        self.host = host

When my code calls set_proxy, self.type is None. Now, I had naively expected the first branch to be the only branch. Could someone explain what's happening here, and what is meant to happen?

I'm thinking that this bug may turn into a doc fix instead of a behaviour fix, but I'm finding it surprisingly hard to know how urllib2 is supposed to be used.
msg98312 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-01-26 03:08
As you noticed, the _set_tunnel method is a private method not intended to be used directly. Its being used by urllib2 when https through proxy is required.
urllib2 works like this, it reads HTTPS_PROXY environment variable (in turn includes HTTPSProxyHandler and HTTPSProxyAuthenticationHandler) and then try to do a urlopen on an https:// url or a request object through the tunnel doing a  CONNECT instead of a GET.

How do think the docs can be improved? If you have any suggestions please upload a patch. 
Thanks.
msg98401 - (view) Author: Cameron Simpson (cameron) Date: 2010-01-27 03:06
Well, following your description I've backed out my urllib2 test case to this:

  f = urllib2.urlopen('https://localhost/boguspath')
  os.system("lsof -p %d | grep IPv4" % (os.getpid(),))
  f = urllib2.urlopen(R)
  print f.read()

and it happily runs HTTPS through the proxy if I set the https_proxy envvar. So it's all well and good for the "just do what the environment suggests" use case.

However, my older test:

  U = urllib2.Request('https://localhost/boguspath')
  U.set_proxy('localhost:3128', 'https')
  f = urllib2.urlopen(R)
  print f.read()

still blows up with:

  File "/opt/python-2.6.4/lib/python2.6/urllib2.py", line 381, in open
    protocol = req.get_type()
  AttributeError: HTTPResponse instance has no attribute 'get_type'

Now, this is the use case for "I have a custom proxy setup for this activity".

It seems a little dd that "req" above is an HTTPResponse instead of a Request, and that my be why there's no .ettype() method available.

I also see nothing obviously wrong with my set_proxy() call above based on the docs for the .set_proxy() method, though obviously it fails.

I think what may be needed is a small expansion of the section in the Examples are on proxies. There's an description of the use of the *_proxy envvars there (and not elsewhere, which seems wrong) and an example of providing a proxy Handler. An addition example with a functioning use of a bare .set_proxy() might help.
History
Date User Action Args
2010-01-27 03:06:57cameronsetmessages: + msg98401
2010-01-26 03:08:36orsenthilsetmessages: + msg98312
2010-01-26 02:24:06cameronsetmessages: + msg98311
2010-01-25 06:57:44cameronsetmessages: + msg98268
2010-01-25 06:22:17orsenthilsetassignee: orsenthil

nosy: + orsenthil
2010-01-25 05:46:37cameronsetmessages: + msg98266
2010-01-25 04:57:07brian.curtinsetpriority: normal
nosy: + brian.curtin
versions: + Python 3.1, Python 2.7, Python 3.2

stage: test needed
2010-01-25 04:34:00cameroncreate