Martin, thanks for elaborating my thoughts!
I have dug I bit deeper in Python2's urllib code with pdb, and I think I have narrowed the issue down to what open_http does.
In my example code, replacing opener.open(url) with opener.open_http(url) gives the same problem.
I realize I did not provide you with the output of the script, so here it is:
* Python 2.7.10
python urllib_error.py
('Trying to open', 'https://www.python.org')
Traceback (most recent call last):
File "urllib_error.py", line 30, in <module>
opener.open_http((host, selector))
File "/home/mazzucco/.pyenv/versions/2.7.10/lib/python2.7/urllib.py", line 364, in open_http
return self.http_error(url, fp, errcode, errmsg, headers)
File "/home/mazzucco/.pyenv/versions/2.7.10/lib/python2.7/urllib.py", line 381, in http_error
return self.http_error_default(url, fp, errcode, errmsg, headers)
File "/home/mazzucco/.pyenv/versions/2.7.10/lib/python2.7/urllib.py", line 386, in http_error_default
raise IOError, ('http error', errcode, errmsg, headers)
IOError: ('http error', 501, 'Not Implemented', <httplib.HTTPMessage instance at 0x7f875a67b950>)
* Python 3.4.3
python urllib_error.py
Trying to open https://www.python.org
Traceback (most recent call last):
File "urllib_error.py", line 30, in <module>
opener.open_http((host, selector))
File "/home/mazzucco/.pyenv/versions/3.4.3/lib/python3.4/urllib/request.py", line 1805, in open_http
return self._open_generic_http(http.client.HTTPConnection, url, data)
File "/home/mazzucco/.pyenv/versions/3.4.3/lib/python3.4/urllib/request.py", line 1801, in _open_generic_http
response.status, response.reason, response.msg, data)
File "/home/mazzucco/.pyenv/versions/3.4.3/lib/python3.4/urllib/request.py", line 1821, in http_error
return self.http_error_default(url, fp, errcode, errmsg, headers)
File "/home/mazzucco/.pyenv/versions/3.4.3/lib/python3.4/urllib/request.py", line 1826, in http_error_default
raise HTTPError(url, errcode, errmsg, headers, None)
urllib.error.HTTPError: HTTP Error 501: Not Implemented
When I unwrap the contents of httplib.HTTPMessage, the error page returned by the squid proxy says:
-------------------------------------------------------
ERROR
The requested URL could not be retrieved
The following error was encountered while trying to retrieve the URL: https://www.python.org
Unsupported Request Method and Protocol
Squid does not support all request methods for all access protocols. For example, you can not POST a Gopher request.
-------------------------------------------------------
Looking at Python2's implementation of URLopener's open_http, I can get an even more minimal failing example limited to httplib:
import httplib
host = 'proxy.corp.com:8181' # this is not the actual proxy
selector = 'https://www.python.org'
print("Trying to open", selector)
h = httplib.HTTP(host)
h.putrequest('GET', selector)
h.putheader('User-Agent', 'Python-urllib/1.17')
h.endheaders(None)
errcode, errmsg, headers = h.getreply()
print(errcode, errmsg)
print(headers.items())
Running the script on Python 2.7.10 prints:
('Trying to open', 'https://www.python.org')
(501, 'Not Implemented')
[('content-length', '3069'), ('via', '1.0 proxy.corp.com (squid/3.1.6)'), ('x-cache', 'MISS from proxy.corp.com'), ('content-language', 'en'), ('x-squid-error', 'ERR_UNSUP_REQ 0'), ('x-cache-lookup', 'NONE from proxy.corp.com:8181'), ('vary', 'Accept-Language'), ('server', 'squid/3.1.6'), ('proxy-connection', 'close'), ('date', 'Fri, 10 Jul 2015 09:27:14 GMT'), ('content-type', 'text/html'), ('mime-version', '1.0')]
As I said, I found out about this when using buildout to download files over HTTPS.
Buildout uses urllib.urlretrieve on Python2 and urllib.request.urlretrieve on Python3. I guess that the latter has been fixed in issue 1424152, so that's why I can download with buildout on Python3. |