A file-like object u returned by the urlopen() function in both Python
2.6/3.0 has a method info() that returns a 'HTTPMessage' object. For
example:
::: Python 2.6
>>> from urllib2 import urlopen
>>> u = urlopen("http://www.python.org")
>>> u.info()
<httplib.HTTPMessage instance at 0xce5738>
>>>
::: Python 3.0
>>> from urllib.request import urlopen
>>> u = urlopen("http://www.python.org")
>>> u.info()
<http.client.HTTPMessage object at 0x4bfa10>
>>>
So far, so good. HTTPMessage is defined in two different modules, but
that's fine (it's just library reorganization).
Two major problems:
1. There is no documentation whatsoever on HTTPMessage. No description
in the docs for httplib (python 2.6) or http.client (python 3.0).
2. The HTTPMessage object in Python 2.6 derives from mimetools.Message
and has a totally different programming interface than HTTPMessage in
Python 3.0 which derives from email.message.Message. Check it out:
:::Python 2.6
>>> dir(u.info())
['__contains__', '__delitem__', '__doc__', '__getitem__', '__init__',
'__iter__', '__len__', '__module__', '__setitem__', '__str__',
'addcontinue', 'addheader', 'dict', 'encodingheader', 'fp', 'get',
'getaddr', 'getaddrlist', 'getallmatchingheaders', 'getdate',
'getdate_tz', 'getencoding', 'getfirstmatchingheader', 'getheader',
'getheaders', 'getmaintype', 'getparam', 'getparamnames', 'getplist',
'getrawheader', 'getsubtype', 'gettype', 'has_key', 'headers',
'iscomment', 'isheader', 'islast', 'items', 'keys', 'maintype',
'parseplist', 'parsetype', 'plist', 'plisttext', 'readheaders',
'rewindbody', 'seekable', 'setdefault', 'startofbody', 'startofheaders',
'status', 'subtype', 'type', 'typeheader', 'unixfrom', 'values']
:::Python 3.0
>>> dir(u.info())
['__class__', '__contains__', '__delattr__', '__delitem__', '__dict__',
'__doc__', '__eq__', '__format__', '__ge__', '__getattribute__',
'__getitem__', '__gt__', '__hash__', '__init__', '__iter__', '__le__',
'__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__',
'__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__',
'__str__', '__subclasshook__', '__weakref__', '_charset',
'_default_type', '_get_params_preserve', '_headers', '_payload',
'_unixfrom', 'add_header', 'as_string', 'attach', 'defects',
'del_param', 'epilogue', 'get', 'get_all', 'get_boundary',
'get_charset', 'get_charsets', 'get_content_charset',
'get_content_maintype', 'get_content_subtype', 'get_content_type',
'get_default_type', 'get_filename', 'get_param', 'get_params',
'get_payload', 'get_unixfrom', 'getallmatchingheaders', 'is_multipart',
'items', 'keys', 'preamble', 'replace_header', 'set_boundary',
'set_charset', 'set_default_type', 'set_param', 'set_payload',
'set_type', 'set_unixfrom', 'values', 'walk']
I know that getting rid of mimetools was desired, but I have no idea if
changing the API on HTTPMessage was intended or not. In any case, it's
one of the only cases in the entire library where the programming
interface to an object radically changes from 2.6 -> 3.0.
I ran into this problem with code that was trying to properly determine
the charset encoding of the byte string returned by urlopen().
I haven't checked whether 2to3 deals with this or not, but it might be
something for someone to look at in their copious amounts of spare time.
|
There is a difference in what HTTPResponse.getheaders() returns.
Python 2.7.2 (default, Jun 12 2011, 14:24:46) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import httplib
>>> c = httplib.HTTPConnection('www.joelverhagen.com')
>>> c.request('GET', '/sandbox/tests/cookies.php')
>>> c.getresponse().getheaders()
[('content-length', '0'), ('set-cookie', 'test_cookie1=foobar; expires=Fri, 02-Mar-2012 16:54:15 GMT, test_cookie2=barfoo; expires=Fri, 02-Mar-2012 16:54:15 GMT'), ('vary', 'Accept-Encoding'), ('server', 'Apache'), ('date', 'Fri, 02 Mar 2012 16:53:15 GMT'), ('content-type', 'text/html')]
Python 3.2.2 (default, Sep 4 2011, 09:07:29) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from http import client
>>> c = client.HTTPConnection('www.joelverhagen.com')
>>> c.request('GET', '/sandbox/tests/cookies.php')
>>> c.getresponse().getheaders()
[('Date', 'Fri, 02 Mar 2012 16:56:40 GMT'), ('Server', 'Apache'), ('Set-Cookie', 'test_cookie1=foobar; expires=Fri, 02-Mar-2012 16:57:40 GMT'), ('Set-Cookie', 'test_cookie2=barfoo; expires=Fri, 02-Mar-2012 16:57:40 GMT'), ('Vary', 'Accept-Encoding'), ('Content-Length', '0'), ('Content-Type', 'text/html')]
As you can see, in 2.7.2 HTTPResponse.getheaders() in 2.7.2 joins headers with the same name by ", ". In 3.2.2, the headers are kept separate and two or more 2-tuples.
This causes problems if you convert the list of 2-tuples to a dict, because the keys collide (causing all but one of the values associated the non-unique keys to be overwritten). It looks like this problem is caused by using the email header parser (which keeps the keys and values as separate 2-tuples). In Python 2.7.2, the HTTPMessage.addheader(...) function does the comma-separating.
Is this API change intentional? Should HTTPResponse.getheaders() comma-separate the values like the HTTPResponse.getheader(...) function (in both 2.7.2 and 3.2.2)?
See also:
https://github.com/shazow/urllib3/issues/3#issuecomment-3008415
|