classification
Title: urllib2 header capitalization
Type: behavior
Components: Library (Lib) Versions: Python 2.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: BitTorment, frispete, jjlee, orsenthil
Priority: Keywords: patch

Created on 2008-03-11 21:22 by frispete, last changed 2008-05-06 12:21 by BitTorment.

Files
File name Uploaded Description Edit Remove
urllib2-cap-headers.diff frispete, 2008-03-11 21:22 Patch proposal
issue2275.patch orsenthil, 2008-03-30 20:09
Messages
msg63466 (view) Author: Hans-Peter Jansen (frispete) Date: 2008-03-11 21:22
The urllib2 behavior related to headers is - hmm - improvable. 
It simply capitalize() the key, which leads to funny results like:
Accept-charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
while this is seemingly conforming to the specs, it's simply different 
to every other implementation of such things.. 

And we can do better. How about:
--- /usr/lib/python/urllib2.py  2008-01-10 19:03:55.000000000 +0100
+++ urllib2.py  2008-03-11 21:25:33.523890670 +0100
@@ -261,13 +261,16 @@ class Request:
     def is_unverifiable(self):
         return self.unverifiable

+    def _cap_header_key(self, key):
+        return '-'.join((ck.capitalize() for ck in key.split('-')))
+
     def add_header(self, key, val):
         # useful for something like authentication
-        self.headers[key.capitalize()] = val
+        self.headers[self._cap_header_key(key)] = val

     def add_unredirected_header(self, key, val):
         # will not be added to a redirected request
-        self.unredirected_hdrs[key.capitalize()] = val
+        self.unredirected_hdrs[self._cap_header_key(key)] = val

     def has_header(self, header_name):
         return (header_name in self.headers or

I'm not happy with the _cap_header_key name, but you get the idea.
The patch is optimized to operate with maximum locality. It's also 
attached.

I would be very grateful, if something similar could be applied.

Opinions?
msg64740 (view) Author: Senthil (orsenthil) Date: 2008-03-30 04:43
I understand your implementation of _cap_header_key function.
But should not this patch be handled in a way wherein.
key.capitalize() is just replaced by key.upper()?

Should not that suffice?
What is the difference between _cap_header_key and key.upper()?

Thank you,
Senthil
msg64747 (view) Author: Hans-Peter Jansen (frispete) Date: 2008-03-30 12:12
> But should not this patch be handled in a way wherein.
> key.capitalize() is just replaced by key.upper()?

Hmm, are you sure?
>>> "hello".upper()
'HELLO'
>>> 

but the issue is with values containing dashes:
>>> 'accept-charset'.capitalize()
'Accept-charset'
whereas the rest of the world would expect:
'Accept-Charset'
        ^

This is especially useful, if you try to emulate the behavior of a
typical browser as close as possible.
msg64748 (view) Author: John J Lee (jjlee) Date: 2008-03-30 14:56
urllib2.Request.headers is, in practice, an undocumented public
interface.  Did you run the tests?  There is room for improvement here,
but not in the way you suggest.

python[1]$ python2.6
iPython 2.6a1+ (trunk:62045M, Mar 30 2008, 03:07:23) 
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import test.test_urllib2
>>> print test.test_urllib2.test_request_headers_dict.__doc__

    The Request.headers dictionary is not a documented interface.  It should
    stay that way, because the complete set of headers are only accessible
    through the .get_header(), .has_header(), .header_items() interface.
    However, .headers pre-dates those methods, and so real code will be
using
    the dictionary.

    The introduction in 2.4 of those methods was a mistake for the same
reason:
    code that previously saw all (urllib2 user)-provided headers in .headers
    now sees only a subset (and the function interface is ugly and
incomplete).
    A better change would have been to replace .headers dict with a dict
    subclass (or UserDict.DictMixin instance?)  that preserved the .headers
    interface and also provided access to the "unredirected" headers.  It's
    probably too late to fix that, though.


    Check .capitalize() case normalization:

    >>> url = "http://example.com"
    >>> Request(url, headers={"Spam-eggs": "blah"}).headers["Spam-eggs"]
    'blah'
    >>> Request(url, headers={"spam-EggS": "blah"}).headers["Spam-eggs"]
    'blah'

    Currently, Request(url, "Spam-eggs").headers["Spam-Eggs"] raises
KeyError,
    but that could be changed in future.

    
>>>
msg64749 (view) Author: John J Lee (jjlee) Date: 2008-03-30 15:00
Specifically, these improvements could be made:

 * the headers actually sent to httplib could be normalized to
Standard-Http-Case by urllib2
 * the urllib2.Request.headers interface could support case-insensitive
key lookup
msg64760 (view) Author: Senthil (orsenthil) Date: 2008-03-30 20:06
Hi John, Greetings!
I agree with both of your suggestions.
Attached is the patch which aims to implement both in one go.
Please provide your comments on that.
If this method is okay, I shall go ahead with patches for tests and
attach it also.

Thanks,
Senthil
msg64761 (view) Author: Hans-Peter Jansen (frispete) Date: 2008-03-30 20:28
Hi Senthil,

that looks promising, and the title() trick is nice, as it fixes my 
issue.. 

Thanks,
Pete
msg66311 (view) Author: Martin McNickle (BitTorment) Date: 2008-05-06 12:21
This looks good.  I would suggest that the unredirected_hdrs would use
the CaseInsensitiveDict too.

There is still the problem (from the test documentation) that accessing
.headers directly will only show a subset of headers i.e. it won't show
any headers from .unredirected_hdrs.  Have you any suggestions as to how
they both can be accessed from the same interface?

The test documentation also says:

"Note the case normalization of header names here, to
.capitalize()-case.  This should be preserved for
backwards-compatibility.  (In the HTTP case, normalization to
.title()-case is done by urllib2 before sending headers to
httplib)."

It suggests that capitalize() should be kept for backwards
compatibility.  I have tested and the headers actually sent to the
server are in title()-case even though they are stored in the Request
object as captitalize()-case.

This would initially suggest that the case-normalization should be
removed from the patch.  However, as .headers would now be using a
case-insensitive dictionary, this would still ensure backwards compatibily.
History
Date User Action Args
2008-05-06 12:21:45BitTormentsetnosy: + BitTorment
messages: + msg66311
2008-03-30 20:28:36frispetesetmessages: + msg64761
2008-03-30 20:09:42orsenthilsetfiles: + issue2275.patch
2008-03-30 20:09:22orsenthilsetfiles: - issue2275.patch
2008-03-30 20:06:39orsenthilsetfiles: + issue2275.patch
messages: + msg64760
2008-03-30 15:00:25jjleesetmessages: + msg64749
2008-03-30 14:56:50jjleesetnosy: + jjlee
messages: + msg64748
2008-03-30 12:12:27frispetesetmessages: + msg64747
2008-03-30 04:43:46orsenthilsetnosy: + orsenthil
messages: + msg64740
2008-03-11 21:22:19frispetecreate