classification
Title: urllib2 forces title() on header names, breaking some requests
Type: enhancement Stage: needs patch
Components: Library (Lib) Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, Cal.Leeming, demian.brecht, eric.araujo, karlcow, orsenthil, r.david.murray, santa4nt, sleepycal, terry.reedy
Priority: normal Keywords:

Created on 2011-06-30 19:23 by Cal.Leeming, last changed 2014-06-24 06:39 by karlcow.

Messages (20)
msg139512 - (view) Author: Cal Leeming (Cal.Leeming) Date: 2011-06-30 19:23
I came up against a problem today whilst trying to submit a request to a remote API. The header needed to contain:

'Content-MD5' : "md5here"

But the urllib2 Request() forces capitalize() on all header names, and transformed it into "Content-Md5", which in turn made the remote web server ignore the header and break the request (as the remote side is case sensitive, of which we don't have any control over).

I attempted to get smart by using the following patch:
class _str(str):
    def capitalize(s):
        print s
        return s

_headers = {_str("Content-MD5") : 'md5here'}

But this failed to work:


---HEADERS---
{'Content-MD5': 'nts0yj7AdzJALyNOxafDyA=='}

---URLLIB2 DEBUG---
send: 'POST /api/v1 m HTTP/1.1\r\nContent-Md5: nts0yj7AdzJALyNOxafDyA==\r\n\r\n\r\n'

Upon inspecting the urllib2.py source, I found 3 references to capitalize() which seem to cause this problem, but it seems impossible to monkey patch, nor fix without forking.

Therefore, I'd like to +1 a feature request to have an extra option at the time of the request being opened, to bypass the capitalize() on header names (maybe, header_keep_original = True or something). 

And, if anyone could suggest a possible monkey patch (which doesn't involve forking huge chunks of code), that'd be good too :)

Thanks

Cal
msg139514 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-30 19:33
Well, three occurrences means you only have three methods to patch (and two of them are trivial).  But I agree that copying the non-trivial method doesn't look fun from a maintenance perspective.

You could also try using an object that is not a subclass of str.  The problem with subclassing str is that some (most?) string methods do not do a subclass check but directly call the C implementation of the method.  I think there's an issue in the tracker somewhere about that.

The problem with not subclassing string, of course, is that you may end up implementing a lot of methods on your object to get it to play nicely with urllib2's assumption that it *is* a string.
msg139515 - (view) Author: Cal Leeming (Cal.Leeming) Date: 2011-06-30 19:39
Sorry, I should clarify.. The str() patch worked, but it failed to work within the realm of urllib2:


s = _str("Content-MD5")
print "Builtin:"
print "plain:       %s" % ( s )
print "capitalized: %s" % ( s.capitalize() )

s = str("Content-MD5")
print "Builtin:"
print "plain:       %s" % ( s )
print "capitalized: %s" % ( s.capitalize() )

Builtin:
plain:       Content-MD5
capitalized: Content-MD5
Builtin:
plain:       Content-MD5
capitalized: Content-md5

Why it works in the unit test, and not within urllib2, is totally beyond me. Especially since I put a debug call on the method, and it does get called.. yet urllib2 debug still shows it sending the wrong value.

---
capitalize() bypassed: sending value: Content-MD5
send: 'POST /api/url\r\nContent-Md5: nts0yj7AdzJALyNOxafDyA==\r\n\r\n'
---

I have a feeling that the problem may lie somewhere after the opener (like HTTPConnection or AbstractHTTPHandler), rather than the urllib2 calls to capitalize(), but not having much luck monkey patching those :X
msg139516 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-30 19:50
Well, judging by your test it isn't capitalize that's the issue.  capitalize produces Content-md5, whereas debug is showing urllib2 sending Content-Md5.  So something else is massaging the header name on send.
msg139519 - (view) Author: Cal Leeming (Cal.Leeming) Date: 2011-06-30 20:17
(short answer, I found the cause, and a suitable monkey patch) - below are details of how I did it and steps I took.

-----

Okay so I forked AbstractHTTPHandler() then patched do_request_(), at which point "request.headers" and request.header_items() have the correct header name (Content-MD5).

So I tried this:
opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1))
opener.addheaders = [("Content-TE5", 'test'), ]

However the headers came back capitalized, so the problem is happening somewhere after addheaders. 


 > grep -R "addheaders" *.py
urllib.py:        self.addheaders = [('User-Agent', self.version)]
urllib.py:        self.addheaders.append(args)
urllib.py:        for args in self.addheaders: h.putheader(*args)
urllib.py:            for args in self.addheaders: h.putheader(*args)
urllib2.py:        self.addheaders = [('User-agent', client_version)]
urllib2.py:        for name, value in self.parent.addheaders:

> grep -R "def putheader" *.py
httplib.py:    def putheader(self, header, value):
httplib.py:    def putheader(self, header, *values):

I also then found: http://stackoverflow.com/questions/3278418/testing-urllib2-application-http-responses-loaded-from-files

I then patched this:

            class HTTPConnection(httplib.HTTPConnection):
                def putheader(self, header, value):
                    print [header, value]

This in turn brought back:
['Content-Md5', 'nts0yj7AdzJALyNOxafDyA==']

Which means it's happening before putheader(). So I patched _send_request() on HTTPConnection(), and that also brought back 'Content-Md5'. Exception trace shows:

  File "/ddcms/dev/webapp/../webapp/sites/ma/management/commands/ddcms.py", line 147, in _send_request
    _res = opener.open(req)
  -- CORRECT --
  File "/usr/local/lib/python2.6/urllib2.py", line 391, in open
    response = self._open(req, data)
  -- CORRECT --
  File "/usr/local/lib/python2.6/urllib2.py", line 409, in _open
    '_open', req)
  -- CORRECT --
  File "/usr/local/lib/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  -- CORRECT --
  File "/ddcms/dev/webapp/../webapp/sites/ma/management/commands/ddcms.py", line 126, in http_open
    return self.do_open(HTTPConnection, req)
  -- CORRECT --
  File "/usr/local/lib/python2.6/urllib2.py", line 1142, in do_open
    h.request(req.get_method(), req.get_selector(), req.data, headers)
  -- INVALID --
  File "/usr/local/lib/python2.6/httplib.py", line 914, in request
    self._send_request(method, url, body, headers)
  File "/ddcms/dev/webapp/../webapp/sites/ma/management/commands/ddcms.py", line 122, in _send_request
    raise


The line that causes it?

                    headers = dict(
                        (name.title(), val) for name, val in headers.items())
                    
So it would appear that title() also needs monkey patching.. Patched to use:


# Patch case sensitive headers (due to reflected API being non RFC compliant, and
# urllib2 not giving the option to choose between the two)
class _str(str):
    def capitalize(s):
        print "capitalize() bypassed: sending value: %s" % ( s )
        return s
    
    def title(s):
        print "title() bypassed: sending value: %s" % ( s )
        return s

_headers = {_str('Content-MD5') : _md5_content}

capitalize() bypassed: sending value: Content-MD5
title() bypassed: sending value: Content-MD5
send: 'POST /url/api HTTP/1.1\r\nContent-MD5: nts0yj7AdzJALyNOxafDyA==\r\n\r\n'
msg139520 - (view) Author: Cal Leeming (Cal.Leeming) Date: 2011-06-30 20:19
So @r.david.murray, it would appear you were right :D Really, I should have looped through each method on str(), and wrapped them all to see which were being called, but lesson learned I guess.

Sooo, I guess now the question is, can we possibly get a vote on having a feature which disables this functionality from the opener level. Something like:
opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1, keep_original_header_case=True))

But obviously a less tedious attribute name :)

In the mean times, if anyone else comes up against this problem, the code I pasted above will work fine for now.

Cal
msg139523 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-06-30 20:53
A feature request for a way to control this is reasonable.  However, new features can only go into 3.3.
msg139524 - (view) Author: Cal Leeming (Cal.Leeming) Date: 2011-06-30 21:00
Damn 3.3 huh? Ah well, at least it's in the pipeline ^_^

Thanks for your help on this @r.david.murray!
msg139546 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2011-07-01 06:09
AFAIR, that capitalize part is somewhere a requirement in RFC, if the server did not behave in proper manner, it may not be a good idea for the client to change (or be permissive the flag).
msg139547 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2011-07-01 06:12
Sorry, not "Capitalize", but the "Title" part. One can some bugs which lead to this change in the urllib2.
msg139551 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-07-01 07:45
Quoting http://tools.ietf.org/html/rfc2068#section-4.2:

  Field names are case-insensitive.

Which is only logical, since they are modeled on email headers, and email header names are case insensitive.  So, the server in question is broken, yes, but that doesn't mean we can't provide a facility to allow Python to inter-operate with it.  Email, for example, preserves the case of the field names it parses or receives from the application program, but otherwise treats them case-insensitively.  However, since the current code coerces to title case, we have to provide this feature as a switchable facility defaulting to the current behavior, for backward compatibility reasons.

And someone needs to write a patch....
msg139578 - (view) Author: Cal Leeming (sleepycal) Date: 2011-07-01 13:21
Thats full understandable that the default won't change. I'll put this in my
todo list to write a patch in a week or two.
On 1 Jul 2011 08:45, "R. David Murray" <report@bugs.python.org> wrote:
>
> R. David Murray <rdmurray@bitdance.com> added the comment:
>
> Quoting http://tools.ietf.org/html/rfc2068#section-4.2:
>
> Field names are case-insensitive.
>
> Which is only logical, since they are modeled on email headers, and email
header names are case insensitive. So, the server in question is broken,
yes, but that doesn't mean we can't provide a facility to allow Python to
inter-operate with it. Email, for example, preserves the case of the field
names it parses or receives from the application program, but otherwise
treats them case-insensitively. However, since the current code coerces to
title case, we have to provide this feature as a switchable facility
defaulting to the current behavior, for backward compatibility reasons.
>
> And someone needs to write a patch....
>
> ----------
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue12455>
> _______________________________________
msg175484 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-11-13 01:40
The comment about urllib.request forcing .title() is consistent with 'Content-Length' and 'Content-Type' in the docs but puzzling and inconsistent given that in 3.3, header names are printed .capitalize()'ed and not .title()'ed and that has_header and get_header *require* the .capitalize() form and reject the .title() form.

import urllib.request
opener = urllib.request.build_opener()
request = urllib.request.Request("http://example.com/", headers =
        {"Content-Type": "application/x-www-form-urlencoded"})
opener.open(request, "1".encode("us-ascii"))
print(request.header_items(),
      request.has_header("Content-Type"),
      request.has_header("Content-type"),
      request.get_header("Content-Type"),
      request.get_header("Content-type"), sep='\n')
>>> 
[('Content-type', 'application/x-www-form-urlencoded'), ('Content-length', '1'), ('User-agent', 'Python-urllib/3.3'), ('Host', 'example.com')]
False
True
None
application/x-www-form-urlencoded

Did .title in 2.7 urllib2 request get changed to .capitalize in 3.x urllib.request (without the examples in the doc being changed) or is request inconsistent within itself?

Cal did not the 2.7 code exhibiting the problme, but when I add this code in 3.3, the output start as shown.

request.add_header('Content-MD5', 'xxx')
print(request.header_items())
#
[('Content-md5', 'xxx'), ...

So is 3.3 sending 'Content-Md5' or 'Content-md5'

My guess is the former, as urllib.request has the same single use of .title in .do_open as Cal quoted. The two files also have the same three uses of .capitalize in .add_header, .add_unredirected_header, and .do_request. So it seems that header names are normalized to .capitalize on entry and .title on sending, or something like that. Ugh. Is there any good justification for this?

I do not see anything in the doc about headers names being normalized either way or about the requirements of has_/get_header. If the behavior were consistent and the same since forever, then I would say the current docs should be improved and a change would be an enhancement request. Since the behavior seems inconsistent, I am more inclined to think there is a bug.

I realize that this message expands the scope of the issue, but it is all about the handing of header names in requests.
msg183233 - (view) Author: karl (karlcow) * Date: 2013-02-28 21:13
Note that HTTP header fields are case-insensitive.
See http://tools.ietf.org/html/draft-ietf-httpbis-p1-messaging#section-3.2

   Each HTTP header field consists of a case-insensitive field name
   followed by a colon (":"), optional whitespace, and the field value.

Basically the author of a request can set them to whatever he/she wants. But we should, IMHO, respect the author intent. It might happen that someone will choose a specific combination of casing to deal with broken servers and/or proxies. So a cycle of set/get/send should not modify at all the headers.
msg183237 - (view) Author: karl (karlcow) * Date: 2013-02-28 21:47
So looking at the casing of headers, I discovered other issues. I opened another bug. http://bugs.python.org/issue17322
msg183362 - (view) Author: karl (karlcow) * Date: 2013-03-03 03:59
Are there issues related to removing the capitalize() and title() appears?

# title()

* http://hg.python.org/cpython/file/886df716cd09/Lib/urllib/request.py#l1239

# capitalize()

* http://hg.python.org/cpython/file/886df716cd09/Lib/urllib/request.py#l359
* http://hg.python.org/cpython/file/886df716cd09/Lib/urllib/request.py#l363
* http://hg.python.org/cpython/file/886df716cd09/Lib/urllib/request.py#l1206

Because the behavior is inconsistent, I would live to propose a patch removing them and be sure to be completely neutral with regards to them.
msg183364 - (view) Author: karl (karlcow) * Date: 2013-03-03 04:33
tests in http://hg.python.org/cpython/file/886df716cd09/Lib/test/test_wsgiref.py#l370 also checking that everything is case insensitive. 

And the method to get the headers in wsgiref, make sure they are lower-case
http://hg.python.org/cpython/file/886df716cd09/Lib/wsgiref/headers.py#l82
msg184807 - (view) Author: karl (karlcow) * Date: 2013-03-20 21:49
terry.reedy:


You said: "and that has_header and get_header *require* the .capitalize() form and reject the .title() form."

I made a patch for these two. 
See http://bugs.python.org/issue5550
msg220886 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-06-17 20:46
@Karl do you intend following up on this issue?
msg221400 - (view) Author: karl (karlcow) * Date: 2014-06-24 06:39
Mark,

I'm happy to followup. 
I will be in favor of removing any capitalization and not to change headers whatever they are. Because it doesn't matter per spec. Browsers do not care about the capitalization. And I haven't identified Web Compatibility issues regarding the capitalization.

That said, it seems that Cal msg139512 had an issue, I would love to know which server/API had this behavior to fill a but at http://webcompat.com/

So…

Where do we stand? Feature or removing anything which modifies the capitalization of headers?
History
Date User Action Args
2014-06-24 06:39:01karlcowsetmessages: + msg221400
2014-06-17 20:46:44BreamoreBoysetnosy: + BreamoreBoy
messages: + msg220886
2013-03-20 21:49:30karlcowsetmessages: + msg184807
2013-03-03 04:33:23karlcowsetmessages: + msg183364
2013-03-03 03:59:14karlcowsetmessages: + msg183362
2013-02-28 21:47:58karlcowsetmessages: + msg183237
2013-02-28 21:13:47karlcowsetnosy: + karlcow
messages: + msg183233
2013-02-24 02:09:40demian.brechtsetnosy: + demian.brecht
2012-11-13 01:40:29terry.reedysetnosy: + terry.reedy

messages: + msg175484
versions: + Python 3.4, - Python 3.3
2011-07-19 14:59:19eric.araujosetnosy: + eric.araujo
2011-07-19 14:59:13eric.araujosetfiles: - unnamed
2011-07-01 13:21:37sleepycalsetfiles: + unnamed

messages: + msg139578
nosy: + sleepycal
2011-07-01 07:45:43r.david.murraysetmessages: + msg139551
2011-07-01 06:12:47orsenthilsetmessages: + msg139547
2011-07-01 06:09:18orsenthilsetnosy: + orsenthil
messages: + msg139546
2011-06-30 22:54:29santa4ntsetnosy: + santa4nt
2011-06-30 21:00:40Cal.Leemingsetmessages: + msg139524
2011-06-30 20:53:45r.david.murraysetversions: + Python 3.3, - Python 2.7
title: urllib2 Request() forces capitalize() on header names, breaking some requests -> urllib2 forces title() on header names, breaking some requests
messages: + msg139523

type: behavior -> enhancement
stage: needs patch
2011-06-30 20:19:38Cal.Leemingsetmessages: + msg139520
2011-06-30 20:17:20Cal.Leemingsetmessages: + msg139519
2011-06-30 19:50:52r.david.murraysetmessages: + msg139516
2011-06-30 19:39:52Cal.Leemingsetmessages: + msg139515
2011-06-30 19:33:47r.david.murraysetnosy: + r.david.murray
messages: + msg139514
2011-06-30 19:23:38Cal.Leemingcreate