classification
Title: httplib: unicode url will cause an ascii codec error when combined with a utf-8 string header
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: Bob.Chen, Herman Schistad, demian.brecht, iritkatriel, orsenthil, vstinner
Priority: normal Keywords: patch

Created on 2014-08-20 02:49 by Bob.Chen, last changed 2021-06-20 17:21 by iritkatriel. This issue is now closed.

Files
File name Uploaded Description Edit
httplib.py.patch Bob.Chen, 2015-01-04 06:17
Messages (11)
msg225553 - (view) Author: Bob Chen (Bob.Chen) * Date: 2014-08-20 02:49
Try to run these two script below, and you will understand what I'm talking about.

If you specified an url and it happened to be an unicode string(which is quite common in python because python processes string as unicode and you could possibly get it from somewhere else), and your header contains a utf-8 string converted from a foreign language, like u'呵呵', then the codec error occurred.

File "/usr/lib/python2.7/httplib.py", line 808, in _send_output
    msg = "\r\n".join(self._buffer) 


# -*- encoding: utf-8 -*-
# should fail
import httplib, urllib
params = urllib.urlencode({'@number': 12524, '@type': 'issue', '@action': 'show'})
headers = {"Content-type": "application/x-www-form-urlencoded",
        "Accept": "text/plain", 'notes': u'呵呵'.encode('utf-8')}
conn = httplib.HTTPConnection(u"bugs.python.org")
conn.request("POST", u"http://bugs.python.org/any_url", params, headers)
response = conn.getresponse()
print response.status, response.reason



# -*- encoding: utf-8 -*-
# should be ok
import httplib, urllib
params = urllib.urlencode({'@number': 12524, '@type': 'issue', '@action': 'show'})
headers = {"Content-type": "application/x-www-form-urlencoded",
        "Accept": "text/plain", 'notes': u'呵呵'.encode('utf-8')}
conn = httplib.HTTPConnection(u"bugs.python.org")
conn.request("POST", "http://bugs.python.org/any_url", params, headers)
response = conn.getresponse()
print response.status, response.reason
msg225733 - (view) Author: Bob Chen (Bob.Chen) * Date: 2014-08-23 06:29
I personally suggest httplib convert the url and other elements to be string, at the begging of the class init.
msg226011 - (view) Author: Bob Chen (Bob.Chen) * Date: 2014-08-28 06:20
This patch ensures the url not to be unicode, so the 'join' would not cause error when there is utf-8 string behind.
msg226946 - (view) Author: Bob Chen (Bob.Chen) * Date: 2014-09-16 10:10
up...
msg231594 - (view) Author: Bob Chen (Bob.Chen) * Date: 2014-11-24 07:12
Someone come and pick up this? It has been a long time...
msg233332 - (view) Author: Demian Brecht (demian.brecht) * (Python triager) Date: 2015-01-02 23:23
A few notes:

1. Unicode hosts are not automatically IDNA-encoded (which they /could/ be rather than relying on the programmer to be aware of this), but this really has no bearing on this specific issue
2. Unicode paths are not automatically IRI-encoded (see https://tools.ietf.org/html/rfc3987#section-3), which should also likely be automatically handled when unicode objects are encountered as the path
3. When a single unicode element is contained within a list, string_join will defer to PyUnicode_Join.

The problem here is that your pre-joined request elements looks like this: [u'POST http://bugs.python.org/any_url HTTP/1.1', 'Host: bugs.python.org', 'Accept-Encoding: identity', 'Content-Length: 44', 'notes: \xe5\x91\xb5\xe5\x91\xb5', 'Content-type: application/x-www-form-urlencoded', 'Accept: text/plain', '', '']

Because there's a unicode object contained in the list at index 0, the entire list is converted to unicode, which results in the error when \xe5 is encountered by the ascii decoder.

The proposed solution won't work as unicode characters are legal (see RFC 3987) and will fail should anything outside of the ascii character set be present.

I think that the correct way to solve this issue is to automatically encode unicode paths (or IRIs) using urllib.quote, passing the reserved characters defined in RFC 3987 as the safe parameter:

>>> urllib.quote(u'/foo/呵/bar'.encode('utf-8'),':/?#[]@!$&\'()*+,;=')
'/foo/%E5%91%B5/bar'
msg233390 - (view) Author: Bob Chen (Bob.Chen) * Date: 2015-01-04 05:49
Is there any possibility that we encapsulate urllib.quote into httplib? Because many developers wouldn't know about this utility function. And as I mentioned above, they could have got an unicode url from anywhere inside python, like an API call, without being noticed that it is potentially wrong.
msg233391 - (view) Author: Bob Chen (Bob.Chen) * Date: 2015-01-04 06:18
How about this patch?
msg233393 - (view) Author: Demian Brecht (demian.brecht) * (Python triager) Date: 2015-01-04 06:46
utf-8 encoding is only one step in IRI encoding. Correct IRI encoding is non trivial and doesn't fall into the support policy for 2.7 (bug/security fixes). I think that the best that can be done for 2.7 is to enhance the documentation around HTTPConnection.__init__ (unicode hostnames should be IDNA-encoded with the built-in IDNA encoder) and HTTPConnection.request/putrequest noting that unicode paths should be IRI encoded, with a link to RFC 3987.
msg284818 - (view) Author: Herman Schistad (Herman Schistad) Date: 2017-01-06 13:28
I can confirm that this patch solves the issues I've had where I can submit multipart forms provided I have a string URL, but not if it's unicode.

I'm using Python 2.7.12. Applying the patch fixes the issue.

Code which breaks, assuming the file contains binary data:


# -*- encoding: utf-8 -*-
import urllib3
pool_manager = urllib3.PoolManager(num_pools=2)
url = u'http://example.org/form' # removing the 'u' fixes it
content = open('/some/binary/file').read()
fields = [
    ('foo', 'something'),
    ('bar', ('/some/binary/file', content, 'application/octet-stream'))
]
pool_manager.request("POST", url, fields=fields, encode_multipart=True, headers={})
msg396182 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-06-20 17:21
This looks like a 2.7-only issue.
History
Date User Action Args
2021-06-20 17:21:42iritkatrielsetstatus: open -> closed

nosy: + iritkatriel
messages: + msg396182

resolution: out of date
stage: resolved
2017-01-06 13:28:39Herman Schistadsetnosy: + Herman Schistad
messages: + msg284818
2015-01-04 06:46:17demian.brechtsetmessages: + msg233393
2015-01-04 06:18:26Bob.Chensetmessages: + msg233391
2015-01-04 06:17:04Bob.Chensetfiles: + httplib.py.patch
2015-01-04 06:00:46Bob.Chensetfiles: - httplib.py.patch
2015-01-04 05:49:47Bob.Chensetmessages: + msg233390
2015-01-02 23:23:08demian.brechtsetmessages: + msg233332
2014-12-24 19:15:59eric.araujosetnosy: + orsenthil
2014-11-24 07:13:51Bob.Chensettype: crash -> behavior
2014-11-24 07:12:44Bob.Chensetmessages: + msg231594
2014-09-16 10:10:45Bob.Chensetmessages: + msg226946
2014-08-28 08:49:17vstinnersetnosy: + vstinner
2014-08-28 06:20:06Bob.Chensetmessages: + msg226011
2014-08-28 06:05:49Bob.Chensetfiles: + httplib.py.patch
keywords: + patch
2014-08-23 06:29:54Bob.Chensetmessages: + msg225733
2014-08-20 21:53:05demian.brechtsetnosy: + demian.brecht
2014-08-20 02:49:37Bob.Chencreate