classification
Title: Improve error message for http.client when posting unicode string
Type: enhancement Stage: resolved
Components: Unicode Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Emil Stenström, ezio.melotti, gvanrossum, martin.panter, python-dev, vstinner
Priority: normal Keywords: patch

Created on 2016-01-07 22:27 by Emil Stenström, last changed 2016-02-08 12:19 by martin.panter. This issue is now closed.

Files
File name Uploaded Description Edit
utfpatch.diff gvanrossum, 2016-01-08 16:43 review
utfpatch.v2.diff martin.panter, 2016-01-31 21:56 review
Messages (13)
msg257721 - (view) Author: Emil Stenström (Emil Stenström) Date: 2016-01-07 22:27
This issue is in response to this thread on python-ideas: https://mail.python.org/pipermail/python-ideas/2016-January/037678.html

Note that Cory did a lot of encoding background work here:
https://mail.python.org/pipermail/python-ideas/2016-January/037680.html

---
Bug description:

When posting an unencoded unicode string directly with python-requests you get the following stacktrace:

import requests
r = requests.post("http://example.com", data="Celebrate 🎉") 
...
  File "../lib/python3.4/http/client.py", line 1127, in _send_request
    body = body.encode('iso-8859-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 14-15: ordinal not in range(256) 

This is because requests uses http.client, and http.client assumes the encoding to be latin-1 if given a unicode string. This is a very common source of bugs for beginners who assume sending in unicode would automatically encode it in utf-8, like in the libraries of many other languages.

The simplest fix here is to catch the UnicodeEncodeError and improve the error message to something that points beginners in the right direction.

Another option would be to:
- Keep encoding in latin-1 first, and if that fails try utf-8

Other possible solutions (that would be backwards incompatible) includes:
- Changing the default encoding to utf-8 instead of latin-1
- Detect an unencoded unicode string and fail without encoding it with a descriptive error message

---

Just to show that this is a problem that exists in the wild, here are a few examples that all crashes on the same line in http.client (not all going through the requests library:

- https://github.com/kennethreitz/requests/issues/2838
- https://github.com/kennethreitz/requests/issues/1822
- http://stackoverflow.com/questions/34618149/post-unicode-string-to-web-service-using-python-requests-library
- https://www.reddit.com/r/learnpython/comments/3violw/unicodeencodeerror_when_searching_ebay_with/
- https://github.com/codecov/codecov-python/issues/35
- https://github.com/google/google-api-python-client/issues/145
- https://bugs.launchpad.net/ubuntu/+source/lazr.restfulclient/+bug/1414063
msg257734 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-01-08 01:53
For the record, this is what Requests sent when I passed a Latin-1-encodable string:

b'POST / HTTP/1.1\r\n'
b'Host: example.com\r\n'
b'Content-Length: 11\r\n'
b'Connection: keep-alive\r\n'
b'Accept: */*\r\n'
b'Accept-Encoding: gzip, deflate\r\n'
b'User-Agent: python-requests/2.9.1\r\n'
b'\r\n'
b'Celebrate \xa9'

There is no Content-Type header field, nor any indication of the encoding used. This is also how the lower-level HTTPConnection.request() method works.

The documentation already mentions that a text string gets encoded with ISO-8859-1 (a.k.a. Latin-1): <https://docs.python.org/3.3/library/http.client.html#http.client.HTTPConnection.request>. How do you propose to improve the error message?

Encoding with either Latin-1 or UTF-8 depending on the characters sounds like a terrible idea. We may as well send the request without any body and pretend everything is okay. I don’t understand the point of changing to UTF-8 either. If you actually want UTF-8 encoded text, why not explicitly encode it yourself?

Failing for any unencoded text string would be a serious backwards compatibility problem. It would break the POST example using urlencode() at <https://docs.python.org/3/library/http.client.html#examples> for instance.

IMO the Latin-1 encoding feature is a bad API design, maybe based on a misunderstanding of HTTP. Perhaps it would be more reasonable to deprecate the automatic Latin-1 encoding, and only allow ASCII characters in a text string. That would still cater for the urlencode() scenario in the POST example.

Of the links you posted, they seem to be different problems with separate solutions:

Requests bug 2838: Perhaps the user was trying to send URL-encoded form data. If so, textual fields should be UTF-8 encoded and then percent-encoded, resulting in only ASCII codes in the “data” argument. Python has urllib.parse.urlencode() which does this.

Requests bug 1822: It sounds like the user or a library intended to send UTF-8, so they should encode it themselves.

Stack Overflow: Custom web service needed fixing, and the user had to encode as UTF-8. This is a custom agreement between the client and server, it is not up to Python.

Ebay: I’m not familiar with any Ebay API and it is not clear from the post, but I suspect the user wasn’t encoding their data properly. Maybe similar to the first case.

For the rest it is not clear what the problem or solution was. Some of them sound like they were somehow sending text when they really wanted to send arbitrary bytes, in which case UTF-8 is not going to help.
msg257735 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-01-08 02:22
Any solution that encodes Unicode in a way that works for some characters but fails for others has the same problem that Unicode had in Python 3. Unfortunately we're stuck with such a solution (Latin-1) and for backwards compatibility reasons we can't change it. If we were to deprecate it, we should warn for *any* data given as a Unicode string, even if it's plain ASCII (even if it's an empty string :-).

But even if we don't deprecate it, we can still change the text of the error message (but not the type of the exception used) to be more clear.

Can we please start drafting a suitable error message here?
msg257736 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-01-08 03:16
After reading through the linked thread, there are a few error message proposals:

Guido: "use data.encode('utf-8') if you want the data to be encoded in UTF-8". (Then of course the server might not like it.)

Andrew Barnert: A UnicodeEncodeError (or subclass of it?) with text like "HTTP body without encoding defaults to 'latin-1', which can't encode character '\u5555' in position 30: ordinal not in range(256)")

Paul Moore: Encode as ASCII and catch UnicodeEncodeError and re-raise as a TypeError "Unicode string supplied without an explicit encoding".

Emil, do you think any of these would help?
msg257763 - (view) Author: Emil Stenström (Emil Stenström) Date: 2016-01-08 16:04
I think changing the error message is enough for the short term, and deprecation of automatic encoding is the correct way in the long term.

A text that mention "utf-8" which will likely be the correct solution definitely gets my vote, so Guidos suggestion sounds good to me:

UnicodeEncodeError("Use data.encode('utf-8') if you want the data to be encoded in UTF-8")

Andrew's and Pauls suggestions doesn't point to a solution to the problem, which I think is a great think for something this basic. Also, the error message only gets shown when latin-1 fails, so we can't use text that speaks about "no encoding" in general.
msg257766 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-01-08 16:43
Here's a patch. I noticed there are lots of other places where a similar encoding() call exists -- I wrapped them all using a helper function. Please review carefully.
msg257767 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-01-08 16:45
BTW the error and traceback will look something like this:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/guido/src/cpython/Lib/http/client.py", line 1138, in _send_request
    self.putheader(hdr, value)
  File "/Users/guido/src/cpython/Lib/http/client.py", line 1062, in putheader
    header = _encode(header, 'ascii', 'header')
  File "/Users/guido/src/cpython/Lib/http/client.py", line 161, in _encode
    (name.title(), data[err.start:err.end], name)) from None
UnicodeEncodeError: 'ascii' codec can't encode character '\u1234' in position 3: Header ('ሴ') is not valid Latin-1. Use header.encode('utf-8') if you want to send it encoded in UTF-8.
msg257773 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-01-08 18:58
I think this would be okay for 3.5.2 as well.
msg257787 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-01-09 00:02
Personally I am skeptical if suggesting UTF-8 for the body data is a good idea, but I can go along with it, since other people want it. But I do strongly question whether it is right to suggest UTF-8 for header fields. The RFC <https://tools.ietf.org/html/rfc7230#page-26> only mentions ASCII and Latin-1.

Newer protocols based on HTTP (RTSP comes to mind) do specify UTF-8 for the header, but that is probably out of scope of both the HTTP module and beginner-targetted errors.

If I were redoing this patch, I would drop all the changes except at the body.encode() line in Emil’s original post.
msg257788 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-01-09 00:08
Martin, please make a patch along those lines! The only reason I generalized this to headers is that one of the three Requests issues referenced in the original post seemed to be about a header value (https://github.com/kennethreitz/requests/issues/1926). But that one seems different than the other two anyways, and it's about Python 3.7 so it wouldn't be helped by anything we're doing here anyways.
msg259301 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-01-31 21:56
Here is my cut down version of Guido’s patch. Now it only adds the message when someone passes a text string as the HTTPConnection.request(body=...) parameter:

>>> c.request("POST", "", body="Celebrate \U0001F389")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/proj/python/cpython/Lib/http/client.py", line 1098, in request
    self._send_request(method, url, body, headers)
  File "/home/proj/python/cpython/Lib/http/client.py", line 1142, in _send_request
    body = _encode(body, 'body')
  File "/home/proj/python/cpython/Lib/http/client.py", line 161, in _encode
    (name.title(), data[err.start:err.end], name)) from None
UnicodeEncodeError: 'latin-1' codec can't encode character '\U0001f389' in position 10: Body ('🎉') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

What do people think?
msg259312 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2016-02-01 04:49
LGTM.
msg259838 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-02-08 11:57
New changeset 966bd147ccb5 by Martin Panter in branch '3.5':
Issue #26045: Add UTF-8 suggestion to error in http.client
https://hg.python.org/cpython/rev/966bd147ccb5

New changeset 9896ead3cc1d by Martin Panter in branch 'default':
Issue #26045: Merge http.client error addition from 3.5
https://hg.python.org/cpython/rev/9896ead3cc1d
History
Date User Action Args
2016-02-08 12:19:44martin.pantersetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2016-02-08 11:57:19python-devsetnosy: + python-dev
messages: + msg259838
2016-02-01 04:49:19gvanrossumsetmessages: + msg259312
2016-01-31 21:56:58martin.pantersetfiles: + utfpatch.v2.diff

messages: + msg259301
2016-01-09 00:08:28gvanrossumsetmessages: + msg257788
2016-01-09 00:02:27martin.pantersetmessages: + msg257787
stage: patch review
2016-01-08 18:58:18gvanrossumsetmessages: + msg257773
versions: + Python 3.5
2016-01-08 18:48:00terry.reedysetversions: - Python 3.2, Python 3.3, Python 3.4, Python 3.5
2016-01-08 16:45:10gvanrossumsetmessages: + msg257767
2016-01-08 16:43:24gvanrossumsetfiles: + utfpatch.diff
keywords: + patch
messages: + msg257766
2016-01-08 16:04:13Emil Stenströmsetmessages: + msg257763
2016-01-08 03:16:10martin.pantersetmessages: + msg257736
2016-01-08 02:22:45gvanrossumsetmessages: + msg257735
2016-01-08 01:53:21martin.pantersetnosy: + martin.panter
messages: + msg257734
2016-01-07 23:12:04gvanrossumsetnosy: + gvanrossum
2016-01-07 22:27:17Emil Stenströmcreate