Improve error message for http.client when posting unicode string #70233

EmilStenstrm · 2016-01-07T22:27:17Z

BPO	26045
Nosy	@gvanrossum, @vstinner, @ezio-melotti, @vadmium
Files	utfpatch.diff utfpatch.v2.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2016-02-08.12:19:44.280>
created_at = <Date 2016-01-07.22:27:17.326>
labels = ['type-feature', 'expert-unicode']
title = 'Improve error message for http.client when posting unicode string'
updated_at = <Date 2016-02-08.12:19:44.279>
user = 'https://bugs.python.org/EmilStenstrm'

bugs.python.org fields:

activity = <Date 2016-02-08.12:19:44.279>
actor = 'martin.panter'
assignee = 'none'
closed = True
closed_date = <Date 2016-02-08.12:19:44.280>
closer = 'martin.panter'
components = ['Unicode']
creation = <Date 2016-01-07.22:27:17.326>
creator = 'Emil Stenstr\xc3\xb6m'
dependencies = []
files = ['41537', '41768']
hgrepos = []
issue_num = 26045
keywords = ['patch']
message_count = 13.0
messages = ['257721', '257734', '257735', '257736', '257763', '257766', '257767', '257773', '257787', '257788', '259301', '259312', '259838']
nosy_count = 6.0
nosy_names = ['gvanrossum', 'vstinner', 'ezio.melotti', 'python-dev', 'martin.panter', 'Emil Stenstr\xc3\xb6m']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue26045'
versions = ['Python 3.5', 'Python 3.6']

EmilStenstrm · 2016-01-07T22:27:17Z

This issue is in response to this thread on python-ideas: https://mail.python.org/pipermail/python-ideas/2016-January/037678.html

Note that Cory did a lot of encoding background work here:
https://mail.python.org/pipermail/python-ideas/2016-January/037680.html

---
Bug description:

When posting an unencoded unicode string directly with python-requests you get the following stacktrace:

import requests
r = requests.post("http://example.com", data="Celebrate 🎉") 
...
  File "../lib/python3.4/http/client.py", line 1127, in _send_request
    body = body.encode('iso-8859-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 14-15: ordinal not in range(256)

This is because requests uses http.client, and http.client assumes the encoding to be latin-1 if given a unicode string. This is a very common source of bugs for beginners who assume sending in unicode would automatically encode it in utf-8, like in the libraries of many other languages.

The simplest fix here is to catch the UnicodeEncodeError and improve the error message to something that points beginners in the right direction.

Another option would be to:

Keep encoding in latin-1 first, and if that fails try utf-8

Other possible solutions (that would be backwards incompatible) includes:

Changing the default encoding to utf-8 instead of latin-1
Detect an unencoded unicode string and fail without encoding it with a descriptive error message

---

Just to show that this is a problem that exists in the wild, here are a few examples that all crashes on the same line in http.client (not all going through the requests library:

vadmium · 2016-01-08T01:53:20Z

For the record, this is what Requests sent when I passed a Latin-1-encodable string:

b'POST / HTTP/1.1\r\n'
b'Host: example.com\r\n'
b'Content-Length: 11\r\n'
b'Connection: keep-alive\r\n'
b'Accept: */*\r\n'
b'Accept-Encoding: gzip, deflate\r\n'
b'User-Agent: python-requests/2.9.1\r\n'
b'\r\n'
b'Celebrate \xa9'

There is no Content-Type header field, nor any indication of the encoding used. This is also how the lower-level HTTPConnection.request() method works.

The documentation already mentions that a text string gets encoded with ISO-8859-1 (a.k.a. Latin-1): <https://docs.python.org/3.3/library/http.client.html#http.client.HTTPConnection.request\>. How do you propose to improve the error message?

Encoding with either Latin-1 or UTF-8 depending on the characters sounds like a terrible idea. We may as well send the request without any body and pretend everything is okay. I don’t understand the point of changing to UTF-8 either. If you actually want UTF-8 encoded text, why not explicitly encode it yourself?

Failing for any unencoded text string would be a serious backwards compatibility problem. It would break the POST example using urlencode() at <https://docs.python.org/3/library/http.client.html#examples\> for instance.

IMO the Latin-1 encoding feature is a bad API design, maybe based on a misunderstanding of HTTP. Perhaps it would be more reasonable to deprecate the automatic Latin-1 encoding, and only allow ASCII characters in a text string. That would still cater for the urlencode() scenario in the POST example.

Of the links you posted, they seem to be different problems with separate solutions:

Requests bug 2838: Perhaps the user was trying to send URL-encoded form data. If so, textual fields should be UTF-8 encoded and then percent-encoded, resulting in only ASCII codes in the “data” argument. Python has urllib.parse.urlencode() which does this.

Requests bug 1822: It sounds like the user or a library intended to send UTF-8, so they should encode it themselves.

Stack Overflow: Custom web service needed fixing, and the user had to encode as UTF-8. This is a custom agreement between the client and server, it is not up to Python.

Ebay: I’m not familiar with any Ebay API and it is not clear from the post, but I suspect the user wasn’t encoding their data properly. Maybe similar to the first case.

For the rest it is not clear what the problem or solution was. Some of them sound like they were somehow sending text when they really wanted to send arbitrary bytes, in which case UTF-8 is not going to help.

gvanrossum · 2016-01-08T02:22:45Z

Any solution that encodes Unicode in a way that works for some characters but fails for others has the same problem that Unicode had in Python 3. Unfortunately we're stuck with such a solution (Latin-1) and for backwards compatibility reasons we can't change it. If we were to deprecate it, we should warn for *any* data given as a Unicode string, even if it's plain ASCII (even if it's an empty string :-).

But even if we don't deprecate it, we can still change the text of the error message (but not the type of the exception used) to be more clear.

Can we please start drafting a suitable error message here?

vadmium · 2016-01-08T03:16:10Z

After reading through the linked thread, there are a few error message proposals:

Guido: "use data.encode('utf-8') if you want the data to be encoded in UTF-8". (Then of course the server might not like it.)

Andrew Barnert: A UnicodeEncodeError (or subclass of it?) with text like "HTTP body without encoding defaults to 'latin-1', which can't encode character '\u5555' in position 30: ordinal not in range(256)")

Paul Moore: Encode as ASCII and catch UnicodeEncodeError and re-raise as a TypeError "Unicode string supplied without an explicit encoding".

Emil, do you think any of these would help?

EmilStenstrm · 2016-01-08T16:04:13Z

I think changing the error message is enough for the short term, and deprecation of automatic encoding is the correct way in the long term.

A text that mention "utf-8" which will likely be the correct solution definitely gets my vote, so Guidos suggestion sounds good to me:

UnicodeEncodeError("Use data.encode('utf-8') if you want the data to be encoded in UTF-8")

Andrew's and Pauls suggestions doesn't point to a solution to the problem, which I think is a great think for something this basic. Also, the error message only gets shown when latin-1 fails, so we can't use text that speaks about "no encoding" in general.

gvanrossum · 2016-01-08T16:43:24Z

Here's a patch. I noticed there are lots of other places where a similar encoding() call exists -- I wrapped them all using a helper function. Please review carefully.

gvanrossum · 2016-01-08T16:45:10Z

BTW the error and traceback will look something like this:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/guido/src/cpython/Lib/http/client.py", line 1138, in _send_request
    self.putheader(hdr, value)
  File "/Users/guido/src/cpython/Lib/http/client.py", line 1062, in putheader
    header = _encode(header, 'ascii', 'header')
  File "/Users/guido/src/cpython/Lib/http/client.py", line 161, in _encode
    (name.title(), data[err.start:err.end], name)) from None
UnicodeEncodeError: 'ascii' codec can't encode character '\u1234' in position 3: Header ('ሴ') is not valid Latin-1. Use header.encode('utf-8') if you want to send it encoded in UTF-8.

gvanrossum · 2016-01-08T18:58:19Z

I think this would be okay for 3.5.2 as well.

vadmium · 2016-01-09T00:02:27Z

Personally I am skeptical if suggesting UTF-8 for the body data is a good idea, but I can go along with it, since other people want it. But I do strongly question whether it is right to suggest UTF-8 for header fields. The RFC <https://tools.ietf.org/html/rfc7230#page-26\> only mentions ASCII and Latin-1.

Newer protocols based on HTTP (RTSP comes to mind) do specify UTF-8 for the header, but that is probably out of scope of both the HTTP module and beginner-targetted errors.

If I were redoing this patch, I would drop all the changes except at the body.encode() line in Emil’s original post.

gvanrossum · 2016-01-09T00:08:28Z

Martin, please make a patch along those lines! The only reason I generalized this to headers is that one of the three Requests issues referenced in the original post seemed to be about a header value (https://github.com/kennethreitz/requests/issues/1926). But that one seems different than the other two anyways, and it's about Python 3.7 so it wouldn't be helped by anything we're doing here anyways.

vadmium · 2016-01-31T21:56:57Z

Here is my cut down version of Guido’s patch. Now it only adds the message when someone passes a text string as the HTTPConnection.request(body=...) parameter:

>>> c.request("POST", "", body="Celebrate \U0001F389")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/proj/python/cpython/Lib/http/client.py", line 1098, in request
    self._send_request(method, url, body, headers)
  File "/home/proj/python/cpython/Lib/http/client.py", line 1142, in _send_request
    body = _encode(body, 'body')
  File "/home/proj/python/cpython/Lib/http/client.py", line 161, in _encode
    (name.title(), data[err.start:err.end], name)) from None
UnicodeEncodeError: 'latin-1' codec can't encode character '\U0001f389' in position 10: Body ('🎉') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

What do people think?

gvanrossum · 2016-02-01T04:49:20Z

LGTM.

python-dev · 2016-02-08T11:57:19Z

New changeset 966bd147ccb5 by Martin Panter in branch '3.5':
Issue bpo-26045: Add UTF-8 suggestion to error in http.client
https://hg.python.org/cpython/rev/966bd147ccb5

New changeset 9896ead3cc1d by Martin Panter in branch 'default':
Issue bpo-26045: Merge http.client error addition from 3.5
https://hg.python.org/cpython/rev/9896ead3cc1d

EmilStenstrm mannequin added topic-unicode type-feature A feature request or enhancement labels Jan 7, 2016

vadmium closed this as completed Feb 8, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error message for http.client when posting unicode string #70233

Improve error message for http.client when posting unicode string #70233

EmilStenstrm mannequin commented Jan 7, 2016

EmilStenstrm mannequin commented Jan 7, 2016

vadmium commented Jan 8, 2016

gvanrossum commented Jan 8, 2016

vadmium commented Jan 8, 2016

EmilStenstrm mannequin commented Jan 8, 2016

gvanrossum commented Jan 8, 2016

gvanrossum commented Jan 8, 2016

gvanrossum commented Jan 8, 2016

vadmium commented Jan 9, 2016

gvanrossum commented Jan 9, 2016

vadmium commented Jan 31, 2016

gvanrossum commented Feb 1, 2016

python-dev mannequin commented Feb 8, 2016

Improve error message for http.client when posting unicode string #70233

Improve error message for http.client when posting unicode string #70233

Comments

EmilStenstrm mannequin commented Jan 7, 2016

EmilStenstrm mannequin commented Jan 7, 2016

vadmium commented Jan 8, 2016

gvanrossum commented Jan 8, 2016

vadmium commented Jan 8, 2016

EmilStenstrm mannequin commented Jan 8, 2016

gvanrossum commented Jan 8, 2016

gvanrossum commented Jan 8, 2016

gvanrossum commented Jan 8, 2016

vadmium commented Jan 9, 2016

gvanrossum commented Jan 9, 2016

vadmium commented Jan 31, 2016

gvanrossum commented Feb 1, 2016

python-dev mannequin commented Feb 8, 2016