Message 257734 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	martin.panter
Recipients	Emil Stenström, ezio.melotti, gvanrossum, martin.panter, vstinner
Date	2016-01-08.01:53:19
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1452218002.02.0.686677523178.issue26045@psf.upfronthosting.co.za>
In-reply-to

Content
For the record, this is what Requests sent when I passed a Latin-1-encodable string: b'POST / HTTP/1.1\r\n' b'Host: example.com\r\n' b'Content-Length: 11\r\n' b'Connection: keep-alive\r\n' b'Accept: /\r\n' b'Accept-Encoding: gzip, deflate\r\n' b'User-Agent: python-requests/2.9.1\r\n' b'\r\n' b'Celebrate \xa9' There is no Content-Type header field, nor any indication of the encoding used. This is also how the lower-level HTTPConnection.request() method works. The documentation already mentions that a text string gets encoded with ISO-8859-1 (a.k.a. Latin-1): <https://docs.python.org/3.3/library/http.client.html#http.client.HTTPConnection.request>. How do you propose to improve the error message? Encoding with either Latin-1 or UTF-8 depending on the characters sounds like a terrible idea. We may as well send the request without any body and pretend everything is okay. I don’t understand the point of changing to UTF-8 either. If you actually want UTF-8 encoded text, why not explicitly encode it yourself? Failing for any unencoded text string would be a serious backwards compatibility problem. It would break the POST example using urlencode() at <https://docs.python.org/3/library/http.client.html#examples> for instance. IMO the Latin-1 encoding feature is a bad API design, maybe based on a misunderstanding of HTTP. Perhaps it would be more reasonable to deprecate the automatic Latin-1 encoding, and only allow ASCII characters in a text string. That would still cater for the urlencode() scenario in the POST example. Of the links you posted, they seem to be different problems with separate solutions: Requests bug 2838: Perhaps the user was trying to send URL-encoded form data. If so, textual fields should be UTF-8 encoded and then percent-encoded, resulting in only ASCII codes in the “data” argument. Python has urllib.parse.urlencode() which does this. Requests bug 1822: It sounds like the user or a library intended to send UTF-8, so they should encode it themselves. Stack Overflow: Custom web service needed fixing, and the user had to encode as UTF-8. This is a custom agreement between the client and server, it is not up to Python. Ebay: I’m not familiar with any Ebay API and it is not clear from the post, but I suspect the user wasn’t encoding their data properly. Maybe similar to the first case. For the rest it is not clear what the problem or solution was. Some of them sound like they were somehow sending text when they really wanted to send arbitrary bytes, in which case UTF-8 is not going to help.

For the record, this is what Requests sent when I passed a Latin-1-encodable string:

b'POST / HTTP/1.1\r\n'
b'Host: example.com\r\n'
b'Content-Length: 11\r\n'
b'Connection: keep-alive\r\n'
b'Accept: */*\r\n'
b'Accept-Encoding: gzip, deflate\r\n'
b'User-Agent: python-requests/2.9.1\r\n'
b'\r\n'
b'Celebrate \xa9'

There is no Content-Type header field, nor any indication of the encoding used. This is also how the lower-level HTTPConnection.request() method works.

The documentation already mentions that a text string gets encoded with ISO-8859-1 (a.k.a. Latin-1): <https://docs.python.org/3.3/library/http.client.html#http.client.HTTPConnection.request>. How do you propose to improve the error message?

Encoding with either Latin-1 or UTF-8 depending on the characters sounds like a terrible idea. We may as well send the request without any body and pretend everything is okay. I don’t understand the point of changing to UTF-8 either. If you actually want UTF-8 encoded text, why not explicitly encode it yourself?

Failing for any unencoded text string would be a serious backwards compatibility problem. It would break the POST example using urlencode() at <https://docs.python.org/3/library/http.client.html#examples> for instance.

IMO the Latin-1 encoding feature is a bad API design, maybe based on a misunderstanding of HTTP. Perhaps it would be more reasonable to deprecate the automatic Latin-1 encoding, and only allow ASCII characters in a text string. That would still cater for the urlencode() scenario in the POST example.

Of the links you posted, they seem to be different problems with separate solutions:

Requests bug 2838: Perhaps the user was trying to send URL-encoded form data. If so, textual fields should be UTF-8 encoded and then percent-encoded, resulting in only ASCII codes in the “data” argument. Python has urllib.parse.urlencode() which does this.

Requests bug 1822: It sounds like the user or a library intended to send UTF-8, so they should encode it themselves.

Stack Overflow: Custom web service needed fixing, and the user had to encode as UTF-8. This is a custom agreement between the client and server, it is not up to Python.

Ebay: I’m not familiar with any Ebay API and it is not clear from the post, but I suspect the user wasn’t encoding their data properly. Maybe similar to the first case.

For the rest it is not clear what the problem or solution was. Some of them sound like they were somehow sending text when they really wanted to send arbitrary bytes, in which case UTF-8 is not going to help.

History
Date	User	Action	Args
2016-01-08 01:53:22	martin.panter	set	recipients: + martin.panter, gvanrossum, vstinner, ezio.melotti, Emil Stenström
2016-01-08 01:53:22	martin.panter	set	messageid: <1452218002.02.0.686677523178.issue26045@psf.upfronthosting.co.za>
2016-01-08 01:53:21	martin.panter	link	issue26045 messages
2016-01-08 01:53:19	martin.panter	create