Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error message for http.client when posting unicode string #70233

Closed
EmilStenstrm mannequin opened this issue Jan 7, 2016 · 13 comments
Closed

Improve error message for http.client when posting unicode string #70233

EmilStenstrm mannequin opened this issue Jan 7, 2016 · 13 comments
Labels
topic-unicode type-feature A feature request or enhancement

Comments

@EmilStenstrm
Copy link
Mannequin

EmilStenstrm mannequin commented Jan 7, 2016

BPO 26045
Nosy @gvanrossum, @vstinner, @ezio-melotti, @vadmium
Files
  • utfpatch.diff
  • utfpatch.v2.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-02-08.12:19:44.280>
    created_at = <Date 2016-01-07.22:27:17.326>
    labels = ['type-feature', 'expert-unicode']
    title = 'Improve error message for http.client when posting unicode string'
    updated_at = <Date 2016-02-08.12:19:44.279>
    user = 'https://bugs.python.org/EmilStenstrm'

    bugs.python.org fields:

    activity = <Date 2016-02-08.12:19:44.279>
    actor = 'martin.panter'
    assignee = 'none'
    closed = True
    closed_date = <Date 2016-02-08.12:19:44.280>
    closer = 'martin.panter'
    components = ['Unicode']
    creation = <Date 2016-01-07.22:27:17.326>
    creator = 'Emil Stenstr\xc3\xb6m'
    dependencies = []
    files = ['41537', '41768']
    hgrepos = []
    issue_num = 26045
    keywords = ['patch']
    message_count = 13.0
    messages = ['257721', '257734', '257735', '257736', '257763', '257766', '257767', '257773', '257787', '257788', '259301', '259312', '259838']
    nosy_count = 6.0
    nosy_names = ['gvanrossum', 'vstinner', 'ezio.melotti', 'python-dev', 'martin.panter', 'Emil Stenstr\xc3\xb6m']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue26045'
    versions = ['Python 3.5', 'Python 3.6']

    @EmilStenstrm
    Copy link
    Mannequin Author

    EmilStenstrm mannequin commented Jan 7, 2016

    This issue is in response to this thread on python-ideas: https://mail.python.org/pipermail/python-ideas/2016-January/037678.html

    Note that Cory did a lot of encoding background work here:
    https://mail.python.org/pipermail/python-ideas/2016-January/037680.html

    ---
    Bug description:

    When posting an unencoded unicode string directly with python-requests you get the following stacktrace:

    import requests
    r = requests.post("http://example.com", data="Celebrate 🎉") 
    ...
      File "../lib/python3.4/http/client.py", line 1127, in _send_request
        body = body.encode('iso-8859-1')
    UnicodeEncodeError: 'latin-1' codec can't encode characters in position 14-15: ordinal not in range(256) 

    This is because requests uses http.client, and http.client assumes the encoding to be latin-1 if given a unicode string. This is a very common source of bugs for beginners who assume sending in unicode would automatically encode it in utf-8, like in the libraries of many other languages.

    The simplest fix here is to catch the UnicodeEncodeError and improve the error message to something that points beginners in the right direction.

    Another option would be to:

    • Keep encoding in latin-1 first, and if that fails try utf-8

    Other possible solutions (that would be backwards incompatible) includes:

    • Changing the default encoding to utf-8 instead of latin-1
    • Detect an unencoded unicode string and fail without encoding it with a descriptive error message

    ---

    Just to show that this is a problem that exists in the wild, here are a few examples that all crashes on the same line in http.client (not all going through the requests library:

    @EmilStenstrm EmilStenstrm mannequin added topic-unicode type-feature A feature request or enhancement labels Jan 7, 2016
    @vadmium
    Copy link
    Member

    vadmium commented Jan 8, 2016

    For the record, this is what Requests sent when I passed a Latin-1-encodable string:

    b'POST / HTTP/1.1\r\n'
    b'Host: example.com\r\n'
    b'Content-Length: 11\r\n'
    b'Connection: keep-alive\r\n'
    b'Accept: */*\r\n'
    b'Accept-Encoding: gzip, deflate\r\n'
    b'User-Agent: python-requests/2.9.1\r\n'
    b'\r\n'
    b'Celebrate \xa9'

    There is no Content-Type header field, nor any indication of the encoding used. This is also how the lower-level HTTPConnection.request() method works.

    The documentation already mentions that a text string gets encoded with ISO-8859-1 (a.k.a. Latin-1): <https://docs.python.org/3.3/library/http.client.html#http.client.HTTPConnection.request\>. How do you propose to improve the error message?

    Encoding with either Latin-1 or UTF-8 depending on the characters sounds like a terrible idea. We may as well send the request without any body and pretend everything is okay. I don’t understand the point of changing to UTF-8 either. If you actually want UTF-8 encoded text, why not explicitly encode it yourself?

    Failing for any unencoded text string would be a serious backwards compatibility problem. It would break the POST example using urlencode() at <https://docs.python.org/3/library/http.client.html#examples\> for instance.

    IMO the Latin-1 encoding feature is a bad API design, maybe based on a misunderstanding of HTTP. Perhaps it would be more reasonable to deprecate the automatic Latin-1 encoding, and only allow ASCII characters in a text string. That would still cater for the urlencode() scenario in the POST example.

    Of the links you posted, they seem to be different problems with separate solutions:

    Requests bug 2838: Perhaps the user was trying to send URL-encoded form data. If so, textual fields should be UTF-8 encoded and then percent-encoded, resulting in only ASCII codes in the “data” argument. Python has urllib.parse.urlencode() which does this.

    Requests bug 1822: It sounds like the user or a library intended to send UTF-8, so they should encode it themselves.

    Stack Overflow: Custom web service needed fixing, and the user had to encode as UTF-8. This is a custom agreement between the client and server, it is not up to Python.

    Ebay: I’m not familiar with any Ebay API and it is not clear from the post, but I suspect the user wasn’t encoding their data properly. Maybe similar to the first case.

    For the rest it is not clear what the problem or solution was. Some of them sound like they were somehow sending text when they really wanted to send arbitrary bytes, in which case UTF-8 is not going to help.

    @gvanrossum
    Copy link
    Member

    Any solution that encodes Unicode in a way that works for some characters but fails for others has the same problem that Unicode had in Python 3. Unfortunately we're stuck with such a solution (Latin-1) and for backwards compatibility reasons we can't change it. If we were to deprecate it, we should warn for *any* data given as a Unicode string, even if it's plain ASCII (even if it's an empty string :-).

    But even if we don't deprecate it, we can still change the text of the error message (but not the type of the exception used) to be more clear.

    Can we please start drafting a suitable error message here?

    @vadmium
    Copy link
    Member

    vadmium commented Jan 8, 2016

    After reading through the linked thread, there are a few error message proposals:

    Guido: "use data.encode('utf-8') if you want the data to be encoded in UTF-8". (Then of course the server might not like it.)

    Andrew Barnert: A UnicodeEncodeError (or subclass of it?) with text like "HTTP body without encoding defaults to 'latin-1', which can't encode character '\u5555' in position 30: ordinal not in range(256)")

    Paul Moore: Encode as ASCII and catch UnicodeEncodeError and re-raise as a TypeError "Unicode string supplied without an explicit encoding".

    Emil, do you think any of these would help?

    @EmilStenstrm
    Copy link
    Mannequin Author

    EmilStenstrm mannequin commented Jan 8, 2016

    I think changing the error message is enough for the short term, and deprecation of automatic encoding is the correct way in the long term.

    A text that mention "utf-8" which will likely be the correct solution definitely gets my vote, so Guidos suggestion sounds good to me:

    UnicodeEncodeError("Use data.encode('utf-8') if you want the data to be encoded in UTF-8")

    Andrew's and Pauls suggestions doesn't point to a solution to the problem, which I think is a great think for something this basic. Also, the error message only gets shown when latin-1 fails, so we can't use text that speaks about "no encoding" in general.

    @gvanrossum
    Copy link
    Member

    Here's a patch. I noticed there are lots of other places where a similar encoding() call exists -- I wrapped them all using a helper function. Please review carefully.

    @gvanrossum
    Copy link
    Member

    BTW the error and traceback will look something like this:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/guido/src/cpython/Lib/http/client.py", line 1138, in _send_request
        self.putheader(hdr, value)
      File "/Users/guido/src/cpython/Lib/http/client.py", line 1062, in putheader
        header = _encode(header, 'ascii', 'header')
      File "/Users/guido/src/cpython/Lib/http/client.py", line 161, in _encode
        (name.title(), data[err.start:err.end], name)) from None
    UnicodeEncodeError: 'ascii' codec can't encode character '\u1234' in position 3: Header ('ሴ') is not valid Latin-1. Use header.encode('utf-8') if you want to send it encoded in UTF-8.

    @gvanrossum
    Copy link
    Member

    I think this would be okay for 3.5.2 as well.

    @vadmium
    Copy link
    Member

    vadmium commented Jan 9, 2016

    Personally I am skeptical if suggesting UTF-8 for the body data is a good idea, but I can go along with it, since other people want it. But I do strongly question whether it is right to suggest UTF-8 for header fields. The RFC <https://tools.ietf.org/html/rfc7230#page-26\> only mentions ASCII and Latin-1.

    Newer protocols based on HTTP (RTSP comes to mind) do specify UTF-8 for the header, but that is probably out of scope of both the HTTP module and beginner-targetted errors.

    If I were redoing this patch, I would drop all the changes except at the body.encode() line in Emil’s original post.

    @gvanrossum
    Copy link
    Member

    Martin, please make a patch along those lines! The only reason I generalized this to headers is that one of the three Requests issues referenced in the original post seemed to be about a header value (https://github.com/kennethreitz/requests/issues/1926). But that one seems different than the other two anyways, and it's about Python 3.7 so it wouldn't be helped by anything we're doing here anyways.

    @vadmium
    Copy link
    Member

    vadmium commented Jan 31, 2016

    Here is my cut down version of Guido’s patch. Now it only adds the message when someone passes a text string as the HTTPConnection.request(body=...) parameter:

    >>> c.request("POST", "", body="Celebrate \U0001F389")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/proj/python/cpython/Lib/http/client.py", line 1098, in request
        self._send_request(method, url, body, headers)
      File "/home/proj/python/cpython/Lib/http/client.py", line 1142, in _send_request
        body = _encode(body, 'body')
      File "/home/proj/python/cpython/Lib/http/client.py", line 161, in _encode
        (name.title(), data[err.start:err.end], name)) from None
    UnicodeEncodeError: 'latin-1' codec can't encode character '\U0001f389' in position 10: Body ('🎉') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

    What do people think?

    @gvanrossum
    Copy link
    Member

    LGTM.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 8, 2016

    New changeset 966bd147ccb5 by Martin Panter in branch '3.5':
    Issue bpo-26045: Add UTF-8 suggestion to error in http.client
    https://hg.python.org/cpython/rev/966bd147ccb5

    New changeset 9896ead3cc1d by Martin Panter in branch 'default':
    Issue bpo-26045: Merge http.client error addition from 3.5
    https://hg.python.org/cpython/rev/9896ead3cc1d

    @vadmium vadmium closed this as completed Feb 8, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants