Title: http.client cannot send non-ASCII request lines
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.8, Python 3.7
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: orsenthil, tburke
Priority: normal Keywords: patch

Created on 2019-03-12 20:33 by tburke, last changed 2019-03-13 23:59 by tburke.

Pull Requests
URL Status Linked Edit
PR 12314 open tburke, 2019-03-13 23:59
PR 12315 open tburke, 2019-03-13 23:59
Messages (1)
msg337802 - (view) Author: Tim Burke (tburke) * Date: 2019-03-12 20:33
While the RFCs are rather clear that non-ASCII data would be out of spec,

* that doesn't prevent a poorly-behaved client from sending non-ASCII bytes on the wire, which means
* as an application developer, it's useful to be able to mimic such a client to verify expected behavior while still using stdlib to handle things like header parsing, particularly since
* this worked perfectly well on Python 2.

The two most-obvious ways (to me, anyway) to try to send a request for /你好 (for example) are

    # Assume it will get UTF-8 encoded, as that's the default encoding
    # for urllib.parse.quote()
    conn.putrequest('GET', '/\u4f60\u597d')

    # Assume it will get Latin-1 encoded, as
    #   * that's the encoding used in http.client.parse_headers(),
    #   * that's the encoding used for PEP-3333, and
    #   * it has a one-to-one mapping with bytes
    conn.putrequest('GET', '/\xe4\xbd\xa0\xe5\xa5\xbd')

both fail with something like

    UnicodeEncodeError: 'ascii' codec can't encode characters in position ...

Trying to pre-encode like

    conn.putrequest('GET', b'/\xe4\xbd\xa0\xe5\xa5\xbd')

at least doesn't raise an error, but still does not do what was intended; rather than a request line like

    GET /你好 HTTP/1.1



depending on how you choose to interpret the bytes), the server gets

    GET b'/\xe4\xbd\xa0\xe5\xa5\xbd' HTTP/1.1

The trouble comes down to -- we don't actually have any control over what the caller passes as the url (so the assumption doesn't hold), nor do we know anything about the encoding that was *intended*.

One of three fixes seems warranted:

* Switch to using Latin-1 to encode instead of ASCII (again, leaning on the precedent set in parse_headers and PEP-3333). This may make it too easy to write an out-of-spec client, however.
* Continue to use ASCII to encode, but include errors='surrogateescape' to give callers an escape hatch. This seems like a reasonably high bar to ensure that the caller actually intends to send unquoted data.
* Accept raw bytes and actually use them (rather than their repr()), allowing the caller to decide upon an appropriate encoding.
Date User Action Args
2019-03-13 23:59:52tburkesetpull_requests: + pull_request12289
2019-03-13 23:59:19tburkesetkeywords: + patch
stage: patch review
pull_requests: + pull_request12288
2019-03-13 10:18:31SilentGhostsetnosy: + orsenthil

versions: - Python 3.4, Python 3.5, Python 3.6
2019-03-12 20:33:23tburkecreate