Message 337802 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tburke
Recipients	tburke
Date	2019-03-12.20:33:23
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1552422803.93.0.420596825145.issue36274@roundup.psfhosted.org>
In-reply-to

Content
While the RFCs are rather clear that non-ASCII data would be out of spec, * that doesn't prevent a poorly-behaved client from sending non-ASCII bytes on the wire, which means * as an application developer, it's useful to be able to mimic such a client to verify expected behavior while still using stdlib to handle things like header parsing, particularly since * this worked perfectly well on Python 2. The two most-obvious ways (to me, anyway) to try to send a request for /你好 (for example) are # Assume it will get UTF-8 encoded, as that's the default encoding # for urllib.parse.quote() conn.putrequest('GET', '/\u4f60\u597d') # Assume it will get Latin-1 encoded, as # * that's the encoding used in http.client.parse_headers(), # * that's the encoding used for PEP-3333, and # * it has a one-to-one mapping with bytes conn.putrequest('GET', '/\xe4\xbd\xa0\xe5\xa5\xbd') both fail with something like UnicodeEncodeError: 'ascii' codec can't encode characters in position ... Trying to pre-encode like conn.putrequest('GET', b'/\xe4\xbd\xa0\xe5\xa5\xbd') at least doesn't raise an error, but still does not do what was intended; rather than a request line like GET /你好 HTTP/1.1 (or /ä½ å¥½ depending on how you choose to interpret the bytes), the server gets GET b'/\xe4\xbd\xa0\xe5\xa5\xbd' HTTP/1.1 The trouble comes down to https://github.com/python/cpython/blob/v3.7.2/Lib/http/client.py#L1104-L1107 -- we don't actually have any control over what the caller passes as the url (so the assumption doesn't hold), nor do we know anything about the encoding that was intended. One of three fixes seems warranted: * Switch to using Latin-1 to encode instead of ASCII (again, leaning on the precedent set in parse_headers and PEP-3333). This may make it too easy to write an out-of-spec client, however. * Continue to use ASCII to encode, but include errors='surrogateescape' to give callers an escape hatch. This seems like a reasonably high bar to ensure that the caller actually intends to send unquoted data. * Accept raw bytes and actually use them (rather than their repr()), allowing the caller to decide upon an appropriate encoding.

While the RFCs are rather clear that non-ASCII data would be out of spec,

* that doesn't prevent a poorly-behaved client from sending non-ASCII bytes on the wire, which means
* as an application developer, it's useful to be able to mimic such a client to verify expected behavior while still using stdlib to handle things like header parsing, particularly since
* this worked perfectly well on Python 2.

The two most-obvious ways (to me, anyway) to try to send a request for /你好 (for example) are

    # Assume it will get UTF-8 encoded, as that's the default encoding
    # for urllib.parse.quote()
    conn.putrequest('GET', '/\u4f60\u597d')

    # Assume it will get Latin-1 encoded, as
    #   * that's the encoding used in http.client.parse_headers(),
    #   * that's the encoding used for PEP-3333, and
    #   * it has a one-to-one mapping with bytes
    conn.putrequest('GET', '/\xe4\xbd\xa0\xe5\xa5\xbd')

both fail with something like

    UnicodeEncodeError: 'ascii' codec can't encode characters in position ...

Trying to pre-encode like

    conn.putrequest('GET', b'/\xe4\xbd\xa0\xe5\xa5\xbd')

at least doesn't raise an error, but still does not do what was intended; rather than a request line like

    GET /你好 HTTP/1.1

(or

    /ä½ å¥½

depending on how you choose to interpret the bytes), the server gets

    GET b'/\xe4\xbd\xa0\xe5\xa5\xbd' HTTP/1.1

The trouble comes down to https://github.com/python/cpython/blob/v3.7.2/Lib/http/client.py#L1104-L1107 -- we don't actually have any control over what the caller passes as the url (so the assumption doesn't hold), nor do we know anything about the encoding that was *intended*.

One of three fixes seems warranted:

* Switch to using Latin-1 to encode instead of ASCII (again, leaning on the precedent set in parse_headers and PEP-3333). This may make it too easy to write an out-of-spec client, however.
* Continue to use ASCII to encode, but include errors='surrogateescape' to give callers an escape hatch. This seems like a reasonably high bar to ensure that the caller actually intends to send unquoted data.
* Accept raw bytes and actually use them (rather than their repr()), allowing the caller to decide upon an appropriate encoding.

History
Date	User	Action	Args
2019-03-12 20:33:23	tburke	set	recipients: + tburke
2019-03-12 20:33:23	tburke	set	messageid: <1552422803.93.0.420596825145.issue36274@roundup.psfhosted.org>
2019-03-12 20:33:23	tburke	link	issue36274 messages
2019-03-12 20:33:23	tburke	create