Message 227510 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	rbcollins
Recipients	benjamin.peterson, ezio.melotti, grahamd, lemburg, ncoghlan, pitrou, pje, rbcollins, serhiy.storchaka, vstinner
Date	2014-09-25.07:04:01
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1411628642.22.0.712198102165.issue22264@psf.upfronthosting.co.za>
In-reply-to

Content
So this looks like its going to instantly create bugs in programs that use it. HTTP/1.1 headers are one of: latin1 MIME encoded (RFC2047) invalid and working only by accident HTTP/2 doesn't change this. An API that encourages folk to encode into utf8 and then put that in their headers is problematic. Consider: def dump_wsgistr(data, encoding, errors='strict'): data.encode(encoding, errors).decode('iso-8859-1') This takes a string that one wants to put into a header value, encodes it with a user specified encoding, then decodes that into iso-8859-1 [at which point it can be encoded back to octets by the wsgi server before putting on the wire]. But this is fundamentally wrong in the common case: either 'data' is itself suitable as a header value (e.g. it is ASCII - recommended per RFC7230 section 3.2.4) or 'data' needs encoding via RFC 2047 encoding not via utf8. There are a few special cases where folk have incorrectly shoved utf8 into header values and we need to make it possible for folk working within WSGI to do that - which is why the API is the way it is - but we shouldn't make it easier for them to do the wrong thing. I'd support an API that DTRT here by taking a string, tries US_ASCII, with fallback to MIME encoded with utf8 as the encoding parameter.

So this looks like its going to instantly create bugs in programs that use it. HTTP/1.1 headers are one of:
latin1
MIME encoded (RFC2047)
invalid and working only by accident

HTTP/2 doesn't change this.

An API that encourages folk to encode into utf8 and then put that in their headers is problematic.

Consider:

    def dump_wsgistr(data, encoding, errors='strict'):
        data.encode(encoding, errors).decode('iso-8859-1')

This takes a string that one wants to put into a header value, encodes it with a *user specified encoding*, then decodes that into iso-8859-1 [at which point it can be encoded back to octets by the wsgi server before putting on the wire].

But this is fundamentally wrong in the common case: either 'data' is itself suitable as a header value (e.g. it is ASCII - recommended per RFC7230 section 3.2.4) or 'data' needs encoding via RFC 2047 encoding not via utf8.

There are a few special cases where folk have incorrectly shoved utf8 into header values and we need to make it possible for folk working within WSGI to do that - which is why the API is the way it is - but we shouldn't make it *easier* for them to do the wrong thing.

I'd support an API that DTRT here by taking a string, tries US_ASCII, with fallback to MIME encoded with utf8 as the encoding parameter.

History
Date	User	Action	Args
2014-09-25 07:04:02	rbcollins	set	recipients: + rbcollins, lemburg, pje, ncoghlan, pitrou, vstinner, benjamin.peterson, ezio.melotti, grahamd, serhiy.storchaka
2014-09-25 07:04:02	rbcollins	set	messageid: <1411628642.22.0.712198102165.issue22264@psf.upfronthosting.co.za>
2014-09-25 07:04:02	rbcollins	link	issue22264 messages
2014-09-25 07:04:01	rbcollins	create