Message 225867 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	benjamin.peterson, ezio.melotti, grahamd, lemburg, ncoghlan, pitrou, pje, serhiy.storchaka, vstinner
Date	2014-08-25.06:39:33
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1408948773.68.0.530186261554.issue22264@psf.upfronthosting.co.za>
In-reply-to

Content
After reviewing the stdlib code as Serhiy suggested and reflecting on the matter for a while, I now think it's better to think of this idea in terms of formalising the concept of a "WSGI string". That is, data that has been decoded as latin-1 not because that's necessarily correct, but because it creates a valid str object that doesn't lose any information, doesn't have any surrogate escapes in it, yet can still handle arbitrary binary data. Under that model, and using a dumps/loads inspired naming scheme (since this is effectively a serialisation format for the WSGI server/application boundary), the appropriate helpers would be: def dump_wsgistr(data, encoding, errors='strict'): data.encode(encoding, errors).decode('iso-8859-1') def load_wsgistr(data, encoding, errors='strict'): data.encode('iso-8859-1').decode(encoding, errors) As Victor says, using surrogateescape by default is not correct. However, some of the code in wsgiref.handlers does pass a custom errors setting, so it's appropriate to make that configurable. With this change, there would be several instances in wsgiref.handlers that could be changed from the current: data.encode(encoding).decode('iso-8859-1') to: dump_wsgistr(data, encoding) The point is that it isn't "iso-8859-1" that's significant - it's the compliance with the data format mandated by the WSGI 1.0.1 specification (which just happens to be "latin-1 decoded string").

After reviewing the stdlib code as Serhiy suggested and reflecting on the matter for a while, I now think it's better to think of this idea in terms of formalising the concept of a "WSGI string". That is, data that has been decoded as latin-1 not because that's necessarily correct, but because it creates a valid str object that doesn't lose any information, doesn't have any surrogate escapes in it, yet can still handle arbitrary binary data.

Under that model, and using a dumps/loads inspired naming scheme (since this is effectively a serialisation format for the WSGI server/application boundary), the appropriate helpers would be:

    def dump_wsgistr(data, encoding, errors='strict'):
        data.encode(encoding, errors).decode('iso-8859-1')

    def load_wsgistr(data, encoding, errors='strict'):
        data.encode('iso-8859-1').decode(encoding, errors)

As Victor says, using surrogateescape by default is not correct. However, some of the code in wsgiref.handlers does pass a custom errors setting, so it's appropriate to make that configurable.

With this change, there would be several instances in wsgiref.handlers that could be changed from the current:

    data.encode(encoding).decode('iso-8859-1')

to:

    dump_wsgistr(data, encoding)

The point is that it isn't "iso-8859-1" that's significant - it's the compliance with the data format mandated by the WSGI 1.0.1 specification (which just happens to be "latin-1 decoded string").

History
Date	User	Action	Args
2014-08-25 06:39:33	ncoghlan	set	recipients: + ncoghlan, lemburg, pje, pitrou, vstinner, benjamin.peterson, ezio.melotti, grahamd, serhiy.storchaka
2014-08-25 06:39:33	ncoghlan	set	messageid: <1408948773.68.0.530186261554.issue22264@psf.upfronthosting.co.za>
2014-08-25 06:39:33	ncoghlan	link	issue22264 messages
2014-08-25 06:39:33	ncoghlan	create