Message 225814 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ncoghlan
Recipients	ncoghlan
Date	2014-08-24.12:45:41
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1408884341.89.0.273491669506.issue22264@psf.upfronthosting.co.za>
In-reply-to

Content
The WSGI 1.1 standard mandates that binary data be decoded as latin-1 text: http://www.python.org/dev/peps/pep-3333/#unicode-issues This means that many WSGI headers will in fact contain improperly encoded data. Developers working directly with WSGI (rather than using a WSGI framework like Django, Flask or Pyramid) need to convert those strings back to bytes and decode them properly before passing them on to user applications. I suggest adding a simple "fix_encoding" function to wsgiref that covers this: def fix_encoding(data, encoding, errors="surrogateescape"): return data.encode("latin-1").decode(encoding, errors) The primary intended benefit is to WSGI related code more self-documenting. Compare the proposal with the status quo: data = wsgiref.fix_encoding(data, "utf-8") data = data.encode("latin-1").decode("utf-8", "surrogateescape") The proposal hides the mechanical details of what is going on in order to emphasise why the change is needed, and provides you with a name to go look up if you want to learn more. The latter just looks nonsensical unless you're already familiar with this particular corner of the WSGI specification.

The WSGI 1.1 standard mandates that binary data be decoded as latin-1 text: http://www.python.org/dev/peps/pep-3333/#unicode-issues

This means that many WSGI headers will in fact contain *improperly encoded data*. Developers working directly with WSGI (rather than using a WSGI framework like Django, Flask or Pyramid) need to convert those strings back to bytes and decode them properly before passing them on to user applications.

I suggest adding a simple "fix_encoding" function to wsgiref that covers this:

    def fix_encoding(data, encoding, errors="surrogateescape"):
        return data.encode("latin-1").decode(encoding, errors)

The primary intended benefit is to WSGI related code more self-documenting. Compare the proposal with the status quo:

    data = wsgiref.fix_encoding(data, "utf-8")
    data = data.encode("latin-1").decode("utf-8", "surrogateescape")

The proposal hides the mechanical details of what is going on in order to emphasise *why* the change is needed, and provides you with a name to go look up if you want to learn more.

The latter just looks nonsensical unless you're already familiar with this particular corner of the WSGI specification.

History
Date	User	Action	Args
2014-08-24 12:45:41	ncoghlan	set	recipients: + ncoghlan
2014-08-24 12:45:41	ncoghlan	set	messageid: <1408884341.89.0.273491669506.issue22264@psf.upfronthosting.co.za>
2014-08-24 12:45:41	ncoghlan	link	issue22264 messages
2014-08-24 12:45:41	ncoghlan	create