Message 177453 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	grahamd
Recipients	claudep, grahamd
Date	2012-12-14.09:11:34
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1355476295.03.0.691371802467.issue16679@psf.upfronthosting.co.za>
In-reply-to

Content
You can't try UTF-8 and then fall back to ISO-8859-1. PEP 3333 requires it always be ISO-8859-1. If an application needs it as something else, it is the web applications job to do it. The relevant part of the PEP is: """On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters.""" By converting as UTF-8 you would be breaking the requirement that only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive) are passed through. So it is inconvenient if your expectation is that will always be UTF-8, but is how it has to work. This is because it could be something other than UTF-8, yet still be able to be successfully converted as UTF-8. In that case the application would get something totally different to the original which is wrong. So, the WSGI server cannot ever make any assumptions and the WSGI application always has to be the one which converts it to the correct Unicode string. The only way that can be done and still pass through a native string, is that it is done as ISO-8859-1 (which is byte preserving), allowing the application to go back to bytes and then back to Unicode in correct encoding.

You can't try UTF-8 and then fall back to ISO-8859-1. PEP 3333 requires it always be ISO-8859-1. If an application needs it as something else, it is the web applications job to do it.

The relevant part of the PEP is:

"""On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters."""

By converting as UTF-8 you would be breaking the requirement that only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive) are passed through.

So it is inconvenient if your expectation is that will always be UTF-8, but is how it has to work. This is because it could be something other than UTF-8, yet still be able to be successfully converted as UTF-8. In that case the application would get something totally different to the original which is wrong.

So, the WSGI server cannot ever make any assumptions and the WSGI application always has to be the one which converts it to the correct Unicode string. The only way that can be done and still pass through a native string, is that it is done as ISO-8859-1 (which is byte preserving), allowing the application to go back to bytes and then back to Unicode in correct encoding.

History
Date	User	Action	Args
2012-12-14 09:11:35	grahamd	set	recipients: + grahamd, claudep
2012-12-14 09:11:35	grahamd	set	messageid: <1355476295.03.0.691371802467.issue16679@psf.upfronthosting.co.za>
2012-12-14 09:11:35	grahamd	link	issue16679 messages
2012-12-14 09:11:34	grahamd	create