Message 177591 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	aclover
Recipients	aclover, claudep, grahamd, pje
Date	2012-12-16.12:03:31
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1355659412.62.0.985557563504.issue16679@psf.upfronthosting.co.za>
In-reply-to

Content
WSGI's usage of ISO-8859-1 for all HTTP-byte-originated strings is very much deliberate; we needed a way to preserve the original input bytes whilst still using unicode strings, and at the time surrogateescape was not available. The result is counter-intuitive but at least it is finally consistent; the expectation is that most web authors will be using some kind of web framework or input-reading library that will hide away the unpleasant details. See http://mail.python.org/pipermail/web-sig/2007-December/thread.html#3002 and http://mail.python.org/pipermail/web-sig/2010-July/thread.html#4473 for the background discussion. In any case we cannot assume a path is UTF-8 - not every URI is known to have come from an IRI so RFC 3987 does not necessarily apply. UTF-8-with-Latin1-fallback is also undesirable in itself as it adds ambiguity - an ISO-8859-1 byte sequence that by coincidence happens to be a valid UTF-8 byte sequence will get mangled.

WSGI's usage of ISO-8859-1 for all HTTP-byte-originated strings is very much deliberate; we needed a way to preserve the original input bytes whilst still using unicode strings, and at the time surrogateescape was not available. The result is counter-intuitive but at least it is finally consistent; the expectation is that most web authors will be using some kind of web framework or input-reading library that will hide away the unpleasant details.

See http://mail.python.org/pipermail/web-sig/2007-December/thread.html#3002 and http://mail.python.org/pipermail/web-sig/2010-July/thread.html#4473 for the background discussion.

In any case we cannot assume a path is UTF-8 - not every URI is known to have come from an IRI so RFC 3987 does not necessarily apply. UTF-8-with-Latin1-fallback is also undesirable in itself as it adds ambiguity - an ISO-8859-1 byte sequence that by coincidence happens to be a valid UTF-8 byte sequence will get mangled.

History
Date	User	Action	Args
2012-12-16 12:03:32	aclover	set	recipients: + aclover, pje, grahamd, claudep
2012-12-16 12:03:32	aclover	set	messageid: <1355659412.62.0.985557563504.issue16679@psf.upfronthosting.co.za>
2012-12-16 12:03:32	aclover	link	issue16679 messages
2012-12-16 12:03:31	aclover	create