classification
Title: Wrong URL path decoding
Type: behavior Stage:
Components: Documentation Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: aclover, claudep, docs@python, grahamd, pje
Priority: normal Keywords: patch

Created on 2012-12-14 08:28 by claudep, last changed 2012-12-17 16:44 by claudep.

Files
File name Uploaded Description Edit
issue16679-1.diff claudep, 2012-12-14 08:59 Decoding path with 'utf-8' review
Messages (8)
msg177449 - (view) Author: Claude Paroz (claudep) Date: 2012-12-14 08:28
In wsgiref/simple_server.py (WSGIRequestHandler.get_environ), Python 3 is currently populating the env['PATH_INFO'] variable by decoding the URL path, assuming it was encoded with 'iso-8859-1', which appears to be wrong, according to RFC 3986/3987.
For example, if you request the path /سلام in any modern browser, PATH_INFO will contain "/سلاÙ".
'iso-8859-1' should be replaced by 'utf-8' for decoding.

Note that this was introduced as part of the fix for http://bugs.python.org/issue10155
msg177450 - (view) Author: Graham Dumpleton (grahamd) Date: 2012-12-14 08:42
The requirement per PEP 3333 is that the original byte string needs to be converted to native string (Unicode) with the ISO-8891-1 encoding. This is to ensure that the original bytes are preserved so that the WSGI application, with its own knowledge of what encoding the byte string was in, can then properly convert it to the correct encoding.

In other words, the WSGI server is not allowed to assume that the original byte string was UTF-8, because in practice it may not be and it cannot know what it is. The WSGI server must use ISO-8859-1. The WSGI application if it needs it in UTF-8, must then convert it back to a byte string using IS0-8859-1 and then from there convert it back to a native string as UTF-8.

So if I understand what you are saying, you are suggesting a change which is incompatible with PEP 3333.

Please provide a code snippet or patch to show what you are proposing to be changed so it can be determined precisely what you are talking about.
msg177451 - (view) Author: Claude Paroz (claudep) Date: 2012-12-14 08:59
Attached are my proposed changes.

Also, I just came across http://bugs.python.org/issue3300, which finally led Python urllib.parse.quote to default to UTF-8 encoding, after a lengthy discussion.
msg177453 - (view) Author: Graham Dumpleton (grahamd) Date: 2012-12-14 09:11
You can't try UTF-8 and then fall back to ISO-8859-1. PEP 3333 requires it always be ISO-8859-1. If an application needs it as something else, it is the web applications job to do it.

The relevant part of the PEP is:

"""On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters."""

By converting as UTF-8 you would be breaking the requirement that only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive) are passed through.

So it is inconvenient if your expectation is that will always be UTF-8, but is how it has to work. This is because it could be something other than UTF-8, yet still be able to be successfully converted as UTF-8. In that case the application would get something totally different to the original which is wrong.

So, the WSGI server cannot ever make any assumptions and the WSGI application always has to be the one which converts it to the correct Unicode string. The only way that can be done and still pass through a native string, is that it is done as ISO-8859-1 (which is byte preserving), allowing the application to go back to bytes and then back to Unicode in correct encoding.
msg177457 - (view) Author: Claude Paroz (claudep) Date: 2012-12-14 10:47
I may understand your reasoning when you cannot make any assumptions about the encoding of a series of bytes.

I think that the case of PATH_INFO is different, because it should comply with standards, and then you *can* make the assumption that the original path is 'utf-8'-encoded. So either leave the string undecoded, or decode it to what the standards say. It would put un unneccessary burden on WSGI apps to always require to encode-redecode this string.
Wouldn't it be possible to amend PEP 3333?

Hopefully we can get some other opinions about this issue.
msg177578 - (view) Author: PJ Eby (pje) * (Python committer) Date: 2012-12-16 02:48
> Wouldn't it be possible to amend PEP 3333?

Sure...  except then it would also be necessary to amend PEP 3333, and also all WSGI applications already written that assume this, any time in the last nine years.

This is a known and intended consistent property of how WSGI handles HTTP headers.  Under Python 2.x, PATH_INFO was a byte string (and still is), and to maintain also side-compatibility with Jython and IronPython, header strings are always maintained as "bytes in unicode form", with applications having responsibility to decode-recode as needed.

This isn't a minor corner of the spec, it's central to how headers are handled, and has been so long before Python 3 even existed.  To mess with it now means you break applications and frameworks that are already correctly written to follow the specs.

To put it in brief, the reported behavior is not a bug, it is a feature and by design.  A server that returns a UTF-8 decoded PATH_INFO is in violation of the spec, so the reference implementation of the spec should absolutely not do so.  ;-)
msg177591 - (view) Author: And Clover (aclover) Date: 2012-12-16 12:03
WSGI's usage of ISO-8859-1 for all HTTP-byte-originated strings is very much deliberate; we needed a way to preserve the original input bytes whilst still using unicode strings, and at the time surrogateescape was not available. The result is counter-intuitive but at least it is finally consistent; the expectation is that most web authors will be using some kind of web framework or input-reading library that will hide away the unpleasant details.

See #3002">http://mail.python.org/pipermail/web-sig/2007-December/thread.html#3002 and #4473">http://mail.python.org/pipermail/web-sig/2010-July/thread.html#4473 for the background discussion.

In any case we cannot assume a path is UTF-8 - not every URI is known to have come from an IRI so RFC 3987 does not necessarily apply. UTF-8-with-Latin1-fallback is also undesirable in itself as it adds ambiguity - an ISO-8859-1 byte sequence that by coincidence happens to be a valid UTF-8 byte sequence will get mangled.
msg177650 - (view) Author: Claude Paroz (claudep) Date: 2012-12-17 16:44
Thanks for the explanations (and history). I realize that changing the behaviour is probably not an option.

As an example in a framework, we are currently discussing how we will cope with this in Django: https://code.djangoproject.com/ticket/19468

On the Python side, it might be worth adding an admonition about PATH_INFO and non-ascii URLs on the wsgiref docs.
History
Date User Action Args
2012-12-17 16:44:29claudepsetnosy: + docs@python
messages: + msg177650

assignee: docs@python
components: + Documentation, - Library (Lib)
2012-12-16 12:03:32acloversetmessages: + msg177591
2012-12-16 02:48:48pjesetmessages: + msg177578
2012-12-15 22:54:46terry.reedysetnosy: + pje, aclover
2012-12-14 10:47:22claudepsetmessages: + msg177457
2012-12-14 09:11:35grahamdsetmessages: + msg177453
2012-12-14 08:59:43claudepsetfiles: + issue16679-1.diff
keywords: + patch
messages: + msg177451
2012-12-14 08:43:22berker.peksagsetversions: + Python 3.4, - Python 3.5
2012-12-14 08:42:39grahamdsetnosy: + grahamd
messages: + msg177450
2012-12-14 08:28:30claudepcreate