Created on 2012-12-14 08:28 by claudep, last changed 2016-04-23 00:47 by martin.panter.
|issue16679-1.diff||claudep, 2012-12-14 08:59||Decoding path with 'utf-8'||review|
|msg177449 - (view)||Author: Claude Paroz (claudep)||Date: 2012-12-14 08:28|
In wsgiref/simple_server.py (WSGIRequestHandler.get_environ), Python 3 is currently populating the env['PATH_INFO'] variable by decoding the URL path, assuming it was encoded with 'iso-8859-1', which appears to be wrong, according to RFC 3986/3987. For example, if you request the path /سلام in any modern browser, PATH_INFO will contain "/Ø³ÙØ§Ù". 'iso-8859-1' should be replaced by 'utf-8' for decoding. Note that this was introduced as part of the fix for http://bugs.python.org/issue10155
|msg177450 - (view)||Author: Graham Dumpleton (grahamd)||Date: 2012-12-14 08:42|
The requirement per PEP 3333 is that the original byte string needs to be converted to native string (Unicode) with the ISO-8891-1 encoding. This is to ensure that the original bytes are preserved so that the WSGI application, with its own knowledge of what encoding the byte string was in, can then properly convert it to the correct encoding. In other words, the WSGI server is not allowed to assume that the original byte string was UTF-8, because in practice it may not be and it cannot know what it is. The WSGI server must use ISO-8859-1. The WSGI application if it needs it in UTF-8, must then convert it back to a byte string using IS0-8859-1 and then from there convert it back to a native string as UTF-8. So if I understand what you are saying, you are suggesting a change which is incompatible with PEP 3333. Please provide a code snippet or patch to show what you are proposing to be changed so it can be determined precisely what you are talking about.
|msg177451 - (view)||Author: Claude Paroz (claudep)||Date: 2012-12-14 08:59|
Attached are my proposed changes. Also, I just came across http://bugs.python.org/issue3300, which finally led Python urllib.parse.quote to default to UTF-8 encoding, after a lengthy discussion.
|msg177453 - (view)||Author: Graham Dumpleton (grahamd)||Date: 2012-12-14 09:11|
You can't try UTF-8 and then fall back to ISO-8859-1. PEP 3333 requires it always be ISO-8859-1. If an application needs it as something else, it is the web applications job to do it. The relevant part of the PEP is: """On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters.""" By converting as UTF-8 you would be breaking the requirement that only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive) are passed through. So it is inconvenient if your expectation is that will always be UTF-8, but is how it has to work. This is because it could be something other than UTF-8, yet still be able to be successfully converted as UTF-8. In that case the application would get something totally different to the original which is wrong. So, the WSGI server cannot ever make any assumptions and the WSGI application always has to be the one which converts it to the correct Unicode string. The only way that can be done and still pass through a native string, is that it is done as ISO-8859-1 (which is byte preserving), allowing the application to go back to bytes and then back to Unicode in correct encoding.
|msg177457 - (view)||Author: Claude Paroz (claudep)||Date: 2012-12-14 10:47|
I may understand your reasoning when you cannot make any assumptions about the encoding of a series of bytes. I think that the case of PATH_INFO is different, because it should comply with standards, and then you *can* make the assumption that the original path is 'utf-8'-encoded. So either leave the string undecoded, or decode it to what the standards say. It would put un unneccessary burden on WSGI apps to always require to encode-redecode this string. Wouldn't it be possible to amend PEP 3333? Hopefully we can get some other opinions about this issue.
|msg177578 - (view)||Author: PJ Eby (pje) *||Date: 2012-12-16 02:48|
> Wouldn't it be possible to amend PEP 3333? Sure... except then it would also be necessary to amend PEP 3333, and also all WSGI applications already written that assume this, any time in the last nine years. This is a known and intended consistent property of how WSGI handles HTTP headers. Under Python 2.x, PATH_INFO was a byte string (and still is), and to maintain also side-compatibility with Jython and IronPython, header strings are always maintained as "bytes in unicode form", with applications having responsibility to decode-recode as needed. This isn't a minor corner of the spec, it's central to how headers are handled, and has been so long before Python 3 even existed. To mess with it now means you break applications and frameworks that are already correctly written to follow the specs. To put it in brief, the reported behavior is not a bug, it is a feature and by design. A server that returns a UTF-8 decoded PATH_INFO is in violation of the spec, so the reference implementation of the spec should absolutely not do so. ;-)
|msg177591 - (view)||Author: And Clover (aclover)||Date: 2012-12-16 12:03|
WSGI's usage of ISO-8859-1 for all HTTP-byte-originated strings is very much deliberate; we needed a way to preserve the original input bytes whilst still using unicode strings, and at the time surrogateescape was not available. The result is counter-intuitive but at least it is finally consistent; the expectation is that most web authors will be using some kind of web framework or input-reading library that will hide away the unpleasant details. See http://mail.python.org/pipermail/web-sig/2007-December/thread.html#3002 and http://mail.python.org/pipermail/web-sig/2010-July/thread.html#4473 for the background discussion. In any case we cannot assume a path is UTF-8 - not every URI is known to have come from an IRI so RFC 3987 does not necessarily apply. UTF-8-with-Latin1-fallback is also undesirable in itself as it adds ambiguity - an ISO-8859-1 byte sequence that by coincidence happens to be a valid UTF-8 byte sequence will get mangled.
|msg177650 - (view)||Author: Claude Paroz (claudep)||Date: 2012-12-17 16:44|
Thanks for the explanations (and history). I realize that changing the behaviour is probably not an option. As an example in a framework, we are currently discussing how we will cope with this in Django: https://code.djangoproject.com/ticket/19468 On the Python side, it might be worth adding an admonition about PATH_INFO and non-ascii URLs on the wsgiref docs.
|msg263862 - (view)||Author: Alexey Gorshkov (animus)||Date: 2016-04-20 21:17|
I'm from Issue 26808. I'd like to see some explanation on: how about QUERY_STRING value? Why only PATH_INFO is encoded in such a manner, but QUERY_STRING is passed without any changes and does not requires any latin-1 to utf-8 recodings?
|msg263872 - (view)||Author: Graham Dumpleton (grahamd)||Date: 2016-04-21 03:53|
As I commented on Issue 26808, it actually looks to me like the QUERY_STRING is processed fine and it is actually PATH_INFO that is not. I am confused at this point. I hate dealing with these WSGI level details now. :-(
|msg263878 - (view)||Author: Martin Panter (martin.panter) *||Date: 2016-04-21 04:58|
PEP 3333 defers to a draft CGI specification for PATH_INFO and QUERY_STRING: <https://tools.ietf.org/html/draft-coar-cgi-v11-03>. (Dunno why it didn’t reference the final RFC 3875 instead, published 2004.) Anyway, both draft and final RFCs say “PATH_INFO is not URL-encoded”, but “the QUERY_STRING variable contains a URL-encoded search or parameter string”. Graham, maybe you are seeing Latin-1 code points in PATH_INFO that have been translated from the %XX URL syntax, and QUERY_STRING retaining the original %XX syntax.
|msg263879 - (view)||Author: Graham Dumpleton (grahamd)||Date: 2016-04-21 05:16|
What I get in Apache for: http://127.0.0.1:8000/a=тест in Safari browser is: 'REQUEST_URI': '/a=%D1%82%D0%B5%D1%81%D1%82', 'PATH_INFO': '/a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82', Where as for curl see: 'REQUEST_URI': '/a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82', 'PATH_INFO': '/a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82', For: http://127.0.0.1:8000/?a=тест in Safari get: 'REQUEST_URI': '/?a=%D1%82%D0%B5%D1%81%D1%82', 'PATH_INFO': '/' 'QUERY_STRING': 'a=%D1%82%D0%B5%D1%81%D1%82', and curl: 'REQUEST_URI': '/?a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82', 'PATH_INFO': '/', 'QUERY_STRING': 'a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82', So yes curl sends as bytes rather than encoded, expected that. Gunicorn on: http://127.0.0.1:8000/a=тест sees for Safari: 'PATH_INFO': '/a=Ñ\x82ÐµÑ\x81Ñ\x82', 'QUERY_STRING': '', 'RAW_URI': '/a=%D1%82%D0%B5%D1%81%D1%82', and curl: 'PATH_INFO': '/a=Ã\x91Â\x82Ã\x90ÂµÃ\x91Â\x81Ã\x91Â\x82', 'QUERY_STRING': '', 'RAW_URI': '/a=Ñ\x82ÐµÑ\x81Ñ\x82', Gunicorn on: http://127.0.0.1:8000/?a=тест sees for Safari: 'PATH_INFO': '/', 'QUERY_STRING': 'a=%D1%82%D0%B5%D1%81%D1%82', 'RAW_URI': '/?a=%D1%82%D0%B5%D1%81%D1%82', and curl: 'PATH_INFO': '/', 'QUERY_STRING': 'a=Ñ\x82ÐµÑ\x81Ñ\x82', 'RAW_URI': '/?a=Ñ\x82ÐµÑ\x81Ñ\x82', So in Apache I get through UTF-8 byte string as Latin-1. So can see multi byte characters. Gunicorn is doing something different when gets raw bytes from curl. As does wsgiref. Showed Gunicorn as it has RAW_URI which is supposed to be the same as REQUEST_URI in Apache, but actually isn't showing the same there either. Whatever is happening, mod_wsgi still gives a good outcome.
|msg263928 - (view)||Author: Andrew Clover (Andrew Clover)||Date: 2016-04-21 17:55|
> Why only PATH_INFO is encoded in such a manner, but QUERY_STRING is passed without any changes and does not requires any latin-1 to utf-8 recodings? Laziness: QUERY_STRING should be pure-ASCII, making any such transcoding a no-op. In principle a user agent *can* submit non-ASCII characters in a query string without %-encoding them, but it's not standards-conformant and most browsers don't usually do it (exception: apparently curl as above), so it's not worth adding a layer of hopefully-fixing-but-potentially-mangling to this variable to support a situation that shouldn't arise for normal requests. PATH_INFO only requires special handling because of the sad, sad historical artefact of the CGI spec requiring it to have URL-decoding applied to it at the gateway, thus making the non-ASCII characters pop out of the percentage woodwork. @Graham can you share more about how those test results were generated and displayed? The Gunicorn results are about what I would expect - the double-decoding of PATH_INFO is arguably undesirable when curl submits raw bytes, but ultimately that's an unspecified situation so I don't really case. The output from Apache, on the other hand, is odd - something appears to have mangled the results at the reporting stage as not only is there double-decoding but also some double-backslashes. It looks like the strings have been put through ascii(repr()) or something?
|msg263934 - (view)||Author: Graham Dumpleton (grahamd)||Date: 2016-04-21 19:54|
Double back slashes would possibly be an artefact of the some mess that happens when logging out through the Apache error log.
|2016-04-23 00:47:43||martin.panter||set||stage: needs patch|
|2016-04-21 19:54:39||grahamd||set||messages: + msg263934|
|2016-04-21 17:55:05||Andrew Clover||set||nosy:
+ Andrew Clover|
messages: + msg263928
|2016-04-21 05:16:09||grahamd||set||messages: + msg263879|
messages: + msg263878
|2016-04-21 03:53:25||grahamd||set||messages: + msg263872|
messages: + msg263862
|2016-04-20 12:44:42||martin.panter||link||issue26808 superseder|
|2016-04-09 07:11:50||martin.panter||set||title: Wrong URL path decoding -> Add advice about non-ASCII wsgiref PATH_INFO|
messages: + msg177650
components: + Documentation, - Library (Lib)
|2012-12-16 12:03:32||aclover||set||messages: + msg177591|
|2012-12-16 02:48:48||pje||set||messages: + msg177578|
+ pje, aclover|
|2012-12-14 10:47:22||claudep||set||messages: + msg177457|
|2012-12-14 09:11:35||grahamd||set||messages: + msg177453|
keywords: + patch
messages: + msg177451
|2012-12-14 08:43:22||berker.peksag||set||versions: + Python 3.4, - Python 3.5|
messages: + msg177450