New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add advice about non-ASCII wsgiref PATH_INFO #60883
Comments
In wsgiref/simple_server.py (WSGIRequestHandler.get_environ), Python 3 is currently populating the env['PATH_INFO'] variable by decoding the URL path, assuming it was encoded with 'iso-8859-1', which appears to be wrong, according to RFC 3986/3987. Note that this was introduced as part of the fix for http://bugs.python.org/issue10155 |
The requirement per PEP-3333 is that the original byte string needs to be converted to native string (Unicode) with the ISO-8891-1 encoding. This is to ensure that the original bytes are preserved so that the WSGI application, with its own knowledge of what encoding the byte string was in, can then properly convert it to the correct encoding. In other words, the WSGI server is not allowed to assume that the original byte string was UTF-8, because in practice it may not be and it cannot know what it is. The WSGI server must use ISO-8859-1. The WSGI application if it needs it in UTF-8, must then convert it back to a byte string using IS0-8859-1 and then from there convert it back to a native string as UTF-8. So if I understand what you are saying, you are suggesting a change which is incompatible with PEP-3333. Please provide a code snippet or patch to show what you are proposing to be changed so it can be determined precisely what you are talking about. |
Attached are my proposed changes. Also, I just came across http://bugs.python.org/issue3300, which finally led Python urllib.parse.quote to default to UTF-8 encoding, after a lengthy discussion. |
You can't try UTF-8 and then fall back to ISO-8859-1. PEP-3333 requires it always be ISO-8859-1. If an application needs it as something else, it is the web applications job to do it. The relevant part of the PEP is: """On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters.""" By converting as UTF-8 you would be breaking the requirement that only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive) are passed through. So it is inconvenient if your expectation is that will always be UTF-8, but is how it has to work. This is because it could be something other than UTF-8, yet still be able to be successfully converted as UTF-8. In that case the application would get something totally different to the original which is wrong. So, the WSGI server cannot ever make any assumptions and the WSGI application always has to be the one which converts it to the correct Unicode string. The only way that can be done and still pass through a native string, is that it is done as ISO-8859-1 (which is byte preserving), allowing the application to go back to bytes and then back to Unicode in correct encoding. |
I may understand your reasoning when you cannot make any assumptions about the encoding of a series of bytes. I think that the case of PATH_INFO is different, because it should comply with standards, and then you *can* make the assumption that the original path is 'utf-8'-encoded. So either leave the string undecoded, or decode it to what the standards say. It would put un unneccessary burden on WSGI apps to always require to encode-redecode this string. Hopefully we can get some other opinions about this issue. |
Sure... except then it would also be necessary to amend PEP-3333, and also all WSGI applications already written that assume this, any time in the last nine years. This is a known and intended consistent property of how WSGI handles HTTP headers. Under Python 2.x, PATH_INFO was a byte string (and still is), and to maintain also side-compatibility with Jython and IronPython, header strings are always maintained as "bytes in unicode form", with applications having responsibility to decode-recode as needed. This isn't a minor corner of the spec, it's central to how headers are handled, and has been so long before Python 3 even existed. To mess with it now means you break applications and frameworks that are already correctly written to follow the specs. To put it in brief, the reported behavior is not a bug, it is a feature and by design. A server that returns a UTF-8 decoded PATH_INFO is in violation of the spec, so the reference implementation of the spec should absolutely not do so. ;-) |
WSGI's usage of ISO-8859-1 for all HTTP-byte-originated strings is very much deliberate; we needed a way to preserve the original input bytes whilst still using unicode strings, and at the time surrogateescape was not available. The result is counter-intuitive but at least it is finally consistent; the expectation is that most web authors will be using some kind of web framework or input-reading library that will hide away the unpleasant details. See http://mail.python.org/pipermail/web-sig/2007-December/thread.html#3002 and http://mail.python.org/pipermail/web-sig/2010-July/thread.html#4473 for the background discussion. In any case we cannot assume a path is UTF-8 - not every URI is known to have come from an IRI so RFC 3987 does not necessarily apply. UTF-8-with-Latin1-fallback is also undesirable in itself as it adds ambiguity - an ISO-8859-1 byte sequence that by coincidence happens to be a valid UTF-8 byte sequence will get mangled. |
Thanks for the explanations (and history). I realize that changing the behaviour is probably not an option. As an example in a framework, we are currently discussing how we will cope with this in Django: https://code.djangoproject.com/ticket/19468 On the Python side, it might be worth adding an admonition about PATH_INFO and non-ascii URLs on the wsgiref docs. |
I'm from bpo-26808. I'd like to see some explanation on: how about QUERY_STRING value? Why only PATH_INFO is encoded in such a manner, but QUERY_STRING is passed without any changes and does not requires any latin-1 to utf-8 recodings? |
As I commented on bpo-26808, it actually looks to me like the QUERY_STRING is processed fine and it is actually PATH_INFO that is not. I am confused at this point. I hate dealing with these WSGI level details now. :-( |
PEP-3333 defers to a draft CGI specification for PATH_INFO and QUERY_STRING: <https://tools.ietf.org/html/draft-coar-cgi-v11-03\>. (Dunno why it didn’t reference the final RFC 3875 instead, published 2004.) Anyway, both draft and final RFCs say “PATH_INFO is not URL-encoded”, but “the QUERY_STRING variable contains a URL-encoded search or parameter string”. Graham, maybe you are seeing Latin-1 code points in PATH_INFO that have been translated from the %XX URL syntax, and QUERY_STRING retaining the original %XX syntax. |
What I get in Apache for: in Safari browser is: 'REQUEST_URI': '/a=%D1%82%D0%B5%D1%81%D1%82', Where as for curl see: 'REQUEST_URI': '/a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82', For: in Safari get: 'REQUEST_URI': '/?a=%D1%82%D0%B5%D1%81%D1%82', and curl: 'REQUEST_URI': '/?a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82', So yes curl sends as bytes rather than encoded, expected that. Gunicorn on: sees for Safari: 'PATH_INFO': '/a=Ñ\x82еÑ\x81Ñ\x82', and curl: 'PATH_INFO': '/a=Ã\x91Â\x82Ã\x90µÃ\x91Â\x81Ã\x91Â\x82', Gunicorn on: sees for Safari: 'PATH_INFO': '/', and curl: 'PATH_INFO': '/', So in Apache I get through UTF-8 byte string as Latin-1. So can see multi byte characters. Gunicorn is doing something different when gets raw bytes from curl. As does wsgiref. Showed Gunicorn as it has RAW_URI which is supposed to be the same as REQUEST_URI in Apache, but actually isn't showing the same there either. Whatever is happening, mod_wsgi still gives a good outcome. |
Laziness: QUERY_STRING should be pure-ASCII, making any such transcoding a no-op. In principle a user agent *can* submit non-ASCII characters in a query string without %-encoding them, but it's not standards-conformant and most browsers don't usually do it (exception: apparently curl as above), so it's not worth adding a layer of hopefully-fixing-but-potentially-mangling to this variable to support a situation that shouldn't arise for normal requests. PATH_INFO only requires special handling because of the sad, sad historical artefact of the CGI spec requiring it to have URL-decoding applied to it at the gateway, thus making the non-ASCII characters pop out of the percentage woodwork. @graham can you share more about how those test results were generated and displayed? The Gunicorn results are about what I would expect - the double-decoding of PATH_INFO is arguably undesirable when curl submits raw bytes, but ultimately that's an unspecified situation so I don't really case. The output from Apache, on the other hand, is odd - something appears to have mangled the results at the reporting stage as not only is there double-decoding but also some double-backslashes. It looks like the strings have been put through ascii(repr()) or something? |
Double back slashes would possibly be an artefact of the some mess that happens when logging out through the Apache error log. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: