Add advice about non-ASCII wsgiref PATH_INFO #60883

claudep · 2012-12-14T08:28:30Z

BPO	16679
Nosy	@pjeby, @bobince, @vadmium
Files	issue16679-1.diff: Decoding path with 'utf-8'

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2012-12-14.08:28:30.319>
labels = ['type-bug', 'docs']
title = 'Add advice about non-ASCII wsgiref PATH_INFO'
updated_at = <Date 2016-04-23.00:47:43.413>
user = 'https://bugs.python.org/claudep'

bugs.python.org fields:

activity = <Date 2016-04-23.00:47:43.413>
actor = 'martin.panter'
assignee = 'docs@python'
closed = False
closed_date = None
closer = None
components = ['Documentation']
creation = <Date 2012-12-14.08:28:30.319>
creator = 'claudep'
dependencies = []
files = ['28308']
hgrepos = []
issue_num = 16679
keywords = ['patch']
message_count = 14.0
messages = ['177449', '177450', '177451', '177453', '177457', '177578', '177591', '177650', '263862', '263872', '263878', '263879', '263928', '263934']
nosy_count = 8.0
nosy_names = ['pje', 'grahamd', 'aclover', 'docs@python', 'martin.panter', 'animus', 'claudep', 'Andrew Clover']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue16679'
versions = ['Python 3.4']

claudep · 2012-12-14T08:28:30Z

In wsgiref/simple_server.py (WSGIRequestHandler.get_environ), Python 3 is currently populating the env['PATH_INFO'] variable by decoding the URL path, assuming it was encoded with 'iso-8859-1', which appears to be wrong, according to RFC 3986/3987.
For example, if you request the path /سلام in any modern browser, PATH_INFO will contain "/Ø³Ù�Ø§Ù".
'iso-8859-1' should be replaced by 'utf-8' for decoding.

Note that this was introduced as part of the fix for http://bugs.python.org/issue10155

grahamd · 2012-12-14T08:42:39Z

The requirement per PEP-3333 is that the original byte string needs to be converted to native string (Unicode) with the ISO-8891-1 encoding. This is to ensure that the original bytes are preserved so that the WSGI application, with its own knowledge of what encoding the byte string was in, can then properly convert it to the correct encoding.

In other words, the WSGI server is not allowed to assume that the original byte string was UTF-8, because in practice it may not be and it cannot know what it is. The WSGI server must use ISO-8859-1. The WSGI application if it needs it in UTF-8, must then convert it back to a byte string using IS0-8859-1 and then from there convert it back to a native string as UTF-8.

So if I understand what you are saying, you are suggesting a change which is incompatible with PEP-3333.

Please provide a code snippet or patch to show what you are proposing to be changed so it can be determined precisely what you are talking about.

claudep · 2012-12-14T08:59:43Z

Attached are my proposed changes.

Also, I just came across http://bugs.python.org/issue3300, which finally led Python urllib.parse.quote to default to UTF-8 encoding, after a lengthy discussion.

grahamd · 2012-12-14T09:11:34Z

You can't try UTF-8 and then fall back to ISO-8859-1. PEP-3333 requires it always be ISO-8859-1. If an application needs it as something else, it is the web applications job to do it.

The relevant part of the PEP is:

"""On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters."""

By converting as UTF-8 you would be breaking the requirement that only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive) are passed through.

So it is inconvenient if your expectation is that will always be UTF-8, but is how it has to work. This is because it could be something other than UTF-8, yet still be able to be successfully converted as UTF-8. In that case the application would get something totally different to the original which is wrong.

So, the WSGI server cannot ever make any assumptions and the WSGI application always has to be the one which converts it to the correct Unicode string. The only way that can be done and still pass through a native string, is that it is done as ISO-8859-1 (which is byte preserving), allowing the application to go back to bytes and then back to Unicode in correct encoding.

claudep · 2012-12-14T10:47:22Z

I may understand your reasoning when you cannot make any assumptions about the encoding of a series of bytes.

I think that the case of PATH_INFO is different, because it should comply with standards, and then you *can* make the assumption that the original path is 'utf-8'-encoded. So either leave the string undecoded, or decode it to what the standards say. It would put un unneccessary burden on WSGI apps to always require to encode-redecode this string.
Wouldn't it be possible to amend PEP-3333?

Hopefully we can get some other opinions about this issue.

pjeby · 2012-12-16T02:48:47Z

Wouldn't it be possible to amend PEP-3333?

Sure... except then it would also be necessary to amend PEP-3333, and also all WSGI applications already written that assume this, any time in the last nine years.

This is a known and intended consistent property of how WSGI handles HTTP headers. Under Python 2.x, PATH_INFO was a byte string (and still is), and to maintain also side-compatibility with Jython and IronPython, header strings are always maintained as "bytes in unicode form", with applications having responsibility to decode-recode as needed.

This isn't a minor corner of the spec, it's central to how headers are handled, and has been so long before Python 3 even existed. To mess with it now means you break applications and frameworks that are already correctly written to follow the specs.

To put it in brief, the reported behavior is not a bug, it is a feature and by design. A server that returns a UTF-8 decoded PATH_INFO is in violation of the spec, so the reference implementation of the spec should absolutely not do so. ;-)

bobince · 2012-12-16T12:03:32Z

WSGI's usage of ISO-8859-1 for all HTTP-byte-originated strings is very much deliberate; we needed a way to preserve the original input bytes whilst still using unicode strings, and at the time surrogateescape was not available. The result is counter-intuitive but at least it is finally consistent; the expectation is that most web authors will be using some kind of web framework or input-reading library that will hide away the unpleasant details.

See http://mail.python.org/pipermail/web-sig/2007-December/thread.html#3002 and http://mail.python.org/pipermail/web-sig/2010-July/thread.html#4473 for the background discussion.

In any case we cannot assume a path is UTF-8 - not every URI is known to have come from an IRI so RFC 3987 does not necessarily apply. UTF-8-with-Latin1-fallback is also undesirable in itself as it adds ambiguity - an ISO-8859-1 byte sequence that by coincidence happens to be a valid UTF-8 byte sequence will get mangled.

claudep · 2012-12-17T16:44:29Z

Thanks for the explanations (and history). I realize that changing the behaviour is probably not an option.

As an example in a framework, we are currently discussing how we will cope with this in Django: https://code.djangoproject.com/ticket/19468

On the Python side, it might be worth adding an admonition about PATH_INFO and non-ascii URLs on the wsgiref docs.

animus · 2016-04-20T21:17:37Z

I'm from bpo-26808. I'd like to see some explanation on: how about QUERY_STRING value? Why only PATH_INFO is encoded in such a manner, but QUERY_STRING is passed without any changes and does not requires any latin-1 to utf-8 recodings?

grahamd · 2016-04-21T03:53:25Z

As I commented on bpo-26808, it actually looks to me like the QUERY_STRING is processed fine and it is actually PATH_INFO that is not. I am confused at this point. I hate dealing with these WSGI level details now. :-(

vadmium · 2016-04-21T04:58:10Z

PEP-3333 defers to a draft CGI specification for PATH_INFO and QUERY_STRING: <https://tools.ietf.org/html/draft-coar-cgi-v11-03\>. (Dunno why it didn’t reference the final RFC 3875 instead, published 2004.) Anyway, both draft and final RFCs say “PATH_INFO is not URL-encoded”, but “the QUERY_STRING variable contains a URL-encoded search or parameter string”.

Graham, maybe you are seeing Latin-1 code points in PATH_INFO that have been translated from the %XX URL syntax, and QUERY_STRING retaining the original %XX syntax.

grahamd · 2016-04-21T05:16:09Z

What I get in Apache for:

http://127.0.0.1:8000/a=тест

in Safari browser is:

'REQUEST_URI': '/a=%D1%82%D0%B5%D1%81%D1%82',
'PATH_INFO': '/a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82',

Where as for curl see:

'REQUEST_URI': '/a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82',
'PATH_INFO': '/a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82',

For:

http://127.0.0.1:8000/?a=тест

in Safari get:

'REQUEST_URI': '/?a=%D1%82%D0%B5%D1%81%D1%82',
'PATH_INFO': '/'
'QUERY_STRING': 'a=%D1%82%D0%B5%D1%81%D1%82',

and curl:

'REQUEST_URI': '/?a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82',
'PATH_INFO': '/',
'QUERY_STRING': 'a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82',

So yes curl sends as bytes rather than encoded, expected that.

Gunicorn on:

http://127.0.0.1:8000/a=тест

sees for Safari:

'PATH_INFO': '/a=Ñ\x82ÐµÑ\x81Ñ\x82',
'QUERY_STRING': '',
'RAW_URI': '/a=%D1%82%D0%B5%D1%81%D1%82',

and curl:

'PATH_INFO': '/a=Ã\x91Â\x82Ã\x90ÂµÃ\x91Â\x81Ã\x91Â\x82',
'QUERY_STRING': '',
'RAW_URI': '/a=Ñ\x82ÐµÑ\x81Ñ\x82',

Gunicorn on:

http://127.0.0.1:8000/?a=тест

sees for Safari:

'PATH_INFO': '/',
'QUERY_STRING': 'a=%D1%82%D0%B5%D1%81%D1%82',
'RAW_URI': '/?a=%D1%82%D0%B5%D1%81%D1%82',

and curl:

'PATH_INFO': '/',
'QUERY_STRING': 'a=Ñ\x82ÐµÑ\x81Ñ\x82',
'RAW_URI': '/?a=Ñ\x82ÐµÑ\x81Ñ\x82',

So in Apache I get through UTF-8 byte string as Latin-1. So can see multi byte characters. Gunicorn is doing something different when gets raw bytes from curl. As does wsgiref. Showed Gunicorn as it has RAW_URI which is supposed to be the same as REQUEST_URI in Apache, but actually isn't showing the same there either.

Whatever is happening, mod_wsgi still gives a good outcome.

AndrewClover · 2016-04-21T17:55:05Z

Why only PATH_INFO is encoded in such a manner, but QUERY_STRING is passed without any changes and does not requires any latin-1 to utf-8 recodings?

Laziness: QUERY_STRING should be pure-ASCII, making any such transcoding a no-op.

In principle a user agent *can* submit non-ASCII characters in a query string without %-encoding them, but it's not standards-conformant and most browsers don't usually do it (exception: apparently curl as above), so it's not worth adding a layer of hopefully-fixing-but-potentially-mangling to this variable to support a situation that shouldn't arise for normal requests.

PATH_INFO only requires special handling because of the sad, sad historical artefact of the CGI spec requiring it to have URL-decoding applied to it at the gateway, thus making the non-ASCII characters pop out of the percentage woodwork.

@graham can you share more about how those test results were generated and displayed? The Gunicorn results are about what I would expect - the double-decoding of PATH_INFO is arguably undesirable when curl submits raw bytes, but ultimately that's an unspecified situation so I don't really case.

The output from Apache, on the other hand, is odd - something appears to have mangled the results at the reporting stage as not only is there double-decoding but also some double-backslashes. It looks like the strings have been put through ascii(repr()) or something?

grahamd · 2016-04-21T19:54:39Z

Double back slashes would possibly be an artefact of the some mess that happens when logging out through the Apache error log.

claudep mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Dec 14, 2012

claudep mannequin added docs Documentation in the Doc dir and removed stdlib Python modules in the Lib dir labels Dec 17, 2012

claudep mannequin assigned docspython Dec 17, 2012

vadmium changed the title ~~Wrong URL path decoding~~ Add advice about non-ASCII wsgiref PATH_INFO Apr 9, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add advice about non-ASCII wsgiref PATH_INFO #60883

Add advice about non-ASCII wsgiref PATH_INFO #60883

claudep mannequin commented Dec 14, 2012

claudep mannequin commented Dec 14, 2012

grahamd mannequin commented Dec 14, 2012

claudep mannequin commented Dec 14, 2012

grahamd mannequin commented Dec 14, 2012

claudep mannequin commented Dec 14, 2012

pjeby mannequin commented Dec 16, 2012

bobince mannequin commented Dec 16, 2012

claudep mannequin commented Dec 17, 2012

animus mannequin commented Apr 20, 2016

grahamd mannequin commented Apr 21, 2016

vadmium commented Apr 21, 2016

grahamd mannequin commented Apr 21, 2016

AndrewClover mannequin commented Apr 21, 2016

grahamd mannequin commented Apr 21, 2016

Add advice about non-ASCII wsgiref PATH_INFO #60883

Add advice about non-ASCII wsgiref PATH_INFO #60883

Comments

claudep mannequin commented Dec 14, 2012

claudep mannequin commented Dec 14, 2012

grahamd mannequin commented Dec 14, 2012

claudep mannequin commented Dec 14, 2012

grahamd mannequin commented Dec 14, 2012

claudep mannequin commented Dec 14, 2012

pjeby mannequin commented Dec 16, 2012

bobince mannequin commented Dec 16, 2012

claudep mannequin commented Dec 17, 2012

animus mannequin commented Apr 20, 2016

grahamd mannequin commented Apr 21, 2016

vadmium commented Apr 21, 2016

grahamd mannequin commented Apr 21, 2016

AndrewClover mannequin commented Apr 21, 2016

grahamd mannequin commented Apr 21, 2016