Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add advice about non-ASCII wsgiref PATH_INFO #60883

Open
claudep mannequin opened this issue Dec 14, 2012 · 14 comments
Open

Add advice about non-ASCII wsgiref PATH_INFO #60883

claudep mannequin opened this issue Dec 14, 2012 · 14 comments
Labels
docs Documentation in the Doc dir type-bug An unexpected behavior, bug, or error

Comments

@claudep
Copy link
Mannequin

claudep mannequin commented Dec 14, 2012

BPO 16679
Nosy @pjeby, @bobince, @vadmium
Files
  • issue16679-1.diff: Decoding path with 'utf-8'
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2012-12-14.08:28:30.319>
    labels = ['type-bug', 'docs']
    title = 'Add advice about non-ASCII wsgiref PATH_INFO'
    updated_at = <Date 2016-04-23.00:47:43.413>
    user = 'https://bugs.python.org/claudep'

    bugs.python.org fields:

    activity = <Date 2016-04-23.00:47:43.413>
    actor = 'martin.panter'
    assignee = 'docs@python'
    closed = False
    closed_date = None
    closer = None
    components = ['Documentation']
    creation = <Date 2012-12-14.08:28:30.319>
    creator = 'claudep'
    dependencies = []
    files = ['28308']
    hgrepos = []
    issue_num = 16679
    keywords = ['patch']
    message_count = 14.0
    messages = ['177449', '177450', '177451', '177453', '177457', '177578', '177591', '177650', '263862', '263872', '263878', '263879', '263928', '263934']
    nosy_count = 8.0
    nosy_names = ['pje', 'grahamd', 'aclover', 'docs@python', 'martin.panter', 'animus', 'claudep', 'Andrew Clover']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'needs patch'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue16679'
    versions = ['Python 3.4']

    @claudep
    Copy link
    Mannequin Author

    claudep mannequin commented Dec 14, 2012

    In wsgiref/simple_server.py (WSGIRequestHandler.get_environ), Python 3 is currently populating the env['PATH_INFO'] variable by decoding the URL path, assuming it was encoded with 'iso-8859-1', which appears to be wrong, according to RFC 3986/3987.
    For example, if you request the path /سلام in any modern browser, PATH_INFO will contain "/سÙ�اÙ".
    'iso-8859-1' should be replaced by 'utf-8' for decoding.

    Note that this was introduced as part of the fix for http://bugs.python.org/issue10155

    @claudep claudep mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Dec 14, 2012
    @grahamd
    Copy link
    Mannequin

    grahamd mannequin commented Dec 14, 2012

    The requirement per PEP-3333 is that the original byte string needs to be converted to native string (Unicode) with the ISO-8891-1 encoding. This is to ensure that the original bytes are preserved so that the WSGI application, with its own knowledge of what encoding the byte string was in, can then properly convert it to the correct encoding.

    In other words, the WSGI server is not allowed to assume that the original byte string was UTF-8, because in practice it may not be and it cannot know what it is. The WSGI server must use ISO-8859-1. The WSGI application if it needs it in UTF-8, must then convert it back to a byte string using IS0-8859-1 and then from there convert it back to a native string as UTF-8.

    So if I understand what you are saying, you are suggesting a change which is incompatible with PEP-3333.

    Please provide a code snippet or patch to show what you are proposing to be changed so it can be determined precisely what you are talking about.

    @claudep
    Copy link
    Mannequin Author

    claudep mannequin commented Dec 14, 2012

    Attached are my proposed changes.

    Also, I just came across http://bugs.python.org/issue3300, which finally led Python urllib.parse.quote to default to UTF-8 encoding, after a lengthy discussion.

    @grahamd
    Copy link
    Mannequin

    grahamd mannequin commented Dec 14, 2012

    You can't try UTF-8 and then fall back to ISO-8859-1. PEP-3333 requires it always be ISO-8859-1. If an application needs it as something else, it is the web applications job to do it.

    The relevant part of the PEP is:

    """On Python platforms where the str or StringType type is in fact Unicode-based (e.g. Jython, IronPython, Python 3, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive). It is a fatal error for an application to supply strings containing any other Unicode character or code point. Similarly, servers and gateways must not supply strings to an application containing any other Unicode characters."""

    By converting as UTF-8 you would be breaking the requirement that only code points representable in ISO-8859-1 encoding (\u0000 through \u00FF, inclusive) are passed through.

    So it is inconvenient if your expectation is that will always be UTF-8, but is how it has to work. This is because it could be something other than UTF-8, yet still be able to be successfully converted as UTF-8. In that case the application would get something totally different to the original which is wrong.

    So, the WSGI server cannot ever make any assumptions and the WSGI application always has to be the one which converts it to the correct Unicode string. The only way that can be done and still pass through a native string, is that it is done as ISO-8859-1 (which is byte preserving), allowing the application to go back to bytes and then back to Unicode in correct encoding.

    @claudep
    Copy link
    Mannequin Author

    claudep mannequin commented Dec 14, 2012

    I may understand your reasoning when you cannot make any assumptions about the encoding of a series of bytes.

    I think that the case of PATH_INFO is different, because it should comply with standards, and then you *can* make the assumption that the original path is 'utf-8'-encoded. So either leave the string undecoded, or decode it to what the standards say. It would put un unneccessary burden on WSGI apps to always require to encode-redecode this string.
    Wouldn't it be possible to amend PEP-3333?

    Hopefully we can get some other opinions about this issue.

    @pjeby
    Copy link
    Mannequin

    pjeby mannequin commented Dec 16, 2012

    Wouldn't it be possible to amend PEP-3333?

    Sure... except then it would also be necessary to amend PEP-3333, and also all WSGI applications already written that assume this, any time in the last nine years.

    This is a known and intended consistent property of how WSGI handles HTTP headers. Under Python 2.x, PATH_INFO was a byte string (and still is), and to maintain also side-compatibility with Jython and IronPython, header strings are always maintained as "bytes in unicode form", with applications having responsibility to decode-recode as needed.

    This isn't a minor corner of the spec, it's central to how headers are handled, and has been so long before Python 3 even existed. To mess with it now means you break applications and frameworks that are already correctly written to follow the specs.

    To put it in brief, the reported behavior is not a bug, it is a feature and by design. A server that returns a UTF-8 decoded PATH_INFO is in violation of the spec, so the reference implementation of the spec should absolutely not do so. ;-)

    @bobince
    Copy link
    Mannequin

    bobince mannequin commented Dec 16, 2012

    WSGI's usage of ISO-8859-1 for all HTTP-byte-originated strings is very much deliberate; we needed a way to preserve the original input bytes whilst still using unicode strings, and at the time surrogateescape was not available. The result is counter-intuitive but at least it is finally consistent; the expectation is that most web authors will be using some kind of web framework or input-reading library that will hide away the unpleasant details.

    See http://mail.python.org/pipermail/web-sig/2007-December/thread.html#3002 and http://mail.python.org/pipermail/web-sig/2010-July/thread.html#4473 for the background discussion.

    In any case we cannot assume a path is UTF-8 - not every URI is known to have come from an IRI so RFC 3987 does not necessarily apply. UTF-8-with-Latin1-fallback is also undesirable in itself as it adds ambiguity - an ISO-8859-1 byte sequence that by coincidence happens to be a valid UTF-8 byte sequence will get mangled.

    @claudep
    Copy link
    Mannequin Author

    claudep mannequin commented Dec 17, 2012

    Thanks for the explanations (and history). I realize that changing the behaviour is probably not an option.

    As an example in a framework, we are currently discussing how we will cope with this in Django: https://code.djangoproject.com/ticket/19468

    On the Python side, it might be worth adding an admonition about PATH_INFO and non-ascii URLs on the wsgiref docs.

    @claudep claudep mannequin added docs Documentation in the Doc dir and removed stdlib Python modules in the Lib dir labels Dec 17, 2012
    @claudep claudep mannequin assigned docspython Dec 17, 2012
    @vadmium vadmium changed the title Wrong URL path decoding Add advice about non-ASCII wsgiref PATH_INFO Apr 9, 2016
    @animus
    Copy link
    Mannequin

    animus mannequin commented Apr 20, 2016

    I'm from bpo-26808. I'd like to see some explanation on: how about QUERY_STRING value? Why only PATH_INFO is encoded in such a manner, but QUERY_STRING is passed without any changes and does not requires any latin-1 to utf-8 recodings?

    @grahamd
    Copy link
    Mannequin

    grahamd mannequin commented Apr 21, 2016

    As I commented on bpo-26808, it actually looks to me like the QUERY_STRING is processed fine and it is actually PATH_INFO that is not. I am confused at this point. I hate dealing with these WSGI level details now. :-(

    @vadmium
    Copy link
    Member

    vadmium commented Apr 21, 2016

    PEP-3333 defers to a draft CGI specification for PATH_INFO and QUERY_STRING: <https://tools.ietf.org/html/draft-coar-cgi-v11-03\>. (Dunno why it didn’t reference the final RFC 3875 instead, published 2004.) Anyway, both draft and final RFCs say “PATH_INFO is not URL-encoded”, but “the QUERY_STRING variable contains a URL-encoded search or parameter string”.

    Graham, maybe you are seeing Latin-1 code points in PATH_INFO that have been translated from the %XX URL syntax, and QUERY_STRING retaining the original %XX syntax.

    @grahamd
    Copy link
    Mannequin

    grahamd mannequin commented Apr 21, 2016

    What I get in Apache for:

    http://127.0.0.1:8000/a=тест

    in Safari browser is:

    'REQUEST_URI': '/a=%D1%82%D0%B5%D1%81%D1%82',
    'PATH_INFO': '/a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82',

    Where as for curl see:

    'REQUEST_URI': '/a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82',
    'PATH_INFO': '/a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82',

    For:

    http://127.0.0.1:8000/?a=тест

    in Safari get:

    'REQUEST_URI': '/?a=%D1%82%D0%B5%D1%81%D1%82',
    'PATH_INFO': '/'
    'QUERY_STRING': 'a=%D1%82%D0%B5%D1%81%D1%82',

    and curl:

    'REQUEST_URI': '/?a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82',
    'PATH_INFO': '/',
    'QUERY_STRING': 'a=\xc3\x91\\x82\xc3\x90\xc2\xb5\xc3\x91\\x81\xc3\x91\\x82',

    So yes curl sends as bytes rather than encoded, expected that.

    Gunicorn on:

    http://127.0.0.1:8000/a=тест

    sees for Safari:

    'PATH_INFO': '/a=Ñ\x82еÑ\x81Ñ\x82',
    'QUERY_STRING': '',
    'RAW_URI': '/a=%D1%82%D0%B5%D1%81%D1%82',

    and curl:

    'PATH_INFO': '/a=Ã\x91Â\x82Ã\x90µÃ\x91Â\x81Ã\x91Â\x82',
    'QUERY_STRING': '',
    'RAW_URI': '/a=Ñ\x82еÑ\x81Ñ\x82',

    Gunicorn on:

    http://127.0.0.1:8000/?a=тест

    sees for Safari:

    'PATH_INFO': '/',
    'QUERY_STRING': 'a=%D1%82%D0%B5%D1%81%D1%82',
    'RAW_URI': '/?a=%D1%82%D0%B5%D1%81%D1%82',

    and curl:

    'PATH_INFO': '/',
    'QUERY_STRING': 'a=Ñ\x82еÑ\x81Ñ\x82',
    'RAW_URI': '/?a=Ñ\x82еÑ\x81Ñ\x82',

    So in Apache I get through UTF-8 byte string as Latin-1. So can see multi byte characters. Gunicorn is doing something different when gets raw bytes from curl. As does wsgiref. Showed Gunicorn as it has RAW_URI which is supposed to be the same as REQUEST_URI in Apache, but actually isn't showing the same there either.

    Whatever is happening, mod_wsgi still gives a good outcome.

    @AndrewClover
    Copy link
    Mannequin

    AndrewClover mannequin commented Apr 21, 2016

    Why only PATH_INFO is encoded in such a manner, but QUERY_STRING is passed without any changes and does not requires any latin-1 to utf-8 recodings?

    Laziness: QUERY_STRING should be pure-ASCII, making any such transcoding a no-op.

    In principle a user agent *can* submit non-ASCII characters in a query string without %-encoding them, but it's not standards-conformant and most browsers don't usually do it (exception: apparently curl as above), so it's not worth adding a layer of hopefully-fixing-but-potentially-mangling to this variable to support a situation that shouldn't arise for normal requests.

    PATH_INFO only requires special handling because of the sad, sad historical artefact of the CGI spec requiring it to have URL-decoding applied to it at the gateway, thus making the non-ASCII characters pop out of the percentage woodwork.

    @graham can you share more about how those test results were generated and displayed? The Gunicorn results are about what I would expect - the double-decoding of PATH_INFO is arguably undesirable when curl submits raw bytes, but ultimately that's an unspecified situation so I don't really case.

    The output from Apache, on the other hand, is odd - something appears to have mangled the results at the reporting stage as not only is there double-decoding but also some double-backslashes. It looks like the strings have been put through ascii(repr()) or something?

    @grahamd
    Copy link
    Mannequin

    grahamd mannequin commented Apr 21, 2016

    Double back slashes would possibly be an artefact of the some mess that happens when logging out through the Apache error log.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant