Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fixups for encoding problems to wsgiref #54364

Closed
bobince mannequin opened this issue Oct 20, 2010 · 10 comments
Closed

Add fixups for encoding problems to wsgiref #54364

bobince mannequin opened this issue Oct 20, 2010 · 10 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@bobince
Copy link
Mannequin

bobince mannequin commented Oct 20, 2010

BPO 10155
Nosy @pjeby, @orsenthil, @merwok, @bobince
Files
  • wsgiref-patches-2.7.patch: Patch against wsgiref in Python 2.7
  • wsgiref-patches-eby2692.patch: Patch against PJE's Python 2.x wsgiref branch
  • wsgiref-patches-3.2a3.proper.patch: Patch against wsgiref in py3k branch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-12-17.20:29:20.452>
    created_at = <Date 2010-10-20.16:23:40.437>
    labels = ['type-bug', 'library']
    title = 'Add fixups for encoding problems to wsgiref'
    updated_at = <Date 2012-12-17.20:29:20.451>
    user = 'https://github.com/bobince'

    bugs.python.org fields:

    activity = <Date 2012-12-17.20:29:20.451>
    actor = 'aclover'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-12-17.20:29:20.452>
    closer = 'aclover'
    components = ['Library (Lib)']
    creation = <Date 2010-10-20.16:23:40.437>
    creator = 'aclover'
    dependencies = []
    files = ['19304', '19309', '19348']
    hgrepos = []
    issue_num = 10155
    keywords = ['patch']
    message_count = 10.0
    messages = ['119220', '119221', '119244', '119395', '119480', '120354', '120377', '124211', '124229', '177667']
    nosy_count = 4.0
    nosy_names = ['pje', 'orsenthil', 'eric.araujo', 'aclover']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue10155'
    versions = ['Python 3.2']

    @bobince
    Copy link
    Mannequin Author

    bobince mannequin commented Oct 20, 2010

    Currently wsgiref's CGIHandler makes a WSGI environ from the CGI environ without changes.

    Unfortunately the CGI environ is wrong in a number of common circumstances:

    • on Windows, the native environ is Unicode, and different servers choose different decodings for HTTP bytes to store in the environ (most notably for PATH_INFO);

    • on Windows with Python 2.x, os.environ is read from the Unicode native environ using the ANSI encoding, which will lose/mangle non-ASCII characters;

    • on Posix with Python 3.x, os.environ is read from a native bytes environ using the filesystemencoding which is probably not ISO-8859-1.

    • on IIS, PATH_INFO inappropriately includes SCRIPT_NAME unless a hidden, rarely-used, and problematic config option is applied.

    Previously, it was not clear in PEP-333 what was supposed to happen with headers and encodings, especially under Python 3. PEP-3333 clears this up. These patches add fixups to wsgiref to try to generate the nearest to a 'correct' environ as per PEP-3333 as possible for the current platform and server software.

    They also fix simple_server to use the correct encoding for PATH_INFO, and include the fix for bpo-9022, correspondingly updating the simple_server demo app and tests to conform to PEP-3333's expectation that headers will be ISO-8859-1-decoded Unicode strings. The test_bytes_validation test is removed: as I understand it, it's no long allowed to use byte string headers/status.

    @bobince bobince mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Oct 20, 2010
    @bobince
    Copy link
    Mannequin Author

    bobince mannequin commented Oct 20, 2010

    (patch for Python 2.x, for what it's worth)

    @bobince
    Copy link
    Mannequin Author

    bobince mannequin commented Oct 20, 2010

    (same again for branch PJ Eby's wsgiref svn: same as previous 2.7 patch aside from the line numbers)

    @merwok
    Copy link
    Member

    merwok commented Oct 22, 2010

    Your patch adds a new handler, which is arguably a new feature that has to be rejected in a bugfix branch.

    @bobince
    Copy link
    Mannequin Author

    bobince mannequin commented Oct 24, 2010

    Ah, sorry, submitted wrong patch against 3.2, disregard. Here's the 'proper' version (the functionality isn't changed, just the former patch had an unused and-Falsed out clause for reading environb, which in the end I decided not to use as the surrogateescape approach already covers it just as well for values).

    @Éric: yes. Actually the whole patch is pretty much new functionality, which should not be considered for a 2.7.x bugfix release. I've submitted a patch against 2.7 for completeness and for the use of a separately-maintained post-2.7 wsgiref, but unless there is ever a Python 2.8 it should never hit stdlib.

    The status quo wrt Unicode in environ is broken and inconsistent, which an accepted PEP-3333 would finally clear up. But there may be webapps deployed that rely on their particular server's current inconsistent environ, and those shouldn't be broken by a bugfix 2.7 or 3.1 release.

    @pjeby
    Copy link
    Mannequin

    pjeby mannequin commented Nov 3, 2010

    Committed to Py3K in r86146, with added docs and a larger list of transcodable CGI variables.

    @bobince
    Copy link
    Mannequin Author

    bobince mannequin commented Nov 4, 2010

    Thanks.

    Some of those additions in _needs_transcode are potentially controversial, though. I'm not wholly sure it's the right thing to transcode these.

    Some of them may not actually come from the request, eg REMOTE_USER may be filled in by IIS's Windows authentication using a native-Unicode string from the Windows user database. Is it the right thing to turn it into UTF-8-bytes-in-Unicode for consistency with Apache? Maybe. (At least for most of the other new envvars there will never see a non-ASCII character. Or in REMOTE_IDENT's case never be used for anything.)

    The case with the REDIRECT_HTTP_ and SSL_ envvars is an interesting one. Whilst transcoding them at some point will very probably be what applications need to do if they want to actually use them, is it within CGIHandler's remit to change Apache mod-specific variables that are not specified by CGI or WSGI?

    (There might, after all, be lots of these to catch for other mods and servers, and it's *conceivable* that somebody might be re-using one of these names to set in the environment for some other purpose, in which case transcoding would be adding an unexpected mangling. We can't in the general case expect users to know to avoid envvar names are used as non-standard extensions in all servers.)

    REDIRECT_HTTP_ at least comes from the HTTP request, so I guess the consistency is good there. (But then I think the only header that actually may contain non-ASCII is REDIRECT_URL, which replaces the unescaped SCRIPT_NAME and PATH_INFO; that one isn't caught at the moment.)

    @pjeby
    Copy link
    Mannequin

    pjeby mannequin commented Dec 17, 2010

    So, do you have any suggestions for a specific change to the patch?

    @bobince
    Copy link
    Mannequin Author

    bobince mannequin commented Dec 17, 2010

    No, not specifically. My patch is conservative about what variables it recodes, yours more liberal, but it's difficult to say which is the better approach, or what PEP-3333 requires.

    If you're happy with the current patch, go ahead, let's have it for 3.2; I don't foresee significant problems with it. It's unlikely anyone is going to be re-using the SSL_ or REDIRECT_ variable names for something other than what Apache uses them for. There might be some confusion from IIS users over what encoding REMOTE_USER should be in, but I can't see any consistent resolution for that issue, and we'll certainly be in a better position than we are now.

    @bobince
    Copy link
    Mannequin Author

    bobince mannequin commented Dec 17, 2012

    (belated close-fixed)

    @bobince bobince mannequin closed this as completed Dec 17, 2012
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant