classification
Title: Add fixups for encoding problems to wsgiref
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: aclover, eric.araujo, orsenthil, pje
Priority: normal Keywords: patch

Created on 2010-10-20 16:23 by aclover, last changed 2012-12-17 20:29 by aclover. This issue is now closed.

Files
File name Uploaded Description Edit
wsgiref-patches-2.7.patch aclover, 2010-10-20 16:25 Patch against wsgiref in Python 2.7 review
wsgiref-patches-eby2692.patch aclover, 2010-10-20 23:36 Patch against PJE's Python 2.x wsgiref branch
wsgiref-patches-3.2a3.proper.patch aclover, 2010-10-24 00:06 Patch against wsgiref in py3k branch
Messages (10)
msg119220 - (view) Author: And Clover (aclover) Date: 2010-10-20 16:23
Currently wsgiref's CGIHandler makes a WSGI environ from the CGI environ without changes.

Unfortunately the CGI environ is wrong in a number of common circumstances:

- on Windows, the native environ is Unicode, and different servers choose different decodings for HTTP bytes to store in the environ (most notably for PATH_INFO);

- on Windows with Python 2.x, os.environ is read from the Unicode native environ using the ANSI encoding, which will lose/mangle non-ASCII characters;

- on Posix with Python 3.x, os.environ is read from a native bytes environ using the filesystemencoding which is probably not ISO-8859-1.

- on IIS, PATH_INFO inappropriately includes SCRIPT_NAME unless a hidden, rarely-used, and problematic config option is applied.

Previously, it was not clear in PEP 333 what was supposed to happen with headers and encodings, especially under Python 3. PEP 3333 clears this up. These patches add fixups to wsgiref to try to generate the nearest to a 'correct' environ as per PEP 3333 as possible for the current platform and server software.

They also fix simple_server to use the correct encoding for PATH_INFO, and include the fix for issue 9022, correspondingly updating the simple_server demo app and tests to conform to PEP 3333's expectation that headers will be ISO-8859-1-decoded Unicode strings. The test_bytes_validation test is removed: as I understand it, it's no long allowed to use byte string headers/status.
msg119221 - (view) Author: And Clover (aclover) Date: 2010-10-20 16:25
(patch for Python 2.x, for what it's worth)
msg119244 - (view) Author: And Clover (aclover) Date: 2010-10-20 23:36
(same again for branch PJ Eby's wsgiref svn: same as previous 2.7 patch aside from the line numbers)
msg119395 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-10-22 17:40
Your patch adds a new handler, which is arguably a new feature that has to be rejected in a bugfix branch.
msg119480 - (view) Author: And Clover (aclover) Date: 2010-10-24 00:06
Ah, sorry, submitted wrong patch against 3.2, disregard. Here's the 'proper' version (the functionality isn't changed, just the former patch had an unused and-Falsed out clause for reading environb, which in the end I decided not to use as the surrogateescape approach already covers it just as well for values).

@Éric: yes. Actually the whole patch is pretty much new functionality, which should not be considered for a 2.7.x bugfix release. I've submitted a patch against 2.7 for completeness and for the use of a separately-maintained post-2.7 wsgiref, but unless there is ever a Python 2.8 it should never hit stdlib.

The status quo wrt Unicode in environ is broken and inconsistent, which an accepted PEP 3333 would finally clear up. But there may be webapps deployed that rely on their particular server's current inconsistent environ, and those shouldn't be broken by a bugfix 2.7 or 3.1 release.
msg120354 - (view) Author: PJ Eby (pje) * (Python committer) Date: 2010-11-03 23:21
Committed to Py3K in r86146, with added docs and a larger list of transcodable CGI variables.
msg120377 - (view) Author: And Clover (aclover) Date: 2010-11-04 03:55
Thanks.

Some of those additions in _needs_transcode are potentially controversial, though. I'm not wholly sure it's the right thing to transcode these.

Some of them may not actually come from the request, eg `REMOTE_USER` may be filled in by IIS's Windows authentication using a native-Unicode string from the Windows user database. Is it the right thing to turn it into UTF-8-bytes-in-Unicode for consistency with Apache? Maybe. (At least for most of the other new envvars there will never see a non-ASCII character. Or in `REMOTE_IDENT`'s case never be used for anything.)

The case with the REDIRECT_HTTP_ and SSL_ envvars is an interesting one. Whilst transcoding them at some point will very probably be what applications need to do if they want to actually use them, is it within CGIHandler's remit to change Apache mod-specific variables that are not specified by CGI or WSGI?

(There might, after all, be lots of these to catch for other mods and servers, and it's *conceivable* that somebody might be re-using one of these names to set in the environment for some other purpose, in which case transcoding would be adding an unexpected mangling. We can't in the general case expect users to know to avoid envvar names are used as non-standard extensions in all servers.)

REDIRECT_HTTP_ at least comes from the HTTP request, so I guess the consistency is good there. (But then I think the only header that actually may contain non-ASCII is REDIRECT_URL, which replaces the unescaped SCRIPT_NAME and PATH_INFO; that one isn't caught at the moment.)
msg124211 - (view) Author: PJ Eby (pje) * (Python committer) Date: 2010-12-17 15:36
So, do you have any suggestions for a specific change to the patch?
msg124229 - (view) Author: And Clover (aclover) Date: 2010-12-17 16:59
No, not specifically. My patch is conservative about what variables it recodes, yours more liberal, but it's difficult to say which is the better approach, or what PEP 3333 requires.

If you're happy with the current patch, go ahead, let's have it for 3.2; I don't foresee significant problems with it. It's unlikely anyone is going to be re-using the SSL_ or REDIRECT_ variable names for something other than what Apache uses them for. There might be some confusion from IIS users over what encoding REMOTE_USER should be in, but I can't see any consistent resolution for that issue, and we'll certainly be in a better position than we are now.
msg177667 - (view) Author: And Clover (aclover) Date: 2012-12-17 20:29
(belated close-fixed)
History
Date User Action Args
2012-12-25 15:16:34orsenthillinkissue9022 superseder
2012-12-17 20:29:20acloversetstatus: open -> closed
resolution: fixed
messages: + msg177667
2010-12-17 16:59:27acloversetnosy: pje, orsenthil, eric.araujo, aclover
messages: + msg124229
2010-12-17 15:36:30pjesetnosy: pje, orsenthil, eric.araujo, aclover
messages: + msg124211
2010-11-04 03:55:04acloversetmessages: + msg120377
versions: - Python 2.7
2010-11-03 23:21:37pjesetmessages: + msg120354
2010-10-24 00:06:49acloversetfiles: + wsgiref-patches-3.2a3.proper.patch

messages: + msg119480
versions: - Python 3.1
2010-10-23 23:48:22acloversetfiles: - wsgiref-patches-3.2a3.patch
2010-10-22 17:40:23eric.araujosetnosy: + eric.araujo

messages: + msg119395
versions: + Python 3.1
2010-10-21 03:57:39orsenthilsetnosy: + orsenthil
2010-10-20 23:56:24ned.deilysetnosy: + pje
2010-10-20 23:36:10acloversetfiles: + wsgiref-patches-eby2692.patch

messages: + msg119244
2010-10-20 16:25:27acloversetfiles: + wsgiref-patches-2.7.patch

messages: + msg119221
2010-10-20 16:23:40aclovercreate