classification
Title: wsgiref on Python 3.x incorrectly implements URL handling causing mangled Unicode
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.4, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: apollo13, aronacher, barry, cvrebert, grahamd, haypo, orsenthil, python-dev, r.david.murray, serhiy.storchaka, terry.reedy
Priority: normal Keywords: patch

Created on 2014-01-06 09:46 by aronacher, last changed 2014-01-14 09:02 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
wsgiref_latin1.patch serhiy.storchaka, 2014-01-11 09:46 review
Messages (13)
msg207418 - (view) Author: Armin Ronacher (aronacher) * (Python committer) Date: 2014-01-06 09:46
I just noticed through looking through someone else's WSGI framework that wsgiref is incorrectly handling URL handling.  It does not go through the WSGI coding dance in the wsgiref.utils.request_uri function.

Testcase through werkzeug:

>>> from wsgiref.util import request_uri
>>> from werkzeug.test import create_environ
>>> from werkzeug.urls import url_parse, url_unquote
>>> env = create_environ('/\N{SNOWMAN}')
>>> url_parse(request_uri(env)).path
'/%C3%A2%C2%98%C2%83'
>>> url_unquote(url_parse(request_uri(env)).path)
'/â\x98\x83'
>>> _ == '/\N{SNOWMAN}'
False

If this passes tests then I'm assuming that wsgiref is doing the inverse bug somewhere else.  I will look into it later, but this behavior is definitely broken.
msg207887 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-01-10 22:53
Which version and bugfix release are you using?
What is werkzeug and what does it have to do with stdlib urllib?
An stdlib test cannot depend on 3rd party code.
msg207888 - (view) Author: Armin Ronacher (aronacher) * (Python committer) Date: 2014-01-10 22:59
> Which version and bugfix release are you using?

You can reproduce it against the current development version of Python 3.

> What is werkzeug and what does it have to do with stdlib urllib?

Werkzeug is a WSGI implementation.

> An stdlib test cannot depend on 3rd party code.

That's why the output values are in the clear so you can remove the werkzeug specific parts.  url_unquote can be replaced with urllib.parse.unquote.  None of that is relevant to the issue here though.  It was just to show that the standard library is currently in violation to PEP 3333.
msg207890 - (view) Author: Armin Ronacher (aronacher) * (Python committer) Date: 2014-01-10 23:13
What it currently returns:

>>> from wsgiref.util import request_uri
>>> request_uri({
...  'wsgi.url_scheme': 'http',
...  'SCRIPT_NAME': '',
...  'PATH_INFO': '/\xe2\x98\x83',
...  'SERVER_PORT': '80',
...  'SERVER_NAME': 'localhost'
... })
'http://localhost/%C3%A2%C2%98%C2%83'

What it should return:

'http://localhost/%E2%98%83'
msg207891 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-10 23:31
Could you please show us the value of env?

Perhaps it is werkzeug creates wrongly quoted URL. request_uri() just calls urllib.parse.quote() which works good.

>>> from urllib.parse import quote, unquote
>>> quote('/\N{SNOWMAN}')
'/%E2%98%83'
>>> unquote('/%E2%98%83') == '/\N{SNOWMAN}'
True

Your result looks as

>>> quote('/\N{SNOWMAN}'.encode().decode('latin1'))
'/%C3%A2%C2%98%C2%83'
msg207892 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-10 23:49
>>> from wsgiref.util import request_uri
>>> request_uri({
...  'wsgi.url_scheme': 'http',
...  'SCRIPT_NAME': '',
...  'PATH_INFO': '/\N{SNOWMAN}',
...  'SERVER_PORT': '80',
...  'SERVER_NAME': 'localhost'
... })
'http://localhost/%E2%98%83'
>>> request_uri({
...  'wsgi.url_scheme': 'http',
...  'SCRIPT_NAME': '',
...  'PATH_INFO': b'/\xe2\x98\x83',
...  'SERVER_PORT': '80',
...  'SERVER_NAME': 'localhost'
... })
'http://localhost/%E2%98%83'
msg207901 - (view) Author: Armin Ronacher (aronacher) * (Python committer) Date: 2014-01-11 08:06
Two things wrong with your example:

a) PATH_INFO on Python 3 must not be bytes
b) PATH_INFO on Python 3 must be latin1 transfer encoded.  See unicode_to_wsgi and wsgi_to_bytes functions in PEP 3333.
msg207903 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-11 09:46
OK, now I understand the issue. Here is a patch which fixes 
wsgiref.application_uri() and wsgiref.request_uri().
msg207911 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-01-11 20:04
> SCRIPT_NAME="/spammity", PATH_INFO="/späm")
Has the policy of limiting stdlib code to ascii chars, including \ escapes, except where needed for tests, been changed?
msg207912 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-11 20:09
> > SCRIPT_NAME="/spammity", PATH_INFO="/späm")
> Has the policy of limiting stdlib code to ascii chars, including \ escapes,
> except where needed for tests, been changed?

This character is already used in this file.
msg207920 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2014-01-11 22:40
And those examples were only in test.

Use of latin-1 to have a literal text for round trip is ok. The patch looks good to me.
msg207942 - (view) Author: Roundup Robot (python-dev) Date: 2014-01-12 10:16
New changeset 29732b43ccf2 by Serhiy Storchaka in branch '3.3':
Issue #20138: The wsgiref.application_uri() and wsgiref.request_uri()
http://hg.python.org/cpython/rev/29732b43ccf2

New changeset 73781fe1daa2 by Serhiy Storchaka in branch 'default':
Issue #20138: The wsgiref.application_uri() and wsgiref.request_uri()
http://hg.python.org/cpython/rev/73781fe1daa2

New changeset 40fb60df4755 by Serhiy Storchaka in branch '2.7':
Issue #20138: Backport tests for handling non-ASCII URLs in the
http://hg.python.org/cpython/rev/40fb60df4755
msg208085 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-14 09:02
Thank you Armin for your report.
History
Date User Action Args
2014-01-14 09:02:25serhiy.storchakasetstatus: open -> closed
messages: + msg208085

components: + Library (Lib)
resolution: fixed
stage: test needed -> resolved
2014-01-12 10:16:30python-devsetnosy: + python-dev
messages: + msg207942
2014-01-12 09:41:37serhiy.storchakasetassignee: serhiy.storchaka
2014-01-11 22:40:47orsenthilsetmessages: + msg207920
2014-01-11 20:09:58serhiy.storchakasetmessages: + msg207912
2014-01-11 20:04:53terry.reedysetmessages: + msg207911
2014-01-11 11:33:06serhiy.storchakasetnosy: + barry
2014-01-11 09:46:11serhiy.storchakasetfiles: + wsgiref_latin1.patch
keywords: + patch
messages: + msg207903
2014-01-11 08:30:39terry.reedysetnosy: + orsenthil
2014-01-11 08:29:32r.david.murraysetnosy: + r.david.murray
2014-01-11 08:06:30aronachersetmessages: + msg207901
2014-01-10 23:56:01hayposetnosy: + haypo
2014-01-10 23:49:45serhiy.storchakasetmessages: + msg207892
2014-01-10 23:31:43serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg207891
2014-01-10 23:13:08aronachersetmessages: + msg207890
2014-01-10 22:59:49aronachersetmessages: + msg207888
2014-01-10 22:53:36terry.reedysetversions: + Python 3.3, Python 3.4
nosy: + terry.reedy

messages: + msg207887

type: behavior
stage: test needed
2014-01-06 21:44:10cvrebertsetnosy: + cvrebert
2014-01-06 11:32:55apollo13setnosy: + apollo13
2014-01-06 10:48:00grahamdsetnosy: + grahamd
2014-01-06 09:46:59aronachercreate