This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: wsgiref.simple_server: mojibake with cp1252 bytes in PATH_INFO
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Anthony Sottile, martin.panter, python-dev, Александр Эри
Priority: normal Keywords: patch

Created on 2016-04-08 20:48 by Anthony Sottile, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
patch Anthony Sottile, 2016-04-08 20:48 review
patch Anthony Sottile, 2016-04-08 22:34 review
patch Anthony Sottile, 2016-04-09 01:47 review
patch Anthony Sottile, 2016-04-09 02:55 review
simple_server.py.diff Александр Эри, 2016-04-20 10:46
Messages (10)
msg263043 - (view) Author: Anthony Sottile (Anthony Sottile) * Date: 2016-04-08 20:48
Patch attached with test.

In summary:

A request to the url b'/\x80' appears to the application as a request to b'\xc2\x80' -- The issue being the latin1 decoded PATH_INFO is re-encoded as UTF-8 and then decoded as latin1

(on the wire) b'\x80' -(decode latin1)-> u'\x80' -(encode utf-8)-> b'\xc2\x80' -(decode latin1)-> b'\xc2\x80'

My patch cuts out the encode(utf-8)->decode(latin1)
msg263044 - (view) Author: Anthony Sottile (Anthony Sottile) * Date: 2016-04-08 20:53
A few typos in my previous comment, pressed enter too quickly, here's an updated comment:

Patch attached with test.

In summary:

A request to the url b'/\x80' appears to the application as a request to b'/\xc2\x80' -- The issue being the latin1 decoded PATH_INFO is re-encoded as UTF-8 and then decoded as latin1

    (on the wire) b'\x80' -(decode latin1)-> u'\x80' -(encode utf-8)-> b'\xc2\x80' -(decode latin1)-> u'\xc2\x80'

My patch cuts out the encode(utf-8)->decode(latin1):

    (on the wire) b'\x80' -(decode latin1) -> u'\x80'
msg263048 - (view) Author: Anthony Sottile (Anthony Sottile) * Date: 2016-04-08 22:34
Oops, broke b'/%80'.

Here's a better fix that now takes:

    (on the wire) b'\x80' -(decode latin1)-> u'\x80' -(encode utf-8)-> b'\xc2\x80' -(decode latin1)-> u'\xc2\x80'

to:

    (on the wire) b'\x80' -(decode latin1)-> u'\x80' -(encode latin1) -> b'\x80' -(decode latin1)-> u'\x80'
msg263050 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-04-08 23:50
I was going to say your original fix was the reverse of a change in r86146. But you seem to be fixing the problems before I express them :)

For the fix I would suggest something like unquote(path, "latin-1") would be simpler. I left some other review comments about the tests.
msg263054 - (view) Author: Anthony Sottile (Anthony Sottile) * Date: 2016-04-09 01:47
Updates after review.
msg263055 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-04-09 02:41
Thanks, this version looks pretty good to me.
msg263056 - (view) Author: Anthony Sottile (Anthony Sottile) * Date: 2016-04-09 02:55
Forgot to remove the pyver code (leaning a bit too much on pre-commit)
msg263596 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-04-17 03:04
New changeset 1f2cfcd5a83f by Martin Panter in branch '3.5':
Issue #26717: Stop encoding Latin-1-ized WSGI paths with UTF-8
https://hg.python.org/cpython/rev/1f2cfcd5a83f

New changeset 815a4ac67e68 by Martin Panter in branch 'default':
Issue #26717: Merge wsgiref fix from 3.5
https://hg.python.org/cpython/rev/815a4ac67e68
msg263818 - (view) Author: Александр Эри (Александр Эри) Date: 2016-04-20 10:46
Why wsgiref uses latin1? It must use utf-8.
msg263844 - (view) Author: Anthony Sottile (Anthony Sottile) * Date: 2016-04-20 14:34
PEP3333 states that environ variables are str variables decoded using
latin1:
https://www.python.org/dev/peps/pep-3333/#id19

Therefore, to get the original bytes, one must encode using latin1
On Apr 20, 2016 3:46 AM, "Александр Эри" <report@bugs.python.org> wrote:

>
> Александр Эри added the comment:
>
> Why wsgiref uses latin1? It must use utf-8.
>
> ----------
> keywords: +patch
> nosy: +Александр Эри
> Added file: http://bugs.python.org/file42531/simple_server.py.diff
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue26717>
> _______________________________________
>
History
Date User Action Args
2022-04-11 14:58:29adminsetgithub: 70904
2016-04-20 14:34:55Anthony Sottilesetmessages: + msg263844
2016-04-20 10:46:18Александр Эриsetfiles: + simple_server.py.diff

nosy: + Александр Эри
messages: + msg263818

keywords: + patch
2016-04-17 08:23:34martin.pantersetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2016-04-17 03:04:43python-devsetnosy: + python-dev
messages: + msg263596
2016-04-09 02:55:45Anthony Sottilesetfiles: + patch

messages: + msg263056
2016-04-09 02:41:20martin.pantersetmessages: + msg263055
2016-04-09 01:47:58Anthony Sottilesetfiles: + patch

messages: + msg263054
2016-04-08 23:50:19martin.pantersetversions: - Python 3.4
nosy: + martin.panter

messages: + msg263050

type: behavior
stage: patch review
2016-04-08 22:34:31Anthony Sottilesetfiles: + patch

messages: + msg263048
2016-04-08 20:53:29Anthony Sottilesetmessages: + msg263044
2016-04-08 20:48:05Anthony Sottilecreate