Add wsgiref.util.dump_wsgistr & load_wsgistr #66460

ncoghlan · 2014-08-24T12:45:42Z

BPO	22264
Nosy	@malemburg, @pjeby, @ncoghlan, @pitrou, @vstinner, @rbtcollins, @benjaminp, @ezio-melotti, @serhiy-storchaka

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2018-06-09.11:06:24.115>
created_at = <Date 2014-08-24.12:45:41.866>
labels = ['type-feature', 'library', 'expert-unicode']
title = 'Add wsgiref.util.dump_wsgistr & load_wsgistr'
updated_at = <Date 2018-06-09.11:06:24.113>
user = 'https://github.com/ncoghlan'

bugs.python.org fields:

activity = <Date 2018-06-09.11:06:24.113>
actor = 'ncoghlan'
assignee = 'none'
closed = True
closed_date = <Date 2018-06-09.11:06:24.115>
closer = 'ncoghlan'
components = ['Library (Lib)', 'Unicode']
creation = <Date 2014-08-24.12:45:41.866>
creator = 'ncoghlan'
dependencies = []
files = []
hgrepos = []
issue_num = 22264
keywords = []
message_count = 17.0
messages = ['225814', '225815', '225816', '225819', '225829', '225852', '225853', '225854', '225856', '225857', '225867', '227336', '227510', '227511', '227575', '242940', '319141']
nosy_count = 10.0
nosy_names = ['lemburg', 'pje', 'ncoghlan', 'pitrou', 'vstinner', 'rbcollins', 'benjamin.peterson', 'ezio.melotti', 'grahamd', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'rejected'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue22264'
versions = ['Python 3.6']

ncoghlan · 2014-08-24T12:45:42Z

The WSGI 1.1 standard mandates that binary data be decoded as latin-1 text: http://www.python.org/dev/peps/pep-3333/#unicode-issues

This means that many WSGI headers will in fact contain *improperly encoded data*. Developers working directly with WSGI (rather than using a WSGI framework like Django, Flask or Pyramid) need to convert those strings back to bytes and decode them properly before passing them on to user applications.

I suggest adding a simple "fix_encoding" function to wsgiref that covers this:

    def fix_encoding(data, encoding, errors="surrogateescape"):
        return data.encode("latin-1").decode(encoding, errors)

The primary intended benefit is to WSGI related code more self-documenting. Compare the proposal with the status quo:

    data = wsgiref.fix_encoding(data, "utf-8")
    data = data.encode("latin-1").decode("utf-8", "surrogateescape")

The proposal hides the mechanical details of what is going on in order to emphasise *why* the change is needed, and provides you with a name to go look up if you want to learn more.

The latter just looks nonsensical unless you're already familiar with this particular corner of the WSGI specification.

ncoghlan · 2014-08-24T12:56:54Z

This would be a better fit for the util submodule rather than the top level package.

ncoghlan · 2014-08-24T12:58:25Z

Last tweak, since the purpose is to fix the original incorrect decoding to latin-1, this should be defined as a decoding operation:

    def fix_decoding(data, encoding, errors="surrogateescape"):
        return data.encode("latin-1").decode(encoding, errors)

serhiy-storchaka · 2014-08-24T13:53:51Z

Could you please provide an example how this helper will improve stdlib or user code?

ncoghlan · 2014-08-24T15:23:04Z

Current cryptic incantation that requires deep knowledge of the encoding system to follow:

    data = data.encode("latin-1").decode("utf-8", "surrogateescape")

Replacement that is not only more self-documenting, but also gives you something specific to look up in order to learn more:

    data = wsgiref.util.fix_encoding(data, "utf-8")

As a WSGI server, the standard library code mostly does this in the other direction, converting data from its original web server provided encoding *to* latin-1.

grahamd · 2014-08-25T00:38:41Z

Is actually WSGI 1.0.1 and not 1.1. :-)

vstinner · 2014-08-25T00:46:18Z

I don't like "fix" in the name "fix_encoding". It is negative. Why not "decode" or "decode_wsgi"?

grahamd · 2014-08-25T00:51:33Z

From memory, the term sometimes used on the WEB-SIG when discussed was transcode.

I find the idea that it needs 'fixing' or is 'incorrect', as in 'fix the original incorrect decoding to latin-1' is a bit misleading as well. It was the only practical way of doing things that didn't cause a lot of other problems and was a deliberate decision. It wasn't a mistake.

vstinner · 2014-08-25T01:06:06Z

I don't think that applications are prepared to handle surrogate characters, so I'm not sure that the default encoding should be "surrogateescape". In my experience, text is later encoded to UTF-8 (or latin1 or ascii) and you then you an error from the encoder.

Just one example: issue bpo-11186.

vstinner · 2014-08-25T01:06:54Z

Oh, I forgot to mention that I'm not convinced that we should add such function to the Python stdlib.

ncoghlan · 2014-08-25T06:39:33Z

After reviewing the stdlib code as Serhiy suggested and reflecting on the matter for a while, I now think it's better to think of this idea in terms of formalising the concept of a "WSGI string". That is, data that has been decoded as latin-1 not because that's necessarily correct, but because it creates a valid str object that doesn't lose any information, doesn't have any surrogate escapes in it, yet can still handle arbitrary binary data.

Under that model, and using a dumps/loads inspired naming scheme (since this is effectively a serialisation format for the WSGI server/application boundary), the appropriate helpers would be:

    def dump_wsgistr(data, encoding, errors='strict'):
        data.encode(encoding, errors).decode('iso-8859-1')

    def load_wsgistr(data, encoding, errors='strict'):
        data.encode('iso-8859-1').decode(encoding, errors)

As Victor says, using surrogateescape by default is not correct. However, some of the code in wsgiref.handlers does pass a custom errors setting, so it's appropriate to make that configurable.

With this change, there would be several instances in wsgiref.handlers that could be changed from the current:

data.encode(encoding).decode('iso-8859-1')

to:

    dump_wsgistr(data, encoding)

The point is that it isn't "iso-8859-1" that's significant - it's the compliance with the data format mandated by the WSGI 1.0.1 specification (which just happens to be "latin-1 decoded string").

ncoghlan · 2014-09-23T10:01:35Z

Updated issue title to reflect current proposal

rbtcollins · 2014-09-25T07:04:02Z

So this looks like its going to instantly create bugs in programs that use it. HTTP/1.1 headers are one of:
latin1
MIME encoded (RFC2047)
invalid and working only by accident

HTTP/2 doesn't change this.

An API that encourages folk to encode into utf8 and then put that in their headers is problematic.

Consider:

    def dump_wsgistr(data, encoding, errors='strict'):
        data.encode(encoding, errors).decode('iso-8859-1')

This takes a string that one wants to put into a header value, encodes it with a *user specified encoding*, then decodes that into iso-8859-1 [at which point it can be encoded back to octets by the wsgi server before putting on the wire].

But this is fundamentally wrong in the common case: either 'data' is itself suitable as a header value (e.g. it is ASCII - recommended per RFC7230 section 3.2.4) or 'data' needs encoding via RFC 2047 encoding not via utf8.

There are a few special cases where folk have incorrectly shoved utf8 into header values and we need to make it possible for folk working within WSGI to do that - which is why the API is the way it is - but we shouldn't make it *easier* for them to do the wrong thing.

I'd support an API that DTRT here by taking a string, tries US_ASCII, with fallback to MIME encoded with utf8 as the encoding parameter.

ncoghlan · 2014-09-25T07:25:17Z

I'm not wedded to the specific algorithm - I definitely don't consider myself an HTTP or WSGI expert.

I do like the general idea of treating "wsgistr" as a serialisation format though, as that's effectively what it is at this point.

rbtcollins · 2014-09-25T20:43:09Z

So I guess the API concern I have is that there are two cases:

common spec compliant - US-ASCII + RFC2047
dealing with exceptions - UTF8 or otherwise

The former totally makes sense as a codec, though the current email implementation of it isn't quite a codec.

The latter almost wants to be a separate API, which I may propose as part of WSGI for HTTP/2; nevertheless in PEP-3333 its integral to the main API because of the bytes-encoded-as-codepoints-00-ff solution.

So, perhaps:

a codec or something looking like it that covers the common case
this would not permit specifying codecs on decode, for instance [since RFC2047 is self describing].
documentation for the uncommon case. *Possibly* a helper function.

ncoghlan · 2015-05-12T04:57:35Z

Reviewing the items I had flagged as dependencies of bpo-22555 for personal tracking purposes, I suggest we defer further consideration of this idea to 3.6 and/or the creation of a WSGI 2.0 specification.

ncoghlan · 2018-06-09T11:06:24Z

We don't have anyone clamouring for this, and the third party module https://ftfy.readthedocs.io/en/latest/ has a lot more utilities for working with messy binary inputs and incorrectly decoded text than we'd ever add to the standard library, so I'm going to reject this as only being theoretically useful, and not actually solving a practical problem for users.

ncoghlan added the type-feature A feature request or enhancement label Aug 24, 2014

ncoghlan changed the title ~~Add wsgiref.fix_encoding~~ Add wsgiref.util.fix_encoding Aug 24, 2014

ncoghlan changed the title ~~Add wsgiref.util.fix_encoding~~ Add wsgiref.util.fix_decoding Aug 24, 2014

serhiy-storchaka added stdlib Python modules in the Lib dir topic-unicode labels Aug 24, 2014

ncoghlan changed the title ~~Add wsgiref.util.fix_decoding~~ Add wsgiref.util helpers for dealing with "WSGI strings" Aug 25, 2014

ncoghlan changed the title ~~Add wsgiref.util helpers for dealing with "WSGI strings"~~ Add wsgiref.util.dump_wsgistr & load_wsgistr Sep 23, 2014

ncoghlan closed this as completed Jun 9, 2018

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add wsgiref.util.dump_wsgistr & load_wsgistr #66460

Add wsgiref.util.dump_wsgistr & load_wsgistr #66460

ncoghlan commented Aug 24, 2014

ncoghlan commented Aug 24, 2014

ncoghlan commented Aug 24, 2014

ncoghlan commented Aug 24, 2014

serhiy-storchaka commented Aug 24, 2014

ncoghlan commented Aug 24, 2014

grahamd mannequin commented Aug 25, 2014

vstinner commented Aug 25, 2014

grahamd mannequin commented Aug 25, 2014

vstinner commented Aug 25, 2014

vstinner commented Aug 25, 2014

ncoghlan commented Aug 25, 2014

ncoghlan commented Sep 23, 2014

rbtcollins commented Sep 25, 2014

ncoghlan commented Sep 25, 2014

rbtcollins commented Sep 25, 2014

ncoghlan commented May 12, 2015

ncoghlan commented Jun 9, 2018

Add wsgiref.util.dump_wsgistr & load_wsgistr #66460

Add wsgiref.util.dump_wsgistr & load_wsgistr #66460

Comments

ncoghlan commented Aug 24, 2014

ncoghlan commented Aug 24, 2014

ncoghlan commented Aug 24, 2014

ncoghlan commented Aug 24, 2014

serhiy-storchaka commented Aug 24, 2014

ncoghlan commented Aug 24, 2014

grahamd mannequin commented Aug 25, 2014

vstinner commented Aug 25, 2014

grahamd mannequin commented Aug 25, 2014

vstinner commented Aug 25, 2014

vstinner commented Aug 25, 2014

ncoghlan commented Aug 25, 2014

ncoghlan commented Sep 23, 2014

rbtcollins commented Sep 25, 2014

ncoghlan commented Sep 25, 2014

rbtcollins commented Sep 25, 2014

ncoghlan commented May 12, 2015

ncoghlan commented Jun 9, 2018