Message 70862 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gvanrossum
Recipients	gvanrossum, janssen, jimjjewett, lemburg, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date	2008-08-07.21:46:22
SpamBayes Score	7.1693207e-12
Marked as misclassified	No
Message-id	<ca471dc20808071446i549063eo89f5060e1e72e818@mail.gmail.com>
In-reply-to	<1218143828.6.0.619997562476.issue3300@psf.upfronthosting.co.za>

Content
On Thu, Aug 7, 2008 at 2:17 PM, Bill Janssen <report@bugs.python.org> wrote: > > Bill Janssen <bill.janssen@gmail.com> added the comment: > > My main fear with this patch is that "unquote" will become seen as > unreliable, because naive software trying to parse URLs will encounter > uses of percent-encoding where the encoded octets are not in fact UTF-8 > bytes. They're just some set of bytes. Apps that want to handle these correctly have no choice but to use unquote_to_bytes(), or setting error='ignore' or error='replace'. Your original proposal was to make unquote() behave like unquote_to_bytes(), which would require changes to virtually every app using unqote(), since almost all apps assume the result is a (text) string. Setting the default encoding to Latin-1 would prevent these errors, but would commit the sin of mojibake (the Japanese word for Perl code :-). I don't like that much either. A middle ground might be to set the default encoding to ASCII -- that's closer to Martin's claim that URLs are supposed to be ASCII only. It will fail as soon as an app receives encoded non-ASCII text. This will require many apps to be changed, but at least it forces the developers to think about which encoding to assume (perhaps there's one handy in the request headers if it's a web app) or about error handling or perhaps using unquote_to_bytes(). However I fear that this middle ground will in practice cause: (a) more in-the-field failures, since devs are notorious for testing with ASCII only; and (b) the creation of a recipe for "fixing" unquote() calls that fail by setting the encoding to UTF-8 without thinking about the alternatives, thereby effectively recreating the UTF-8 default with much more pain. Therefore I think that the UTF-8 default is probably the most pragmatic choice. In the code review, I have asked Matt to change the default error handling from errors='replace' to errors='strict'. I suppose we could reduce outright crashes in the field by setting this to 'replace' (even though for quote() I think it should remain 'strict'). But this may cause more subtle failures, where apps simply receive garbage data. At least when you're serving pages with error 500 the developers tend to get called in. When the users merely get failing results such bugs may remain lingering much longer. > A secondary concern is that it > will invisibly produce invalid data, because it decodes some > non-UTF-8-encoded string that happens to only use UTF-8-valid sequences > as the wrong string value. In my experience this is very unlikely. UTF-8 looks like total junk in Latin-1, so it's unlikely to occur naturally. If you see something that matches a UTF-8 sequence in Latin-1 text, it's most likely that in fact it was incorrectly decoded earlier... > Now, I have to confess that I don't know how common these use cases are > in actual URL usage. It would be nice if there was some organization > that had a large collection of URLs, and could provide a test set we > could run a scanner over :-). > > As a workaround, though, I've sent a message off to Larry Masinter to > ask about this case. He's one of the authors of the URI spec. Looking forward to his response.

On Thu, Aug 7, 2008 at 2:17 PM, Bill Janssen <report@bugs.python.org> wrote:
>
> Bill Janssen <bill.janssen@gmail.com> added the comment:
>
> My main fear with this patch is that "unquote" will become seen as
> unreliable, because naive software trying to parse URLs will encounter
> uses of percent-encoding where the encoded octets are not in fact UTF-8
> bytes.  They're just some set of bytes.

Apps that want to handle these correctly have no choice but to use
unquote_to_bytes(), or setting error='ignore' or error='replace'.

Your original proposal was to make unquote() behave like
unquote_to_bytes(), which would require changes to virtually every app
using unqote(), since almost all apps assume the result is a (text)
string.

Setting the default encoding to Latin-1 would prevent these errors,
but would commit the sin of mojibake (the Japanese word for Perl code
:-). I don't like that much either.

A middle ground might be to set the default encoding to ASCII --
that's closer to Martin's claim that URLs are supposed to be ASCII
only. It will fail as soon as an app receives encoded non-ASCII text.
This will require many apps to be changed, but at least it forces the
developers to think about which encoding to assume (perhaps there's
one handy in the request headers if it's a web app) or about error
handling or perhaps using unquote_to_bytes().

However I fear that this middle ground will in practice cause:

(a) more in-the-field failures, since devs are notorious for testing
with ASCII only; and

(b) the creation of a recipe for "fixing" unquote() calls that fail by
setting the encoding to UTF-8 without thinking about the alternatives,
thereby effectively recreating the UTF-8 default with much more pain.

Therefore I think that the UTF-8 default is probably the most pragmatic choice.

In the code review, I have asked Matt to change the default error
handling from errors='replace' to errors='strict'. I suppose we could
reduce outright crashes in the field by setting this to 'replace'
(even though for quote() I think it should remain 'strict'). But this
may cause more subtle failures, where apps simply receive garbage
data. At least when you're serving pages with error 500 the developers
tend to get called in. When the users merely get failing results such
bugs may remain lingering much longer.

> A secondary concern is that it
> will invisibly produce invalid data, because it decodes some
> non-UTF-8-encoded string that happens to only use UTF-8-valid sequences
> as the wrong string value.

In my experience this is very unlikely. UTF-8 looks like total junk in
Latin-1, so it's unlikely to occur naturally. If you see something
that matches a UTF-8 sequence in Latin-1 text, it's most likely that
in fact it was incorrectly decoded earlier...

> Now, I have to confess that I don't know how common these use cases are
> in actual URL usage.  It would be nice if there was some organization
> that had a large collection of URLs, and could provide a test set we
> could run a scanner over :-).
>
> As a workaround, though, I've sent a message off to Larry Masinter to
> ask about this case.  He's one of the authors of the URI spec.

Looking forward to his response.

History
Date	User	Action	Args
2008-08-07 21:46:24	gvanrossum	set	recipients: + gvanrossum, lemburg, loewis, jimjjewett, janssen, orsenthil, pitrou, thomaspinckney3, mgiuca
2008-08-07 21:46:23	gvanrossum	link	issue3300 messages
2008-08-07 21:46:22	gvanrossum	create