Message 70869 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gvanrossum
Recipients	gvanrossum, janssen, jimjjewett, lemburg, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date	2008-08-07.23:23:47
SpamBayes Score	1.9302338e-11
Marked as misclassified	No
Message-id	<ca471dc20808071623v74ca2f35m947484f381f2a3fe@mail.gmail.com>
In-reply-to	<4b3e516a0808071558r41892c2aud4b8060226355d6@mail.gmail.com>

Content
On Thu, Aug 7, 2008 at 3:58 PM, Bill Janssen <report@bugs.python.org> wrote: > Bill Janssen <bill.janssen@gmail.com> added the comment: >> Your original proposal was to make unquote() behave like >> unquote_to_bytes(), which would require changes to virtually every app >> using unqote(), since almost all apps assume the result is a (text) >> string. > > Actually, careful apps realize that the result of "unquote" in Python 2 is a > sequence of bytes, and do something careful with that. Given that in 2.x it's a string, and knowing my users, I expect that careful apps are a tiny minority. > So only careless > apps would break, and they'd break in such a way that their maintainers > would have to look at the situation again, and think about it. Seems like a > 'good thing', to me. And since this is Python 3, fully allowed. I really > don't understand your position here, I'm afraid. My position is that although 3.0 supports Unicode much better than 2.x (I won't ever use pretentious and meaningless phrases like "fully supports"), that doesn't mean that you have to use it. I expect lots of simple web apps don't need Unicode but do need quote/unquote functionality to get around "forbidden" characters in query strings like &. I don't want to force such apps to become more aware of Unicode than absolutely necessary. >> A middle ground might be to set the default encoding to ASCII -- >> that's closer to Martin's claim that URLs are supposed to be ASCII >> only. > > URLs are supposed to be ASCII only -- but the percent-encoded byte > sequences in various parts of the path aren't. > >> This will require many apps to be changed, but at least it forces the >> developers to think about which encoding to assume (perhaps there's >> one handy in the request headers if it's a web app) or about error >> handling or perhaps using unquote_to_bytes(). > > Yes, this is closer to my line of reasoning. > >> However I fear that this middle ground will in practice cause: >> >> (a) more in-the-field failures, since devs are notorious for testing >> with ASCII only; and > > Returning bytes deals with this problem. In an unpleasant way. We might as well consider changing all APIs that deal with URLs to insist on bytes. >> (b) the creation of a recipe for "fixing" unquote() calls that fail by >> setting the encoding to UTF-8 without thinking about the alternatives, >> thereby effectively recreating the UTF-8 default with much more pain. > > Could be, but at least they will have had to think about. There's lots of > bad code out there, and maybe by making them think about it, some of it will > improve. I'd rather use a carrot than a stick. IOW I'd rather write aggressive docs than break people's code. >> A secondary concern is that it >> > will invisibly produce invalid data, because it decodes some >> > non-UTF-8-encoded string that happens to only use UTF-8-valid sequences >> > as the wrong string value. >> >> In my experience this is very unlikely. UTF-8 looks like total junk in >> Latin-1, so it's unlikely to occur naturally. If you see something >> that matches a UTF-8 sequence in Latin-1 text, it's most likely that >> in fact it was incorrectly decoded earlier... > Latin-1 isn't the only alternate encoding in the world, and not all > percent-encoded byte sequences in URLs are encoded strings. I'd feel better > if we were being guided by more than your just experience (vast though it > may rightly be said to be!). Say, by looking at all the URLs that Google > knows about :-). I'd particularly feel better if some expert in Asian use > of the Web spoke up here... OK, let's wait and see if one bites.

On Thu, Aug 7, 2008 at 3:58 PM, Bill Janssen <report@bugs.python.org> wrote:
> Bill Janssen <bill.janssen@gmail.com> added the comment:
>> Your original proposal was to make unquote() behave like
>> unquote_to_bytes(), which would require changes to virtually every app
>> using unqote(), since almost all apps assume the result is a (text)
>> string.
>
> Actually, careful apps realize that the result of "unquote" in Python 2 is a
> sequence of bytes, and do something careful with that.

Given that in 2.x it's a string, and knowing my users, I expect that
careful apps are a tiny minority.

> So only careless
> apps would break, and they'd break in such a way that their maintainers
> would have to look at the situation again, and think about it.  Seems like a
> 'good thing', to me.  And since this is Python 3, fully allowed.  I really
> don't understand your position here, I'm afraid.

My position is that although 3.0 supports Unicode much better than 2.x
(I won't ever use pretentious and meaningless phrases like "fully
supports"), that doesn't mean that you *have* to use it. I expect lots
of simple web apps don't need Unicode but do need quote/unquote
functionality to get around "forbidden" characters in query strings
like &. I don't want to force such apps to become more aware of
Unicode than absolutely necessary.

>> A middle ground might be to set the default encoding to ASCII --
>> that's closer to Martin's claim that URLs are supposed to be ASCII
>> only.
>
> URLs *are* supposed to be ASCII only -- but the percent-encoded byte
> sequences in various parts of the path aren't.
>
>> This will require many apps to be changed, but at least it forces the
>> developers to think about which encoding to assume (perhaps there's
>> one handy in the request headers if it's a web app) or about error
>> handling or perhaps using unquote_to_bytes().
>
> Yes, this is closer to my line of reasoning.
>
>> However I fear that this middle ground will in practice cause:
>>
>> (a) more in-the-field failures, since devs are notorious for testing
>> with ASCII only; and
>
> Returning bytes deals with this problem.

In an unpleasant way. We might as well consider changing all APIs that
deal with URLs to insist on bytes.

>> (b) the creation of a recipe for "fixing" unquote() calls that fail by
>> setting the encoding to UTF-8 without thinking about the alternatives,
>> thereby effectively recreating the UTF-8 default with much more pain.
>
> Could be, but at least they will have had to think about.  There's lots of
> bad code out there, and maybe by making them think about it, some of it will
> improve.

I'd rather use a carrot than a stick. IOW I'd rather write aggressive
docs than break people's code.

>> A secondary concern is that it
>> > will invisibly produce invalid data, because it decodes some
>> > non-UTF-8-encoded string that happens to only use UTF-8-valid sequences
>> > as the wrong string value.
>>
>> In my experience this is very unlikely. UTF-8 looks like total junk in
>> Latin-1, so it's unlikely to occur naturally. If you see something
>> that matches a UTF-8 sequence in Latin-1 text, it's most likely that
>> in fact it was incorrectly decoded earlier...

> Latin-1 isn't the only alternate encoding in the world, and not all
> percent-encoded byte sequences in URLs are encoded strings.  I'd feel better
> if we were being guided by more than your just experience (vast though it
> may rightly be said to be!).  Say, by looking at all the URLs that Google
> knows about :-).  I'd particularly feel better if some expert in Asian use
> of the Web spoke up here...

OK, let's wait and see if one bites.

History
Date	User	Action	Args
2008-08-07 23:23:49	gvanrossum	set	recipients: + gvanrossum, lemburg, loewis, jimjjewett, janssen, orsenthil, pitrou, thomaspinckney3, mgiuca
2008-08-07 23:23:48	gvanrossum	link	issue3300 messages
2008-08-07 23:23:47	gvanrossum	create