Message 71054 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mgiuca
Recipients	gvanrossum, janssen, jimjjewett, lemburg, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date	2008-08-12.15:20:08
SpamBayes Score	1.110223e-16
Marked as misclassified	No
Message-id	<1218554429.43.0.154361702641.issue3300@psf.upfronthosting.co.za>
In-reply-to

Content
Bill, this debate is getting snipy, and going nowhere. We could argue about what is the "pure" and "correct" thing to do, but we have a limited time frame here, so I suggest we just look at the important facts. 1. There is an overwhelming consensus (including from me) that a str->bytes version is acceptable to have in the library (whether or not it's the "correct solution"). 2. There is an overwhelming consensus (including from you) that a str->str version is acceptable to have in the library (whether or not it's the "correct solution"). 3. By default, the str->str version breaks much less code, so both of us decided to use it by default. To this end, both of our patches: 1. Have a str->bytes version available. 2. Have a str->str version available. 3. Have "quote" and "unquote" functions call the str->str version. So it seems we have agreed on that. Therefore, there should be no more arguing about which is "more right". So all your arguments seem to be essentially saying "the str->bytes methods work perfectly; I don't care about if the str->str methods are correct or not". The fact that your string versions quote UTF-8 and unquote Latin-1 shows just how un-seriously you take the str->str methods. Well the fact is that a) a great many users do NOT SHARE your ideals and will default to using "quote" and "unquote" rather than the bytes functions, and b) all of the rest of the library uses "quote" and "unquote". So from a practical sense, how these methods behave is of the utmost importance - they are more important than any new functions we introduce at this point. For example, the cgi.FieldStorage and the http.server modules will implicitly call unquote and quote. That means whether you, or I, or Guido, or The King Of The Internet likes it or not, we have to have a "most reasonable" solution to the problem of quoting and unquoting strings. > Good thing we don't need to [handle unescaped non-ASCII characters in > unquote]; URIs consist of ASCII characters. Once again, practicality beats purity. I'd argue that it's a good (not strictly required) idea to not mangle input unless we have to. > > * Question: How does unquote_bytes deal with unescaped characters? > Not sure I understand this question... I meant unescaped non-ASCII characters, as discussed above (eg. unquote_bytes('\u0123')). > Your test cases probably aren't testing things I feel it's necessary > to test. I'm interested in having the old test cases for urllib > pass, as well as providing the ability to unquote_to_bytes(). I'm sorry, but you're missing the point of test-driven development. If you think there is a bug, you don't just fix it and say "look, the old test cases still pass!" You write new FAILING test cases to demonstrate the bug. Then you change the code to make the test cases pass. All your test suite proves is that you're happy with things the way they are. > Matt, your patch is not some God-given thing here. No, I am merely suggesting that it's had a great deal more thought put into it -- not just my thought, but all the other people in the past month who've suggested different approaches and brought up discussion points. Including yourself -- it was your suggestion in the first place to have the str->bytes functions, which I agree are important. > > <snip> - Quote uses cache > I see no real advantage there, except that it has a built-in > memory leak. Just use a function. Good point. Well the merits of using a cache are completely independent from the behavioural aspects. I simply changed the existing code as little as possible. Hence this patch will have the same performance strengths/weaknesses as all previous versions, and the performance can be tuned after 3.0 if necessary. (Not urgent). On statistics about UTF-8 versus other encodings. Yes, I agree, there are lots of URIs floating around out there, in many different encodings. Unfortunately, we can't implicitly handle them all (and I'm talking once more explicitly about the str->str transform here). We need to pick one as the default. Whether Latin-1 is more popular than UTF-8 for the time being is no good reason to pick Latin-1. It is called a "legacy encoding" for a reason. It is being phased out and should NOT be supported from here on in as the default encoding in a major web programming language. (Also there is no point in claiming to be "Unicode compliant" then turning around and supporting a charset with 256 symbols by default). Because Python's urllib will mostly be used in the context of building web apps, it is up to the programmer to decide what encoding to use for h(is\|er) web app. For future apps, this should almost certainly be UTF-8 (if it isn't, the website won't be able to accept form input across all characters, so isn't Unicode compliant anyway). The problem you mention of browsers submitting URIs encoded based on the charset is simply something we have to live with. A server will never be able to deal with that unless the URIs are coming from pages which it served. As this is very often the case, then as I said above, the app should serve UTF-8 and there'll be no problems. Also note that ALL the browsers I tested (FF/Saf/IE) use UTF-8 no matter what, if you directly type Unicode characters into the address bar.

Bill, this debate is getting snipy, and going nowhere. We could argue
about what is the "pure" and "correct" thing to do, but we have a
limited time frame here, so I suggest we just look at the important facts.

1. There is an overwhelming consensus (including from me) that a
str->bytes version is acceptable to have in the library (whether or not
it's the "correct solution").
2. There is an overwhelming consensus (including from you) that a
str->str version is acceptable to have in the library (whether or not
it's the "correct solution").
3. By default, the str->str version breaks much less code, so both of us
decided to use it by default.

To this end, both of our patches:

1. Have a str->bytes version available.
2. Have a str->str version available.
3. Have "quote" and "unquote" functions call the str->str version.

So it seems we have agreed on that. Therefore, there should be no more
arguing about which is "more right".

So all your arguments seem to be essentially saying "the str->bytes
methods work perfectly; I don't care about if the str->str methods are
correct or not". The fact that your string versions quote UTF-8 and
unquote Latin-1 shows just how un-seriously you take the str->str methods.

Well the fact is that a) a great many users do NOT SHARE your ideals and
will default to using "quote" and "unquote" rather than the bytes
functions, and b) all of the rest of the library uses "quote" and
"unquote". So from a practical sense, how these methods behave is of the
utmost importance - they are more important than any new functions we
introduce at this point.

For example, the cgi.FieldStorage and the http.server modules will
implicitly call unquote and quote.

That means whether you, or I, or Guido, or The King Of The Internet
likes it or not, we have to have a "most reasonable" solution to the
problem of quoting and unquoting strings.

> Good thing we don't need to [handle unescaped non-ASCII characters in
> unquote]; URIs consist of ASCII characters.

Once again, practicality beats purity. I'd argue that it's a *good* (not
strictly required) idea to not mangle input unless we have to.

> > * Question: How does unquote_bytes deal with unescaped characters?

> Not sure I understand this question...

I meant unescaped non-ASCII characters, as discussed above (eg.
unquote_bytes('\u0123')).

> Your test cases probably aren't testing things I feel it's necessary
> to test. I'm interested in having the old test cases for urllib
> pass, as well as providing the ability to unquote_to_bytes().

I'm sorry, but you're missing the point of test-driven development. If
you think there is a bug, you don't just fix it and say "look, the old
test cases still pass!" You write new FAILING test cases to demonstrate
the bug. Then you change the code to make the test cases pass. All your
test suite proves is that you're happy with things the way they are.

> Matt, your patch is not some God-given thing here.

No, I am merely suggesting that it's had a great deal more thought put
into it -- not just my thought, but all the other people in the past
month who've suggested different approaches and brought up discussion
points. Including yourself -- it was your suggestion in the first place
to have the str->bytes functions, which I agree are important.

> > <snip> - Quote uses cache

> I see no real advantage there, except that it has a built-in
> memory leak. Just use a function.

Good point. Well the merits of using a cache are completely independent
from the behavioural aspects. I simply changed the existing code as
little as possible. Hence this patch will have the same performance
strengths/weaknesses as all previous versions, and the performance can
be tuned after 3.0 if necessary. (Not urgent).

On statistics about UTF-8 versus other encodings. Yes, I agree, there
are lots of URIs floating around out there, in many different encodings.
Unfortunately, we can't implicitly handle them all (and I'm talking once
more explicitly about the str->str transform here). We need to pick one
as the default. Whether Latin-1 is more popular than UTF-8 *for the time
being* is no good reason to pick Latin-1. It is called a "legacy
encoding" for a reason. It is being phased out and should NOT be
supported from here on in as the default encoding in a major web
programming language.

(Also there is no point in claiming to be "Unicode compliant" then
turning around and supporting a charset with 256 symbols by default).

Because Python's urllib will mostly be used in the context of building
web apps, it is up to the programmer to decide what encoding to use for
h(is|er) web app. For future apps, this should almost certainly be UTF-8
(if it isn't, the website won't be able to accept form input across all
characters, so isn't Unicode compliant anyway).

The problem you mention of browsers submitting URIs encoded based on the
charset is simply something we have to live with. A server will never be
able to deal with that unless the URIs are coming from pages which *it
served*. As this is very often the case, then as I said above, the app
should serve UTF-8 and there'll be no problems. Also note that ALL the
browsers I tested (FF/Saf/IE) use UTF-8 no matter what, if you directly
type Unicode characters into the address bar.

History
Date	User	Action	Args
2008-08-12 15:20:29	mgiuca	set	recipients: + mgiuca, lemburg, gvanrossum, loewis, jimjjewett, janssen, orsenthil, pitrou, thomaspinckney3
2008-08-12 15:20:29	mgiuca	set	messageid: <1218554429.43.0.154361702641.issue3300@psf.upfronthosting.co.za>
2008-08-12 15:20:13	mgiuca	link	issue3300 messages
2008-08-12 15:20:09	mgiuca	create