Author janssen
Recipients gvanrossum, janssen, jimjjewett, lemburg, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date 2008-08-12.02:30:52
SpamBayes Score 4.44089e-16
Marked as misclassified No
Message-id <4b3e516a0808111930m459bbc98sef6934c15a158e65@mail.gmail.com>
In-reply-to <1218306864.44.0.78350530603.issue3300@psf.upfronthosting.co.za>
Content
On Sat, Aug 9, 2008 at 11:34 AM, Matt Giuca <report@bugs.python.org> wrote:

>
> Matt Giuca <matt.giuca@gmail.com> added the comment:
>
> Bill, I had a look at your patch. I see you've decided to make
> quote_as_string the default? In that case, I don't know why you had to
> rewrite everything to implement the same basic behaviour as my patch.
>

Just to get things clear in my head.  I don't care what code is used, so
long as it doesn't break the Web.

> (My latest few patches support bytes both ways). Anyway, I have a lot of
> issues with your implementation.
>
> * Why did you replace all of the existing machinery? Particularly the
> way quote creates Quoter objects and stores them in a cache. I haven't
> done any speed tests, but I assume that was all there for performance
> reasons.

I see no real advantage there, except that it has a built-in memory leak.
Just use a function.

> * The idea of quote_as_bytes is malformed. quote_as_bytes takes a str or
> bytes, and outputs a URI as a bytes, while quote_as_string outputs a URI
> as a str. This is the first time in the whole discussion we've
> represented a URI as bytes, not a str. URIs are not byte sequences, they
> are character sequences (see discussion below).

Just there for consistency.  Of course,  URIs are actually strings of
*ASCII* characters, which sort of specifies an encoding right there, so
using ASCII octets isn't all that far-fetched...

> * The names unquote_as_* and quote_as_* are confusing. Use unquote_to_*
> and quote_from_* to avoid ambiguity.

Good point.

> * Are unquote_as_string and unquote both part of your public interface?

Yes.  Good programmers will switch to unquote_as_string or unquote_as_bytes
(or, with your amendment, unquote_to_string and unquote_to_bytes, while
unquote is just there for backwards compatibility for naifs.

* As Antoine pointed out above, it's too limiting for quote to force
> UTF-8. Add a 'charset' parameter.

That's what quote_as_string is for.

> * Add an 'errors' parameter too, to give the caller control over how
> strict to be.

No.  Errors should be seen.  This is data, not whimsey.

>
> * unquote and unquote_plus are missing 'charset' param, which should be
> passed along to unquote_as_string.

No, they're only there for backwards compatibility, and the 2.x versions
don't have a charset param. *

>I do like the addition of a "plus" argument, as opposed to the

> separate unquote_plus and quote_plus functions. I'd swap the arguments
> to unquote around so charset is first and then plus, so you can write
> unquote(mystring, 'utf-8') without using a keyword argument.

Unquote doesn't take a charset (in my design).

>
> * In unquote: The "raw_unicode_escape" encoding makes no sense. It does
> exactly the same thing as Latin-1, except it also looks for b"\\uxxxx"
> in the string and converts that into a Unicode character. So your code
> behaves like this:
>
> >>> urllib.parse.unquote('%5Cu00fc')
> 'ü'
> (Should output "\u00fc")
> >>> urllib.parse.unquote('%5Cu')
> UnicodeDecodeError: 'rawunicodeescape' codec can't decode bytes in
> position 11-12: truncated \uXXXX
> (Should output "\u")
>
> I suspect the email package (where you got the inspiration to use
> 'rawunicodeescape') has this same crazy problem, but that isn't my
> concern today!
>
> Aside from this weirdness, you're essentially defaulting unquote to
> Latin-1. As I've said countless times, unquote needs to be the inverse
> of quote, or you get this behaviour:

> >>> urllib.parse.unquote(urllib.parse.quote('ü'))
> 'ü'
>
> Once again, I refer you to my favourite web server example.
>
> import http.server
> s = http.server.HTTPServer(('',8000),
>        http.server.SimpleHTTPRequestHandler)
> s.serve_forever()
>
> Run this in a directory with a non-Latin-1 filename (eg. "漢字"), and
> you will get a 404 when you click on the file.

Using unquote() and expecting a string is basically broken already.  So what
I'm trying to do here is to avoid losing data in the translation.

> * One issue I worked very hard to solve is how to deal with unescaped
> non-ASCII characters in unquote. Technically this is an invalid URI, so
> I'm not sure how important it is, but it's nice to be able to assume the
> unquote function won't mess with them. For example,
> unquote_as_string("\u6f22%C3%BC", charset="latin-1") should give
> "\u6f22\u00fc" (or at least it would be nice). Yours raises
> "UnicodeEncodeError: 'ascii' codec can't encode character". (I assume
> this is a wanted property, given that the existing test suite tests that
> unquote can handle ALL unescaped ASCII characters (that's what
> escape_string in test_unquoting is for) - I merely extend this concept
> to be able to handle all unescaped Unicode characters). Note that it's
> impossible to allow such lenience if you implement unquote_as_string as
> calling unquote_as_bytes and then decoding.

Good thing we don't need to; URIs consist of ASCII characters.

>
> * Question: How does unquote_bytes deal with unescaped characters?

Not sure I understand this question...

>
> * Needs a lot more test cases, and documentation for your changes. I
> suggest you plug my new test cases for urllib in and see if you can make
> your code pass all the things I test for (and if not, have a good reason).

Your test cases probably aren't testing things I feel it's necessary to
test. I'm interested in having the old test cases for urllib pass, as well
as providing the ability to unquote_to_bytes().

> In addition, a major problem I have is with this dangerous assumption
> that RFC 3986 specifies a byte->str encoding. You keep stating
> assumptions like this:
>

> Remember that the RFC for percent-encoding really takes
> > bytes in, and produces bytes out.  The string-in and string-out
> > versions are to support naive programming (what a nice way of
> > putting it!).
>

Those aren't assumptions, they're simple statements of fact.

>
> You assume that my patch, the string version of quote/unquote, is a
> "hack" in order to satisfy the naive souls who only want to deal with
> strings, while your method is the "pure and correct" solution.

Yes.

>
> It is clear from reading RFC 3986 that URIs are characters, not octets
> (that's not debatable - as we see URIs written on billboards, they're
> definately characters). What *is* debatable is whether the source data
> for a URI is a character or a byte. I'm arguing it can be either.
>

I wasn't talking about URIs (you're right, they're clearly sequences of
characters), I was talking about the process of percent-encoding, which
takes a sequence of octets, and produces an ASCII string, and the process of
percent-decoding, which takes that ASCII string, and produces that same
original sequence of octets.  Don't confuse the two issues.

You seem to have arrived at the same outcome, but from a very different
> standpoint that your unquote_as_bytes is "correct" and unquote_as_string
> is a nasty hack

Nasty?  I suppose that's in the eye of the beholder.  No, I think that this
whole situation has come about because Python 1 and Python 2 had this
unfortunate ambiguity about what a "str" was.  urllib.unquote() has always
produced a sequence of bytes, but unfortunately too many careless
programmers have regarded it as a sequence of characters.  Unquote_as_string
simply (and, IMO,  usefully), conflates two operations:  percent-decode, and
string decode.  Non-careless programmers have always written something like:

   s = urllib.unquote(frag).decode("UTF-8", "strict")

That is, they understood what was going on, and handled it correctly.  I'd
suggest removing "unquote" completely, and leaving everyone porting to
Python 3 to make their own choice between unquote_to_bytes and
unquote_to_string.

> Bill:
> > Barring any other information, I think that the "segments" in the
> > path of an "http" URL must also be assumed to be binary; that is,
> > any octet is allowed, and no character set can be presumed.
>
> Once again, practicality beats purity. Segments in HTTP URLs usually
> refer to file paths. Files (in all modern operating systems) are Unicode
> filenames, not byte filenames. So you HAVE to assume the character set
> at some point. The best assumption we can make (as shown by all the
> major browsers' defaults) is UTF-8.
>

You're not suggesting we build something that works for 80% of the cases,
and breaks on the other 20%?  When we don't need to?

> So this probably indicates a bug in urllib.parse:
> > urllib.parse.urlparse(<snip>)
> > should throw an exception somewhere.
>
> No, this is just assuming the behaviour of unquote (which is called by
> urlparse) to preserve unescaped characters as-is. So this "bug" is
> actually in your own implementation of unquote (and mine too). But I
> refute that that's desirable behaviour. I think you SHOULD NOT have
> removed that test case. It is indicating a bug in your code (which is
> what it's there for).

This has nothing to do with unquote().  The octets in question are not
percent-encoded.

> Look, I'm sorry to be so harsh, but I don't see what this alternative
> patch accomplishes. My analysis probably comes off as self-righteous,
> but then, I've spent the past month reading documentation and testing
> every aspect of this issue rigorously, and all I see is you ignoring all
> my work and rewriting it in a less-robust way.

Sorry to be so harsh, but... wow, take some lessons in humility!  It's not
about you or your patch or how long you spend reading stuff, it's about
Python. And, without referring explicitly to your patch, it really doesn't
matter how long someone spends on something, if they get it wrong.  I think
you are conflating things in a way which will break stuff (and not break
other stuff, that's true).  We're arguing about which way of breaking things
is better, I think.

> Your point here was to make the str->bytes and bytes->str versions of
> unquote and quote, respectively, work, right? Well I've already got
> those working too. Please tell me what you think of my implementation so
> we can get it approved before the deadline.

Matt, your patch is not some God-given thing here.  There are lots of
patches waiting to be looked at, and this is a fairly trivial concern.  I'm
guessing there are lots of URIs out there that contain non-UTF-8 octets,
particularly in Japan, which has used ISO-2022-JP for years (though Antoine
advances an example of Latin-1 in URLs right here).  Be nice if someone who
worked at Google (hint!, hint!) looked through their massive collection of
URLs and validated this hypothesis, or not.
Files
File name Uploaded
unnamed janssen, 2008-08-12.02:30:45
History
Date User Action Args
2008-08-12 02:30:56janssensetrecipients: + janssen, lemburg, gvanrossum, loewis, jimjjewett, orsenthil, pitrou, thomaspinckney3, mgiuca
2008-08-12 02:30:54janssenlinkissue3300 messages
2008-08-12 02:30:52janssencreate