On Sat, Aug 9, 2008 at 11:34 AM, Matt Giuca <report@bugs.python.org> wrote:

Matt Giuca <matt.giuca@gmail.com> added the comment:

Bill, I had a look at your patch. I see you've decided to make
quote_as_string the default? In that case, I don't know why you had to
rewrite everything to implement the same basic behaviour as my patch.

Just to get things clear in my head.  I don't care what code is used, so long as it doesn't break the Web.
 
(My latest few patches support bytes both ways). Anyway, I have a lot of
issues with your implementation.

* Why did you replace all of the existing machinery? Particularly the
way quote creates Quoter objects and stores them in a cache. I haven't
done any speed tests, but I assume that was all there for performance
reasons.

I see no real advantage there, except that it has a built-in memory leak.  Just use a function. 
 
* The idea of quote_as_bytes is malformed. quote_as_bytes takes a str or
bytes, and outputs a URI as a bytes, while quote_as_string outputs a URI
as a str. This is the first time in the whole discussion we've
represented a URI as bytes, not a str. URIs are not byte sequences, they
are character sequences (see discussion below).

Just there for consistency.  Of course,  URIs are actually strings of *ASCII* characters, which sort of specifies an encoding right there, so using ASCII octets isn't all that far-fetched...


* The names unquote_as_* and quote_as_* are confusing. Use unquote_to_*
and quote_from_* to avoid ambiguity.

Good point.

> * Are unquote_as_string and unquote both part of your public interface?

Yes.  Good programmers will switch to unquote_as_string or unquote_as_bytes (or, with your amendment, unquote_to_string and unquote_to_bytes, while unquote is just there for backwards compatibility for naifs.

* As Antoine pointed out above, it's too limiting for quote to force
UTF-8. Add a 'charset' parameter.

That's what quote_as_string is for.
 
* Add an 'errors' parameter too, to give the caller control over how
strict to be.

No.  Errors should be seen.  This is data, not whimsey.

* unquote and unquote_plus are missing 'charset' param, which should be
passed along to unquote_as_string.

No, they're only there for backwards compatibility, and the 2.x versions don't have a charset param. *

>I do like the addition of a "plus" argument, as opposed to the
 
separate unquote_plus and quote_plus functions. I'd swap the arguments
to unquote around so charset is first and then plus, so you can write
unquote(mystring, 'utf-8') without using a keyword argument.

Unquote doesn't take a charset (in my design).

* In unquote: The "raw_unicode_escape" encoding makes no sense. It does
exactly the same thing as Latin-1, except it also looks for b"\\uxxxx"
in the string and converts that into a Unicode character. So your code
behaves like this:

>>> urllib.parse.unquote('%5Cu00fc')
'ü'
(Should output "\u00fc")
>>> urllib.parse.unquote('%5Cu')
UnicodeDecodeError: 'rawunicodeescape' codec can't decode bytes in
position 11-12: truncated \uXXXX
(Should output "\u")

I suspect the email package (where you got the inspiration to use
'rawunicodeescape') has this same crazy problem, but that isn't my
concern today!

Aside from this weirdness, you're essentially defaulting unquote to
Latin-1. As I've said countless times, unquote needs to be the inverse
of quote, or you get this behaviour:

>>> urllib.parse.unquote(urllib.parse.quote('ü'))
'ü'

Once again, I refer you to my favourite web server example.

import http.server
s = http.server.HTTPServer(('',8000),
       http.server.SimpleHTTPRequestHandler)
s.serve_forever()

Run this in a directory with a non-Latin-1 filename (eg. "漢字"), and
you will get a 404 when you click on the file.

Using unquote() and expecting a string is basically broken already.  So what I'm trying to do here is to avoid losing data in the translation.


* One issue I worked very hard to solve is how to deal with unescaped
non-ASCII characters in unquote. Technically this is an invalid URI, so
I'm not sure how important it is, but it's nice to be able to assume the
unquote function won't mess with them. For example,
unquote_as_string("\u6f22%C3%BC", charset="latin-1") should give
"\u6f22\u00fc" (or at least it would be nice). Yours raises
"UnicodeEncodeError: 'ascii' codec can't encode character". (I assume
this is a wanted property, given that the existing test suite tests that
unquote can handle ALL unescaped ASCII characters (that's what
escape_string in test_unquoting is for) - I merely extend this concept
to be able to handle all unescaped Unicode characters). Note that it's
impossible to allow such lenience if you implement unquote_as_string as
calling unquote_as_bytes and then decoding.

Good thing we don't need to; URIs consist of ASCII characters.

* Question: How does unquote_bytes deal with unescaped characters?

Not sure I understand this question...

* Needs a lot more test cases, and documentation for your changes. I
suggest you plug my new test cases for urllib in and see if you can make
your code pass all the things I test for (and if not, have a good reason).

Your test cases probably aren't testing things I feel it's necessary to test. I'm interested in having the old test cases for urllib pass, as well as providing the ability to unquote_to_bytes().
 
In addition, a major problem I have is with this dangerous assumption
that RFC 3986 specifies a byte->str encoding. You keep stating
assumptions like this:
 
> Remember that the RFC for percent-encoding really takes
> bytes in, and produces bytes out.  The string-in and string-out
> versions are to support naive programming (what a nice way of
> putting it!).

Those aren't assumptions, they're simple statements of fact.

You assume that my patch, the string version of quote/unquote, is a
"hack" in order to satisfy the naive souls who only want to deal with
strings, while your method is the "pure and correct" solution.

Yes.
 

It is clear from reading RFC 3986 that URIs are characters, not octets
(that's not debatable - as we see URIs written on billboards, they're
definately characters). What *is* debatable is whether the source data
for a URI is a character or a byte. I'm arguing it can be either.

I wasn't talking about URIs (you're right, they're clearly sequences of characters), I was talking about the process of percent-encoding, which takes a sequence of octets, and produces an ASCII string, and the process of percent-decoding, which takes that ASCII string, and produces that same original sequence of octets.  Don't confuse the two issues.

You seem to have arrived at the same outcome, but from a very different
standpoint that your unquote_as_bytes is "correct" and unquote_as_string
is a nasty hack

Nasty?  I suppose that's in the eye of the beholder.  No, I think that this whole situation has come about because Python 1 and Python 2 had this unfortunate ambiguity about what a "str" was.  urllib.unquote() has always produced a sequence of bytes, but unfortunately too many careless programmers have regarded it as a sequence of characters.  Unquote_as_string simply (and, IMO,  usefully), conflates two operations:  percent-decode, and string decode.  Non-careless programmers have always written something like:

   s = urllib.unquote(frag).decode("UTF-8", "strict")

That is, they understood what was going on, and handled it correctly.  I'd suggest removing "unquote" completely, and leaving everyone porting to Python 3 to make their own choice between unquote_to_bytes and unquote_to_string.


Bill:
> Barring any other information, I think that the "segments" in the
> path of an "http" URL must also be assumed to be binary; that is,
> any octet is allowed, and no character set can be presumed.

Once again, practicality beats purity. Segments in HTTP URLs usually
refer to file paths. Files (in all modern operating systems) are Unicode
filenames, not byte filenames. So you HAVE to assume the character set
at some point. The best assumption we can make (as shown by all the
major browsers' defaults) is UTF-8.

You're not suggesting we build something that works for 80% of the cases, and breaks on the other 20%?  When we don't need to?

> So this probably indicates a bug in urllib.parse:
> urllib.parse.urlparse(<snip>)
> should throw an exception somewhere.

No, this is just assuming the behaviour of unquote (which is called by
urlparse) to preserve unescaped characters as-is. So this "bug" is
actually in your own implementation of unquote (and mine too). But I
refute that that's desirable behaviour. I think you SHOULD NOT have
removed that test case. It is indicating a bug in your code (which is
what it's there for).

This has nothing to do with unquote().  The octets in question are not percent-encoded.
 
Look, I'm sorry to be so harsh, but I don't see what this alternative
patch accomplishes. My analysis probably comes off as self-righteous,
but then, I've spent the past month reading documentation and testing
every aspect of this issue rigorously, and all I see is you ignoring all
my work and rewriting it in a less-robust way. 

Sorry to be so harsh, but... wow, take some lessons in humility!  It's not about you or your patch or how long you spend reading stuff, it's about Python. And, without referring explicitly to your patch, it really doesn't matter how long someone spends on something, if they get it wrong.  I think you are conflating things in a way which will break stuff (and not break other stuff, that's true).  We're arguing about which way of breaking things is better, I think.
 
Your point here was to make the str->bytes and bytes->str versions of
unquote and quote, respectively, work, right? Well I've already got
those working too. Please tell me what you think of my implementation so
we can get it approved before the deadline.

Matt, your patch is not some God-given thing here.  There are lots of patches waiting to be looked at, and this is a fairly trivial concern.  I'm guessing there are lots of URIs out there that contain non-UTF-8 octets, particularly in Japan, which has used ISO-2022-JP for years (though Antoine advances an example of Latin-1 in URLs right here).  Be nice if someone who worked at Google (hint!, hint!) looked through their massive collection of URLs and validated this hypothesis, or not.