Message 70949 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mgiuca
Recipients	gvanrossum, janssen, jimjjewett, lemburg, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date	2008-08-09.18:34:19
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1218306864.44.0.78350530603.issue3300@psf.upfronthosting.co.za>
In-reply-to

Content
Bill, I had a look at your patch. I see you've decided to make quote_as_string the default? In that case, I don't know why you had to rewrite everything to implement the same basic behaviour as my patch. (My latest few patches support bytes both ways). Anyway, I have a lot of issues with your implementation. * Why did you replace all of the existing machinery? Particularly the way quote creates Quoter objects and stores them in a cache. I haven't done any speed tests, but I assume that was all there for performance reasons. * The idea of quote_as_bytes is malformed. quote_as_bytes takes a str or bytes, and outputs a URI as a bytes, while quote_as_string outputs a URI as a str. This is the first time in the whole discussion we've represented a URI as bytes, not a str. URIs are not byte sequences, they are character sequences (see discussion below). I think only quote_as_string is valid. * The names unquote_as_* and quote_as_* are confusing. Use unquote_to_* and quote_from_* to avoid ambiguity. * Are unquote_as_string and unquote both part of your public interface? That seems like unnecessary duplication. * As Antoine pointed out above, it's too limiting for quote to force UTF-8. Add a 'charset' parameter. * Add an 'errors' parameter too, to give the caller control over how strict to be. * unquote and unquote_plus are missing 'charset' param, which should be passed along to unquote_as_string. * I do like the addition of a "plus" argument, as opposed to the separate unquote_plus and quote_plus functions. I'd swap the arguments to unquote around so charset is first and then plus, so you can write unquote(mystring, 'utf-8') without using a keyword argument. * In unquote: The "raw_unicode_escape" encoding makes no sense. It does exactly the same thing as Latin-1, except it also looks for b"\\uxxxx" in the string and converts that into a Unicode character. So your code behaves like this: >>> urllib.parse.unquote('%5Cu00fc') 'ü' (Should output "\u00fc") >>> urllib.parse.unquote('%5Cu') UnicodeDecodeError: 'rawunicodeescape' codec can't decode bytes in position 11-12: truncated \uXXXX (Should output "\u") I suspect the email package (where you got the inspiration to use 'rawunicodeescape') has this same crazy problem, but that isn't my concern today! Aside from this weirdness, you're essentially defaulting unquote to Latin-1. As I've said countless times, unquote needs to be the inverse of quote, or you get this behaviour: >>> urllib.parse.unquote(urllib.parse.quote('ü')) 'Ã¼' Once again, I refer you to my favourite web server example. import http.server s = http.server.HTTPServer(('',8000), http.server.SimpleHTTPRequestHandler) s.serve_forever() Run this in a directory with a non-Latin-1 filename (eg. "漢字"), and you will get a 404 when you click on the file. * One issue I worked very hard to solve is how to deal with unescaped non-ASCII characters in unquote. Technically this is an invalid URI, so I'm not sure how important it is, but it's nice to be able to assume the unquote function won't mess with them. For example, unquote_as_string("\u6f22%C3%BC", charset="latin-1") should give "\u6f22\u00fc" (or at least it would be nice). Yours raises "UnicodeEncodeError: 'ascii' codec can't encode character". (I assume this is a wanted property, given that the existing test suite tests that unquote can handle ALL unescaped ASCII characters (that's what escape_string in test_unquoting is for) - I merely extend this concept to be able to handle all unescaped Unicode characters). Note that it's impossible to allow such lenience if you implement unquote_as_string as calling unquote_as_bytes and then decoding. * Question: How does unquote_bytes deal with unescaped characters? (Since this is a str->bytes transform, you need to encode them somehow). I don't have a good answer for you here, which is one reason I think it's wrong to treat a URI as an octet encoding. I treat them as UTF-8. You treat them as ASCII. Since URIs are supposed to only contain ASCII, the answers "ASCII", "Latin-1" and "UTF-8" are all as good as each other, but as I said above, I prefer to be lenient and allow non-ASCII URIs as input. * Needs a lot more test cases, and documentation for your changes. I suggest you plug my new test cases for urllib in and see if you can make your code pass all the things I test for (and if not, have a good reason). In addition, a major problem I have is with this dangerous assumption that RFC 3986 specifies a byte->str encoding. You keep stating assumptions like this: > Remember that the RFC for percent-encoding really takes > bytes in, and produces bytes out. The string-in and string-out > versions are to support naive programming (what a nice way of > putting it!). You assume that my patch, the string version of quote/unquote, is a "hack" in order to satisfy the naive souls who only want to deal with strings, while your method is the "pure and correct" solution. This is in no way certain. It is clear from reading RFC 3986 that URIs are characters, not octets (that's not debatable - as we see URIs written on billboards, they're definately characters). What is debatable is whether the source data for a URI is a character or a byte. I'm arguing it can be either. The process of percent-encoding is taking a character, converting it into an octet, and percent-encoding that octet into a 3-character sequence. I quote: > Percent-encoded octets may be used within a URI to represent > characters outside the range of the US-ASCII coded character set > if this representation is allowed by the scheme or by the protocol > element in which the URI is referenced. Such a definition should > specify the character encoding used to map those characters to > octets prior to being percent-encoded for the URI. It could not be clearer that the RFC allows us to view percent encoding as a str->str transform. I also see later on (in section 2.1) the RFC talking about "data octets" being percent-encoded, so I assume it's also valid to speak about a bytes->str transform. Therefore, I believe my quote and unquote are as correct as can be with respect to the RFC, and I've also supplied unquote_to_bytes and quote_from_bytes to satisfy the other (IMHO less clear, and certainly less useful) reading of the RFC. You seem to have arrived at the same outcome, but from a very different standpoint that your unquote_as_bytes is "correct" and unquote_as_string is a nasty hack. Bill: > Barring any other information, I think that the "segments" in the > path of an "http" URL must also be assumed to be binary; that is, > any octet is allowed, and no character set can be presumed. Once again, practicality beats purity. Segments in HTTP URLs usually refer to file paths. Files (in all modern operating systems) are Unicode filenames, not byte filenames. So you HAVE to assume the character set at some point. The best assumption we can make (as shown by all the major browsers' defaults) is UTF-8. > So this probably indicates a bug in urllib.parse: > urllib.parse.urlparse(<snip>) > should throw an exception somewhere. No, this is just assuming the behaviour of unquote (which is called by urlparse) to preserve unescaped characters as-is. So this "bug" is actually in your own implementation of unquote (and mine too). But I refute that that's desirable behaviour. I think you SHOULD NOT have removed that test case. It is indicating a bug in your code (which is what it's there for). Look, I'm sorry to be so harsh, but I don't see what this alternative patch accomplishes. My analysis probably comes off as self-righteous, but then, I've spent the past month reading documentation and testing every aspect of this issue rigorously, and all I see is you ignoring all my work and rewriting it in a less-robust way. Your point here was to make the str->bytes and bytes->str versions of unquote and quote, respectively, work, right? Well I've already got those working too. Please tell me what you think of my implementation so we can get it approved before the deadline. Matt

Bill, I had a look at your patch. I see you've decided to make
quote_as_string the default? In that case, I don't know why you had to
rewrite everything to implement the same basic behaviour as my patch.
(My latest few patches support bytes both ways). Anyway, I have a lot of
issues with your implementation.

* Why did you replace all of the existing machinery? Particularly the
way quote creates Quoter objects and stores them in a cache. I haven't
done any speed tests, but I assume that was all there for performance
reasons.

* The idea of quote_as_bytes is malformed. quote_as_bytes takes a str or
bytes, and outputs a URI as a bytes, while quote_as_string outputs a URI
as a str. This is the first time in the whole discussion we've
represented a URI as bytes, not a str. URIs are not byte sequences, they
are character sequences (see discussion below). I think only
quote_as_string is valid.

* The names unquote_as_* and quote_as_* are confusing. Use unquote_to_*
and quote_from_* to avoid ambiguity.

* Are unquote_as_string and unquote both part of your public interface?
That seems like unnecessary duplication.

* As Antoine pointed out above, it's too limiting for quote to force
UTF-8. Add a 'charset' parameter.

* Add an 'errors' parameter too, to give the caller control over how
strict to be.

* unquote and unquote_plus are missing 'charset' param, which should be
passed along to unquote_as_string.

* I do like the addition of a "plus" argument, as opposed to the
separate unquote_plus and quote_plus functions. I'd swap the arguments
to unquote around so charset is first and then plus, so you can write
unquote(mystring, 'utf-8') without using a keyword argument.

* In unquote: The "raw_unicode_escape" encoding makes no sense. It does
exactly the same thing as Latin-1, except it also looks for b"\\uxxxx"
in the string and converts that into a Unicode character. So your code
behaves like this:

>>> urllib.parse.unquote('%5Cu00fc')
'ü'
(Should output "\u00fc")
>>> urllib.parse.unquote('%5Cu')
UnicodeDecodeError: 'rawunicodeescape' codec can't decode bytes in
position 11-12: truncated \uXXXX
(Should output "\u")

I suspect the email package (where you got the inspiration to use
'rawunicodeescape') has this same crazy problem, but that isn't my
concern today!

Aside from this weirdness, you're essentially defaulting unquote to
Latin-1. As I've said countless times, unquote needs to be the inverse
of quote, or you get this behaviour:

>>> urllib.parse.unquote(urllib.parse.quote('ü'))
'Ã¼'

Once again, I refer you to my favourite web server example.

import http.server
s = http.server.HTTPServer(('',8000),
        http.server.SimpleHTTPRequestHandler)
s.serve_forever()

Run this in a directory with a non-Latin-1 filename (eg. "漢字"), and
you will get a 404 when you click on the file.

* One issue I worked very hard to solve is how to deal with unescaped
non-ASCII characters in unquote. Technically this is an invalid URI, so
I'm not sure how important it is, but it's nice to be able to assume the
unquote function won't mess with them. For example,
unquote_as_string("\u6f22%C3%BC", charset="latin-1") should give
"\u6f22\u00fc" (or at least it would be nice). Yours raises
"UnicodeEncodeError: 'ascii' codec can't encode character". (I assume
this is a wanted property, given that the existing test suite tests that
unquote can handle ALL unescaped ASCII characters (that's what
escape_string in test_unquoting is for) - I merely extend this concept
to be able to handle all unescaped Unicode characters). Note that it's
impossible to allow such lenience if you implement unquote_as_string as
calling unquote_as_bytes and then decoding.

* Question: How does unquote_bytes deal with unescaped characters?
(Since this is a str->bytes transform, you need to encode them somehow).
I don't have a good answer for you here, which is one reason I think
it's wrong to treat a URI as an octet encoding. I treat them as UTF-8.
You treat them as ASCII. Since URIs are supposed to only contain ASCII,
the answers "ASCII", "Latin-1" and "UTF-8" are all as good as each
other, but as I said above, I prefer to be lenient and allow non-ASCII
URIs as input.

* Needs a lot more test cases, and documentation for your changes. I
suggest you plug my new test cases for urllib in and see if you can make
your code pass all the things I test for (and if not, have a good reason).

In addition, a major problem I have is with this dangerous assumption
that RFC 3986 specifies a byte->str encoding. You keep stating
assumptions like this:

> Remember that the RFC for percent-encoding really takes
> bytes in, and produces bytes out.  The string-in and string-out
> versions are to support naive programming (what a nice way of
> putting it!).

You assume that my patch, the string version of quote/unquote, is a
"hack" in order to satisfy the naive souls who only want to deal with
strings, while your method is the "pure and correct" solution. This is
in no way certain.

It is clear from reading RFC 3986 that URIs are characters, not octets
(that's not debatable - as we see URIs written on billboards, they're
definately characters). What *is* debatable is whether the source data
for a URI is a character or a byte. I'm arguing it can be either.

The process of percent-encoding is taking a character, converting it
into an octet, and percent-encoding that octet into a 3-character
sequence. I quote:

> Percent-encoded octets may be used within a URI to represent
> characters outside the range of the US-ASCII coded character set
> if this representation is allowed by the scheme or by the protocol
> element in which the URI is referenced.  Such a definition should
> specify the character encoding used to map those characters to
> octets prior to being percent-encoded for the URI.

It could not be clearer that the RFC allows us to view percent encoding
as a str->str transform.

I also see later on (in section 2.1) the RFC talking about "data octets"
being percent-encoded, so I assume it's also valid to speak about a
bytes->str transform.

Therefore, I believe my quote and unquote are as correct as can be with
respect to the RFC, and I've also supplied unquote_to_bytes and
quote_from_bytes to satisfy the other (IMHO less clear, and certainly
less useful) reading of the RFC.

You seem to have arrived at the same outcome, but from a very different
standpoint that your unquote_as_bytes is "correct" and unquote_as_string
is a nasty hack.

Bill:
> Barring any other information, I think that the "segments" in the
> path of an "http" URL must also be assumed to be binary; that is,
> any octet is allowed, and no character set can be presumed.

Once again, practicality beats purity. Segments in HTTP URLs usually
refer to file paths. Files (in all modern operating systems) are Unicode
filenames, not byte filenames. So you HAVE to assume the character set
at some point. The best assumption we can make (as shown by all the
major browsers' defaults) is UTF-8.

> So this probably indicates a bug in urllib.parse:
> urllib.parse.urlparse(<snip>)
> should throw an exception somewhere.

No, this is just assuming the behaviour of unquote (which is called by
urlparse) to preserve unescaped characters as-is. So this "bug" is
actually in your own implementation of unquote (and mine too). But I
refute that that's desirable behaviour. I think you SHOULD NOT have
removed that test case. It is indicating a bug in your code (which is
what it's there for).

Look, I'm sorry to be so harsh, but I don't see what this alternative
patch accomplishes. My analysis probably comes off as self-righteous,
but then, I've spent the past month reading documentation and testing
every aspect of this issue rigorously, and all I see is you ignoring all
my work and rewriting it in a less-robust way.

Your point here was to make the str->bytes and bytes->str versions of
unquote and quote, respectively, work, right? Well I've already got
those working too. Please tell me what you think of my implementation so
we can get it approved before the deadline.

Matt

History
Date	User	Action	Args
2008-08-09 18:34:25	mgiuca	set	recipients: + mgiuca, lemburg, gvanrossum, loewis, jimjjewett, janssen, orsenthil, pitrou, thomaspinckney3
2008-08-09 18:34:24	mgiuca	set	messageid: <1218306864.44.0.78350530603.issue3300@psf.upfronthosting.co.za>
2008-08-09 18:34:23	mgiuca	link	issue3300 messages
2008-08-09 18:34:19	mgiuca	create