Author lemburg
Recipients gvanrossum, janssen, jimjjewett, lemburg, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date 2008-08-07.21:43:25
SpamBayes Score 4.28249e-05
Marked as misclassified No
Message-id <489B6C7C.4030803@egenix.com>
In-reply-to <1218143828.6.0.619997562476.issue3300@psf.upfronthosting.co.za>
Content
On 2008-08-07 23:17, Bill Janssen wrote:
> Bill Janssen <bill.janssen@gmail.com> added the comment:
> 
> My main fear with this patch is that "unquote" will become seen as
> unreliable, because naive software trying to parse URLs will encounter
> uses of percent-encoding where the encoded octets are not in fact UTF-8
> bytes.  They're just some set of bytes. 

unquote will have to be able to deal with old-style URLs that
use the Latin-1 encoding. HTML uses (or used to use) the Latin-1
encoding as default and that's how URLs ended up using it as well:

http://www.w3schools.com/TAGS/ref_urlencode.asp

I'd suggest to have it first try UTF-8 decoding and then fall back
to Latin-1 decoding.

> A secondary concern is that it
> will invisibly produce invalid data, because it decodes some
> non-UTF-8-encoded string that happens to only use UTF-8-valid sequences
> as the wrong string value.

It's rather unlikely that someone will have used a Latin-1 encoded
URL which happens to decode as valid UTF-8: The valid UTF-8 combinations
don't really make any sense when used as text, e.g.

Ã?öÃ1/4

> Now, I have to confess that I don't know how common these use cases are
> in actual URL usage.  It would be nice if there was some organization
> that had a large collection of URLs, and could provide a test set we
> could run a scanner over :-).
> 
> As a workaround, though, I've sent a message off to Larry Masinter to
> ask about this case.  He's one of the authors of the URI spec.
History
Date User Action Args
2008-08-07 21:43:33lemburgsetrecipients: + lemburg, gvanrossum, loewis, jimjjewett, janssen, orsenthil, pitrou, thomaspinckney3, mgiuca
2008-08-07 21:43:25lemburglinkissue3300 messages
2008-08-07 21:43:25lemburgcreate