Setting the default encoding to Latin-1 would prevent these errors,
but would commit the sin of mojibake (the Japanese word for Perl code
:-). I don't like that much either.
No, that would be wrong. Returning a string just for the sake of returning a string. Remember, the data percent-encoded is not necessarily a string, and not necessarily in any known encoding.
A middle ground might be to set the default encoding to ASCII --
that's closer to Martin's claim that URLs are supposed to be ASCII
only.
URLs *are* supposed to be ASCII only -- but the percent-encoded byte sequences in various parts of the path aren't.
This will require many apps to be changed, but at least it forces the
developers to think about which encoding to assume (perhaps there's
one handy in the request headers if it's a web app) or about error
handling or perhaps using unquote_to_bytes().
Yes, this is closer to my line of reasoning.
However I fear that this middle ground will in practice cause:
(a) more in-the-field failures, since devs are notorious for testing
with ASCII only; and
Returning bytes deals with this problem.
(b) the creation of a recipe for "fixing" unquote() calls that fail by
setting the encoding to UTF-8 without thinking about the alternatives,
thereby effectively recreating the UTF-8 default with much more pain.
Could be, but at least they will have had to think about. There's lots of bad code out there, and maybe by making them think about it, some of it will improve.
> A secondary concern is that it
> will invisibly produce invalid data, because it decodes some
> non-UTF-8-encoded string that happens to only use UTF-8-valid sequences
> as the wrong string value.
In my experience this is very unlikely. UTF-8 looks like total junk in
Latin-1, so it's unlikely to occur naturally. If you see something
that matches a UTF-8 sequence in Latin-1 text, it's most likely that
in fact it was incorrectly decoded earlier...
Latin-1 isn't the only alternate encoding in the world, and not all percent-encoded byte sequences in URLs are encoded strings. I'd feel better if we were being guided by more than your just experience (vast though it may rightly be said to be!). Say, by looking at all the URLs that Google knows about :-). I'd particularly feel better if some expert in Asian use of the Web spoke up here...