Author mgiuca
Recipients loewis, mgiuca, orsenthil, thomaspinckney3
Date 2008-07-09.15:51:47
SpamBayes Score 7.60174e-07
Marked as misclassified No
Message-id <1215618712.42.0.51291899833.issue3300@psf.upfronthosting.co.za>
In-reply-to
Content
OK I've gone back over the patch and decided to add the "encoding" and
"errors" arguments from the str.encode/decode methods as optional
arguments to quote and unquote. This is a much bigger change than I
originally intended, but I think it makes things much better because
we'll get UTF-8 by default (which as far as I can tell is by far the
most common encoding).

(Tom Pinckney just made the same suggestion right as I'm typing this up!)

So my new patch is a bit more extensive, and changes the interface (in a
backwards-compatible way). Both quote and unquote now support "encoding"
and "errors" arguments, defaulting to "utf-8" and "replace", respectively.

Implementation detail: This changes the Quoter class a lot; it now
hashes four fields to ensure it doesn't use the wrong cache.

Also fixed an issue with the previous patch where non-ASCII-compatible
encodings broke for code points < 128.

I then ran the full test suite and discovered two other modules test
cases broke. I've fixed them so the full suite passes, but I'm
suspicious there may be more issues (see below).

* Lib/test/test_http_cookiejar.py: A test case was written explicitly
expecting Latin-1 encoding. I've changed this test case to expect UTF-8.
* Lib/email/utils.py: I extensively analysed this code and discovered
that it kind of "cheats" - it uses the Latin-1 encoding and treats it as
octets, then applies its own encoding scheme. So to fix this, I changed
the email module to call quote and unquote with encoding="latin-1".
Hence it has the same behaviour as before.

Some potential issues:

* I have not updated the documentation yet. If this idea is to go ahead,
the docs will need to show these new optional arguments. (I'll do that
myself but haven't yet).
* While the full test suite passes, I'm sure there will be many more
issues since I've changed the interface. Therefore I don't recommend
this patch is accepted just yet. I plan to do an investigation into all
uses (within the standard lib) of quote and unquote to see if there are
any other compatibility issues, particularly within urllib. Hence I'll
respond to this again in a few days.
* The new patch to "safe" argument of quote allows non-ASCII characters
to be made safe. This correspondingly allows the construction of URIs
with non-ASCII characters. Is it better to allow users to do this if
they really want, or just mysteriously fail to let those characters through?

I would also like to have a separate pair of functions, unquote_raw and
quote_raw, which work on bytes objects instead of strings. (unquote_raw
would take a str and produce a bytes, while quote_raw would take a bytes
and produce a str). As URI encoding is fundamentally an octet encoding,
not a character encoding, this is the only way to do URI encoding
without choosing a Unicode character encoding. (I see some modules such
as "email" treating the implicit Latin-1 encoding as byte encoding,
which is a bit dodgy - they could benefit from raw functions). But as
that requires further changes to the interface, I'll save it for another
day.

Patch (parse.py.patch2) is for branch /branches/py3k, revision 64820.

Commit log:

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets
(previously implicitly decoded as ISO-8859-1). As per RFC 3986, default
is "utf-8".

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded (previously characters in range(128, 256)
were encoded as ISO-8859-1, and characters above that as UTF-8). Also
fixed characters greater than 256 not responding to "safe", and also not
being cached.

Lib/test/test_urllib.py, Lib/test/test_http_cookiejar.py: Updated test
cases which expected output in ISO-8859-1, now expects UTF-8.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).
History
Date User Action Args
2008-07-09 15:51:52mgiucasetspambayes_score: 7.60174e-07 -> 7.60174e-07
recipients: + mgiuca, loewis, orsenthil, thomaspinckney3
2008-07-09 15:51:52mgiucasetspambayes_score: 7.60174e-07 -> 7.60174e-07
messageid: <1215618712.42.0.51291899833.issue3300@psf.upfronthosting.co.za>
2008-07-09 15:51:51mgiucalinkissue3300 messages
2008-07-09 15:51:48mgiucacreate