Message 69473 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mgiuca
Recipients	loewis, mgiuca, orsenthil, thomaspinckney3
Date	2008-07-09.15:51:47
SpamBayes Score	7.6017363e-07
Marked as misclassified	No
Message-id	<1215618712.42.0.51291899833.issue3300@psf.upfronthosting.co.za>
In-reply-to

Content
OK I've gone back over the patch and decided to add the "encoding" and "errors" arguments from the str.encode/decode methods as optional arguments to quote and unquote. This is a much bigger change than I originally intended, but I think it makes things much better because we'll get UTF-8 by default (which as far as I can tell is by far the most common encoding). (Tom Pinckney just made the same suggestion right as I'm typing this up!) So my new patch is a bit more extensive, and changes the interface (in a backwards-compatible way). Both quote and unquote now support "encoding" and "errors" arguments, defaulting to "utf-8" and "replace", respectively. Implementation detail: This changes the Quoter class a lot; it now hashes four fields to ensure it doesn't use the wrong cache. Also fixed an issue with the previous patch where non-ASCII-compatible encodings broke for code points < 128. I then ran the full test suite and discovered two other modules test cases broke. I've fixed them so the full suite passes, but I'm suspicious there may be more issues (see below). * Lib/test/test_http_cookiejar.py: A test case was written explicitly expecting Latin-1 encoding. I've changed this test case to expect UTF-8. * Lib/email/utils.py: I extensively analysed this code and discovered that it kind of "cheats" - it uses the Latin-1 encoding and treats it as octets, then applies its own encoding scheme. So to fix this, I changed the email module to call quote and unquote with encoding="latin-1". Hence it has the same behaviour as before. Some potential issues: * I have not updated the documentation yet. If this idea is to go ahead, the docs will need to show these new optional arguments. (I'll do that myself but haven't yet). * While the full test suite passes, I'm sure there will be many more issues since I've changed the interface. Therefore I don't recommend this patch is accepted just yet. I plan to do an investigation into all uses (within the standard lib) of quote and unquote to see if there are any other compatibility issues, particularly within urllib. Hence I'll respond to this again in a few days. * The new patch to "safe" argument of quote allows non-ASCII characters to be made safe. This correspondingly allows the construction of URIs with non-ASCII characters. Is it better to allow users to do this if they really want, or just mysteriously fail to let those characters through? I would also like to have a separate pair of functions, unquote_raw and quote_raw, which work on bytes objects instead of strings. (unquote_raw would take a str and produce a bytes, while quote_raw would take a bytes and produce a str). As URI encoding is fundamentally an octet encoding, not a character encoding, this is the only way to do URI encoding without choosing a Unicode character encoding. (I see some modules such as "email" treating the implicit Latin-1 encoding as byte encoding, which is a bit dodgy - they could benefit from raw functions). But as that requires further changes to the interface, I'll save it for another day. Patch (parse.py.patch2) is for branch /branches/py3k, revision 64820. Commit log: urllib.parse.unquote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the decoding of percent-encoded octets (previously implicitly decoded as ISO-8859-1). As per RFC 3986, default is "utf-8". urllib.parse.quote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the encoding of non-ASCII characters before being percent-encoded (previously characters in range(128, 256) were encoded as ISO-8859-1, and characters above that as UTF-8). Also fixed characters greater than 256 not responding to "safe", and also not being cached. Lib/test/test_urllib.py, Lib/test/test_http_cookiejar.py: Updated test cases which expected output in ISO-8859-1, now expects UTF-8. Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote with encoding="latin-1", to preserve existing behaviour (which the whole email module is dependent upon).

OK I've gone back over the patch and decided to add the "encoding" and
"errors" arguments from the str.encode/decode methods as optional
arguments to quote and unquote. This is a much bigger change than I
originally intended, but I think it makes things much better because
we'll get UTF-8 by default (which as far as I can tell is by far the
most common encoding).

(Tom Pinckney just made the same suggestion right as I'm typing this up!)

So my new patch is a bit more extensive, and changes the interface (in a
backwards-compatible way). Both quote and unquote now support "encoding"
and "errors" arguments, defaulting to "utf-8" and "replace", respectively.

Implementation detail: This changes the Quoter class a lot; it now
hashes four fields to ensure it doesn't use the wrong cache.

Also fixed an issue with the previous patch where non-ASCII-compatible
encodings broke for code points < 128.

I then ran the full test suite and discovered two other modules test
cases broke. I've fixed them so the full suite passes, but I'm
suspicious there may be more issues (see below).

* Lib/test/test_http_cookiejar.py: A test case was written explicitly
expecting Latin-1 encoding. I've changed this test case to expect UTF-8.
* Lib/email/utils.py: I extensively analysed this code and discovered
that it kind of "cheats" - it uses the Latin-1 encoding and treats it as
octets, then applies its own encoding scheme. So to fix this, I changed
the email module to call quote and unquote with encoding="latin-1".
Hence it has the same behaviour as before.

Some potential issues:

* I have not updated the documentation yet. If this idea is to go ahead,
the docs will need to show these new optional arguments. (I'll do that
myself but haven't yet).
* While the full test suite passes, I'm sure there will be many more
issues since I've changed the interface. Therefore I don't recommend
this patch is accepted just yet. I plan to do an investigation into all
uses (within the standard lib) of quote and unquote to see if there are
any other compatibility issues, particularly within urllib. Hence I'll
respond to this again in a few days.
* The new patch to "safe" argument of quote allows non-ASCII characters
to be made safe. This correspondingly allows the construction of URIs
with non-ASCII characters. Is it better to allow users to do this if
they really want, or just mysteriously fail to let those characters through?

I would also like to have a separate pair of functions, unquote_raw and
quote_raw, which work on bytes objects instead of strings. (unquote_raw
would take a str and produce a bytes, while quote_raw would take a bytes
and produce a str). As URI encoding is fundamentally an octet encoding,
not a character encoding, this is the only way to do URI encoding
without choosing a Unicode character encoding. (I see some modules such
as "email" treating the implicit Latin-1 encoding as byte encoding,
which is a bit dodgy - they could benefit from raw functions). But as
that requires further changes to the interface, I'll save it for another
day.

Patch (parse.py.patch2) is for branch /branches/py3k, revision 64820.

Commit log:

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets
(previously implicitly decoded as ISO-8859-1). As per RFC 3986, default
is "utf-8".

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded (previously characters in range(128, 256)
were encoded as ISO-8859-1, and characters above that as UTF-8). Also
fixed characters greater than 256 not responding to "safe", and also not
being cached.

Lib/test/test_urllib.py, Lib/test/test_http_cookiejar.py: Updated test
cases which expected output in ISO-8859-1, now expects UTF-8.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).

History
Date	User	Action	Args
2008-07-09 15:51:52	mgiuca	set	spambayes_score: 7.60174e-07 -> 7.6017363e-07 recipients: + mgiuca, loewis, orsenthil, thomaspinckney3
2008-07-09 15:51:52	mgiuca	set	spambayes_score: 7.60174e-07 -> 7.60174e-07 messageid: <1215618712.42.0.51291899833.issue3300@psf.upfronthosting.co.za>
2008-07-09 15:51:51	mgiuca	link	issue3300 messages
2008-07-09 15:51:48	mgiuca	create