Message 70833 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mgiuca
Recipients	gvanrossum, janssen, jimjjewett, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date	2008-08-07.14:59:55
SpamBayes Score	8.743839e-12
Marked as misclassified	No
Message-id	<1218121198.79.0.14201620078.issue3300@psf.upfronthosting.co.za>
In-reply-to

Content
Following Guido and Antoine's reviews, I've written a new patch which fixes most of the issues raised. The ones I didn't fix I have noted below, and commented on the review site (http://codereview.appspot.com/2827/). Note: I intend to address all of these issues after some discussion. Outstanding issues raised by the reviews: Doc/library/urllib.parse.rst: Should unquote accept a bytes/bytearray as well as a str? Lib/email/utils.py: Should encode_rfc2231 with charset=None accept strings with non-ASCII characters, and just encode them to UTF-8? Lib/test/test_http_cookiejar.py: Does RFC 2965 let me get away with changing the test case to expect UTF-8? (I'm pretty sure it doesn't care what encoding is used). Lib/test/test_urllib.py: Should quote raise a TypeError if given a bytes with encoding/errors arguments? (Motivation: TypeError is what you usually raise if you supply too many args to a function). Lib/urllib/parse.py: (As discussed above) Should quote accept safe characters outside the ASCII range (thereby potentially producing invalid URIs)? ------ Commit log for patch8: Fix for issue 3300. urllib.parse.unquote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the decoding of percent-encoded octets. As per RFC 3986, default is "utf-8" (previously implicitly decoded as ISO-8859-1). Also fixed a bug in which mixed-case hex digits (such as "%aF") weren't being decoded at all. urllib.parse.quote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the encoding of non-ASCII characters before being percent-encoded. Default is "utf-8" (previously characters in range(128, 256) were encoded as ISO-8859-1, and characters above that as UTF-8). Also characters/bytes above 128 are no longer allowed to be "safe". Also now allows either bytes or strings. Added functions urllib.parse.quote_from_bytes, urllib.parse.unquote_to_bytes. All quote/unquote functions now exported from the module. Doc/library/urllib.parse.rst: Updated docs on quote and unquote to reflect new interface, added quote_from_bytes and unquote_to_bytes. Lib/test/test_urllib.py: Added many new test cases testing encoding and decoding Unicode strings with various encodings, as well as testing the new functions. Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py, Lib/test/test_wsgiref.py: Updated and added test cases to deal with UTF-8-encoded URIs. Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote with encoding="latin-1", to preserve existing behaviour (which the whole email module is dependent upon).

Following Guido and Antoine's reviews, I've written a new patch which
fixes *most* of the issues raised. The ones I didn't fix I have noted
below, and commented on the review site
(http://codereview.appspot.com/2827/). Note: I intend to address all of
these issues after some discussion.

Outstanding issues raised by the reviews:

Doc/library/urllib.parse.rst:
Should unquote accept a bytes/bytearray as well as a str?

Lib/email/utils.py:
Should encode_rfc2231 with charset=None accept strings with non-ASCII
characters, and just encode them to UTF-8?

Lib/test/test_http_cookiejar.py:
Does RFC 2965 let me get away with changing the test case to expect
UTF-8? (I'm pretty sure it doesn't care what encoding is used).

Lib/test/test_urllib.py:
Should quote raise a TypeError if given a bytes with encoding/errors
arguments? (Motivation: TypeError is what you usually raise if you
supply too many args to a function).

Lib/urllib/parse.py:
(As discussed above) Should quote accept safe characters outside the
ASCII range (thereby potentially producing invalid URIs)?

------

Commit log for patch8:

Fix for issue 3300.

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded as
ISO-8859-1). Also fixed a bug in which mixed-case hex digits (such as
"%aF") weren't being decoded at all.

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is "utf-8" (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters/bytes above 128 are no longer allowed to be
"safe". Also now allows either bytes or strings.

Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes. All quote/unquote functions now exported
from the module.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.

Lib/test/test_urllib.py: Added many new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).

History
Date	User	Action	Args
2008-08-07 14:59:58	mgiuca	set	recipients: + mgiuca, gvanrossum, loewis, jimjjewett, janssen, orsenthil, pitrou, thomaspinckney3
2008-08-07 14:59:58	mgiuca	set	messageid: <1218121198.79.0.14201620078.issue3300@psf.upfronthosting.co.za>
2008-08-07 14:59:58	mgiuca	link	issue3300 messages
2008-08-07 14:59:56	mgiuca	create