This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author mgiuca
Recipients gvanrossum, janssen, jimjjewett, loewis, mgiuca, orsenthil, pitrou, thomaspinckney3
Date 2008-08-07.14:59:55
SpamBayes Score 8.743839e-12
Marked as misclassified No
Message-id <1218121198.79.0.14201620078.issue3300@psf.upfronthosting.co.za>
In-reply-to
Content
Following Guido and Antoine's reviews, I've written a new patch which
fixes *most* of the issues raised. The ones I didn't fix I have noted
below, and commented on the review site
(http://codereview.appspot.com/2827/). Note: I intend to address all of
these issues after some discussion.

Outstanding issues raised by the reviews:

Doc/library/urllib.parse.rst:
Should unquote accept a bytes/bytearray as well as a str?

Lib/email/utils.py:
Should encode_rfc2231 with charset=None accept strings with non-ASCII
characters, and just encode them to UTF-8?

Lib/test/test_http_cookiejar.py:
Does RFC 2965 let me get away with changing the test case to expect
UTF-8? (I'm pretty sure it doesn't care what encoding is used).

Lib/test/test_urllib.py:
Should quote raise a TypeError if given a bytes with encoding/errors
arguments? (Motivation: TypeError is what you usually raise if you
supply too many args to a function).

Lib/urllib/parse.py:
(As discussed above) Should quote accept safe characters outside the
ASCII range (thereby potentially producing invalid URIs)?

------

Commit log for patch8:

Fix for issue 3300.

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded as
ISO-8859-1). Also fixed a bug in which mixed-case hex digits (such as
"%aF") weren't being decoded at all.

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is "utf-8" (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters/bytes above 128 are no longer allowed to be
"safe". Also now allows either bytes or strings.

Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes. All quote/unquote functions now exported
from the module.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.

Lib/test/test_urllib.py: Added many new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).
History
Date User Action Args
2008-08-07 14:59:58mgiucasetrecipients: + mgiuca, gvanrossum, loewis, jimjjewett, janssen, orsenthil, pitrou, thomaspinckney3
2008-08-07 14:59:58mgiucasetmessageid: <1218121198.79.0.14201620078.issue3300@psf.upfronthosting.co.za>
2008-08-07 14:59:58mgiucalinkissue3300 messages
2008-08-07 14:59:56mgiucacreate