Message 70497 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mgiuca
Recipients	loewis, mgiuca, orsenthil, thomaspinckney3
Date	2008-07-31.11:27:32
SpamBayes Score	1.4462592e-10
Marked as misclassified	No
Message-id	<1217503657.06.0.212871069563.issue3300@psf.upfronthosting.co.za>
In-reply-to

Content
OK after a long discussion on the mailing list, Guido gave this the OK, with the provision that there are str->bytes and bytes->str versions of these functions as well. So I've written those. http://mail.python.org/pipermail/python-dev/2008-July/081601.html quote itself now accepts either a str or a bytes. quote_from_bytes is a new function which is just an alias for quote. (Is this acceptable?) unquote is still str->str. I've added a totally separate function unquote_to_bytes which is str->bytes. Note there is a slight issue here: I didn't quite know what to do with unescaped non-ASCII characters in the input to unquote_to_bytes - they need to somehow be converted to bytes. I chose to encode them using UTF-8, on the basis that they technically shouldn't be in a URI anyway. Note that my new unquote doesn't have this problem; it's carefully written to preserve the Unicode characters, even if they aren't expressible in the given encoding (which explains some of the code bloat). This makes unquote(s, encoding=e) necessarily more robust than unquote_to_bytes(s).decode(e) in terms of unescaped non-ASCII characters in the input. I've also added new test cases and documentation for these two new functions (included in patch6). On an entirely personal note, can whoever checks this in please mention my name in the commit log - I've put in at least 30 hours researching and writing this patch, and I'd like for this not to go uncredited :) Commit log for patch6: Fix for issue 3300. urllib.parse.unquote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the decoding of percent-encoded octets. As per RFC 3986, default is "utf-8" (previously implicitly decoded as ISO-8859-1). urllib.parse.quote: Added "encoding" and "errors" optional arguments, allowing the caller to determine the encoding of non-ASCII characters before being percent-encoded. Default is "utf-8" (previously characters in range(128, 256) were encoded as ISO-8859-1, and characters above that as UTF-8). Also characters/bytes above 128 are no longer allowed to be "safe". Also now allows either bytes or strings. Added functions urllib.parse.quote_from_bytes, urllib.parse.unquote_to_bytes. Doc/library/urllib.parse.rst: Updated docs on quote and unquote to reflect new interface, added quote_from_bytes and unquote_to_bytes. Lib/test/test_urllib.py: Added several new test cases testing encoding and decoding Unicode strings with various encodings, as well as testing the new functions. Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py, Lib/test/test_wsgiref.py: Updated and added test cases to deal with UTF-8-encoded URIs. Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote with encoding="latin-1", to preserve existing behaviour (which the whole email module is dependent upon).

OK after a long discussion on the mailing list, Guido gave this the OK,
with the provision that there are str->bytes and bytes->str versions of
these functions as well. So I've written those.

http://mail.python.org/pipermail/python-dev/2008-July/081601.html

quote itself now accepts either a str or a bytes. quote_from_bytes is a
new function which is just an alias for quote. (Is this acceptable?)

unquote is still str->str. I've added a totally separate function
unquote_to_bytes which is str->bytes.

Note there is a slight issue here: I didn't quite know what to do with
unescaped non-ASCII characters in the input to unquote_to_bytes - they
need to somehow be converted to bytes. I chose to encode them using
UTF-8, on the basis that they technically shouldn't be in a URI anyway.

Note that my new unquote doesn't have this problem; it's carefully
written to preserve the Unicode characters, even if they aren't
expressible in the given encoding (which explains some of the code bloat).

This makes unquote(s, encoding=e) necessarily more robust than
unquote_to_bytes(s).decode(e) in terms of unescaped non-ASCII characters
in the input.

I've also added new test cases and documentation for these two new
functions (included in patch6).

On an entirely personal note, can whoever checks this in please mention
my name in the commit log - I've put in at least 30 hours researching
and writing this patch, and I'd like for this not to go uncredited :)

Commit log for patch6:

Fix for issue 3300.

urllib.parse.unquote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the decoding of percent-encoded octets.
As per RFC 3986, default is "utf-8" (previously implicitly decoded as
ISO-8859-1).

urllib.parse.quote: Added "encoding" and "errors" optional arguments,
allowing the caller to determine the encoding of non-ASCII characters
before being percent-encoded. Default is "utf-8" (previously characters
in range(128, 256) were encoded as ISO-8859-1, and characters above that
as UTF-8). Also characters/bytes above 128 are no longer allowed to be
"safe". Also now allows either bytes or strings.

Added functions urllib.parse.quote_from_bytes,
urllib.parse.unquote_to_bytes.

Doc/library/urllib.parse.rst: Updated docs on quote and unquote to
reflect new interface, added quote_from_bytes and unquote_to_bytes.

Lib/test/test_urllib.py: Added several new test cases testing encoding
and decoding Unicode strings with various encodings, as well as testing
the new functions.

Lib/test/test_http_cookiejar.py, Lib/test/test_cgi.py,
Lib/test/test_wsgiref.py: Updated and added test cases to deal with
UTF-8-encoded URIs.

Lib/email/utils.py: Calls urllib.parse.quote and urllib.parse.unquote
with encoding="latin-1", to preserve existing behaviour (which the whole
email module is dependent upon).

History
Date	User	Action	Args
2008-07-31 11:27:37	mgiuca	set	recipients: + mgiuca, loewis, orsenthil, thomaspinckney3
2008-07-31 11:27:37	mgiuca	set	messageid: <1217503657.06.0.212871069563.issue3300@psf.upfronthosting.co.za>
2008-07-31 11:27:36	mgiuca	link	issue3300 messages
2008-07-31 11:27:32	mgiuca	create